Run scTail
scTail includes three stpes : identify PAS for individual samples, merge PASs across multiple samples, quantify PAS for individual cells
The input files include:
filtered alignment file (bam file)
annotation file (gtf file)
reference genome file (fasta file)
cell barcode list file (tsv file)
chrom size (tsv file)
The output files include:
paraclu_input.tsv : a file includes 4 columns: chromosome, strand, position, reads_count; this file is a middle file and will input to paraclu.
paraclu_output.tsv : a file contains 8 columns: the sequence name, the strand, the first position in the cluster, the last position in the cluster, the number of positions with data in the cluster, the sum of the data values in the cluster, the cluster’s “minimum density” and the cluster’s “maximum density”.
input_to_DP.tsv : a file include 4 columns: chromosome, strand, PAS and cluster_id. This file was made to input to our deep learning neural network.
predict_result.tsv : a file contains three columns: cluster_id; predicted label by CNN model; probability to be positive sample.
positive_result.bed : A dataframe contains six columns: chromosome, cluster_start, cluster_end, cluster_name, score and strand. Samples in this file are positive samples.
lognorm_fit.pdf : a file shows the distribution of fragment size and the result by using log normal distribution to fit
cluster_mapped_gene.bed : a file contains 11 columns: chromosome of PAS cluster, start of PAS cluster, end of PAS cluster, PAS cluster id, score of PAS cluster, strand of PAS cluster, chromosome of gene, start of gene, end of gene, gene id, score
all_cluster.h5ad: final result, including matrix of cell by all PAS. In most situations, you just need focus this file.
two_cluster.h5ad: including matrix of cell by alternative PAS (with two or more than PAS).
Here is a quick test file. You can check it.
Download test file
You can download test file from figshare.
Here, you can download some large file include genome.fa, annotation gtf, a bam file and so on. Please note that you should also download the index file (.bam.bai) for the bam file or index by youself.
Alternatively, you can also download the reference genome fasta file from Ensembl or Genecode. But please make you use the same genome.fa and annotation gtf file as the process building STAR index.
Run scTail
Here are three stpes in scTail : scTail-callPeak , scTail-peakMerge and scTail-count.
scTail-callPeak : identify PAS signal for individual bam file
scTail-peakMerge : merge multiple PAS signal from different samples.
scTail-count : quantify PAS signal for individual cells.
You can run scTail-callPeak by using test file according to the following code.
#!/bin/bash
gtfFile=$download/gencode.v44.annotation.gtf
fastaFile=$download/GRCh38.primary_assembly.genome.fa
bamFile=$download/PBMC866_GX_CB_UB_filtered_test.bam
cellbarcodeFile=$download/barcodes.tsv
chromosomeSize=$download/hg38.chromsize
outputfile=your_faverate_output
scTail-callPeak -b $bamFile --gtf $gtfFile --cellbarcode $cellbarcodeFile -f $fastaFile --species human --chromoSize $chromosomeSize -o $outputfile --minCount 50 -p 20
Before running scTail-peakMerge, you should create a positivebed_list.tsv. This file looks like this.
$your_faverate_output/count/positive_result.bed
$download/onesample_positive.bed
Afterwards, you can run scTail-peakMerge by using test file according to the following code.
#!/bin/bash
positivebed_list=get_by_yourself
outputfile=same_with_the_output_of_scTail-callpeak
scTail-peakMerge --sampleList $positivebed_list.tsv -o $outputfile
You can run scTail-count by using test file according to the following code.
#!/bin/bash
cellbarcodeFile=$download/barcodes.tsv
bamFile=$download/PBMC866_GX_CB_UB_filtered_test.bam
mergebedFile=$outputfile/merge/merged_cluster.bed
outputfile=same_with_the_output_of_scTail-callpeak
scTail-count --cellbarcode $cellbarcodeFile --bam $bamFile --outdir $outputfile --PAScluster $mergebedFile
Options
There are more parameters for setting (scTail-callPeak -h always give the version
you are using):
Usage: scTail-callPeak [options]
Options:
-h, --help show this help message and exit
-g GTF_FILE, --gtf=GTF_FILE
The annotation gtf file for your analysing species.
--cellbarcode=CELLBARCODE
The file include cell barcode which users want to keep
in the downstream analysis.
-f FASTA, --fasta=FASTA
The reference genome file
-b BAM_FILE, --bam=BAM_FILE
The bam file of aligned from STAR or other single cell
aligned software.
-o OUT_DIR, --outdir=OUT_DIR
The directory for output [default : $bam_file]
--chromoSize=CHROMOSIZE
The file which includes chromosome length
--species=SPECIES This indicates the species that you want to analysis.
Only human and mouse are supportted. You should input
human or mouse
Optional arguments:
--minCount=MINCOUNT
Minimum UMI counts for one cluster in all cells
[default: 50]
-p NPROC, --nproc=NPROC
Number of subprocesses [default: 4]
-d DEVICE, --device=DEVICE
If your server has the GPU, then the default card 0
will be used. If your server did not have the GPU,
then cpu will be used.
--maxReadCount=MAXREADCOUNT
For each gene, the maxmium read count kept for
clustering [default: 10000]
--densityFC=DENSITYFC
Minimum value for maximum density / minimum density
[default: 0]
--InnerDistance=INNERDISTANCE
The resolution of each cluster [default: 100]
There are more parameters for setting (scTail-peakMerge -h always give the version
you are using):
Usage: scTail-peakMerge [options]
Options:
-h, --help show this help message and exit
--sampleList=SAMPLELIST
The pathway of tsv file include the path of all
samples
-o OUT_DIR, --outdir=OUT_DIR
The directory for output merge bed file [default :
$bam_file]
Optional arguments:
--maxDistance=MAXDISTANCE
Maximum distance between clusters allowed for clusters
to be merged. [default : 40]
There are more parameters for setting (scTail-count -h always give the version
you are using):
Usage: scTail-count [options]
Options:
-h, --help show this help message and exit
--cellbarcode=CELLBARCODE
The file include cell barcode which users want to keep
in the downstream analysis.
-b BAM_FILE, --bam=BAM_FILE
The bam file of aligned from STAR or other single cell
aligned software.
-o OUT_DIR, --outdir=OUT_DIR
The directory for output [default : $bam_file]
--PAScluster=PASCLUSTER
The bed file of PAS cluster
Optional arguments:
-p NPROC, --nproc=NPROC
Number of subprocesses [default: 4]
--maxReadCount=MAXREADCOUNT
For each gene, the maxmium read count kept for
clustering [default: 10000]