============== Run scTail ============== scTail includes three stpes : identify PAS for individual samples, merge PASs across multiple samples, quantify PAS for individual cells The input files include: * filtered alignment file (bam file) * annotation file (gtf file) * reference genome file (fasta file) * cell barcode list file (tsv file) * chrom size (tsv file) The output files include: * paraclu_input.tsv : a file includes 4 columns: chromosome, strand, position, reads_count; this file is a middle file and will input to paraclu. * paraclu_output.tsv : a file contains 8 columns: the sequence name, the strand, the first position in the cluster, the last position in the cluster, the number of positions with data in the cluster, the sum of the data values in the cluster, the cluster's "minimum density" and the cluster's "maximum density". * input_to_DP.tsv : a file include 4 columns: chromosome, strand, PAS and cluster_id. This file was made to input to our deep learning neural network. * predict_result.tsv : a file contains three columns: cluster_id; predicted label by CNN model; probability to be positive sample. * positive_result.bed : A dataframe contains six columns: chromosome, cluster_start, cluster_end, cluster_name, score and strand. Samples in this file are positive samples. * lognorm_fit.pdf : a file shows the distribution of fragment size and the result by using log normal distribution to fit * cluster_mapped_gene.bed : a file contains 11 columns: chromosome of PAS cluster, start of PAS cluster, end of PAS cluster, PAS cluster id, score of PAS cluster, strand of PAS cluster, chromosome of gene, start of gene, end of gene, gene id, score * all_cluster.h5ad: final result, including matrix of cell by all PAS. In most situations, you just need focus this file. * two_cluster.h5ad: including matrix of cell by alternative PAS (with two or more than PAS). Here is a quick test file. You can check it. Download test file =================== You can download test file from figshare_. .. _figshare: https://doi.org/10.6084/m9.figshare.25902508.v1 Here, you can download some large file include genome.fa, annotation gtf, a bam file and so on. Please note that you should also download the index file (.bam.bai) for the bam file or index by youself. Alternatively, you can also download the reference genome fasta file from Ensembl or Genecode. But please make you use the same genome.fa and annotation gtf file as the process building STAR index. Run scTail ============= Here are three stpes in scTail : **scTail-callPeak** , **scTail-peakMerge** and **scTail-count**. * scTail-callPeak : identify PAS signal for individual bam file * scTail-peakMerge : merge multiple PAS signal from different samples. * scTail-count : quantify PAS signal for individual cells. You can run scTail-callPeak by using test file according to the following code. .. code-block:: bash #!/bin/bash gtfFile=$download/gencode.v44.annotation.gtf fastaFile=$download/GRCh38.primary_assembly.genome.fa bamFile=$download/PBMC866_GX_CB_UB_filtered_test.bam cellbarcodeFile=$download/barcodes.tsv chromosomeSize=$download/hg38.chromsize outputfile=your_faverate_output scTail-callPeak -b $bamFile --gtf $gtfFile --cellbarcode $cellbarcodeFile -f $fastaFile --species human --chromoSize $chromosomeSize -o $outputfile --minCount 50 -p 20 Before running scTail-peakMerge, you should create a positivebed_list.tsv. This file looks like this. .. code-block:: bash $your_faverate_output/count/positive_result.bed $download/onesample_positive.bed Afterwards, you can run scTail-peakMerge by using test file according to the following code. .. code-block:: bash #!/bin/bash positivebed_list=get_by_yourself outputfile=same_with_the_output_of_scTail-callpeak scTail-peakMerge --sampleList $positivebed_list.tsv -o $outputfile You can run scTail-count by using test file according to the following code. .. code-block:: bash #!/bin/bash cellbarcodeFile=$download/barcodes.tsv bamFile=$download/PBMC866_GX_CB_UB_filtered_test.bam mergebedFile=$outputfile/merge/merged_cluster.bed outputfile=same_with_the_output_of_scTail-callpeak scTail-count --cellbarcode $cellbarcodeFile --bam $bamFile --outdir $outputfile --PAScluster $mergebedFile Options ======== There are more parameters for setting (``scTail-callPeak -h`` always give the version you are using): .. code-block:: html Usage: scTail-callPeak [options] Options: -h, --help show this help message and exit -g GTF_FILE, --gtf=GTF_FILE The annotation gtf file for your analysing species. --cellbarcode=CELLBARCODE The file include cell barcode which users want to keep in the downstream analysis. -f FASTA, --fasta=FASTA The reference genome file -b BAM_FILE, --bam=BAM_FILE The bam file of aligned from STAR or other single cell aligned software. -o OUT_DIR, --outdir=OUT_DIR The directory for output [default : $bam_file] --chromoSize=CHROMOSIZE The file which includes chromosome length --species=SPECIES This indicates the species that you want to analysis. Only human and mouse are supportted. You should input human or mouse Optional arguments: --minCount=MINCOUNT Minimum UMI counts for one cluster in all cells [default: 50] -p NPROC, --nproc=NPROC Number of subprocesses [default: 4] -d DEVICE, --device=DEVICE If your server has the GPU, then the default card 0 will be used. If your server did not have the GPU, then cpu will be used. --maxReadCount=MAXREADCOUNT For each gene, the maxmium read count kept for clustering [default: 10000] --densityFC=DENSITYFC Minimum value for maximum density / minimum density [default: 0] --InnerDistance=INNERDISTANCE The resolution of each cluster [default: 100] There are more parameters for setting (``scTail-peakMerge -h`` always give the version you are using): .. code-block:: html Usage: scTail-peakMerge [options] Options: -h, --help show this help message and exit --sampleList=SAMPLELIST The pathway of tsv file include the path of all samples -o OUT_DIR, --outdir=OUT_DIR The directory for output merge bed file [default : $bam_file] Optional arguments: --maxDistance=MAXDISTANCE Maximum distance between clusters allowed for clusters to be merged. [default : 40] There are more parameters for setting (``scTail-count -h`` always give the version you are using): .. code-block:: html Usage: scTail-count [options] Options: -h, --help show this help message and exit --cellbarcode=CELLBARCODE The file include cell barcode which users want to keep in the downstream analysis. -b BAM_FILE, --bam=BAM_FILE The bam file of aligned from STAR or other single cell aligned software. -o OUT_DIR, --outdir=OUT_DIR The directory for output [default : $bam_file] --PAScluster=PASCLUSTER The bed file of PAS cluster Optional arguments: -p NPROC, --nproc=NPROC Number of subprocesses [default: 4] --maxReadCount=MAXREADCOUNT For each gene, the maxmium read count kept for clustering [default: 10000]