rMATS : RNA-seq 可变剪切分析 (可比较两有重复样本)

Reference :

Installation

cd 
git clone https://github.com/Xinglab/rmats-turbo.git
cd rmats-turbo
./build_rmats --conda
conda activate {my.bio-tools.path}/rmats-turbo/conda_envs/rmats
rmats.py 

cd 
git clone https://github.com/Xinglab/rmats2sashimiplot.git
cd rmats2sashimiplot/
bash 2to3.sh 
pip uninstall rmats2sashimiplot
python ./setup.py install
rmats2sashimiplot

Usage

  • Take bam files & gtf file as input.
#-------------------------------------------
# generate bam files from STAR, arguments from script of rmats.
## --s1 --s2 take fastq files as input
wd=
output=$wd/results
index=refdata/cellranger_ref/hg38/star
gtf=refdata/cellranger_ref/hg38/genes/genes.gtf
cores=19
cd $wd 
for i in `cat SRR_Acc_List.txt`
do
STAR --readFilesIn $output/01cleandata/${i}_1_paired_clean.fastq.gz $output/01cleandata/${i}_2_paired_clean.fastq.gz /
--chimSegmentMin 2 --outFilterMismatchNmax 3 /
--alignEndsType EndToEnd /
--runThreadN $cores /
--outSAMstrandField intronMotif /
--outSAMtype BAM Unsorted /  # Failed with BAM SortedByCoordinate
--alignSJDBoverhangMin 6 /
--alignIntronMax 299999 /
--genomeDir $index /
--sjdbGTFfile $gtf /
--outFileNamePrefix  $output/02alignment-hg38/${i}_ /
--readFilesCommand zcat


samtools sort -@ $cores $output/02alignment-hg38/${i}_Aligned.out.bam -o $output/02alignment-hg38/${i}.bam
samtools index -@ $cores $output/02alignment-hg38/${i}.bam


done
#-------------------------------------------
vim rmats.sh
#!/bin/bash
echo "/path/to/bam/ctr_rep1.bam,/path/to/bam/ctr_rep2.bam" > b1.txt 
echo "/path/to/bam/treat_rep1.bam,/path/to/bam/treat_rep2.bam" > b2.txt
gtf=/path/to/gtf/xxxx.gtf
out_dir=/path/to/output
tmp_dir=/path/to/tmp
core=16
read_length=150

rmats.py --b1 b1.txt --b2 b2.txt  --od $out_dir --tmp $tmp_dir /
--readLength $read_length --variable-read-length --nthread $core
#--------------------------------------------
python rmats2sashimiplot/src/rmats2sashimiplot/rmats2sashimiplot.py /
-o plot-RI-1-2/ /
--l1 ctr / #label the sample1
--l2 treat /
--b1 /path/to/bam/ctr_rep1.bam,/path/to/bam/ctr_rep2.bam /
--b2 /path/to/bam/treat_rep1.bam,/path/to/bam/treat_rep2.bam /
-e /path/to/output/RI.MATS.JC.txt /
--event-type RI
python rmats.py -h
usage: rmats.py [options]

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --gtf GTF             An annotation of genes and transcripts in GTF format
  --b1 B1               A text file containing a comma separated list of the
                        BAM files for sample_1. (Only if using BAM)
  --b2 B2               A text file containing a comma separated list of the
                        BAM files for sample_2. (Only if using BAM)
  --s1 S1               A text file containing a comma separated list of the
                        FASTQ files for sample_1. If using paired reads the
                        format is ":" to separate pairs and "," to separate
                        replicates. (Only if using fastq)
  --s2 S2               A text file containing a comma separated list of the
                        FASTQ files for sample_2. If using paired reads the
                        format is ":" to separate pairs and "," to separate
                        replicates. (Only if using fastq)
  --od OD               The directory for final output from the post step
  --tmp TMP             The directory for intermediate output such as ".rmats"
                        files from the prep step
  -t {paired,single}    Type of read used in the analysis: either "paired" for
                        paired-end data or "single" for single-end data.
                        Default: paired
  --libType {fr-unstranded,fr-firststrand,fr-secondstrand}
                        Library type. Use fr-firststrand or fr-secondstrand
                        for strand-specific data. Only relevant to the prep
                        step, not the post step. Default: fr-unstranded
  --readLength READLENGTH
                        The length of each read
  --variable-read-length
                        Allow reads with lengths that differ from --readLength
                        to be processed. --readLength will still be used to
                        determine IncFormLen and SkipFormLen
  --anchorLength ANCHORLENGTH
                        The "anchor length" or "overhang length" used when
                        counting the number of reads spanning splice
                        junctions. A minimum number of "anchor length"
                        nucleotides must be mapped to each end of a given
                        junction. The minimum value is 1 and the default value
                        is set to 1 to make use of all possible splice
                        junction reads.
  --tophatAnchor TOPHATANCHOR
                        The "anchor length" or "overhang length" used in the
                        aligner. At least "anchor length" NT must be mapped to
                        each end of a given junction. The default is 1. (Only
                        if using fastq)
  --bi BINDEX           The directory name of the STAR binary indices (name of
                        the directory that contains the SA file). (Only if
                        using fastq)
  --nthread NTHREAD     The number of threads. The optimal number of threads
                        should be equal to the number of CPU cores. Default: 1
  --tstat TSTAT         The number of threads for the statistical model. If
                        not set then the value of --nthread is used
  --cstat CSTAT         The cutoff splicing difference. The cutoff used in the
                        null hypothesis test for differential splicing. The
                        default is 0.0001 for 0.01% difference. Valid: 0 <=
                        cutoff < 1. Does not apply to the paired stats model
  --task {prep,post,both,inte,stat}
                        Specify which step(s) of rMATS to run. Default: both.
                        prep: preprocess BAMs and generate a .rmats file.
                        post: load .rmats file(s) into memory, detect and
                        count alternative splicing events, and calculate P
                        value (if not --statoff). both: prep + post. inte
                        (integrity): check that the BAM filenames recorded by
                        the prep task(s) match the BAM filenames for the
                        current command line. stat: run statistical test on
                        existing output files
  --statoff             Skip the statistical analysis
  --paired-stats        Use the paired stats model
  --novelSS             Enable detection of novel splice sites (unannotated
                        splice sites). Default is no detection of novel splice
                        sites
  --mil MIL             Minimum Intron Length. Only impacts --novelSS
                        behavior. Default: 50
  --mel MEL             Maximum Exon Length. Only impacts --novelSS behavior.
                        Default: 500
  --allow-clipping      Allow alignments with soft or hard clipping to be used
  --fixed-event-set FIXED_EVENT_SET
                        A directory containing fromGTF.[AS].txt files to be
                        used instead of detecting a new set of events```

Output

  • rMATS 能识别5种Altenative Splicing events (AS : SE / RI / A5SS/A3SS/MXE,对于每种AS,都会有给出比对到exon junction的count。[AS].MATS.JC.txt or [AS].MATS.JCEC.txt

  • 23列,值得关注 :PValueIncLevelDifference=average(IncLevel1) - average(IncLevel2)
    。有几列描述这个剪切事件的染色体位置,名称根据不同AS有点不同。

  • 类似于gene-level的RNA-seq 分析,需要用长度进行 normalization。在rMATS中,用IncLevel (Inclusion Level)对 AS进行定量描述。

    image.png
  • 用likelihood-ratio test ,将sample1和sample2的某event的IncLevel 的差值进行检验。(--cstat)

    I/S
intro : 三个外显子,虚线为intron。看比对到junction 的reads,如果 1-3 ,则跳过了中间,如果 1-2 ;2-3 这个连接部位被检测到了,就表明包含了中间外显子。
Inclusion level ,不同 AS ,长度算法点不一样
paper-stat
vis

版权声明:
作者:Mr李
链接:https://www.techfm.club/p/61773.html
来源:TechFM
文章版权归作者所有,未经允许请勿转载。

THE END
分享
二维码
< <上一篇
下一篇>>