RNA-seq 数据的处理流程（一）

Zad • 2023-07-20 21:14 • 杂文

一、建索引

下载参考基因组和相应的注释文件：

wget ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

mv Homo_sapiens.GRCh38.dna.primary_assembly.fa genome.fa

wget ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz $ gzip -d Homo_sapiens.GRCh38.84.gtf.gz$ mv Homo_sapiens.GRCh38.84.gtf genome.gtf

make exon, splicesite file:

hisat2_extract_splice_sites.py genome.gtf > genome.ss

hisat2_extract_exons.py genome.gtf > genome.exon

Building indexes

hisat2-build -p 16 --exon genome.exon --ss genome.ss genome.fa genome_tran

# -p 线程数；

二、获取比对结果

hisat2 [options]* -x {-1 -2 | -U | --sra-acc } [-S ]

-x # The basename of the index for the reference genome.

-1 # Comma-separated list of files containing mate 1s (filename usually includes _1), e.g. -1 flyA_1.fq,flyB_1.fq.

-2 # Comma-separated list of files containing mate 2s (filename usually includes _2), e.g. -2 flyA_2.fq,flyB_2.fq.

-S # File to write SAM alignments to.

三、结果文件解读

第1列：reads名称；

第2列：Flag标签；Flag标签是二进制数字之和，不同数字代表了不同的意义。Sum of all applicable flags.

1: The read is one of a pair

2: The alignment is one end of a proper paired-end alignment

4: The read has no reported alignments

8: The read is one of a pair and has no reported alignments

16: The alignment is to the reverse reference strand

32: The other mate in the paired-end alignment is aligned to the reverse reference strand

64: The read is mate 1 in a pair

128: The read is mate 2 in a pair

第3列：比对到的染色体信息；

第4列：比对到参考基因组物理位置；

第5列：比对质量值（0-60）；

第6列：CIAGR（记录插入、缺失等）；CIAGR中包含的是比对结果信息，表明了一条reads所有碱基的比对情况。比如CIGAR = 150M表示150bp的reads都比对到参考基因组上；

第7列：配对reads比对到的染色体，=表示相同；

第8列：配对reads比对到的染色体物理位置；

第9列：文库插入序列大小；

第10列：Reads序列；

第11列：质量值。

第12列：Optional fields.

AS:i: : Alignment score. Can be negative. Only present if SAM record is for an aligned read.

ZS:i: : Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than AS:i.

YS:i: : Alignment score for opposite mate in the paired-end alignment. Only present if the SAM record is for a read that aligned as part of a paired-end alignment.

XN:i: : The number of ambiguous bases in the reference covering this alignment. Only present if SAM record is for an aligned read.

XM:i: : The number of mismatches in the alignment. Only present if SAM record is for an aligned read.

XO:i: : The number of gap opens, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read.

XG:i: : The number of gap extensions, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read.

NM:i: : The edit distance; that is, the minimal number of one-nucleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string. Only present if SAM record is for an aligned read.

YF:Z: ~~: String indicating reason why the read was filtered out. See also: [Filtering]. Only appears for reads that were filtered out.~~

YT:Z: : Value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly.

MD:Z: ~~: A string representation of the mismatched reference bases in the alignment. See SAM format specification for details. Only present if SAM record is for an aligned read.~~

XS:A: : Values of + and - indicate the read is mapped to transcripts on sense and anti-sense strands, respectively. Spliced alignments need to have this field, which is required in Cufflinks and StringTie.

We can report this field for the canonical-splice site (GT/AG), but not for non-canonical splice sites. You can direct HISAT2 not to output such alignments (involving non-canonical splice sites) using “–pen-noncansplice 1000000”.

NH:i: : The number of mapped locations for the read or the pair.

Zs:Z: : When the alignment of a read involves SNPs that are in the index, this option is used to indicate where exactly the read involves the SNPs. This optional field is similar to the above MD:Z field. For example, Zs:Z:1|S|rs3747203,97|S|rs16990981 indicates the second base of the read corresponds to a known SNP (ID: rs3747203). 97 bases after the third base (the base after the second one), the read at 100th base involves another known SNP (ID: rs16990981). ‘S’ indicates a single nucleotide polymorphism. ‘D’ and ‘I’ indicate a deletion and an insertion, respectively.

版权声明：
作者：Zad
链接：https://www.techfm.club/p/61251.html
来源：TechFM
文章版权归作者所有，未经允许请勿转载。

THE END

GitHub 标签

分享

二维码



远尘淡墨调烟雨

< <上一篇

利来C罗回葡萄牙比赛，上座率低迷，球迷调侃称这场比赛虽然有C罗，但整体的上座率甚至不如中甲联赛

下一篇>>

搜索内容