2. Nanopore三代测序fastq基因组比对

倾城 • 2025-03-10 07:04 • 杂文

因为预处理会过滤掉90%以上的reads，所以分别用原始数据和过滤后的数据进行基因组比对，看是否有差异。

一、先用未经过预处理的fastq文件进行比对（Test）

查看文件里有多少读段

cat BC07.fastq|grep "^@"|wc -l
818523
cat BC08.fastq|grep "^@"|wc -l
502760

建立比对数据库
（1）准备基因序列和基因注释文件

从Ensemble数据库下载小鼠基因组序列
GRCm39 (GCA_000001635.9)
从Gencode网站下载基因注释GTF文件
GRCm39 Comprehensive gene annotation(GHR)
解压基因组序列和基因注释

gunzip Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz
gunzip gencode.vM36.annotation.gtf.gz

（2）Hisat2比对RNA-seq数据

师兄说Hisat2并不是一个很好用的比对工具，对于三代测序的长reads，minimap2才是更好的比对工具。

安装Hisat2软件

conda install -c bioconda hisat2

使用Hisat2构建索引

hisat2-build -p 8 Mus_musculus.GRCm39.dna_sm.primary_assembly.fa GRCm39_index

GRCm39_index → 生成的索引前缀（将创建多个索引文件）

Hisat2比对单端reads

hisat2 -p 8 -x ../../../genome/mouse_genome/GRCm39_index -U BC07.fastq -S BC07.sam
802807 reads; of these:
  802807 (100.00%) were unpaired; of these:
    802224 (99.93%) aligned 0 times
    462 (0.06%) aligned exactly 1 time
    121 (0.02%) aligned >1 times
0.07% overall alignment rate 

hisat2 -p 8 -x ../../../genome/mouse_genome/GRCm39_index -U BC08.fastq -S BC08.sam
479463 reads; of these:
  479463 (100.00%) were unpaired; of these:
    479102 (99.92%) aligned 0 times
    348 (0.07%) aligned exactly 1 time
    13 (0.00%) aligned >1 times
0.08% overall alignment rate 
#比对效率极低

比对结果分析（Samtools）
（1）samtools生成bam二进制文件，并建立索引

samtools view -@ 8 -bS BC07.sam | samtools sort -@ 8 -o BC07_sorted.bam
samtools index BC07_sorted.bam

samtools view -@ 8 -bS BC08.sam | samtools sort -@ 8 -o BC08_sorted.bam
samtools index BC08_sorted.bam

（2）通过samtools输出比对的统计信息

samtools flagstat BC07_sorted.bam                                
802965 + 0 in total (QC-passed reads + QC-failed reads)
802807 + 0 primary
158 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
741 + 0 mapped (0.09% : N/A)
583 + 0 primary mapped (0.07% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat BC08_sorted.bam
479481 + 0 in total (QC-passed reads + QC-failed reads)
479463 + 0 primary
18 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
379 + 0 mapped (0.08% : N/A)
361 + 0 primary mapped (0.08% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
#为什么比对效率这么低呢？

IGV可视化
（1）打开IGV软件，选择Mouse基因组

IGV

（2）导入：File>Load from file...>.bam文件（.bam.bai要和.bam文件在同一目录下）

因为比对上的序列数量太少了，用samtools查看比对序列在染色体上的位置

samtools idxstats BC07_sorted.bam
1       195154279       30      0
10      130530862       15      0
11      121973369       29      0
12      120092757       210     0
13      120883175       11      0
14      125139656       14      0
15      104073951       20      0
16      98008968        23      0
17      95294699        122     0
18      90720763        9       0
19      61420004        6       0
2       181755017       26      0
3       159745316       12      0
4       156860686       25      0
5       151758149       16      0
6       149588044       17      0
7       144995196       26      0
8       130127694       25      0
9       124359700       34      0
MT      16299   45      0
X       169476592       8       0
Y       91455967        0       0

samtools view BC07_sorted.bam | awk '$3=="12" {print $3, $4}' | sort | uniq -c | sort -nr | head -n 10
     77 12 69206069
     13 12 69408101
      9 12 69408102
      7 12 69408093
      4 12 69408136
      4 12 69206071
      3 12 69408139
      3 12 69408134
      3 12 69408111
      3 12 69408108

查看第12号染色体上比对的位置，把位置复制到IGV中即可查看

所以IGV可视化可以看到比对上的都是短片段，不是长片段，这可能就和Hisat2的工具的不适合有关系。

二、用经过预处理的fastq文件进行比对（Test）

预处理fastq来源于“1. Nanopore三代测序fastq数据预处理”中的*output.fastq文件

查看预处理文件的reads数

cat BC07_output.fastq |grep "^@"|wc -l
105907

cat BC08_output.fastq |grep "^@"|wc -l
33509

进一步去掉可能的rRNA污染（可省略）

安装SortMeRNA

conda install -c bioconda sortmerna

mkdir -p ~/sortmerna_db && cd ~/sortmerna_db
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/silva-bac-16s-id90.fasta
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/silva-bac-23s-id98.fasta
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/silva-euk-18s-id95.fasta
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/silva-euk-28s-id98.fasta
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/rfam-5s-database-id98.fasta
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/rfam-5.8s-database-id98.fasta

如果下载的特别慢，则手动下载rRNA参考基因组，保存在新建的sortmerna_db文件夹中
silva-bac-16s-id90.fasta
silva-bac-23s-id98.fasta
silva-euk-18s-id95.fasta
silva-euk-28s-id98.fasta
rfam-5s-database-id98.fasta
rfam-5.8s-database-id98.fasta
sortmerna做rRNA基因组比对

sortmerna --ref silva-bac-16s-id90.fasta /
          --ref silva-bac-23s-id98.fasta /
          --ref silva-euk-18s-id95.fasta /
          --ref silva-euk-28s-id98.fasta /
          --ref rfam-5s-database-id98.fasta /
          --ref rfam-5.8s-database-id98.fasta /
          --reads ../nanopore-data/FMR1/barcode-241126/BC08_output.fastq /
          --aligned ../nanopore-data/FMR1/barcode-241126/BC08_output_rRNA /
          --other ../nanopore-data/FMR1/barcode-241126/BC08_output_non_rRNA /
          --fastx

sortmerna --ref silva-bac-16s-id90.fasta /
          --ref silva-bac-23s-id98.fasta /
          --ref silva-euk-18s-id95.fasta /
          --ref silva-euk-28s-id98.fasta /
          --ref rfam-5s-database-id98.fasta /
          --ref rfam-5.8s-database-id98.fasta /
          --reads ../nanopore-data/FMR1/barcode-241126/BC07_output.fastq /
          --aligned ../nanopore-data/FMR1/barcode-241126/BC07_output_rRNA /
          --other ../nanopore-data/FMR1/barcode-241126/BC07_output_non_rRNA /
          --fastx

再次检查处理后的reads数

cat BC07_output_non_rRNA.fq |grep "^@"|wc -l
68184

cat BC08_output_non_rRNA.fq |grep "^@"|wc -l
21547

BC08去掉了60%的rRNA，BC07去掉了20%的rRNA，但是BC07是input样品，可能BC08就是富集了rRNA的mRNA，所以考虑不用去除rRNA基因组的fastq文件，直接使用*output.fastq文件。

Hisat2比对RNA-seq数据

hisat2 -p 8 -x ../../../genome/mouse_genome/GRCm39_index -U BC08_output.fastq -S BC08_output.sam
32834 reads; of these:
  32834 (100.00%) were unpaired; of these:
    32682 (99.54%) aligned 0 times
    144 (0.44%) aligned exactly 1 time
    8 (0.02%) aligned >1 times
0.46% overall alignment rate

hisat2 -p 8 -x ../../../genome/mouse_genome/GRCm39_index -U BC07_output.fastq -S BC07_output.sam
105179 reads; of these:
  105179 (100.00%) were unpaired; of these:
    104630 (99.48%) aligned 0 times
    350 (0.33%) aligned exactly 1 time
    199 (0.19%) aligned >1 times
0.52% overall alignment rate

预处理和不预处理基因组比对的reads比较：

samtools生成bam二进制文件，并建立索引

samtools view -@ 8 -bS BC07_output.sam | samtools sort -@ 8 -o BC07_output_sorted.bam          
samtools index BC07_output_sorted.bam

samtools view -@ 8 -bS BC08_output.sam | samtools sort -@ 8 -o BC08_output_sorted.bam
samtools index BC08_output_sorted.bam

IGV可视化（同前）

主要问题：

经过质量控制后能比对上的reads比例上升了，但是总数下降了
不知道是不是RIP建库的问题，基因组比对的方法能比对上的reads数太太太太少了
按同样的流程把BC07和BC08的重复数据再跑一遍pipeline。

版权声明：
作者：倾城
链接：https://www.techfm.club/p/196991.html
来源：TechFM
文章版权归作者所有，未经允许请勿转载。

THE END

GitHub 网站

二维码

【喷嚏图卦20250310】领导没日没夜为人民服务诶

< <上一篇

逛黄沙，买海鲜～

下一篇>>

搜索内容

2. Nanopore三代测序fastq基因组比对

一、先用未经过预处理的fastq文件进行比对（Test）

二、用经过预处理的fastq文件进行比对（Test）

取消回复

共有 0 条评论

Ads