bed基因注释

日常瞎掰

  工欲善其事必先利其器,好用的工具对做事来说很重要,可以让你做起事来收到事半功倍的效果。前段时间分析数据时,需要给intervals做基因注释,需要操作gtf文件,最后发现一款非常好用的python包 — pyranges。该包可以很轻松地将bedgtfgff3等格式文件读取为PyRanges对象,该对象有点类似潘大师的DataFrame,熟悉潘大师的同学应该能体会到DataFrame操作数据的快乐。同样,皮软杰斯内置操作intervals的方法也绝对能给你带来无语伦比的畅快。废话不多说,下面来感受一下使用皮软杰斯做注释基因有多方便。

读取文件

  read_bed方法可以用来读取bed格式的文件,read_gtfread_gff3分别用来读取gtfgff3文件:

import pyranges as pr

bed = pr.read_bed('data.bed')
bed
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       |
| (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         | 212609534 | 212609559 | U0         | 0         | +            |
| chr1         | 169887529 | 169887554 | U0         | 0         | +            |
| chr1         | 216711011 | 216711036 | U0         | 0         | +            |
| chr1         | 144227079 | 144227104 | U0         | 0         | +            |
| ...          | ...       | ...       | ...        | ...       | ...          |
| chrY         | 15224235  | 15224260  | U0         | 0         | -            |
| chrY         | 13517892  | 13517917  | U0         | 0         | -            |
| chrY         | 8010951   | 8010976   | U0         | 0         | -            |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

gtf = pr.read_gtf('gencode.vM24.annotation.gtf')
gtf
+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+
| Chromosome   | Source     | Feature     | Start     | End       | Score      | Strand       | Frame      | gene_id              | gene_type      | +15   |
| (category)   | (object)   | (object)    | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)             | (object)       | ...   |
|--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------|
| chr1         | HAVANA     | gene        | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC            | ...   |
| chr1         | HAVANA     | transcript  | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC            | ...   |
| chr1         | HAVANA     | exon        | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC            | ...   |
| chr1         | ENSEMBL    | gene        | 3102015   | 3102125   | .          | +            | .          | ENSMUSG00000064842.1 | snRNA          | ...   |
| ...          | ...        | ...         | ...       | ...       | ...        | ...          | ...        | ...                  | ...            | ...   |
| chrY         | ENSEMBL    | CDS         | 90838871  | 90839177  | .          | -            | 0          | ENSMUSG00000096850.1 | protein_coding | ...   |
| chrY         | ENSEMBL    | start_codon | 90839174  | 90839177  | .          | -            | 0          | ENSMUSG00000096850.1 | protein_coding | ...   |
| chrY         | ENSEMBL    | stop_codon  | 90838868  | 90838871  | .          | -            | 0          | ENSMUSG00000096850.1 | protein_coding | ...   |
| chrY         | ENSEMBL    | UTR         | 90838868  | 90838871  | .          | -            | .          | ENSMUSG00000096850.1 | protein_coding | ...   |
+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+
Stranded PyRanges object has 1,870,769 rows and 25 columns from 22 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)

  上面说过PyRanges对象类似潘大师的DataFrame,所以可以用类似的方法来过滤数据,如下面只保留gtf里面Feature属性为gene的行:

gene = gtf[gtf.Feature == 'gene']
gene
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+
| Chromosome   | Source     | Feature    | Start     | End       | Score      | Strand       | Frame      | gene_id              | gene_type            | +15   |
| (category)   | (object)   | (object)   | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)             | (object)             | ...   |
|--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------|
| chr1         | HAVANA     | gene       | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC                  | ...   |
| chr1         | ENSEMBL    | gene       | 3102015   | 3102125   | .          | +            | .          | ENSMUSG00000064842.1 | snRNA                | ...   |
| chr1         | HAVANA     | gene       | 3252756   | 3253236   | .          | +            | .          | ENSMUSG00000102851.1 | processed_pseudogene | ...   |
| chr1         | HAVANA     | gene       | 3466586   | 3513553   | .          | +            | .          | ENSMUSG00000089699.1 | lncRNA               | ...   |
| ...          | ...        | ...        | ...       | ...       | ...        | ...          | ...        | ...                  | ...                  | ...   |
| chrY         | HAVANA     | gene       | 90603500  | 90605864  | .          | -            | .          | ENSMUSG00000099619.6 | lncRNA               | ...   |
| chrY         | HAVANA     | gene       | 90665345  | 90667625  | .          | -            | .          | ENSMUSG00000099399.6 | lncRNA               | ...   |
| chrY         | HAVANA     | gene       | 90752426  | 90755467  | .          | -            | .          | ENSMUSG00000095366.2 | lncRNA               | ...   |
| chrY         | ENSEMBL    | gene       | 90838868  | 90839177  | .          | -            | .          | ENSMUSG00000096850.1 | protein_coding       | ...   |
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+
Stranded PyRanges object has 55,385 rows and 25 columns from 22 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)

基因注释

  皮软杰斯里面有nearestk_nearest两种方法可以用。通过函数名我们就可以猜到这两个函数都可以用来查找interval1集合里每一个区间在interval2集合里面距离最近的区间,默认是返回有交集的区间,而k_nearest可以指定返回的区间个数。通过代码看一下区别:

bed_test = pr.PyRanges(chromosomes="chr1", strands="+", starts=[3074300], ends=[3074322])
bed
+--------------+-----------+-----------+--------------+
| Chromosome   |     Start |       End | Strand       |
| (category)   |   (int32) |   (int32) | (category)   |
|--------------+-----------+-----------+--------------|
| chr1         |   3074300 |   3074322 | +            |
+--------------+-----------+-----------+--------------+

bed_test.nearest(gtf,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
| Chromosome   |     Start |       End | Strand       | Source     | Feature    |   Start_gtf |   End_gtf | Score      | Strand_gtf   | Frame      | +18   |
| (category)   |   (int32) |   (int32) | (category)   | (object)   | (object)   |     (int32) |   (int32) | (object)   | (category)   | (object)   | ...   |
|--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |     3073252 |   3074322 | .          | +            | .          | ...   |
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+

bed_test.nearest(gene,overlap=False,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
| Chromosome   |     Start |       End | Strand       | Source     | Feature    |   Start_gtf |   End_gtf | Score      | Strand_gtf   | Frame      | +18   |
| (category)   |   (int32) |   (int32) | (category)   | (object)   | (object)   |     (int32) |   (int32) | (object)   | (category)   | (object)   | ...   |
|--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|
| chr1         |   3074300 |   3074322 | +            | ENSEMBL    | gene       |     3102015 |   3102125 | .          | +            | .          | ...   |
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+

bed_test.k_nearest(gene,k=5,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+
| Chromosome   |     Start |       End | Strand       | Source     | Feature    |   Start_b |     End_b | Score      | Strand_b     | Frame      | +18   |
| (category)   |   (int32) |   (int32) | (category)   | (object)   | (object)   |   (int32) |   (int32) | (object)   | (category)   | (object)   | ...   |
|--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------|
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3073252 |   3074322 | .          | +            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | ENSEMBL    | gene       |   3102015 |   3102125 | .          | +            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3205900 |   3671498 | .          | -            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3252756 |   3253236 | .          | +            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3365730 |   3368549 | .          | -            | .          | ...   |
+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+

  从上面的代码可知有无overlap=False参数的区别,以及nearestk_nearest函数的区别。当然,还有一些比较实用的参数如strandedness:"same",指定链是否相同;how:"upstream"、 "downstream"分别指定选取最近区间的方位,nb_cpu:指定使用的cpu数。除此之外,k_nearest还有一个参数ties可以指定有距离相同的区间时如何返回,可选值为"first"、"last"、"different"。
  suffix参数指定当两个集合有相同列名时,后一个集合列名会添加的字段,默认是'_b'。不过,好像k_nearest函数有BUG这个参数无效,但并不影响结果可以忽略。
  从上面可以看出k_nearest函数功能上略胜一筹,注释基因时用此函数相对更自由一些:

anno = bed.k_nearest(gene, ties='different')
anno = anno[['Chromosome', 'Start', 'End, 'Strand', 'gene_id']]
anno
+--------------+-----------+-----------+--------------+----------------------+
| Chromosome   | Start     | End       | Strand       | gene_id              |
| (category)   | (int32)   | (int32)   | (category)   | (object)             |
|--------------+-----------+-----------+--------------+----------------------|
| chr1         | 212609534 | 212609559 | +            | ENSMUSG00000102307.1 |
| chr1         | 169887529 | 169887554 | +            | ENSMUSG00000103110.1 |
| chr1         | 216711011 | 216711036 | +            | ENSMUSG00000102307.1 |
| chr1         | 144227079 | 144227104 | +            | ENSMUSG00000100389.1 |
| ...          | ...       | ...       | ...          | ...                  |
| chrY         | 15224235  | 15224260  | -            | ENSMUSG00000102835.5 |
| chrY         | 13517892  | 13517917  | -            | ENSMUSG00000102174.1 |
| chrY         | 8010951   | 8010976   | -            | ENSMUSG00000099861.1 |
| chrY         | 7405376   | 7405401   | -            | ENSMUSG00000118447.1 |
+--------------+-----------+-----------+--------------+----------------------+

anno.to_csv("data_anno.txt", sep="/t", header=True)

结束语

  pyranges包的功能还是挺多的,不仅包含区间的交集、差集,还可以进行一些统计运算如JaccardFisher等。虽然,目前有不少现成的软件如homerchipseeker可以做基因注释,很多时候我们可以直接使用这些软件即可,但pyranges还是值得学习收藏一下,也许做个性化数据处理的时候使用它会来得更为方便些。

往期回顾

scanpy踩坑实录
差异基因密度分布
R绘图配色总结
saddleplot | A/B compartments
双曲线火山图一键拿捏

版权声明:
作者:主机优惠
链接:https://www.techfm.club/p/44970.html
来源:TechFM
文章版权归作者所有,未经允许请勿转载。

THE END
分享
二维码
< <上一篇
下一篇>>