-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathatom.xml
More file actions
1617 lines (1378 loc) · 97.5 KB
/
atom.xml
File metadata and controls
1617 lines (1378 loc) · 97.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<id>https://xiaobin-phd.github.io</id>
<title>阿宾的BLOG</title>
<updated>2024-11-03T01:33:16.097Z</updated>
<generator>https://github.com/jpmonette/feed</generator>
<link rel="alternate" href="https://xiaobin-phd.github.io"/>
<link rel="self" href="https://xiaobin-phd.github.io/atom.xml"/>
<logo>https://xiaobin-phd.github.io/images/avatar.png</logo>
<icon>https://xiaobin-phd.github.io/favicon.ico</icon>
<rights>All rights reserved 2024, 阿宾的BLOG</rights>
<entry>
<title type="html"><![CDATA[1102会议总结]]></title>
<id>https://xiaobin-phd.github.io/post/1102-hui-yi-zong-jie/</id>
<link href="https://xiaobin-phd.github.io/post/1102-hui-yi-zong-jie/">
</link>
<updated>2024-11-03T01:21:23.000Z</updated>
<content type="html"><![CDATA[<h3 id="1-会议概括">1、会议概括</h3>
<p>近期,实验室出现了工作态度不积极、工作执行不到位、个人奋斗意识不强以及试剂耗材浪费等问题。对此,申老师对实验室成员进行了批评,并明确提出了整改意见。要求各位成员积极开展自查自纠,务必采取有效措施,切实改变当前状况,以提升实验室的工作效率和责任意识。</p>
<h3 id="2-整改内容">2、整改内容</h3>
<h4 id="1-实验室记录本规范">1. 实验室记录本规范(★★★★★)</h4>
<p>严格按照华中农业大学现行标准执行,要求内容真实、记录及时。同时设立小组负责人,每月对记录进行检查,以确保规范性和准确性。</p>
<h4 id="2-上班考勤制度">2. 上班考勤制度(★★★★)</h4>
<p>上午打卡时间为:8:00-12:00 下午打卡时间为:14:00-22:00</p>
<h4 id="3-经费节约">3. 经费节约(★★★★)</h4>
<ol>
<li>
<p>无意义的实验失败和重复是浪费的主要来源,需整理实验室常用实验的标准操作程序(SOP),指定专人负责,确保每项实验的规范执行。具体分工由薛丽兰统筹。</p>
</li>
<li>
<p>测序和引物采购,需设立专人负责审核,同时做好留学生的实验管理工作。</p>
</li>
<li>
<p>各小组定期召开总结会议,分析经费使用情况,以“该省省,该花花”的原则查漏补缺。</p>
</li>
<li>
<p>个人需对照之前发布的整改措施,规范实验操作,养成节约经费的良好习惯。</p>
</li>
</ol>
<h4 id="4-个人问题">4. 个人问题(★★★)</h4>
<ol>
<li>
<p>明确目标,付诸实践,及时更正工作态度。</p>
</li>
<li>
<p>对老师布置的任务,须及时落实并定期汇报,形成闭环管理。</p>
</li>
<li>
<p>所有原始数据及结果文件第一时间备份至实验室服务器,养成良好的数据管理习惯。</p>
</li>
<li>
<p>低年级学生需精读至少5篇相关文献,提升文献阅读能力。</p>
</li>
<li>
<p>全员应养成撰写总结的习惯,如课题综述和实验心得,以提高文字表达能力。</p>
</li>
<li>
<p>在大组会议汇报中,增加5分钟的新技术分享环节,包括技术基本原理、应用方式及个人思考等内容,促进个人创新能力的提升。</p>
</li>
<li>
<p>实验室需要建设传承精神,高年级同学应负责指导低年级同学,包括实验教学和课题引导;低年级同学应积极参与,主动汇报,尊重师兄师姐的付出。双方都应换位思考,设身处地考虑对方的需求,以促进良好的合作氛围,提高团队凝聚力。</p>
</li>
</ol>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[代谢数据库拓展]]></title>
<id>https://xiaobin-phd.github.io/post/dai-xie-shu-ju-ku-tuo-zhan/</id>
<link href="https://xiaobin-phd.github.io/post/dai-xie-shu-ju-ku-tuo-zhan/">
</link>
<updated>2024-09-17T00:37:33.000Z</updated>
<content type="html"><![CDATA[<h3 id="代谢相关数据库拓展">代谢相关数据库拓展</h3>
<p><strong>目前已完成了初步注释,使用eggnog数据库,得到3964条基因(带有EC号)。</strong></p>
<p><strong>后续工作分为两个步骤:1. 对初步结果进行手动验证,有的基因可能与代谢关系不那么直接,可适当删除。2. 查阅文献,拓展数据库进行注释,与初步结果进行比较分析,最终得到全面而严格的代谢相关基因数据集。</strong></p>
<pre><code class="language-bash"># 常见数据库
1. HMDB:人体代谢组数据库(HMDB)是一个免费的电子数据库,包含关于人体小分子代谢产物的详细信息,其应用领域包括代谢组学、临床化学、生物标志物的发现。该数据库包含114,026个代谢物记录,包括水溶性和脂溶性代谢物,以及被认为是丰富(> 1 uM)或相对稀少(< 1 nM)的代谢物。此外,还有5702个蛋白质序列与这些代谢产物相关。
http://www.hmdb.ca/
2. KEGG:东京基因及基因组百科全书,全书收录了一只各种生物的所有代谢物的代谢途径。支持对代谢网络的搜寻及代谢途径的映射。与代谢组学相关性大的几个模块包括:KEGG PATHWAY,KEGG DISEASA,KEGG COMPOUND,KEGG REACTION。
https://www.genome.jp/kegg/ligand.html
3. Reactome:REACTOME 是一个开源、开放访问、手动策划和同行评审的途径数据库。其中包含信号传导和代谢分子及其组织成生物途径和过程的关系。Reactome 数据模型的核心单元是反应。参与反应的实体(核酸、蛋白质、复合物、疫苗、抗癌疗法和小分子)形成生物相互作用网络,并分为通路。Reactome 中的生物通路示例包括经典中间代谢、信号传导、转录调控、细胞凋亡和疾病。(与KEGG互补)
分析参考:https://blog.csdn.net/weixin_43839173/article/details/125318973
https://reactome.org/
# 0911更新:放弃该数据库,本地注释没有特别好的方式。
</code></pre>
<pre><code class="language-bash"># 其它数据库,参考系统生物学paper
1. BiGG Models:是由美国University of California, San Diego 创立的基于代谢组学的系统生物学整合数据库。 该数据库的最大特点是含有各类模式生物的代谢谱图模型。用户可以直观的调取各种生物的整体代谢通路,也可以查看某个具体的生化反应。同时也可以进行代谢产物搜索。该数据库目前含有2766个代谢产物和3311条代谢生化反应。
http://bigg.ucsd.edu/
ref:https://doi.org/10.1371/journal. pcbi.1009870
2. BRENDA:是一种专门用于酶和代谢酶的功能信息的数据库。它是一种在线资源,提供了大量的酶的相关信息,如酶的命名、分类、反应类型、催化物质、底物和产物等。Brenda数据库还包含了酶的催化机制、反应动力学、酶的结构和序列信息等。
https://www.brenda-enzymes.org/
doi:10.1038/nprot.2009.203
# 240911更新,没必要一味地扩大数据库,追求方法学上的完美,因此这两个数据库被放弃,我们要尽早得到基因数据集开展实验。
</code></pre>
<p><u>初步挑选了6个数据库,探索其本地注释的可行性。</u></p>
<p><strong>0911更新:确定eggnog,kegg和hmdb三个数据库用作分析。</strong></p>
<h3 id="1-hmdb官网蛋白文件下载界面httpshmdbcadownloads">1. HMDB(官网蛋白文件下载界面:https://hmdb.ca/downloads)</h3>
<pre><code class="language-bash"># sed -i 's/旧词/新词/g' 文件名 (fasta格式不规范,没有以>开头)
sed -i 's/HMDBP/>HMDBP/g' hmdb.fasta
# 发现下载的蛋白文件只有5629条序列,而官网上却显示8299条,已联系团队咨询此问题,暂时用下载文件分析。
grep "HMDB" hmdb.fasta |wc -l
5629
# cd-hit检测该数据库的冗余度,发现95%相似度下,只有几百条冗余,不愧是人类代谢数据库!
makeblastdb -in hmdb.fasta -dbtype prot -title hm -parse_seqids -out ./hm
nohup blastp -query /data/xb/1_pig_genome/GCF_000003025.6_Sscrofa11.1_protein.faa -db ~/0_rawdata/database/hmdb/hm -max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 8 > pig.hm.tab &
cat pig.hm.tab|cut -f1 > id.list
sort -n id.list|uniq > rmdup.list
#删除没意义的行 “Warning...”
sed '/Warning/d' rmdup.list > delete.rmdup.listq #得到的蛋白序列多达38706条,而eggnog仅有12539条。
可能要上双向最佳比对,否则得到的基因数太多了(测试发现单向比对最后得到gene序列12458条,而eggnog仅有3694条,更符合文献报道)
</code></pre>
<p><strong>双向最佳比对,用猪蛋白建库</strong></p>
<pre><code class="language-bash">mkdir pig && cd pig
makeblastdb -in ../GCF_000003025.6_Sscrofa11.1_protein.faa -dbtype prot -title pig -parse_seqids -out ./pig
nohup blastp -query ~/0_rawdata/database/hmdb/hmdb.fasta -db /data/xb/1_pig_genome/pig/pig -max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 8 > hmdb.pig.tab &
sed -i '/Warning/d' hmdb.pig.tab
</code></pre>
<pre><code class="language-python"># 双向最佳比对
import pandas as pd
# 读取正向BLAST和反向BLAST的比对结果,指定文件路径
forward_blast_path = '/home/xb/1_results/240906_pig_genome_annot/hmdb/pig.hm.tab'
reverse_blast_path = '/home/xb/1_results/240906_pig_genome_annot/hmdb/rbh/hmdb.pig.tab'
# 读取BLAST结果文件,假设为outfmt 6格式
blast_forward = pd.read_csv(forward_blast_path, sep='\t', header=None)
blast_reverse = pd.read_csv(reverse_blast_path, sep='\t', header=None)
# 设置列名
columns = ['qseqid', 'sseqid', 'pident', 'length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore']
blast_forward.columns = columns
blast_reverse.columns = columns
# 筛选正向和反向的 query 和 subject 组合,并确保唯一性
best_hits_forward = blast_forward[['qseqid', 'sseqid']].drop_duplicates()
best_hits_reverse = blast_reverse[['sseqid', 'qseqid']].drop_duplicates()
# 找到双向最佳比对
rbh = pd.merge(best_hits_forward, best_hits_reverse, left_on=['qseqid', 'sseqid'], right_on=['sseqid', 'qseqid'])
# 保存结果为文件,指定保存路径
output_path = '/home/xb/1_results/240906_pig_genome_annot/hmdb/rbh/rbh.tab'
rbh.to_csv(output_path, sep='\t', index=False)
</code></pre>
<p><strong>根据蛋白ID从GBFF中提取Gene_id</strong></p>
<pre><code class="language-python">from Bio import SeqIO
def extract_gene_ids(gbff_file, protein_ids_file, output_file):
# 读取蛋白质ID列表
with open(protein_ids_file, 'r') as f:
protein_ids = [line.strip() for line in f.readlines()]
# 创建一个字典来存储蛋白质ID与基因ID的映射
protein_to_gene = {}
# 解析GBFF文件
for record in SeqIO.parse(gbff_file, "genbank"):
for feature in record.features:
if feature.type == "CDS":
# 提取protein_id
if "protein_id" in feature.qualifiers:
protein_id = feature.qualifiers["protein_id"][0]
if protein_id in protein_ids:
# 提取GeneID
gene_ids = [xref.split(":")[1] for xref in feature.qualifiers.get("db_xref", []) if xref.startswith("GeneID")]
if gene_ids:
protein_to_gene[protein_id] = gene_ids[0] # 只取第一个GeneID
# 输出结果到文件
with open(output_file, 'w') as out_f:
out_f.write("Protein_ID\tGene_ID\n")
for protein_id in protein_ids:
gene_id = protein_to_gene.get(protein_id, "Not found")
out_f.write(f"{protein_id}\t{gene_id}\n")
# 示例使用
gbff_file = "/data/xb/1_pig_genome/GCF_000003025.6_Sscrofa11.1_genomic.gbff" # 替换为你的GBFF文件路径
protein_ids_file = "/home/xb/1_results/240906_pig_genome_annot/hmdb/rbh/rbh.list" # 替换为包含蛋白质ID的txt文件路径
output_file = "/home/xb/1_results/240906_pig_genome_annot/hmdb/extract_from_gbff/gene.id.txt" # 输出结果的文件路径
extract_gene_ids(gbff_file, protein_ids_file, output_file)
</code></pre>
<pre><code class="language-bash"># 提取第二列的gene_id并去重
cat gene.id.txt|cut -f2 > id.list
sort -n id.list |uniq > rmdup.gene.list # 最终结果4930条gene
# 0911测试,将猪蛋白fasta进行cd-hit 95%的聚类后,蛋白数量从63575减少到28102条,重新进行双向最佳比对,结果发现得到的蛋白数4929条,和不聚类相差无几,说明聚类不仅可以减少资源浪费,对结果影响也不大。
# 0913测试,将置信值设为1e-10,得到蛋白数为4948条,和1e-5(4953条)差别不大。
</code></pre>
<pre><code class="language-bash">import pandas as pd
# 读取两个表格文件
table1 = pd.read_csv('/home/xb/1_results/240906_pig_genome_annot/eggnog/extract_from_gbff/rmdup.list', header=None)
table2 = pd.read_csv('/home/xb/1_results/240906_pig_genome_annot/hmdb/extract_from_gbff/rmdup.gene.list', header=None)
# 重命名列名为 'ID'
table1.columns = ['ID']
table2.columns = ['ID']
# 表1特有内容
unique_table1 = table1[~table1['ID'].isin(table2['ID'])]
# 表2特有内容
unique_table2 = table2[~table2['ID'].isin(table1['ID'])]
# 共有内容
common = table1[table1['ID'].isin(table2['ID'])]
# 确保目标文件夹存在
import os
output_folder = '/home/xb/1_results/240906_pig_genome_annot/diff_compare/gene'
os.makedirs(output_folder, exist_ok=True)
# 将结果保存为文件
unique_table1.to_csv(f'{output_folder}/unique_table1.csv', index=False, header=False)
unique_table2.to_csv(f'{output_folder}/unique_table2.csv', index=False, header=False)
common.to_csv(f'{output_folder}/common.csv', index=False, header=False)
</code></pre>
<p><strong>eggnog注释得到3691条gene id,hmdb得到4930条gene id,eggnog特有1130条,hmdb特有2369条,共有2561条。</strong></p>
<p><strong>切割gene id,从ncbi中批量下载gene序列</strong></p>
<pre><code class="language-python"># 分割脚本,按照500个切割表格,形成多个小表格供后续使用。
import csv
import os
def split_list(lst, n):
"""将列表分割成多个小列表,每个列表最多包含n个元素。"""
for i in range(0, len(lst), n):
yield lst[i:i + n]
def save_split_files(gene_ids, batch_size, output_dir):
"""将基因ID列表分割并保存到多个CSV文件中。"""
if not os.path.exists(output_dir):
os.makedirs(output_dir)
for idx, batch in enumerate(split_list(gene_ids, batch_size), start=1):
file_path = os.path.join(output_dir, f'batch_{idx}.csv')
with open(file_path, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
for gene_id in batch:
writer.writerow([gene_id])
if __name__ == "__main__":
csv_file_path = "/home/xb/1_results/240906_pig_genome_annot/hmdb/extract_from_gbff/rmdup.gene.list" # 输入CSV文件路径
output_dir = "/home/xb/1_results/240906_pig_genome_annot/hmdb/extract_from_gbff/split_files" # 输出目录
batch_size = 600 # 每个文件的基因ID数量
gene_ids_from_csv = []
with open(csv_file_path, newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if len(row) > 0:
gene_id = int(row[0]) # 基因ID在CSV文件的第一列
gene_ids_from_csv.append(gene_id)
save_split_files(gene_ids_from_csv, batch_size, output_dir)
</code></pre>
<pre><code class="language-python"># 下载脚本,根据id下载gene序列,有的gene id在性染色体上,对应能查出两条基因序列。
import sys
import csv
import io
import os
from typing import List
from zipfile import ZipFile
from ncbi.datasets.openapi import ApiClient as DatasetsApiClient
from ncbi.datasets.openapi import ApiException as DatasetsApiException
from ncbi.datasets.openapi import GeneApi as DatasetsGeneApi
def download_and_extract_genes(gene_ids: List[int], zipfile_name: str, output_file_name: str):
"""下载基因数据并提取到文件中。"""
with DatasetsApiClient() as api_client:
gene_api = DatasetsGeneApi(api_client)
try:
gene_dataset_download = gene_api.download_gene_package_without_preload_content(
gene_ids,
include_annotation_type=["FASTA_GENE", "FASTA_PROTEIN"], # 选择下载的数据格式
)
with open(zipfile_name, "wb") as f:
f.write(gene_dataset_download.data)
except DatasetsApiException as e:
sys.exit(f"Exception when calling GeneApi: {e}\n")
try:
with ZipFile(zipfile_name) as dataset_zip:
zinfo = dataset_zip.getinfo(os.path.join("ncbi_dataset/data", "protein.faa"))
with io.TextIOWrapper(dataset_zip.open(zinfo), encoding="utf8") as fh:
with open(output_file_name, "a", encoding="utf8") as output_file:
output_file.write(fh.read())
except KeyError as e:
sys.exit(f"File {output_file_name} not found in zipfile: {e}")
def process_csv_files(input_dir: str, output_file_name: str):
"""处理目录中的所有CSV文件并提取基因数据。"""
csv_files = [f for f in os.listdir(input_dir) if f.endswith('.csv')]
if not csv_files:
sys.exit(f"No CSV files found in directory: {input_dir}")
# 确保输出文件存在
open(output_file_name, 'w').close()
for csv_file in csv_files:
csv_file_path = os.path.join(input_dir, csv_file)
with open(csv_file_path, newline='') as csvfile:
reader = csv.reader(csvfile)
gene_ids_from_csv = [int(row[0]) for row in reader if len(row) > 0]
# 使用 CSV 文件名作为唯一标识的一部分
zipfile_name = f"gene_{os.path.splitext(csv_file)[0]}.zip"
download_and_extract_genes(gene_ids_from_csv, zipfile_name, output_file_name)
if __name__ == "__main__":
input_dir = "/home/xb/1_results/240906_pig_genome_annot/hmdb/extract_from_gbff/split_files" # CSV文件所在目录
output_file_name = "combined_protein.faa" # 合并后的输出文件
process_csv_files(input_dir, output_file_name)
</code></pre>
<p><strong>解压、重命名及合并fna文件</strong></p>
<pre><code class="language-bash">#!/bin/bash 解压&重命名
# 确保目标文件夹存在
mkdir -p gene_output
# 遍历所有以 gene_batch_ 开头的 zip 文件
for zip_file in gene_batch_*.zip; do
# 提取文件名(去掉扩展名)
base_name=$(basename "$zip_file" .zip)
# 解压指定文件并重命名
unzip -j "$zip_file" ncbi_dataset/data/gene.fna -d ./gene_output/ &&
mv ./gene_output/gene.fna ./gene_output/${base_name}.fna
done
</code></pre>
<pre><code class="language-bash">cd gene_output/
cat *.fna > output.fna
seqtk seq -A output.fna | sort -u > output.rmdup.hmdb.fna
grep ">" output.rmdup.hmdb.fna |wc -l # 最后得到4934条gene,结果文件为output.rmdup.hmdb.fna
</code></pre>
<h3 id="2-kegg本地注释">2. KEGG本地注释</h3>
<p><strong>参考:[比较转录组分析(四)—— 组装的 GO 及 KEGG 注释 | Juse's Blog (biojuse.com)](https://biojuse.com/2022/11/28/比较转录组分析(四)—— 组装的注释/)</strong></p>
<p><strong>数据库下载:</strong></p>
<pre><code class="language-bash">mkdir -p /home/xb/0_rawdata/database/kegg/db
cd /home/xb/0_rawdata/database/kegg/db
wget -c ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
wget -c ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
wget -c ftp://ftp.genome.jp/pub/db/kofam/README
db_Version=`grep "Last update: " README|perl -p -e 's/Last update: //g;s/\//_/g'`
gunzip ko_list.gz
tar xvzf profiles.tar.gz
touch db_Version_${db_Version}
cd ..
wget -c ftp://ftp.genome.jp/pub/tools/kofam_scan/kofam_scan-1.3.0.tar.gz
tar -zxvf kofam_scan-1.3.0.tar.gz
</code></pre>
<p><strong>安装依赖</strong></p>
<pre><code class="language-bash">conda create -n kofam -y -c bioconda ruby hmmer parallel
conda activate kofam
cd kofam_scan-1.3.0
cp config-template.yml config.yml
vim config.yml
# 内容如下:
# Path to your KO-HMM database
# A database can be a .hmm file, a .hal file or a directory in which
# .hmm files are. Omit the extension if it is .hal or .hmm file
profile: /home/xb/0_rawdata/database/kegg/db/profiles
# Path to the KO list file
ko_list: /home/xb/0_rawdata/database/kegg/db/ko_list
# Path to an executable file of hmmsearch
# You do not have to set this if it is in your $PATH
hmmsearch: /home/xb/miniconda3/envs/kofam/bin/hmmsearch
# Path to an executable file of GNU parallel
# You do not have to set this if it is in your $PATH
parallel: /home/xb/miniconda3/envs/kofam/bin/parallel
# Number of hmmsearch processes to be run parallelly
cpu: 8
</code></pre>
<p><strong>运行命令</strong></p>
<pre><code class="language-bash">mkdir -p /home/xb/1_results/240906_pig_genome_annot/kegg
cd /home/xb/1_results/240906_pig_genome_annot/kegg
ln -s /home/xb/0_rawdata/database/kegg/kofam_scan-1.3.0/exec_annotation kofamscan
./kofamscan -o group_rep.kofam.out --cpu 8 --format mapper -e 1e-5 /data/xb/1_pig_genome/GCF_000003025.6_Sscrofa11.1_protein.faa #测试软件可行性,报错提示Ruby版本太低。
conda install ruby=2.7 #升级至2.7后可运行
nohup ./kofamscan -o group_rep.kofam.out --cpu 8 --format mapper -e 1e-5 /data/xb/1_pig_genome/GCF_000003025.6_Sscrofa11.1_protein.faa &
</code></pre>
<p><strong>使用 Kofamscan 得到 KO 注释文件后,可以使用对应物种的 pathway-KO 文件给注释相应的 pathway,编写脚本pathway.py</strong></p>
<pre><code>#!/usr/bin/env python3
import os
import sys
import re
def pathway_map(sp="ko"):
"""
:param sp:The default is'ko', which means downloading 'ko00001.keg'
https://www.genome.jp/kegg-bin/download_htext?htext=ko00001.keg&format=htext&filedir=
:return: K_ko_map
"""
#url = r"https://www.genome.jp/kegg-bin/download_htext?htext=asa00001.keg&format=htext&filedir="
keg_file = sp +"00001.keg"
cmd = r"wget 'http://www.kegg.jp/kegg-bin/download_htext?htext=" +sp + "00001.keg&format=htext&filedir=' -O " + keg_file
if not os.path.exists(keg_file):
try:
res = os.system(cmd)
# 使用system模块执行linux命令时,如果执行的命令没有返回值res的值是256
# 如果执行的命令有返回值且成功执行,返回值是0
except:
print("Failed to run\n" + str(cmd) +"\nplease check the network")
sys.exit()
in_keg = open(keg_file, "r").readlines()
K_ko_map = {}
for line in in_keg:
if line.startswith("A"):
# 'A09100 Metabolism'
level_1 = re.match(r'^A(.+?)\s(.+)\n', line).group(2)
elif line.startswith("B "):
# B 09102 Energy metabolism
level_2 = re.match(r'^B\s*(.+?)\s(.*)\n', line).group(2)
elif line.startswith("C "):
# 'C 00010 Glycolysis / Gluconeogenesis [PATH:asa00010]' or 'C 99980 Enzymes with EC numbers'
pathway_info = re.match(r'^C\s*(\d+?)\s(.*)\n', line)
pathway_id = "ko" + str(pathway_info.group(1))
pathway_desc = str(pathway_info.group(2)).split(" [")[0]
pathway_info_list = [pathway_desc, level_1, level_2]
elif line.startswith("D ") and in_keg[0] == '+D\tGENES\tKO\n':
# 'D ASA_1323 glk; glucokinase\tK00845 glk; glucokinase [EC:2.7.1.2]\n'
K_info = re.match(r'^D\s*.*\t(K\d+?)\s(.*)\n', line)
K_id = K_info.group(1)
K_desc = K_info.group(2)
K_info_list = [K_desc, pathway_id, pathway_desc, level_1, level_2]
if K_id not in K_ko_map.keys():
K_ko_map[K_id] = [K_info_list]
else:
l_tem = K_ko_map[K_id]
if K_info_list not in l_tem:
l_tem.append(K_info_list)
K_ko_map[K_id] = l_tem
elif line.startswith("D ") and in_keg[0] == '+D\tKO\n':
# 'D K00844 HK; hexokinase [EC:2.7.1.1]'
K_info = re.match(r'^D\s*(K\d+?)\s(.*)\n', line)
K_id = K_info.group(1)
K_desc = K_info.group(2)
K_info_list = [K_desc, pathway_id, pathway_desc, level_1, level_2]
if K_id not in K_ko_map.keys():
K_ko_map[K_id] = [K_info_list]
else:
l_tem = K_ko_map[K_id]
if K_info_list not in l_tem:
l_tem.append(K_info_list)
K_ko_map[K_id] = l_tem
return (K_ko_map)
def ko_class_map():
# https://www.genome.jp/kegg-bin/download_htext?htext=br08901.keg&format=htext&filedir= htext
# https://www.genome.jp/kegg-bin/download_htext?htext=br08901.keg&format=json&filedir= josn
# https://www.genome.jp/dbget-bin/get_linkdb?-t+orthology+path:ko00040
cmd = r"wget 'https://www.genome.jp/kegg-bin/download_htext?htext=br08901.keg&format=htext&filedir=' -O br08901.keg"
if not os.path.exists("br08901.keg"):
try:
res = os.system(cmd)
# 使用system模块执行linux命令时,如果执行的命令没有返回值res的值是256
# 如果执行的命令有返回值且成功执行,返回值是0
except:
print("Failed to run\n" + str(cmd) + "\nplease check the network")
sys.exit()
in_keg = open("br08901.keg", "r").readlines()
pathway_map=open("kegg_pathway_map.xls","w+")
ko_class = {}
for line in in_keg:
if line.startswith("A"):
# 'A09100 Metabolism'
level_1 = re.match(r'^A<b>(.*)</b>\n', line).group(1)
elif line.startswith("B "):
# B 09102 Energy metabolism
level_2 = re.match(r'^B\s*(.*)\n', line).group(1)
elif line.startswith("C "):
# 'C 00010 Glycolysis / Gluconeogenesis [PATH:asa00010]' or 'C 99980 Enzymes with EC numbers'
pathway_info = re.match(r'^C\s*(\d+?)\s(.*)\n', line)
pathway_id = "ko" + str(pathway_info.group(1))
pathway_desc = str(pathway_info.group(2)).split(" [")[0]
line_out = pathway_id + "\t" + pathway_desc + "\t" + level_1 + "\t" +level_2 + "\t"
pathway_map.write(line_out + "\n")
pathway_info_list = [pathway_desc, level_1, level_2]
if pathway_id not in ko_class.keys():
ko_class[pathway_id] = pathway_info_list
return(ko_class)
def K_list_Parser(K_list,sp="ko"):
file_name = os.path.split(K_list)[1].rsplit(".",1)[0]
out_kegg_anno = file_name + ".kegg_anno.xls"
not_in_pathway_map = file_name + "K_codes_not_in_pathway_map.list"
out_kegg_pathway = file_name + ".kegg_pathway_stata.xls"
anno_f = open(out_kegg_anno, "w+")
not_in_pathway_f = open(not_in_pathway_map, "w+")
pathway_f = open(out_kegg_pathway, "w+")
anno_f.write("gene_id\tK_id\tK_desc\tpathway_id\tpathway_desc\tlevel_1\tlevel_2\n")
pathway_f.write("Pathway\tGenes annoted in term\tPathway ID\tLevel1\tLevel2\tKOs\tGenes\n")
map_kegg = pathway_map(sp)
ko_class = ko_class_map()
infile_list = open(K_list, "r").readlines()
infile_list = [ term.rstrip("\n").split("\t") for term in infile_list ]
k_num_dict = {}
line_out_tem = ""
for line in infile_list:
gene_id = line[0]
if len(line) == 2:
K_id = line[1]
if K_id in map_kegg.keys():
pathway_info = map_kegg[K_id]
for K_info_list in pathway_info :
# K_desc, pathway_id, pathway_desc, level_1, level_2
string = "\t"
line_out =gene_id + "\t" + K_id + "\t" +string.join(K_info_list) + "\n"
if line_out != line_out_tem:
anno_f.write(line_out)
line_out_tem = line_out
else:
not_in_pathway_f.write(gene_id + "\t" + K_id + "\n")
line_out = gene_id + "\t" * 6 + "\n"
if line_out != line_out_tem:
anno_f.write(line_out)
line_out_tem = line_out
# k_id 2 gene_id list
if K_id not in k_num_dict.keys():
k_num_dict[K_id] = gene_id
else:
k_num_dict[K_id] = k_num_dict[K_id] + ';' + gene_id
else:
anno_f.write(gene_id +"\t"*6 + "\n")
ko_sample_dict = {}
for K_id in list(k_num_dict.keys()):
if K_id not in list(map_kegg.keys()):
continue
ko_num_sample = [ term[1] for term in map_kegg[K_id]]
for ko in ko_num_sample:
if ko not in list(ko_sample_dict.keys()):
ko_sample_dict[ko] = K_id
else:
ko_sample_dict[ko] = ko_sample_dict[ko] + ';' + K_id
for ko in [item for item in list(ko_sample_dict.keys()) if item in list(ko_class.keys()) ] :
pathway = ko_class[ko][0]
level1 = ko_class[ko][1]
level2 = ko_class[ko][2]
k_num_list = ko_sample_dict[ko].split(';')
gene_str = ''
for k_num_sample in k_num_list:
gene_str = gene_str + k_num_dict[k_num_sample] + ";"
num_gene = gene_str.count(';')
# pathway_f.write("Pathway\tGenes annoted in term\tPathway ID\tLevel1\tLevel2\tKOs\tGenes\n")
pathway_f.write(pathway + '\t' + str(num_gene)+ '\t' + ko +'\t' +level1 + '\t'+level2 +'\t'+ko_sample_dict[ko].rstrip(';')+ '\t' + gene_str.rstrip(';')+'\n')
anno_f.close()
pathway_f.close()
if len(sys.argv) < 2: #直接执行本脚本给出帮助信息
print(doc)
sys.exit()
elif len(sys.argv) == 2:
kaas_inflie = sys.argv[1]
K_list_Parser(kaas_inflie)
elif len(sys.argv) == 3:
kaas_inflie = sys.argv[1]
sp = sys.argv[2]
K_list_Parser(kaas_inflie,sp)
else:
print(doc)
sys.exit()
</code></pre>
<pre><code class="language-bash"># 特定物种注释,猪ssc
python3 pathway.py group_rep.kofam.out ssc
# 结果说明
br08901.keg kegg pathway分级文件
group_repkofam.kegg_anno.xls 基因K,ko注释(按基因)
group_repkofam.kegg_pathway_stata.xls 基因pathway注释统计(按pathway)
group_repkofamK_codes_not_in_pathway_map.list 没有注释到path的Orthology (K id)
kegg_pathway_map.xls kegg pathway分级表
ko00001.keg 同源簇(KEGG Orthology--KO)信息(参考或特定物种)
# 对group_repkofam.kegg_pathway_stata.xls进一步分析,下载至本地,Level1限制为Metabolism,提取相关的GENE并去重。
</code></pre>
<p><strong>编写提取脚本extract.py</strong></p>
<pre><code class="language-python">import pandas as pd
input_file = '/home/xb/1_results/240906_pig_genome_annot/kegg/metabolism.pro.list'
output_file = '/home/xb/1_results/240906_pig_genome_annot/kegg/metabolism.pro.id'
# 读取只有一列的文件
df = pd.read_csv(input_file, header=None)
# 提取唯一一列(蛋白ID)
protein_ids = df.iloc[:, 0]
# 分割蛋白ID并保存到新文件
with open(output_file, 'w') as outfile:
for protein_id in protein_ids:
# 如果有分号,则按分号分割
ids = protein_id.split(';')
# 写入文件
for id in ids:
outfile.write(f"{id}\n")
</code></pre>
<pre><code class="language-bash">python3 extract.py
sort -n metabolism.pro.id |uniq > rmdup.pro.id #最终得到蛋白ID 4926条
</code></pre>
<p><strong>根据蛋白ID从GBFF中提取Gene_id,最终得到基因序列1740条。(测试了不限制物种的注释到pathway,结果为1759条gene)</strong></p>
<p><strong>切割gene id,从ncbi中批量下载gene序列,参考前文,这里不做赘述!</strong></p>
<h4 id="比较三个数据库的注释结果以韦恩图和表格的形式展示">比较三个数据库的注释结果,以韦恩图和表格的形式展示。</h4>
<p><strong>先在pc端整理出一个三个数据库注释的gene id表格:all.gene.tab,然后用python分析。</strong></p>
<pre><code class="language-bash">mkdir diff_compare && cd diff_compare
python3 analysis.py
</code></pre>
<pre><code class="language-python"># 读取表格数据
file_path = '/home/xb/1_results/240906_pig_genome_annot/diff_compare/all.gene.tab'
df = pd.read_csv(file_path, sep='\t')
# 将每一列的基因ID提取为集合
kegg_set = set(df['KEGG'].dropna())
hmdb_set = set(df['HMDB'].dropna())
eggnog_set = set(df['EGGNOG'].dropna())
# 生成韦恩图
plt.figure(figsize=(8, 8))
venn = venn3([kegg_set, hmdb_set, eggnog_set], ('KEGG', 'HMDB', 'EGGNOG'))
plt.title('KEGG vs HMDB vs EGGNOG Gene IDs')
output_venn_path = '/home/xb/1_results/240906_pig_genome_annot/diff_compare/venn_diagram.png'
plt.savefig(output_venn_path) # 直接保存图片,不需要显示
plt.close() # 关闭图像窗口
# 提取韦恩图各部分的基因ID
venn_data = {
'KEGG_only': kegg_set - hmdb_set - eggnog_set,
'HMDB_only': hmdb_set - kegg_set - eggnog_set,
'EGGNOG_only': eggnog_set - kegg_set - hmdb_set,
'KEGG_HMDB': kegg_set & hmdb_set - eggnog_set,
'KEGG_EGGNOG': kegg_set & eggnog_set - hmdb_set,
'HMDB_EGGNOG': hmdb_set & eggnog_set - kegg_set,
'All_three': kegg_set & hmdb_set & eggnog_set
}
# 将各部分基因ID保存为表格
output_table_path = '/home/xb/1_results/240906_pig_genome_annot/diff_compare/venn_data.xlsx'
with pd.ExcelWriter(output_table_path) as writer:
for section, genes in venn_data.items():
pd.DataFrame(list(genes), columns=[section]).to_excel(writer, sheet_name=section, index=False)
</code></pre>
<figure data-type="image" tabindex="1"><img src="https://xiaobin-phd.github.io/post-images/1726533555852.png" alt="" loading="lazy"></figure>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[猪代谢相关基因注释]]></title>
<id>https://xiaobin-phd.github.io/post/zhu-dai-xie-xiang-guan-ji-yin-zhu-shi/</id>
<link href="https://xiaobin-phd.github.io/post/zhu-dai-xie-xiang-guan-ji-yin-zhu-shi/">
</link>
<updated>2024-09-09T00:43:40.000Z</updated>
<content type="html"><![CDATA[<h3 id="1-参考基因组下载">1、参考基因组下载</h3>
<p><strong>基因组选择:Genome assembly Sscrofa11.1 (官方参考基因组,杜洛克雌猪)</strong></p>
<pre><code>wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/GCF_000003025.6_Sscrofa11.1_genomic.fna.gz
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/GCF_000003025.6_Sscrofa11.1_protein.faa.gz
# MD5校验
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/md5checksums.txt
md5sum -c md5checksums.txt
</code></pre>
<h3 id="2-基因组注释">2、基因组注释</h3>
<p><strong>uniprot注释 (包括下载,建库和注释三部分)</strong></p>
<pre><code class="language-bash">wget -c https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz
makeblastdb -in ~/0_rawdata/database/uniprot/uniprot_sprot.fasta -dbtype prot -out ~/0_rawdata/database/uniprot/uni
blastdbcmd -info -db /home/xb/0_rawdata/database/uniprot/uni # 检查数据库构建是否正确
nohup blastx -query /data/xb/1_pig_genome/GCF_000003025.6_Sscrofa11.1_genomic.fna -db ~/0_rawdata/database/uniprot/uni -max_target_seqs 1 -outfmt 6 -evalue 1e-5 > uniprot.out &
# 已放弃,速度慢且达不到想要的结果。
jobs -l
kill -9 20239
# 重新对faa文件进行注释
diamond makedb --in uniprot_sprot.fasta -d uni
nohup diamond blastp -d ~/0_rawdata/database/uniprot/uni.dmnd -q /data/xb/1_pig_genome/GCF_000003025.6_Sscrofa11.1_protein.faa -o pig.uni.xml -f 5 --sensitive --max-target-seqs 20 -e 1e-5 --id 20 --tmpdir /dev/shm --index-chunks 1 -p 8 &
parsing_blast_result.pl --evalue 1e-5 --HSP-num 1 --out-hit-confidence --suject-annotation pig.uni.xml > pig.uni.tab
</code></pre>
<p>根据参考文献方法,发现eggnog注释可能更切合我们的需求,能直接得出带有EC号的蛋白。</p>
<p>软件安装和使用参考:https://www.yunbios.net/eggNOG.html (包括下载安装,数据库下载,比对)</p>
<pre><code class="language-bash">conda create -n eggnog
conda activate eggnog
conda install -c bioconda -y eggnog-mapper
wget -c http://eggnog6.embl.de/download/emapperdb-5.0.2/eggnog_proteins.dmnd.gz
gzip -d eggnog_proteins.dmnd.gz # 测试失败,得用软件提供的脚本下载数据库。
# 首先要手动创建一个data文件夹,脚本默认不创建。
mkdir -p /home/xb/miniconda3/envs/eggnog/lib/python3.12/site-packages/data
download_eggnog_data.py #使用官方脚本下载数据库
nohup emapper.py -i /data/xb/1_pig_genome/GCF_000003025.6_Sscrofa11.1_protein.faa -o pig --cpu 4 --seed_ortholog_evalue 1e-5 --dmnd_db /home/xb/miniconda3/envs/eggnog/lib/python3.12/site-packages/data/eggnog_proteins.dmnd &
# 运行时间大概在30min,下载pig.emapper.annotations至本地,提取注释到含有EC号的蛋白(12359条),并提取序列。
sort -n id.list |uniq > rmdup.list #去重复,虽然eggnog注释貌似不像blast一样有重复序列
fasta_extract_subseqs_from_list.pl /data/xb/1_pig_genome/GCF_000003025.6_Sscrofa11.1_protein.faa rmdup.list > target.faa
# 使用cd-hit去除重复序列,然后回比到基因组上。
conda install -c bioconda cd-hit
cd-hit -i target.faa -o target.0.95.faa -c 0.95 # 去重后剩余 5270 条
mkdir blastx && cd blastx
makeblastdb -in ../target.0.95.faa -dbtype prot -out ./ec
blastdbcmd -info -db ./ec
nohup blastx -query /data/xb/1_pig_genome/GCF_000003025.6_Sscrofa11.1_genomic.fna -db ./ec -max_target_seqs 1 -outfmt 5 -evalue 1e-5 -num_threads 8 > pig.out & # 无结果,转化成tab显示没有比对上的,因为分析时间太长,无法复现
</code></pre>
<h3 id="3-根据protein_id提取对应的gene_id新思路">3、根据protein_id提取对应的gene_id(新思路)</h3>
<p><strong>查询资料,发现可以直接从gbff中提取gene_id,整理一个脚本测试该功能</strong>(我爱gpt!)</p>
<pre><code class="language-python">from Bio import SeqIO
def extract_gene_ids(gbff_file, protein_ids_file, output_file):
# 读取蛋白质ID列表
with open(protein_ids_file, 'r') as f:
protein_ids = [line.strip() for line in f.readlines()]
# 创建一个字典来存储蛋白质ID与基因ID的映射
protein_to_gene = {}
# 解析GBFF文件
for record in SeqIO.parse(gbff_file, "genbank"):
for feature in record.features:
if feature.type == "CDS":
# 提取protein_id
if "protein_id" in feature.qualifiers:
protein_id = feature.qualifiers["protein_id"][0]
if protein_id in protein_ids:
# 提取GeneID
gene_ids = [xref.split(":")[1] for xref in feature.qualifiers.get("db_xref", []) if xref.startswith("GeneID")]
if gene_ids:
protein_to_gene[protein_id] = gene_ids[0] # 只取第一个GeneID
# 输出结果到文件
with open(output_file, 'w') as out_f:
out_f.write("Protein_ID\tGene_ID\n")
for protein_id in protein_ids:
gene_id = protein_to_gene.get(protein_id, "Not found")
out_f.write(f"{protein_id}\t{gene_id}\n")
# 示例使用
gbff_file = "/data/xb/1_pig_genome/GCF_000003025.6_Sscrofa11.1_genomic.gbff" # 替换为你的GBFF文件路径
protein_ids_file = "/home/xb/1_results/0906_pig_genome_annot/eggnog/rmdup.list" # 替换为包含蛋白质ID的txt文件路径
output_file = "/home/xb/1_results/0906_pig_genome_annot/eggnog/extract_from_gbff/gene.id.txt" # 输出结果的文件路径
extract_gene_ids(gbff_file, protein_ids_file, output_file)
</code></pre>
<pre><code class="language-bash"># 提取第二列的gene_id并去重
cat gene.id.txt|cut -f2 > id.list
sort -n id.list |uniq > rmdup.list #包括 3691 条gene
</code></pre>
<p><strong>无效路径,仅做记录。</strong></p>
<pre><code class="language-bash"># 再整理一个脚本,根据gene_id先从NCBI中批量提取对应的gene 序列文件(需要先获取API接口)。
# gpt回答的无法实现
# 要根据GeneID提取gene的fasta序列需要分为两步:将GeneID转换为相应的nucleotide ID或者RefSeq ID。GeneID本身通常在 `gene` 数据库中用于注释和搜索,但实际的序列数据在 `nucleotide` 数据库中。
# gpt回复任然无效
</code></pre>
<p><strong>参考一个帖子找到灵感:<a href="https://blog.csdn.net/qq_65680034/article/details/136400958">如何从NCBI上的Gene数据库批量下载基因序列数据_ncbi批量下载基因序列-CSDN博客</a> [官方指导文件](<a href="https://www.ncbi.nlm.nih.gov/datasets/docs/v2/languages/">Supported programming languages (nih.gov)</a>)</strong></p>
<p><strong>实操过程如下:第一步,通过OpenAPI java libraries建立Build Python NCBI Datasets API v2alpha library</strong></p>
<pre><code>#!/usr/bin/env bash 编写一个bash脚本,直接运行即可(安装路径为:/home/xb/3_opt/ncbi_api)。
OUTPUT_DIR="python_lib"
wget https://www.ncbi.nlm.nih.gov/datasets/docs/v2/openapi3/openapi3.docs.yaml
wget https://repo1.maven.org/maven2/org/openapitools/openapi-generator-cli/7.2.0/openapi-generator-cli-7.2.0.jar -O openapi-generator-cli.jar
java -jar openapi-generator-cli.jar generate -g python -i openapi3.docs.yaml --package-name "ncbi.datasets.openapi" --additional-properties=pythonAttrNoneIfUnset=true,projectName="ncbi-datasets-pylib"
</code></pre>
<p><strong>这一步之后要输入: <u>pip install .</u> (非常重要,官方文档没有提到,但是不运行此步骤后面会报错)</strong></p>
<p><strong>第二步,编写提取脚本gene_get_info.py</strong></p>
<pre><code class="language-python">import sys
import csv
import io
import os
from typing import List
from zipfile import ZipFile
from ncbi.datasets.openapi import ApiClient as DatasetsApiClient
from ncbi.datasets.openapi import ApiException as DatasetsApiException
from ncbi.datasets.openapi import GeneApi as DatasetsGeneApi
zipfile_name = "gene.zip" #自定义下载的压缩包名称
output_file_name = "protein.faa" # 自定义输出的文件名称
# 从CSV文件中读取基因ID并存储在列表中
gene_ids_from_csv = []
csv_file_path = "/home/xb/1_results/0906_pig_genome_annot/eggnog/extract_from_gbff/rmdup.list" # 基因ID的CSV文件路径
with open(csv_file_path, newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
gene_id = int(row[0]) # 基因ID在CSV文件的第二列
gene_ids_from_csv.append(gene_id)
# 将基因ID列表转换为字符串格式,形如 "[1, 2, 3, ...]"
gene_ids_string = str(gene_ids_from_csv)
# 将字符串中的单引号替换为双引号,使其成为合法的Python列表表示形式
gene_ids_string = gene_ids_string.replace("'", '"')
# 将字符串转换为Python列表
gene_ids_list = eval(gene_ids_string)
with DatasetsApiClient() as api_client:
gene_ids: List[int] = gene_ids_list
gene_api = DatasetsGeneApi(api_client)
try:
gene_dataset_download = gene_api.download_gene_package_without_preload_content(
gene_ids,
include_annotation_type=["FASTA_GENE", "FASTA_PROTEIN"], #选择下载的数据格式
)
with open(zipfile_name, "wb") as f:
f.write(gene_dataset_download.data)
except DatasetsApiException as e:
sys.exit(f"Exception when calling GeneApi: {e}\n")
try:
dataset_zip = ZipFile(zipfile_name)
zinfo = dataset_zip.getinfo(os.path.join("ncbi_dataset/data", "protein.faa"))
with io.TextIOWrapper(dataset_zip.open(zinfo), encoding="utf8") as fh:
with open(output_file_name, "w", encoding="utf8") as output_file:
output_file.write(fh.read())
except KeyError as e:
sys.exit(f"File {output_file_name} not found in zipfile: {e}")
</code></pre>
<p><strong>第三步,运行程序。</strong></p>
<pre><code class="language-bash">source ~/3_opt/cobra/bin/activate
pip install python_lib
chmod 755 gene_get_info.py
python3 gene_get_info.py
</code></pre>
<p><strong>第四步,性能测试,受限于ncbi,一次性无法下载3691条gene,经测试发现单次最多只能下载700条(循环下载的话只能600),因此需要先分割gene_id_list,然后再提取gene 序列,脚本改进如下:</strong></p>
<pre><code class="language-python"># 分割脚本,按照500个切割表格,形成多个小表格供后续使用。
import csv
import os
def split_list(lst, n):
"""将列表分割成多个小列表,每个列表最多包含n个元素。"""
for i in range(0, len(lst), n):
yield lst[i:i + n]
def save_split_files(gene_ids, batch_size, output_dir):
"""将基因ID列表分割并保存到多个CSV文件中。"""
if not os.path.exists(output_dir):
os.makedirs(output_dir)
for idx, batch in enumerate(split_list(gene_ids, batch_size), start=1):
file_path = os.path.join(output_dir, f'batch_{idx}.csv')
with open(file_path, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
for gene_id in batch:
writer.writerow([gene_id])
if __name__ == "__main__":
csv_file_path = "/home/xb/1_results/0906_pig_genome_annot/eggnog/extract_from_gbff/rmdup.list" # 输入CSV文件路径
output_dir = "/home/xb/1_results/0906_pig_genome_annot/eggnog/extract_from_gbff/split_files" # 输出目录
batch_size = 600 # 每个文件的基因ID数量
gene_ids_from_csv = []
with open(csv_file_path, newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if len(row) > 0:
gene_id = int(row[0]) # 基因ID在CSV文件的第一列
gene_ids_from_csv.append(gene_id)
save_split_files(gene_ids_from_csv, batch_size, output_dir)
</code></pre>
<p><strong>然后再访问NCBI API批量下载数据(压缩包形式,按gene_id命</strong>名)</p>
<pre><code class="language-python"># 下载脚本,根据id下载gene序列,有的gene id在性染色体上,对应能查出两条基因序列。
import sys
import csv
import io
import os
from typing import List
from zipfile import ZipFile
from ncbi.datasets.openapi import ApiClient as DatasetsApiClient
from ncbi.datasets.openapi import ApiException as DatasetsApiException
from ncbi.datasets.openapi import GeneApi as DatasetsGeneApi
def download_and_extract_genes(gene_ids: List[int], zipfile_name: str, output_file_name: str):
"""下载基因数据并提取到文件中。"""
with DatasetsApiClient() as api_client:
gene_api = DatasetsGeneApi(api_client)
try:
gene_dataset_download = gene_api.download_gene_package_without_preload_content(
gene_ids,
include_annotation_type=["FASTA_GENE", "FASTA_PROTEIN"], # 选择下载的数据格式
)
with open(zipfile_name, "wb") as f:
f.write(gene_dataset_download.data)
except DatasetsApiException as e:
sys.exit(f"Exception when calling GeneApi: {e}\n")
try:
with ZipFile(zipfile_name) as dataset_zip:
zinfo = dataset_zip.getinfo(os.path.join("ncbi_dataset/data", "protein.faa"))
with io.TextIOWrapper(dataset_zip.open(zinfo), encoding="utf8") as fh:
with open(output_file_name, "a", encoding="utf8") as output_file:
output_file.write(fh.read())
except KeyError as e:
sys.exit(f"File {output_file_name} not found in zipfile: {e}")
def process_csv_files(input_dir: str, output_file_name: str):
"""处理目录中的所有CSV文件并提取基因数据。"""
csv_files = [f for f in os.listdir(input_dir) if f.endswith('.csv')]
if not csv_files:
sys.exit(f"No CSV files found in directory: {input_dir}")
# 确保输出文件存在
open(output_file_name, 'w').close()
for csv_file in csv_files:
csv_file_path = os.path.join(input_dir, csv_file)
with open(csv_file_path, newline='') as csvfile:
reader = csv.reader(csvfile)
gene_ids_from_csv = [int(row[0]) for row in reader if len(row) > 0]
# 使用 CSV 文件名作为唯一标识的一部分
zipfile_name = f"gene_{os.path.splitext(csv_file)[0]}.zip"
download_and_extract_genes(gene_ids_from_csv, zipfile_name, output_file_name)
if __name__ == "__main__":
input_dir = "/home/xb/1_results/0906_pig_genome_annot/eggnog/extract_from_gbff/split_files" # CSV文件所在目录
output_file_name = "combined_protein.faa" # 合并后的输出文件
process_csv_files(input_dir, output_file_name)
</code></pre>
<p><strong>解压、重命名及合并fna文件</strong></p>
<pre><code class="language-bash">#!/bin/bash 解压&重命名
# 确保目标文件夹存在
mkdir -p gene_output
# 遍历所有以 gene_batch_ 开头的 zip 文件
for zip_file in gene_batch_*.zip; do
# 提取文件名(去掉扩展名)
base_name=$(basename "$zip_file" .zip)
# 解压指定文件并重命名
unzip -j "$zip_file" ncbi_dataset/data/gene.fna -d ./gene_output/ &&
mv ./gene_output/gene.fna ./gene_output/${base_name}.fna
done
</code></pre>
<pre><code class="language-bash">cd gene_output/