基因组注释工具Prokka

Published at 2023-03-07 08:26

Author:zhixy

View:2136


简介

全基因组注释是识别一组基因组DNA序列中感兴趣的特征,并用有用的信息标记它们的过程。 Prokka是一个软件工具,可以快速注释细菌、古物和病毒基因组,并生成符合标准的输出文件。

推荐安装方式:

(base) [user@server ~]# conda install -c bioconda prokka

安装完成后,运行prokka:

(base) [user@user ~]# prokka
Name:
  Prokka 1.12 by Torsten Seemann <[email protected]>
Synopsis:
  rapid bacterial genome annotation
Usage:
  prokka [options] <contigs.fasta>
General:
  --help            This help
  --version         Print version and exit
  --docs            Show full manual/documentation
  --citation        Print citation for referencing Prokka
  --quiet           No screen output (default OFF)
  --debug           Debug mode: keep all temporary files (default OFF)
Setup:
  --listdb          List all configured databases
  --setupdb         Index all installed databases
  --cleandb         Remove all database indices
  --depends         List all software dependencies
Outputs:
  --outdir [X]      Output folder [auto] (default '')
  --force           Force overwriting existing output folder (default OFF)
  --prefix [X]      Filename output prefix [auto] (default '')
  --addgenes        Add 'gene' features for each 'CDS' feature (default OFF)
  --addmrna         Add 'mRNA' features for each 'CDS' feature (default OFF)
  --locustag [X]    Locus tag prefix [auto] (default '')
  --increment [N]   Locus tag counter increment (default '1')
  --gffver [N]      GFF version (default '3')
  --compliant       Force Genbank/ENA/DDJB compliance: --addgenes --mincontiglen 200 --centre XXX (default OFF)
  --centre [X]      Sequencing centre ID. (default '')
  --accver [N]      Version to put in Genbank file (default '1')
Organism details:
  --genus [X]       Genus name (default 'Genus')
  --species [X]     Species name (default 'species')
  --strain [X]      Strain name (default 'strain')
  --plasmid [X]     Plasmid name or identifier (default '')
Annotations:
  --kingdom [X]     Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
  --gcode [N]       Genetic code / Translation table (set if --kingdom is set) (default '0')
  --gram [X]        Gram: -/neg /pos (default '')
  --usegenus        Use genus-specific BLAST databases (needs --genus) (default OFF)
  --proteins [X]    FASTA or GBK file to use as 1st priority (default '')
  --hmms [X]        Trusted HMM to first annotate from (default '')
  --metagenome      Improve gene predictions for highly fragmented genomes (default OFF)
  --rawproduct      Do not clean up /product annotation (default OFF)
  --cdsrnaolap      Allow [tr]RNA to overlap CDS (default OFF)
Computation:
  --cpus [N]        Number of CPUs to use [0=all] (default '8')
  --fast            Fast mode - only use basic BLASTP databases (default OFF)
  --noanno          For CDS just set /product="unannotated protein" (default OFF)
  --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
  --evalue [n.n]    Similarity e-value cut-off (default '1e-06')
  --rfam            Enable searching for ncRNAs with InfernalRfam (SLOW!) (default '0')
  --norrna          Don't run rRNA search (default OFF)
  --notrna          Don't run tRNA search (default OFF)
  --rnammer         Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)

关键参数

  • --outdir 输出结果文件夹
  • --prefix 输出结果文件的前缀
  • --locustag 预测基因的座位编码,如 --locustag ORF_,则第一个基因的locus id = ORF_0001
  • --cpus 指定运算时使用的CPU数量
  • --fast 快速模式,仅使用基础BLASTP数据库
  • --noanno 不进行功能注释
  • --norrna 不进行rRNA注释
  • --notrna 不进行tRNA注释

参考示例

(base) [user@server ~]# prokka --outdir GCA_002902505.1 --prefix GCA_002902505.1 --noanno --cpus 8 --locustag 'GCA_002902505.1|ORF' GCA_002902505.1.fna

--locustag 'GCA_002902505.1|ORF' 这样的locustag设置,可以将genome id(GCA_002902505.1)也同时标记在gene id上,便于多个基因组注释结果合并时,区分不同的序列。

注释结果如下:

(base) [user@server ~]# ls -l GCA_002902505.1
total 27M
-rw-rw-r-- 1 user user   84 Apr 14 15:37 errorsummary.val
-rw-rw-r-- 1 user user 950K Apr 14 15:37 GCA_002902505.1.err 
-rw-rw-r-- 1 user user 847K Apr 14 15:37 GCA_002902505.1.faa # 预测蛋白序列
-rw-rw-r-- 1 user user 2.3M Apr 14 15:37 GCA_002902505.1.ffn # 预测核酸序列,包括CDS, rRNA, tRNA, tmRNA, misc_RNA
-rw-rw-r-- 1 user user 2.6M Apr 14 15:37 GCA_002902505.1.fna # 原基因组序列
-rw-rw-r-- 1 user user 2.6M Apr 14 15:37 GCA_002902505.1.fsa # 同上(contigs/scafolds id 不同)
-rw-rw-r-- 1 user user 5.1M Apr 14 15:37 GCA_002902505.1.gbf # genbank格式的注释结果
-rw-rw-r-- 1 user user 3.0M Apr 14 15:37 GCA_002902505.1.gff # gff格式的注释结果
-rw-rw-r-- 1 user user 7.4K Apr 14 15:37 GCA_002902505.1.log 
-rw-rw-r-- 1 user user 8.1M Apr 14 15:37 GCA_002902505.1.sqn # 可用于上传NCBI GenBank的sqn格式
-rw-rw-r-- 1 user user 332K Apr 14 15:37 GCA_002902505.1.tbl # Feature Table file,可通过"tbl2asn"转sqn
-rw-rw-r-- 1 user user 124K Apr 14 15:37 GCA_002902505.1.tsv
-rw-rw-r-- 1 user user   96 Apr 14 15:37 GCA_002902505.1.txt
-rw-rw-r-- 1 user user 643K Apr 14 15:37 GCA_002902505.1.val

参考文献

Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. DOI:10.1093/bioinformatics/btu153