微生物基因组去冗余——dRep

Published at 2023-05-29 11:18

Author：liujr

1.简介

dRep 利用 MASH 和 FastANI 进行基因组距离和平均核苷酸一致性计算，减少成对基因组比较的计算时间。dRep可以对基因组进行聚类，并且选出每个基因组簇的代表基因组，快速对基因组去冗余，留下高质量基因组进行后续分析。

2.安装

dRep 可以通过pip或conda进行安装：

pip install drep

<img src="http://latex.codecogs.com/gif.latex?gitclonehttps%3A%2F%2Fgithub.com%2FMrOlm%2Fdrep.git%0D%0A"> cd drep
$ pip install .

conda config --add channels bioconda; conda install drep

更多安装信息详见

https://drep.readthedocs.io/en/latest/installation.html

3.使用

3.1 快速使用

默认参数计算，指定去冗余功能(dereplicate)，然后是输出目录 (out_directory) 和通配符(*) 指定目录下的多个fasta输入文件

[user@server ~] # dRep dereplicate out_directory -g path/to/genomes/*.fasta

3.2 常用参数设置

-pa MASH 聚类阈值 (default: 0.9)
-sa FastANI 聚类阈值 (default:1)
--genomeInfo 可从外部导入checkm 的结果，只需要包含完整度污染度和异质性信息

我们可以根据实际基因组情况来设置参数，获得更好的去冗余结果

3.3 输出结果

输出结果目录中：

data/ 保存了dRep 每一步得到的结果，包括

checkM
Clustering_files
fastANI_files
MASH_files
prodigal

data_tables/ 为基因组基本信息统计表，包括基因组的完整度、污染度、异质性、长度等信息。

dereplicated_genomes/ 为去冗余后的基因组，是dRep的主要计算结果

figures/ 中有图片，包括：

初级聚类图 Primary_clustering_dendrogram.pdf
次级聚类图 Secondary_clustering_dendrograms.pdf
簇得分 Cluster_scoring.pdf
基因组比对的统计 Clustering_scatterplots.pdf
每个重复集的“最佳”基因组，以及几个快速的整体统计数据 Winning_genomes.pdf

4.参考文献

Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. The ISME Journal 11, 2864-2868, doi:10.1038/ismej.2017.126 (2017).

Columns

Python ICNP Others R Linux Phylogenetics Phylogenomics Genomics Reference Evolution Bioinformatics Protocols Metagenomics Statistics

A Lab of Microbial Systematics and Evolution