多序列比对结果的串联

Published at 2020-04-18 11:24

Author：zhixy

在Phylogenomics分析中，基于SuperMatrix的物种树构建是最主要的方法。所谓SuperMatrix，即将单个基因/蛋白的多序列比对结果（可视为matrix）进行串联，以形成SuperMatrix。借此输入进化树软件，如IQ-TREE，构建物种树。

在序列串联中，考虑到序列比对结果中，每条序列可能出现换行符（这是常见的alignment呈现格式，一般为60/80字符一行）。因此需要有一个处理序列换行符的方法。示例如下：

def remove_return_from_sequence(sequences_set):
    sequences_long_str = ""
    for line in sequences_set:
        if ">" in line:
            sequences_long_str = line.strip()">" #在序列ID行末尾（除去换行符）追加一个">"
        else:
            sequences_long_str = line.strip()
        sequences_set_new = sequences_long_str.split(">") #长字符串按">"分隔，实现同一序列的合并，即实现序列中换行符的去除
        sequences_set_new.pop(0) # 因第一条序列的">"在分隔时，会让结果列表中的第一个元素为一个空字符串 ""，所以用 pop() 去除
        sequences_set_new_output = list() # 让输出结果在格式上与输入sequences_set保持一致
        for i in range(0, len(sequences_set_new)/2):
            sequences_set_new_output.append(">"sequences_set_new[2*i]"\n")
            sequences_set_new_output.append(sequences_set_new[2*i1]"\n")
    return sequences_set_new_output

串联的一种的算法(Python)如下：

import os

def read_alignment(alignment_file_name):
    file = open(alignment_file_name)
    alignment = file.readlines()
    file.close()
    return remove_return_from_sequence(alignment)

files = os.listdir("./alignments") # 获取alignments文件中的文件名，每一个文件对应一个基因/蛋白的比对结果

genome_names = []
# 存放所有比对结果中的genome id，其中的genome id，应该在每一个基因/蛋白的比对结果出现且出现一次，
# 否则对应的基因/蛋白则不符合单拷贝直系同源的标准。

alignments_all = []
# 存放所有比对结果，需要注意的是其中的元素顺序，与files列表中的元素顺序，即基因/蛋白的名称顺序一致

for fname in files:
    # fname = 基因/蛋白名
    alignment = read_alignment(fname)
    alignments_all.append(alignment)
    for i in range(0,len(alignment)/2):
        sequence_id = alignment[2*i]
        genome_id = sequence_id.split("|")[0]
        if genome_id not in genome_names:
            genome_names.append(genome_id)

concatination = [] # 存放串联结果
for gname in genome_names:
    concatenated_seq = ""
    for alignment in alignments_all:
        for i in range(0,len(alignment)/2):
            sequence_id = alignment[2*i]
            sequence = alignment[2*i1]
            genome_id = sequence_id.split("|")[0] # 注意其中包含">"符号
            if genome_id == gname:
                concatenated_seq = sequence
    concatination.append(">"  gname  "\n")
    concatination.append(concatenated_seq  "\n")

# 串联结果写出
file = open(os.getcwd()"/concatenation.fasta","w")
for line in concatination:
    file.write(line)
file.close()

去除换行符的，另一简单方法如下：

seq_dict = {}

for line in sequences: # sequences 序列文件读入的结果
    if line.startwith(">"):
        seqid = line.strip()
        seq_dict[seqid] = ''
    else:
        seq_dict[seqid] = line.strip()

Columns

Python ICNP Others R Linux Phylogenetics Phylogenomics Genomics Reference Evolution Bioinformatics Protocols Metagenomics Statistics

A Lab of Microbial Systematics and Evolution

多序列比对结果的串联

Columns

Search Document