Hello teacher, I am having trouble removing redundant assembly, can you give me some advice?
I am doing a common transcriptome-based mining identification of viruses, and I assemble the sequences downloaded from the SRA separately after removing the host. My plan is to aggregate these assembly results for candidate virus alignment identification. I saw two deredundancy methods of mmseqs2 easy-cluster and easy-linclust, but also retrieved the deredundancy of cd-hit-est, I don't know if mmseqs2 is suitable for the purpose of deredundancy of my transcriptome assembly and merging, if I want to set a stricter clustering threshold, what parameters do I need to pay attention to, I hope you can help me.
I also initially tried the mmseqs2 easy-linclust which is much faster than cd-hit-est.
mmseqs easy-linclust virus.candidate.fasta mmseqs.cluster ./mmseqs.tmp --threads 60
And the results of mmseqs.cluster_all_seqs.fasta, mmseqs.cluster_cluster.tsv, mmseqs.cluster_rep_seq.fasta are obtained. I know mmseqs.cluster_rep_seq.fasta should be the result of deredundancy, but I want to get the information for clustering in order to find the distribution of the virus sequence across different samples, which file should be viewed, or what parameters are set.
Hello teacher, I am having trouble removing redundant assembly, can you give me some advice?
I am doing a common transcriptome-based mining identification of viruses, and I assemble the sequences downloaded from the SRA separately after removing the host. My plan is to aggregate these assembly results for candidate virus alignment identification. I saw two deredundancy methods of mmseqs2 easy-cluster and easy-linclust, but also retrieved the deredundancy of cd-hit-est, I don't know if mmseqs2 is suitable for the purpose of deredundancy of my transcriptome assembly and merging, if I want to set a stricter clustering threshold, what parameters do I need to pay attention to, I hope you can help me.
I also initially tried the mmseqs2 easy-linclust which is much faster than cd-hit-est.
mmseqs easy-linclust virus.candidate.fasta mmseqs.cluster ./mmseqs.tmp --threads 60And the results of mmseqs.cluster_all_seqs.fasta, mmseqs.cluster_cluster.tsv, mmseqs.cluster_rep_seq.fasta are obtained. I know mmseqs.cluster_rep_seq.fasta should be the result of deredundancy, but I want to get the information for clustering in order to find the distribution of the virus sequence across different samples, which file should be viewed, or what parameters are set.