Add nucleotide database support to result2msa by antonvnv · Pull Request #1084 · soedinglab/MMseqs2

antonvnv · 2026-03-07T01:12:13Z

result2msa (and result2profile) were written for amino acid workflows,
with all scoring parameters hardcoded to aminoacid(). They crash when
used with nucleotide search results (--search-type 3) without -a.

The nucleotide search pipeline (blastn.sh) always runs offsetalignment
as a final step, which writes 14-column output (10 base fields + 4 ORF
position fields) without backtrace, or 15 columns with backtrace.

The existing code assumed columns > 10 implies backtrace is present.
For 14-column nucleotide results, it would parse the record but find
an empty backtrace, causing result2msa to emit "DUMMY" placeholders.

The backtrace recomputation fallback also failed for nucleotides: it
used an amino acid SubstitutionMatrix (wrong aa2num mapping for ACGT)
and passed INT_MAX as the diagonal, but the banded nucleotide aligner
requires a valid diagonal.

With -a the output was likely correct in practice, since padded
databases bypass subMat->aa2num via activePrimaryRemap and the
recomputation path is never entered. However the wrong matrix was
still passed to MsaFilter, MultipleAlignment, and other components.

Fix (result2msa only):

Detect DBTYPE_NUCLEOTIDES and use NucleotideMatrix with nucleotide
gap penalties instead of the amino acid SubstitutionMatrix.
Check for backtrace by exact column count (11 or 15) rather than
assuming columns > 10 means backtrace is present.
When recomputing a missing backtrace from a parsed record (14
columns), derive the diagonal from the parsed alignment coordinates
(dbStartPos - qStartPos) and the strand from qStartPos > qEndPos.

result2profile is not fixed: PSSMCalculator is fundamentally hardcoded
to a 20-letter amino acid alphabet (PROFILE_AA_SIZE), with all its
internal pseudocount, prior, and scoring arrays sized accordingly. A
nucleotide matrix (alphabetSize=5) causes out-of-bounds access in
PSSMCalculator and SubstitutionMatrix::calcGlobalAaBiasCorrection.
Fixing this would require either refactoring PSSMCalculator to be
alphabet-generic or writing a separate nucleotide profile calculator.

result2msa (and result2profile) were written for amino acid workflows, with all scoring parameters hardcoded to aminoacid(). They crash when used with nucleotide search results (--search-type 3) without -a. The nucleotide search pipeline (blastn.sh) always runs offsetalignment as a final step, which writes 14-column output (10 base fields + 4 ORF position fields) without backtrace, or 15 columns with backtrace. The existing code assumed columns > 10 implies backtrace is present. For 14-column nucleotide results, it would parse the record but find an empty backtrace, causing result2msa to emit "DUMMY" placeholders. The backtrace recomputation fallback also failed for nucleotides: it used an amino acid SubstitutionMatrix (wrong aa2num mapping for ACGT) and passed INT_MAX as the diagonal, but the banded nucleotide aligner requires a valid diagonal. With -a the output was likely correct in practice, since padded databases bypass subMat->aa2num via activePrimaryRemap and the recomputation path is never entered. However the wrong matrix was still passed to MsaFilter, MultipleAlignment, and other components. Fix (result2msa only): - Detect DBTYPE_NUCLEOTIDES and use NucleotideMatrix with nucleotide gap penalties instead of the amino acid SubstitutionMatrix. - Check for backtrace by exact column count (11 or 15) rather than assuming columns > 10 means backtrace is present. - When recomputing a missing backtrace from a parsed record (14 columns), derive the diagonal from the parsed alignment coordinates (dbStartPos - qStartPos) and the strand from qStartPos > qEndPos. result2profile is not fixed: PSSMCalculator is fundamentally hardcoded to a 20-letter amino acid alphabet (PROFILE_AA_SIZE), with all its internal pseudocount, prior, and scoring arrays sized accordingly. A nucleotide matrix (alphabetSize=5) causes out-of-bounds access in PSSMCalculator and SubstitutionMatrix::calcGlobalAaBiasCorrection. Fixing this would require either refactoring PSSMCalculator to be alphabet-generic or writing a separate nucleotide profile calculator.

antonvnv added 2 commits March 6, 2026 17:08

Advance util/regression submodule commit

3df5eab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nucleotide database support to result2msa#1084

Add nucleotide database support to result2msa#1084
antonvnv wants to merge 2 commits into
soedinglab:masterfrom
antonvnv:result2msa

antonvnv commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antonvnv commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant