Skip to content

Add nucleotide database support to result2msa#1084

Open
antonvnv wants to merge 2 commits intosoedinglab:masterfrom
antonvnv:result2msa
Open

Add nucleotide database support to result2msa#1084
antonvnv wants to merge 2 commits intosoedinglab:masterfrom
antonvnv:result2msa

Conversation

@antonvnv
Copy link

@antonvnv antonvnv commented Mar 7, 2026

result2msa (and result2profile) were written for amino acid workflows,
with all scoring parameters hardcoded to aminoacid(). They crash when
used with nucleotide search results (--search-type 3) without -a.

The nucleotide search pipeline (blastn.sh) always runs offsetalignment
as a final step, which writes 14-column output (10 base fields + 4 ORF
position fields) without backtrace, or 15 columns with backtrace.

The existing code assumed columns > 10 implies backtrace is present.
For 14-column nucleotide results, it would parse the record but find
an empty backtrace, causing result2msa to emit "DUMMY" placeholders.

The backtrace recomputation fallback also failed for nucleotides: it
used an amino acid SubstitutionMatrix (wrong aa2num mapping for ACGT)
and passed INT_MAX as the diagonal, but the banded nucleotide aligner
requires a valid diagonal.

With -a the output was likely correct in practice, since padded
databases bypass subMat->aa2num via activePrimaryRemap and the
recomputation path is never entered. However the wrong matrix was
still passed to MsaFilter, MultipleAlignment, and other components.

Fix (result2msa only):

  • Detect DBTYPE_NUCLEOTIDES and use NucleotideMatrix with nucleotide
    gap penalties instead of the amino acid SubstitutionMatrix.

  • Check for backtrace by exact column count (11 or 15) rather than
    assuming columns > 10 means backtrace is present.

  • When recomputing a missing backtrace from a parsed record (14
    columns), derive the diagonal from the parsed alignment coordinates
    (dbStartPos - qStartPos) and the strand from qStartPos > qEndPos.

result2profile is not fixed: PSSMCalculator is fundamentally hardcoded
to a 20-letter amino acid alphabet (PROFILE_AA_SIZE), with all its
internal pseudocount, prior, and scoring arrays sized accordingly. A
nucleotide matrix (alphabetSize=5) causes out-of-bounds access in
PSSMCalculator and SubstitutionMatrix::calcGlobalAaBiasCorrection.
Fixing this would require either refactoring PSSMCalculator to be
alphabet-generic or writing a separate nucleotide profile calculator.

antonvnv added 2 commits March 6, 2026 17:08
result2msa (and result2profile) were written for amino acid workflows,
with all scoring parameters hardcoded to aminoacid(). They crash when
used with nucleotide search results (--search-type 3) without -a.

The nucleotide search pipeline (blastn.sh) always runs offsetalignment
as a final step, which writes 14-column output (10 base fields + 4 ORF
position fields) without backtrace, or 15 columns with backtrace.

The existing code assumed columns > 10 implies backtrace is present.
For 14-column nucleotide results, it would parse the record but find
an empty backtrace, causing result2msa to emit "DUMMY" placeholders.

The backtrace recomputation fallback also failed for nucleotides: it
used an amino acid SubstitutionMatrix (wrong aa2num mapping for ACGT)
and passed INT_MAX as the diagonal, but the banded nucleotide aligner
requires a valid diagonal.

With -a the output was likely correct in practice, since padded
databases bypass subMat->aa2num via activePrimaryRemap and the
recomputation path is never entered. However the wrong matrix was
still passed to MsaFilter, MultipleAlignment, and other components.

Fix (result2msa only):

- Detect DBTYPE_NUCLEOTIDES and use NucleotideMatrix with nucleotide
  gap penalties instead of the amino acid SubstitutionMatrix.

- Check for backtrace by exact column count (11 or 15) rather than
  assuming columns > 10 means backtrace is present.

- When recomputing a missing backtrace from a parsed record (14
  columns), derive the diagonal from the parsed alignment coordinates
  (dbStartPos - qStartPos) and the strand from qStartPos > qEndPos.

result2profile is not fixed: PSSMCalculator is fundamentally hardcoded
to a 20-letter amino acid alphabet (PROFILE_AA_SIZE), with all its
internal pseudocount, prior, and scoring arrays sized accordingly. A
nucleotide matrix (alphabetSize=5) causes out-of-bounds access in
PSSMCalculator and SubstitutionMatrix::calcGlobalAaBiasCorrection.
Fixing this would require either refactoring PSSMCalculator to be
alphabet-generic or writing a separate nucleotide profile calculator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant