Add nucleotide database support to result2msa#1084
Open
antonvnv wants to merge 2 commits intosoedinglab:masterfrom
Open
Add nucleotide database support to result2msa#1084antonvnv wants to merge 2 commits intosoedinglab:masterfrom
antonvnv wants to merge 2 commits intosoedinglab:masterfrom
Conversation
result2msa (and result2profile) were written for amino acid workflows, with all scoring parameters hardcoded to aminoacid(). They crash when used with nucleotide search results (--search-type 3) without -a. The nucleotide search pipeline (blastn.sh) always runs offsetalignment as a final step, which writes 14-column output (10 base fields + 4 ORF position fields) without backtrace, or 15 columns with backtrace. The existing code assumed columns > 10 implies backtrace is present. For 14-column nucleotide results, it would parse the record but find an empty backtrace, causing result2msa to emit "DUMMY" placeholders. The backtrace recomputation fallback also failed for nucleotides: it used an amino acid SubstitutionMatrix (wrong aa2num mapping for ACGT) and passed INT_MAX as the diagonal, but the banded nucleotide aligner requires a valid diagonal. With -a the output was likely correct in practice, since padded databases bypass subMat->aa2num via activePrimaryRemap and the recomputation path is never entered. However the wrong matrix was still passed to MsaFilter, MultipleAlignment, and other components. Fix (result2msa only): - Detect DBTYPE_NUCLEOTIDES and use NucleotideMatrix with nucleotide gap penalties instead of the amino acid SubstitutionMatrix. - Check for backtrace by exact column count (11 or 15) rather than assuming columns > 10 means backtrace is present. - When recomputing a missing backtrace from a parsed record (14 columns), derive the diagonal from the parsed alignment coordinates (dbStartPos - qStartPos) and the strand from qStartPos > qEndPos. result2profile is not fixed: PSSMCalculator is fundamentally hardcoded to a 20-letter amino acid alphabet (PROFILE_AA_SIZE), with all its internal pseudocount, prior, and scoring arrays sized accordingly. A nucleotide matrix (alphabetSize=5) causes out-of-bounds access in PSSMCalculator and SubstitutionMatrix::calcGlobalAaBiasCorrection. Fixing this would require either refactoring PSSMCalculator to be alphabet-generic or writing a separate nucleotide profile calculator.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
result2msa (and result2profile) were written for amino acid workflows,
with all scoring parameters hardcoded to aminoacid(). They crash when
used with nucleotide search results (--search-type 3) without -a.
The nucleotide search pipeline (blastn.sh) always runs offsetalignment
as a final step, which writes 14-column output (10 base fields + 4 ORF
position fields) without backtrace, or 15 columns with backtrace.
The existing code assumed columns > 10 implies backtrace is present.
For 14-column nucleotide results, it would parse the record but find
an empty backtrace, causing result2msa to emit "DUMMY" placeholders.
The backtrace recomputation fallback also failed for nucleotides: it
used an amino acid SubstitutionMatrix (wrong aa2num mapping for ACGT)
and passed INT_MAX as the diagonal, but the banded nucleotide aligner
requires a valid diagonal.
With -a the output was likely correct in practice, since padded
databases bypass subMat->aa2num via activePrimaryRemap and the
recomputation path is never entered. However the wrong matrix was
still passed to MsaFilter, MultipleAlignment, and other components.
Fix (result2msa only):
Detect DBTYPE_NUCLEOTIDES and use NucleotideMatrix with nucleotide
gap penalties instead of the amino acid SubstitutionMatrix.
Check for backtrace by exact column count (11 or 15) rather than
assuming columns > 10 means backtrace is present.
When recomputing a missing backtrace from a parsed record (14
columns), derive the diagonal from the parsed alignment coordinates
(dbStartPos - qStartPos) and the strand from qStartPos > qEndPos.
result2profile is not fixed: PSSMCalculator is fundamentally hardcoded
to a 20-letter amino acid alphabet (PROFILE_AA_SIZE), with all its
internal pseudocount, prior, and scoring arrays sized accordingly. A
nucleotide matrix (alphabetSize=5) causes out-of-bounds access in
PSSMCalculator and SubstitutionMatrix::calcGlobalAaBiasCorrection.
Fixing this would require either refactoring PSSMCalculator to be
alphabet-generic or writing a separate nucleotide profile calculator.