Skip to content

change to have hts wide depth filtering #2016

Open
vasudeva8 wants to merge 1 commit into
samtools:developfrom
vasudeva8:dpth5
Open

change to have hts wide depth filtering #2016
vasudeva8 wants to merge 1 commit into
samtools:developfrom
vasudeva8:dpth5

Conversation

@vasudeva8
Copy link
Copy Markdown
Contributor

This implements a caching mechanism for hts wide depth filtering.
Reads are added to a cache and on tid change/eof/a window size, the
cached reads are processed to find which reads should be used.
Any read for which pair is present in the window will also be selected.
At end of processing, selected reads are sent to user and read continues
until next process event.

The data is expected to be sorted by position.

The cache is allocated first and increased as and when required. Reads have their
order of read set in them. They are kept in order of position and based on length
when position are same. Internally it is a linked list where reads are cached.
Selected reads are moved to a selected list and others to non-selected list.
Any read selected out of turn from non-selected list, due to pair processing, are
moved to an insert list. Post processing, reads are retrieved from selected and insert
list based on their ordinal. Once all selected are transferred, the read and cache cycle
continues.

The same works with iterators as well, with new hts_itr_usecache method.

@jkbonfield
Copy link
Copy Markdown
Contributor

jkbonfield commented May 28, 2026

It's working on some real world data, but often significantly over-stepping the depth. Eg overshoot.sam.gz from #1084

Region     orig         m=1000  m=100   m=10    m=1                             
104524809  1-4591       1-1045  1-156   1-67    1-67  (mpileup -d $depth)                            
                        1-4591  1-3509  1-529   1-59  (-i hts_maxdepth=$depth)                          

I then produced some synthetic data and can verify this.
The awk summary is showing "position depth" or "range depth", for brevity.

perl -e 'print "\@SQ\tSN:1\tLN:1100\n";for ($i=1;$i<1000;$i++) {$n=($i>=450 && $i<=454)?200:1;for ($j=0;$j<$n;$j++) {print "s_${i}_${j}\t0\t1\t$i\t60\t100M\t*\t0\t0\t","N"x100,"\t*\n"}}'| ./test/test_view -i hts_maxdepth=200 - | samtools depth -|awk 'BEGIN {p=1;l=0} {if ($3==l) next;if (p==$2-1) {printf("%d         \t%d\n",p,l)} else {printf("%d - %d   \t%d\n",p,$2-1,l)};p=$2;l=$3}'|awk '$1>=100 && $1<1000'
100 - 449   	100
450         	299
451         	498
452         	697
453         	896
454 - 549   	1095
550         	896
551         	697
552         	498
553         	299
554 - 999   	100

This basically lets all reads through the cache, and it's the same depth as unfiltered data.

Compare to the mpileup equivalent:

perl -e 'print "\@SQ\tSN:1\tLN:1100\n";for ($i=1;$i<1000;$i++) {$n=($i>=450 && $i<=454)?200:1;for ($j=0;$j<$n;$j++) {print "s_${i}_${j}\t0\t1\t$i\t60\t100M\t*\t0\t0\t","N"x100,"\t*\n"}}'| samtools mpileup -d 200 -| cut -f 1,2,4|awk 'BEGIN {p=1;l=0} {if ($3==l) next;if (p==$2-1) {printf("%d         \t%d\n",p,l)} else {printf("%d - %d   \t%d\n",p,$2-1,l)};p=$2;l=$3}'|awk '$1>=100 && $1<1000'
[mpileup] 1 samples in 1 input files
100 - 449   	100
450 - 549   	199
550 - 999   	100

@jkbonfield
Copy link
Copy Markdown
Contributor

I can verify on my simulated amplicon sequencing data it works.

Simulated with:

#!/usr/bin/perl -w
use strict;

my $glen = 5000; # genome length
my $tlen = 350;  # template length
my $rlen = 250;  # read length
my $dist = 300;  # distance between amplicons.  <= $tlen
my $depth = 1000; # number of templates per amplicon

print "\@SQ\tSN:1\tLN:$glen\n";
my $seq = "A" x $rlen;
my $qual = "I" x $rlen;
my $tnum = 0;
for (my $pos = 1; $pos < $glen; $pos+=$dist) {
    my $pright = $pos + $tlen - $rlen;

    for (my $i = 0; $i < $depth; $i++) {
	# Left: 99 = PAIRED,PROPER_PAIR,MREVERSE,READ1
	print "read_$tnum\t99\t1\t$pos\t60\t${rlen}M\t=\t$pright\t$tlen\t$seq\t$qual\n";

	# Right: 147 = PAIRED,PROPER_PAIR,REVERSE,READ2
	print "read_$tnum\t147\t1\t$pright\t60\t${rlen}M\t=\t$pos\t-$tlen\t$seq\t$qual\n";

	$tnum++;
    }
}

We plot the raw depth ("_orig"), mpileup -d 100, and this PR with test_view -hts_maxdepth=1 | samtools depth.

This was one scenario that broke the original depth reduction code by crashing down to depth 1 following an over-deep section. The new PR doesn't have this and the depth neatly tracks the up/down nature we saw in the unfiltered data, but at a reduced scale (so 1000 / 2000 alternating depth becomes 100 / 200 alternating depth). This is ideal.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants