change to have hts wide depth filtering by vasudeva8 · Pull Request #2016 · samtools/htslib

vasudeva8 · 2026-05-21T09:59:43Z

This implements a caching mechanism for hts wide depth filtering.
Reads are added to a cache and on tid change/eof/a window size, the
cached reads are processed to find which reads should be used.
Any read for which pair is present in the window will also be selected.
At end of processing, selected reads are sent to user and read continues
until next process event.

The data is expected to be sorted by position.

The cache is allocated first and increased as and when required. Reads have their
order of read set in them. They are kept in order of position and based on length
when position are same. Internally it is a linked list where reads are cached.
Selected reads are moved to a selected list and others to non-selected list.
Any read selected out of turn from non-selected list, due to pair processing, are
moved to an insert list. Post processing, reads are retrieved from selected and insert
list based on their ordinal. Once all selected are transferred, the read and cache cycle
continues.

The same works with iterators as well, with new hts_itr_usecache method.

jkbonfield · 2026-05-28T11:43:10Z

It's working on some real world data, but often significantly over-stepping the depth. Eg overshoot.sam.gz from #1084

Region     orig         m=1000  m=100   m=10    m=1                             
104524809  1-4591       1-1045  1-156   1-67    1-67  (mpileup -d $depth)                            
                        1-4591  1-3509  1-529   1-59  (-i hts_maxdepth=$depth)

I then produced some synthetic data and can verify this.
The awk summary is showing "position depth" or "range depth", for brevity.

perl -e 'print "\@SQ\tSN:1\tLN:1100\n";for ($i=1;$i<1000;$i++) {$n=($i>=450 && $i<=454)?200:1;for ($j=0;$j<$n;$j++) {print "s_${i}_${j}\t0\t1\t$i\t60\t100M\t*\t0\t0\t","N"x100,"\t*\n"}}'| ./test/test_view -i hts_maxdepth=200 - | samtools depth -|awk 'BEGIN {p=1;l=0} {if ($3==l) next;if (p==$2-1) {printf("%d         \t%d\n",p,l)} else {printf("%d - %d   \t%d\n",p,$2-1,l)};p=$2;l=$3}'|awk '$1>=100 && $1<1000'
100 - 449   	100
450         	299
451         	498
452         	697
453         	896
454 - 549   	1095
550         	896
551         	697
552         	498
553         	299
554 - 999   	100

This basically lets all reads through the cache, and it's the same depth as unfiltered data.

Compare to the mpileup equivalent:

perl -e 'print "\@SQ\tSN:1\tLN:1100\n";for ($i=1;$i<1000;$i++) {$n=($i>=450 && $i<=454)?200:1;for ($j=0;$j<$n;$j++) {print "s_${i}_${j}\t0\t1\t$i\t60\t100M\t*\t0\t0\t","N"x100,"\t*\n"}}'| samtools mpileup -d 200 -| cut -f 1,2,4|awk 'BEGIN {p=1;l=0} {if ($3==l) next;if (p==$2-1) {printf("%d         \t%d\n",p,l)} else {printf("%d - %d   \t%d\n",p,$2-1,l)};p=$2;l=$3}'|awk '$1>=100 && $1<1000'
[mpileup] 1 samples in 1 input files
100 - 449   	100
450 - 549   	199
550 - 999   	100

jkbonfield · 2026-05-28T13:40:36Z

I can verify on my simulated amplicon sequencing data it works.

Simulated with:

#!/usr/bin/perl -w
use strict;

my $glen = 5000; # genome length
my $tlen = 350;  # template length
my $rlen = 250;  # read length
my $dist = 300;  # distance between amplicons.  <= $tlen
my $depth = 1000; # number of templates per amplicon

print "\@SQ\tSN:1\tLN:$glen\n";
my $seq = "A" x $rlen;
my $qual = "I" x $rlen;
my $tnum = 0;
for (my $pos = 1; $pos < $glen; $pos+=$dist) {
    my $pright = $pos + $tlen - $rlen;

    for (my $i = 0; $i < $depth; $i++) {
	# Left: 99 = PAIRED,PROPER_PAIR,MREVERSE,READ1
	print "read_$tnum\t99\t1\t$pos\t60\t${rlen}M\t=\t$pright\t$tlen\t$seq\t$qual\n";

	# Right: 147 = PAIRED,PROPER_PAIR,REVERSE,READ2
	print "read_$tnum\t147\t1\t$pright\t60\t${rlen}M\t=\t$pos\t-$tlen\t$seq\t$qual\n";

	$tnum++;
    }
}

We plot the raw depth ("_orig"), mpileup -d 100, and this PR with test_view -hts_maxdepth=1 | samtools depth.

This was one scenario that broke the original depth reduction code by crashing down to depth 1 following an over-deep section. The new PR doesn't have this and the depth neatly tracks the up/down nature we saw in the unfiltered data, but at a reduced scale (so 1000 / 2000 alternating depth becomes 100 / 200 alternating depth). This is ideal.

rebased and updated

dda5c0b

vasudeva8 force-pushed the dpth5 branch from 904e5b9 to dda5c0b Compare May 21, 2026 10:03

daviesrob assigned jkbonfield May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change to have hts wide depth filtering #2016

change to have hts wide depth filtering #2016
vasudeva8 wants to merge 1 commit into
samtools:developfrom
vasudeva8:dpth5

vasudeva8 commented May 21, 2026

Uh oh!

jkbonfield commented May 28, 2026 •

edited

Loading

Uh oh!

jkbonfield commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vasudeva8 commented May 21, 2026

Uh oh!

jkbonfield commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbonfield commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jkbonfield commented May 28, 2026 •

edited

Loading