Parallel cram2bam by jkbonfield · Pull Request #2015 · samtools/htslib

jkbonfield · 2026-05-20T16:45:24Z

The main thread does a lot of work in cram_to_bam and bam_set1. Io_lib's model for this was to put this code into the worker threads. I thought I'd done that here too (I have for the writer already), but apparently not. This PR fixes that.

A profile of the main thread:

$ ./test/test_view.dev -B -@32  ~/scratch/data/novaseq.10m.cram & perf record -t $! 2>/dev/null;perf report -n 2>/dev/null| head -20 | egrep -v '^#'
[1] 3634487
[1]+  Done                    ./test/test_view.dev -B -@32 ~/scratch/data/novaseq.10m.cram
    49.75%          2100  test_view.dev  test_view.dev      [.] bam_set1
     9.51%           407  test_view.dev  test_view.dev      [.] cram_to_bam
     5.93%           266  test_view.dev  libc.so.6          [.] __memmove_avx512_unaligned_erms
     3.13%           129  test_view.dev  libc.so.6          [.] __strncpy_evex
     2.50%            92  test_view.dev  [unknown]          [k] 0xffffffff82cb9b7e
     2.37%            97  test_view.dev  test_view.dev      [.] cram_get_seq
     2.25%            92  test_view.dev  test_view.dev      [.] sam_read1
     2.11%            92  test_view.dev  test_view.dev      [.] bam_cigar2rqlens
     1.65%            64  test_view.dev  test_view.dev      [.] cram_get_bam_seq

With this PR:

@ qpg-gpu-01[samtools.../htslib]; ./test/test_view.cram2bam -B -@32  ~/scratch/data/novaseq.10m.cram & perf record -t $! 2>/dev/null;perf report -n 2>/dev/null| head -20 | egrep -v '^#'
[1] 3634842
[1]+  Done                    ./test/test_view.cram2bam -B -@32 ~/scratch/data/novaseq.10m.cram
    15.65%           101  test_view.cram2  test_view.cram2bam  [.] cram_get_bam_seq
    11.31%            65  test_view.cram2  libc.so.6           [.] _int_free
     6.69%            38  test_view.cram2  [unknown]           [k] 0xffffffff829076fb
     5.17%            34  test_view.cram2  test_view.cram2bam  [.] cram_get_seq
     5.12%            29  test_view.cram2  [unknown]           [k] 0xffffffff8291f048
     4.73%            32  test_view.cram2  test_view.cram2bam  [.] sam_read1
     4.24%            24  test_view.cram2  [unknown]           [k] 0xffffffff828d0d41
     4.23%            24  test_view.cram2  [unknown]           [k] 0xffffffff8292a32a
     3.35%            19  test_view.cram2  [unknown]           [k] 0xffffffff82929c09

Wall clock benchmarks with 32 cores show a doubling in throughput (making it slightly faster than BAM, but that's due to the same main-thread excessive work problem there too):

$ for i in `seq 1 10`;do /usr/bin/time -f '%e %U %S'  ./test/test_view.dev 
-B -@32  ~/scratch/data/novaseq.10m.cram;done
1.14 10.34 0.76
1.06 9.13 0.69
1.11 9.42 0.72
1.12 10.20 0.62
1.00 8.72 0.57
1.23 8.85 0.77
1.46 9.13 0.79
1.17 8.67 0.63
1.30 8.73 0.69
0.99 9.64 0.54


$ for i in `seq 1 10`;do /usr/bin/time -f '%e %U %S'  ./test/test_view.cram2bam -B -@32  ~/scratch/data/novaseq.10m.cram;done
0.50 11.83 0.98
0.50 10.32 0.82
0.45 10.13 0.82
0.42 10.10 0.78
0.50 10.19 0.87
0.47 10.07 0.81
0.48 10.15 0.98
0.49 10.03 1.00
0.51 10.06 0.85
0.49 10.28 0.76

This means the creation of BAM objects is done in the worker thread rather than in the main thread. However we still incur the cost of copy that data over

- Cache CG tag as sam_tag2cigar is a significant main thread cost - Check BAM memory policy Signed-off-by: James Bonfield <jkb@sanger.ac.uk>

- Spare_bams is now bam_list as we use the struct directly in slices too. - We no longer use a single malloc block of memory for the entire bam list. - We now copy the struct and swap the pointers in cram_get_bam_seq (provided bam_get_mempolicy is ok). This is now workable as every bam data object is its own malloc block. It's not as efficient as before, but it's safe.

@128

This gives us just one malloc instead of 2 mallocs per BAM object, which still gives us the ability to do pointer-swapping on the bam->data but without paying the penalty of the bam struct allocation. However it's still slow than d62b978 (even with bam_copy1 always in use) with very large numbers of threads simply due to the sheer pressure that bam_init1 puts on the malloc system. It's faster to memcpy than it is to malloc with lots of threads. In glibc it can be partially mitigated with export MALLOC_TOP_PAD_=100000000 export MALLOC_MMAP_MAX_=0 Similarly using libtcmalloc.so mitigates it a bit, but it's still slower here than earlier when using -@128 on my test file. -@16 gives a faster time however, so we need to figure out a better balance somewhere. Signed-off-by: James Bonfield <jkb@sanger.ac.uk>

daviesrob

Looks like this does a good job of moving more work into the thread pool. Would it be worth doing the CG processing there too? I think it would just need bam_tag2cigar() to be made non-static so that it can be called from cram/cram_decode.c.

daviesrob · 2026-05-29T10:33:18Z

+
+    for (i = 0; i < s->hdr->num_records; i++) {
+        int sz = bam_size(bfd, fd, &s->crecs[i]);
+        realloc_bam_data(&s->bl->bams[i], sz);


Is this needed? bam_set1() (via cram2bam()) would calculate the size and do the same realloc_bam_data() call, so this is just moving that work a bit earlier (and duplicating some of it).

daviesrob · 2026-05-29T10:36:37Z

+    if (bl) {
+        // Reuse an old bam list, possibly growing it
+        if (s->hdr->num_records > bl->nbams) {
+            bl->bams = realloc(bl->bams, s->hdr->num_records *


As #2006 has been merged, it might be good to rebase this so it can take advantage of the hts_realloc_p() interface.

jkbonfield added 4 commits May 13, 2026 17:27

WIP: bulk decode CRAM to BAM when threading.

a5830f7

This means the creation of BAM objects is done in the worker thread rather than in the main thread. However we still incur the cost of copy that data over

Improve CRAM decode thread speed further

d62b978

- Cache CG tag as sam_tag2cigar is a significant main thread cost - Check BAM memory policy Signed-off-by: James Bonfield <jkb@sanger.ac.uk>

daviesrob self-assigned this May 21, 2026

daviesrob reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel cram2bam#2015

Parallel cram2bam#2015
jkbonfield wants to merge 4 commits into
samtools:developfrom
jkbonfield:parallel-cram2bam

jkbonfield commented May 20, 2026

Uh oh!

daviesrob left a comment

Uh oh!

daviesrob May 29, 2026

Uh oh!

daviesrob May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkbonfield commented May 20, 2026

Uh oh!

daviesrob left a comment

Choose a reason for hiding this comment

Uh oh!

daviesrob May 29, 2026

Choose a reason for hiding this comment

Uh oh!

daviesrob May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants