Skip to content

Parallel cram2bam#2015

Open
jkbonfield wants to merge 4 commits into
samtools:developfrom
jkbonfield:parallel-cram2bam
Open

Parallel cram2bam#2015
jkbonfield wants to merge 4 commits into
samtools:developfrom
jkbonfield:parallel-cram2bam

Conversation

@jkbonfield
Copy link
Copy Markdown
Contributor

The main thread does a lot of work in cram_to_bam and bam_set1. Io_lib's model for this was to put this code into the worker threads. I thought I'd done that here too (I have for the writer already), but apparently not. This PR fixes that.

A profile of the main thread:

$ ./test/test_view.dev -B -@32  ~/scratch/data/novaseq.10m.cram & perf record -t $! 2>/dev/null;perf report -n 2>/dev/null| head -20 | egrep -v '^#'
[1] 3634487
[1]+  Done                    ./test/test_view.dev -B -@32 ~/scratch/data/novaseq.10m.cram
    49.75%          2100  test_view.dev  test_view.dev      [.] bam_set1
     9.51%           407  test_view.dev  test_view.dev      [.] cram_to_bam
     5.93%           266  test_view.dev  libc.so.6          [.] __memmove_avx512_unaligned_erms
     3.13%           129  test_view.dev  libc.so.6          [.] __strncpy_evex
     2.50%            92  test_view.dev  [unknown]          [k] 0xffffffff82cb9b7e
     2.37%            97  test_view.dev  test_view.dev      [.] cram_get_seq
     2.25%            92  test_view.dev  test_view.dev      [.] sam_read1
     2.11%            92  test_view.dev  test_view.dev      [.] bam_cigar2rqlens
     1.65%            64  test_view.dev  test_view.dev      [.] cram_get_bam_seq

With this PR:

@ qpg-gpu-01[samtools.../htslib]; ./test/test_view.cram2bam -B -@32  ~/scratch/data/novaseq.10m.cram & perf record -t $! 2>/dev/null;perf report -n 2>/dev/null| head -20 | egrep -v '^#'
[1] 3634842
[1]+  Done                    ./test/test_view.cram2bam -B -@32 ~/scratch/data/novaseq.10m.cram
    15.65%           101  test_view.cram2  test_view.cram2bam  [.] cram_get_bam_seq
    11.31%            65  test_view.cram2  libc.so.6           [.] _int_free
     6.69%            38  test_view.cram2  [unknown]           [k] 0xffffffff829076fb
     5.17%            34  test_view.cram2  test_view.cram2bam  [.] cram_get_seq
     5.12%            29  test_view.cram2  [unknown]           [k] 0xffffffff8291f048
     4.73%            32  test_view.cram2  test_view.cram2bam  [.] sam_read1
     4.24%            24  test_view.cram2  [unknown]           [k] 0xffffffff828d0d41
     4.23%            24  test_view.cram2  [unknown]           [k] 0xffffffff8292a32a
     3.35%            19  test_view.cram2  [unknown]           [k] 0xffffffff82929c09

Wall clock benchmarks with 32 cores show a doubling in throughput (making it slightly faster than BAM, but that's due to the same main-thread excessive work problem there too):

$ for i in `seq 1 10`;do /usr/bin/time -f '%e %U %S'  ./test/test_view.dev 
-B -@32  ~/scratch/data/novaseq.10m.cram;done
1.14 10.34 0.76
1.06 9.13 0.69
1.11 9.42 0.72
1.12 10.20 0.62
1.00 8.72 0.57
1.23 8.85 0.77
1.46 9.13 0.79
1.17 8.67 0.63
1.30 8.73 0.69
0.99 9.64 0.54


$ for i in `seq 1 10`;do /usr/bin/time -f '%e %U %S'  ./test/test_view.cram2bam -B -@32  ~/scratch/data/novaseq.10m.cram;done
0.50 11.83 0.98
0.50 10.32 0.82
0.45 10.13 0.82
0.42 10.10 0.78
0.50 10.19 0.87
0.47 10.07 0.81
0.48 10.15 0.98
0.49 10.03 1.00
0.51 10.06 0.85
0.49 10.28 0.76

This means the creation of BAM objects is done in the worker thread
rather than in the main thread.  However we still incur the cost of
copy that data over
- Cache CG tag as sam_tag2cigar is a significant main thread cost
- Check BAM memory policy

Signed-off-by: James Bonfield <jkb@sanger.ac.uk>
- Spare_bams is now bam_list as we use the struct directly in slices too.
- We no longer use a single malloc block of memory for the entire bam list.
- We now copy the struct and swap the pointers in cram_get_bam_seq
  (provided bam_get_mempolicy is ok).  This is now workable as every
  bam data object is its own malloc block.

It's not as efficient as before, but it's safe.
This gives us just one malloc instead of 2 mallocs per BAM object,
which still gives us the ability to do pointer-swapping on the
bam->data but without paying the penalty of the bam struct
allocation.

However it's still slow than d62b978
(even with bam_copy1 always in use) with very large numbers of threads
simply due to the sheer pressure that bam_init1 puts on the malloc
system.  It's faster to memcpy than it is to malloc with lots of
threads.

In glibc it can be partially mitigated with

    export MALLOC_TOP_PAD_=100000000
    export MALLOC_MMAP_MAX_=0

Similarly using libtcmalloc.so mitigates it a bit, but it's still
slower here than earlier when using -@128 on my test file.  -@16 gives
a faster time however, so we need to figure out a better balance
somewhere.

Signed-off-by: James Bonfield <jkb@sanger.ac.uk>
@daviesrob daviesrob self-assigned this May 21, 2026
Copy link
Copy Markdown
Member

@daviesrob daviesrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this does a good job of moving more work into the thread pool. Would it be worth doing the CG processing there too? I think it would just need bam_tag2cigar() to be made non-static so that it can be called from cram/cram_decode.c.

Comment thread cram/cram_decode.c

for (i = 0; i < s->hdr->num_records; i++) {
int sz = bam_size(bfd, fd, &s->crecs[i]);
realloc_bam_data(&s->bl->bams[i], sz);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? bam_set1() (via cram2bam()) would calculate the size and do the same realloc_bam_data() call, so this is just moving that work a bit earlier (and duplicating some of it).

Comment thread cram/cram_decode.c
if (bl) {
// Reuse an old bam list, possibly growing it
if (s->hdr->num_records > bl->nbams) {
bl->bams = realloc(bl->bams, s->hdr->num_records *
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As #2006 has been merged, it might be good to rebase this so it can take advantage of the hts_realloc_p() interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants