Parallel cram2bam#2015
Open
jkbonfield wants to merge 4 commits into
Open
Conversation
This means the creation of BAM objects is done in the worker thread rather than in the main thread. However we still incur the cost of copy that data over
- Cache CG tag as sam_tag2cigar is a significant main thread cost - Check BAM memory policy Signed-off-by: James Bonfield <jkb@sanger.ac.uk>
- Spare_bams is now bam_list as we use the struct directly in slices too. - We no longer use a single malloc block of memory for the entire bam list. - We now copy the struct and swap the pointers in cram_get_bam_seq (provided bam_get_mempolicy is ok). This is now workable as every bam data object is its own malloc block. It's not as efficient as before, but it's safe.
This gives us just one malloc instead of 2 mallocs per BAM object, which still gives us the ability to do pointer-swapping on the bam->data but without paying the penalty of the bam struct allocation. However it's still slow than d62b978 (even with bam_copy1 always in use) with very large numbers of threads simply due to the sheer pressure that bam_init1 puts on the malloc system. It's faster to memcpy than it is to malloc with lots of threads. In glibc it can be partially mitigated with export MALLOC_TOP_PAD_=100000000 export MALLOC_MMAP_MAX_=0 Similarly using libtcmalloc.so mitigates it a bit, but it's still slower here than earlier when using -@128 on my test file. -@16 gives a faster time however, so we need to figure out a better balance somewhere. Signed-off-by: James Bonfield <jkb@sanger.ac.uk>
daviesrob
reviewed
May 29, 2026
Member
daviesrob
left a comment
There was a problem hiding this comment.
Looks like this does a good job of moving more work into the thread pool. Would it be worth doing the CG processing there too? I think it would just need bam_tag2cigar() to be made non-static so that it can be called from cram/cram_decode.c.
|
|
||
| for (i = 0; i < s->hdr->num_records; i++) { | ||
| int sz = bam_size(bfd, fd, &s->crecs[i]); | ||
| realloc_bam_data(&s->bl->bams[i], sz); |
Member
There was a problem hiding this comment.
Is this needed? bam_set1() (via cram2bam()) would calculate the size and do the same realloc_bam_data() call, so this is just moving that work a bit earlier (and duplicating some of it).
| if (bl) { | ||
| // Reuse an old bam list, possibly growing it | ||
| if (s->hdr->num_records > bl->nbams) { | ||
| bl->bams = realloc(bl->bams, s->hdr->num_records * |
Member
There was a problem hiding this comment.
As #2006 has been merged, it might be good to rebase this so it can take advantage of the hts_realloc_p() interface.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The main thread does a lot of work in
cram_to_bamandbam_set1. Io_lib's model for this was to put this code into the worker threads. I thought I'd done that here too (I have for the writer already), but apparently not. This PR fixes that.A profile of the main thread:
With this PR:
Wall clock benchmarks with 32 cores show a doubling in throughput (making it slightly faster than BAM, but that's due to the same main-thread excessive work problem there too):