RUBY-3894 Fix heap buffer overflow in put_string and related methods#372
Conversation
RSTRING_LEN returns long, but the native string encoder was storing the result in int32_t. A string of exactly 2**31 bytes silently wraps to INT32_MIN, which is then passed to rb_bson_utf8_validate as size_t, wrapping to near UINT64_MAX and driving reads far past the heap. Change the length variable to long in pvt_bson_encode_to_utf8, rb_bson_byte_buffer_put_string, rb_bson_byte_buffer_put_cstring, rb_bson_byte_buffer_put_symbol, and pvt_put_bson_key. Add an explicit bounds check in each before the value is narrowed back to int32_t for the BSON wire format (BSON strings are limited to INT32_MAX bytes by spec). Strings exceeding that limit now raise ArgumentError. Regression tests are in spec/bson/string_length_overflow_spec.rb, gated behind STRESS=1 since each example requires ~2 GB of free memory.
There was a problem hiding this comment.
Pull request overview
This PR addresses a critical memory-safety issue in the BSON Ruby native extension where Ruby string lengths (RSTRING_LEN, a long) were truncated into int32_t, potentially wrapping into enormous size_t values and causing out-of-bounds reads during UTF-8 validation and write operations.
Changes:
- Adds a centralized length guard (
pvt_check_string_length) and switches length variables fromint32_ttolongin several string-writing paths, raisingArgumentErrorwhen oversized. - Narrows to
int32_tonly after validation to match BSON wire format constraints. - Adds stress-gated regression specs covering
put_string,put_cstring, andput_symboloverflow protection.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
ext/bson/write.c |
Prevents int32 truncation of Ruby string lengths and adds explicit bounds checks before encoding/writing. |
spec/bson/string_length_overflow_spec.rb |
Adds STRESS-gated regression coverage for oversized string/symbol inputs triggering the prior overflow path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The binary-string encoding path writes length + 5 bytes total (4-byte int32 length prefix + payload + 1-byte null terminator). Allowing length == INT32_MAX caused that addition to overflow signed 32-bit arithmetic, which is undefined behavior in C and could in principle allow a miscompiled bounds check to reintroduce the class of memory safety issue this fix set out to close.
| VALUE utf8_string; | ||
| const char *str; | ||
| int32_t length; | ||
| long length; |
There was a problem hiding this comment.
Why do we want to change this?
There was a problem hiding this comment.
It's because RSTRING_LEN returns long, not int32_t, so the result could (in extreme cases) be cast from long to a large negative int32_t, which could (potentially) cause problems.
Description
RSTRING_LENreturnslong, but five sites inext/bson/write.cstored the result inint32_t, silently truncating strings ≥ 2³¹ bytes to a negative value. That value was then passed torb_bson_utf8_validateassize_t, wrapping to nearUINT64_MAXand driving reads far past the heap allocation (confirmed heap-buffer-overflow with AddressSanitizer).pvt_bson_encode_to_utf8,rb_bson_byte_buffer_put_string,rb_bson_byte_buffer_put_cstring,rb_bson_byte_buffer_put_symbol, andpvt_put_bson_keyby changing the length variable tolongand adding an explicit> INT32_MAXbounds check before narrowing back toint32_tfor the BSON wire format. Strings exceeding the limit now raiseArgumentError.spec/bson/string_length_overflow_spec.rb, gated behindSTRESS=1(each example requires ~2 GB of free memory).