-
Notifications
You must be signed in to change notification settings - Fork 705
TPCH round 2 (work in progress) #3128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Claude finished @davidbuniat's task in 58s —— View job PR Review CompleteI've reviewed the TPCH optimization changes focusing on the new zero-copy string handling logic. Found 6 significant issues related to memory safety and concurrency: Issues Found:
All issues have been commented inline with specific recommendations. Note:The zero-copy optimization is well-documented and the performance intent is clear. These issues are critical for correctness and should be addressed before merge. |
| const auto local_idx = range_offsets_[0] + batch_start; | ||
| const auto* offsets = offsets_cache_[0]; | ||
| const auto start = offsets[local_idx]; | ||
| const auto end = offsets[local_idx + count]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential buffer overrun: Accessing offsets[local_idx + count] without bounds checking. If local_idx + count exceeds the valid offset array size, this causes undefined behavior.
Add bounds validation: if (local_idx + count >= offset_array_size) return 0; before the access.
| for (duckdb::idx_t row_in_batch = 0; row_in_batch < batch_size; ++row_in_batch) { | ||
| const auto local_idx = start_idx + row_in_batch; | ||
| const auto str_start = offsets[local_idx] - base_offset; | ||
| const auto str_end = offsets[local_idx + 1] - base_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential buffer overrun: Accessing offsets[local_idx + 1] without validating that local_idx < offset_array_size - 1. When row_in_batch reaches batch_size - 1, local_idx + 1 may exceed array bounds.
Add assertion or bounds check before the loop: ASSERT(start_idx + batch_size < offset_array_length);
| for (duckdb::idx_t row_in_batch = 0; row_in_batch < batch_size; ++row_in_batch) { | ||
| const auto local_idx = start_idx + row_in_batch; | ||
| const auto str_start = offsets[local_idx] - base_offset; | ||
| const auto str_end = offsets[local_idx + 1] - base_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential buffer overrun (duplicate): Same issue as line 582 - accessing offsets[local_idx + 1] without bounds validation when iterating through batch_size elements.
Add bounds check or assertion before the loop.
| const auto* offsets = offsets_cache_[0]; | ||
| const auto start_offset = offsets[0]; | ||
| const auto str_start = offsets[local_idx] - start_offset; | ||
| const auto str_end = offsets[local_idx + 1] - start_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential buffer overrun: Accessing offsets[local_idx + 1] without bounds checking. The function requires is_single_chunk() == true but doesn't validate that local_idx + 1 is within the valid offset array range.
Add bounds validation before access to prevent undefined behavior.
| // This allows string_t to point directly into our buffer. | ||
| duckdb::StringVector::AddBuffer( | ||
| output_vector, | ||
| duckdb::make_buffer<DeeplakeStringBuffer>(nd::array(samples))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential use-after-free risk: Creating nd::array(samples) as a copy and moving it into DeeplakeStringBuffer, but the original samples parameter is an rvalue reference. If the caller doesn't maintain the original array, the buffer may be freed while DuckDB still references it.
Consider moving samples directly: duckdb::make_buffer<DeeplakeStringBuffer>(std::move(samples)) to ensure proper ownership transfer.
| if (!batch.initialized_.load(std::memory_order_acquire)) [[unlikely]] { | ||
| std::lock_guard lock(col_data.mutex_); | ||
| for (int64_t i = 0; i <= batch_index; ++i) { | ||
| if (!col_data.batches[i].initialized_.load(std::memory_order_relaxed)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition risk: The double-checked locking pattern initializes multiple batches (loop from 0 to batch_index) while only checking batch.initialized_ for the target batch. If two threads request different batches concurrently, both may enter the lock and initialize overlapping ranges, potentially causing data races on batch.owner_ and batch.holder_.
Consider checking and setting initialized flag for each batch i inside the loop to prevent concurrent initialization of the same batch.
|



No description provided.