Skip to content

Conversation

@davidbuniat
Copy link
Member

No description provided.

@claude
Copy link

claude bot commented Jan 30, 2026

Claude finished @davidbuniat's task in 58s —— View job


PR Review Complete

I've reviewed the TPCH optimization changes focusing on the new zero-copy string handling logic. Found 6 significant issues related to memory safety and concurrency:

Issues Found:

  • Buffer overrun risks (4 locations): Missing bounds checks when accessing offsets[local_idx + 1] in string processing loops
  • Use-after-free risk: Potential lifetime issue with nd::array(samples) copy in DeeplakeStringBuffer
  • Race condition: Double-checked locking pattern may allow concurrent batch initialization

All issues have been commented inline with specific recommendations.

Note:

The zero-copy optimization is well-documented and the performance intent is clear. These issues are critical for correctness and should be addressed before merge.


const auto local_idx = range_offsets_[0] + batch_start;
const auto* offsets = offsets_cache_[0];
const auto start = offsets[local_idx];
const auto end = offsets[local_idx + count];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential buffer overrun: Accessing offsets[local_idx + count] without bounds checking. If local_idx + count exceeds the valid offset array size, this causes undefined behavior.

Add bounds validation: if (local_idx + count >= offset_array_size) return 0; before the access.

for (duckdb::idx_t row_in_batch = 0; row_in_batch < batch_size; ++row_in_batch) {
const auto local_idx = start_idx + row_in_batch;
const auto str_start = offsets[local_idx] - base_offset;
const auto str_end = offsets[local_idx + 1] - base_offset;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential buffer overrun: Accessing offsets[local_idx + 1] without validating that local_idx < offset_array_size - 1. When row_in_batch reaches batch_size - 1, local_idx + 1 may exceed array bounds.

Add assertion or bounds check before the loop: ASSERT(start_idx + batch_size < offset_array_length);

for (duckdb::idx_t row_in_batch = 0; row_in_batch < batch_size; ++row_in_batch) {
const auto local_idx = start_idx + row_in_batch;
const auto str_start = offsets[local_idx] - base_offset;
const auto str_end = offsets[local_idx + 1] - base_offset;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential buffer overrun (duplicate): Same issue as line 582 - accessing offsets[local_idx + 1] without bounds validation when iterating through batch_size elements.

Add bounds check or assertion before the loop.

const auto* offsets = offsets_cache_[0];
const auto start_offset = offsets[0];
const auto str_start = offsets[local_idx] - start_offset;
const auto str_end = offsets[local_idx + 1] - start_offset;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential buffer overrun: Accessing offsets[local_idx + 1] without bounds checking. The function requires is_single_chunk() == true but doesn't validate that local_idx + 1 is within the valid offset array range.

Add bounds validation before access to prevent undefined behavior.

// This allows string_t to point directly into our buffer.
duckdb::StringVector::AddBuffer(
output_vector,
duckdb::make_buffer<DeeplakeStringBuffer>(nd::array(samples)));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential use-after-free risk: Creating nd::array(samples) as a copy and moving it into DeeplakeStringBuffer, but the original samples parameter is an rvalue reference. If the caller doesn't maintain the original array, the buffer may be freed while DuckDB still references it.

Consider moving samples directly: duckdb::make_buffer<DeeplakeStringBuffer>(std::move(samples)) to ensure proper ownership transfer.

if (!batch.initialized_.load(std::memory_order_acquire)) [[unlikely]] {
std::lock_guard lock(col_data.mutex_);
for (int64_t i = 0; i <= batch_index; ++i) {
if (!col_data.batches[i].initialized_.load(std::memory_order_relaxed)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition risk: The double-checked locking pattern initializes multiple batches (loop from 0 to batch_index) while only checking batch.initialized_ for the target batch. If two threads request different batches concurrently, both may enter the lock and initialize overlapping ranges, potentially causing data races on batch.owner_ and batch.holder_.

Consider checking and setting initialized flag for each batch i inside the loop to prevent concurrent initialization of the same batch.

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants