Skip to content

feat: add aimd throttled object store#6266

Merged
westonpace merged 6 commits intolance-format:mainfrom
westonpace:feat/aimd-throttle
Mar 25, 2026
Merged

feat: add aimd throttled object store#6266
westonpace merged 6 commits intolance-format:mainfrom
westonpace:feat/aimd-throttle

Conversation

@westonpace
Copy link
Copy Markdown
Member

This does not hook the throttle up anywhere yet, that will come in a future PR.

Closes #6237
Closes #6238

@github-actions github-actions bot added the enhancement New feature or request label Mar 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Code Review

Clean, well-documented PR overall. The AIMD algorithm and token bucket are well-structured. Below are the issues I'd flag before merging.

P0: acquire_token thundering herd under contention

In throttle.rs, the token acquisition loop sleeps then retries:

async fn acquire_token(&self) {
    loop {
        let sleep_duration = { /* lock, check, compute */ };
        tokio::time::sleep(sleep_duration).await;
    }
}

If N tasks are all waiting for 1 token, they wake simultaneously, re-acquire the lock, and only one succeeds. The rest compute a very short sleep and retry, degrading to a busy-wait spin loop with repeated lock acquisitions. Consider using tokio::sync::Notify or adding jitter to the sleep duration.

P1: is_throttle_error string matching is fragile

The function matches substrings like "429" and "503" in the error's display string. This can false-positive on object paths containing those substrings (e.g., data_429_backup.parquet) or error messages using those numbers in other contexts (e.g., "expected 503 bytes"). If the object_store crate exposes structured HTTP status codes, prefer matching on those. If string matching is unavoidable, use more specific patterns like "status: 429".

P1: Multipart uploads are only throttled on initiation

put_multipart and put_multipart_opts throttle the initial create_multipart call, but the returned MultipartUpload handle is unwrapped — subsequent put_part calls bypass the throttle entirely. For large files with many parts, writes are effectively unthrottled. The MultipartUpload should be wrapped to throttle each part upload.

P1: unwrap() on std::sync::Mutex::lock() will panic on poison

In aimd.rs, self.state.lock().unwrap() will panic if any thread previously panicked while holding the lock, cascading a single failure into a process crash. Use .lock().unwrap_or_else(|e| e.into_inner()) to recover from poisoning, or switch to parking_lot::Mutex which doesn't poison.

P1: No eager validation in AimdThrottleWrapper::new()

AimdThrottleWrapper::wrap() calls .expect("invalid AIMD config"), assuming validation was already done — but new() never validates. An invalid config (e.g., min_rate > max_rate) will panic lazily at wrap() time. Either validate eagerly in new() or propagate the error.

P1: Debug impl for AimdController acquires the mutex

AimdController::fmt() calls current_rate() which locks the mutex. If Debug is ever invoked while the lock is held (e.g., debug-logging inside a method that holds the lock), it will deadlock since std::sync::Mutex is not reentrant.

Minor: streaming list/delete_stream bypass throttling

These are passed through without throttling. list() can generate many paginated HTTP requests internally, each subject to cloud provider throttling. Worth at least a doc comment explaining the rationale for the pass-through.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 79.19598% with 207 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-io/src/object_store/throttle.rs 68.00% 194 Missing and 6 partials ⚠️
rust/lance-core/src/utils/aimd.rs 98.10% 6 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

///
/// The `object_store` crate surfaces these as `Error::Generic` with the HTTP
/// status or cloud-specific message embedded in the source chain. We match
/// against known patterns from S3, GCS, and Azure.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could upstream an improvement for this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've simplified this logic a bit, which should work as long as there are retries configured on the object store. However, I like the opendal approach, which is roughly the equivalent of adding is_temporary to object_store::Generic. I'll work on an upstream PR.

if let object_store::Error::Generic { source, .. } = err {
let msg = source.to_string();
// Check for common throttle patterns from cloud stores
msg.contains("429")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bot had a good comment about these. 429 or 503 could occur in the paths of index files or transaction files, so if those paths are also present in the errors then we could make an incorrect inference here.

I wonder if we could find the specific errors for each cloud provider documented. AWS ones seem to be here: https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting to the HTTP response from object store is rather tricky and actually probably impossible due to the fact that most errors are private types hiding behind the Error trait so all you can do is grab the error as a string.

I've simplified the logic for now. We look for the word retries, max_retries which is still potentially in a URI but given the whitespace in there I think we're probably ok (I suspect any whitespace in the URI will be %20 or +) We can worry about a better way after we upstream.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we could contribute up to object_store?

@westonpace
Copy link
Copy Markdown
Member Author

@claude review this PR

@github-actions
Copy link
Copy Markdown
Contributor

Claude Code is working…

I'll analyze this and get back to you.

View job run

@westonpace westonpace requested a review from wkalt March 24, 2026 17:11
Copy link
Copy Markdown
Contributor

@wkalt wkalt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@westonpace westonpace merged commit e2231fd into lance-format:main Mar 25, 2026
31 checks passed
wjones127 pushed a commit to wjones127/lance that referenced this pull request Mar 29, 2026
This does not hook the throttle up anywhere yet, that will come in a
future PR.

Closes lance-format#6237
Closes lance-format#6238
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a proxy ObjectStore that applies the AIMD throttle Implement a generic AIMD rate limit throttle

3 participants