[opt](s3) Skip S3 listing for deterministic file paths using HEAD requests (#60414)#61776
[opt](s3) Skip S3 listing for deterministic file paths using HEAD requests (#60414)#61776dataroaring wants to merge 1 commit intoapache:branch-4.1from
Conversation
…uests (apache#60414) - For S3 paths without wildcards (`*`, `?`, `[...]`), use HEAD requests instead of ListObjectsV2 to avoid requiring `s3:ListBucket` permission - Brace patterns like `{1..10}` are expanded to concrete file paths and verified individually with HEAD requests - This enables loading data from S3 when only `s3:GetObject` permission is granted S3 `ListBucket` permission is often more restricted than `GetObject` in enterprise environments. When users specify exact file paths or deterministic patterns like `file{1..3}.csv`, listing is unnecessary since the file names can be determined from the input. | File | Description | |------|-------------| | `S3Util.java` | Added `isDeterministicPattern()` to detect paths without wildcards, and `expandBracePatterns()` to expand brace patterns to concrete paths | | `S3ObjStorage.java` | Modified `globListInternal()` to use HEAD requests for deterministic paths | | `S3UtilTest.java` | Added unit tests for new utility methods | | Path | Deterministic? | Behavior | |------|----------------|----------| | `s3://bucket/data/file.csv` | ✅ Yes | Single HEAD request | | `s3://bucket/data/file{1..3}.csv` | ✅ Yes | 3 HEAD requests | | `s3://bucket/data/*.csv` | ❌ No | Falls back to LIST | - [x] Added unit tests for `isDeterministicPattern()` - [x] Added unit tests for `expandBracePatterns()` - [ ] Manual testing with S3 TVF and Broker Load 🤖 Generated with [Claude Code](https://claude.ai/code)
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR cherry-picks an optimization to avoid ListObjectsV2/list-permission requirements for object-storage paths that can be deterministically expanded, by using per-object metadata requests (S3 HEAD, Azure getProperties) when safe.
Changes:
- Add deterministic-pattern detection plus bracket/brace expansion utilities in
S3Util. - Use S3
HeadObject/ AzuregetPropertiesto validate expanded deterministic paths, guarded by new FE configs. - Add unit tests for the new
S3Utilpattern utilities.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/fs/obj/S3ObjStorage.java | Adds deterministic-path fast path using S3 HEAD requests with configurable expansion limits. |
| fe/fe-core/src/main/java/org/apache/doris/fs/obj/AzureObjStorage.java | Adds analogous deterministic-path fast path using Azure getProperties requests. |
| fe/fe-core/src/main/java/org/apache/doris/common/util/S3Util.java | Introduces deterministic-pattern checks and bracket/brace expansion helpers. |
| fe/fe-common/src/main/java/org/apache/doris/common/Config.java | Adds configs to enable/disable the optimization and cap expanded path counts. |
| fe/fe-core/src/test/java/org/apache/doris/common/util/S3UtilTest.java | Adds unit tests for deterministic detection and pattern expansion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } catch (Exception e) { | ||
| LOG.warn("Failed to use HEAD requests, falling back to listing: {}", e.getMessage()); | ||
| return null; | ||
| } |
There was a problem hiding this comment.
If an exception occurs after some successful HEADs, this method returns null to fall back to LIST but leaves already-added entries in result, which can lead to duplicates or partial results when listing adds the same files again. Consider accumulating results in a temporary list and only appending to result if the whole HEAD pass succeeds (or rollback to the original size before returning null).
| // First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces | ||
| String expandedPattern = S3Util.expandBracketPatterns(keyPattern); | ||
| expandedPattern = S3Util.extendGlobs(expandedPattern); | ||
| List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern); | ||
|
|
||
| // Fall back to listing if too many paths to avoid overwhelming S3 with HEAD requests | ||
| // Controlled by config: s3_head_request_max_paths | ||
| if (expandedPaths.size() > Config.s3_head_request_max_paths) { | ||
| LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST", | ||
| expandedPaths.size(), Config.s3_head_request_max_paths); | ||
| return null; | ||
| } |
There was a problem hiding this comment.
s3_head_request_max_paths is checked only after fully expanding bracket/range/brace patterns. For very large deterministic patterns, expansion itself can be expensive (CPU/memory) before you ever hit the limit. Consider adding an expansion routine that can short-circuit once the max is exceeded (or pre-compute/estimate expansion cardinality) so the config actually protects the system.
| } catch (Exception e) { | ||
| LOG.warn("Failed to use getProperties requests, falling back to listing: {}", e.getMessage()); | ||
| return null; | ||
| } |
There was a problem hiding this comment.
If an exception occurs after some successful getProperties calls, this method returns null to fall back to LIST but leaves already-added entries in result, which can lead to duplicates/partial results when the listing path runs. Consider building results in a temporary list and only appending on success (or rollback result before returning null).
| // First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces | ||
| String expandedPattern = S3Util.expandBracketPatterns(keyPattern); | ||
| expandedPattern = S3Util.extendGlobs(expandedPattern); | ||
| List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern); | ||
|
|
||
| // Fall back to listing if too many paths to avoid overwhelming Azure with requests | ||
| // Controlled by config: s3_head_request_max_paths | ||
| if (expandedPaths.size() > Config.s3_head_request_max_paths) { | ||
| LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST", | ||
| expandedPaths.size(), Config.s3_head_request_max_paths); | ||
| return null; | ||
| } |
There was a problem hiding this comment.
s3_head_request_max_paths is checked only after expanding patterns into the full expandedPaths list, so extremely large deterministic patterns can still consume significant CPU/memory during expansion before triggering the fallback. Consider adding short-circuiting/early-abort expansion once the configured max is exceeded.
Summary
Cherry-pick of #60414 to branch-4.1.
*,?,[...]), use HEAD requests instead of ListObjectsV2 to avoid requirings3:ListBucketpermission{1..10}are expanded to concrete file paths and verified individually with HEAD requestss3:GetObjectpermission is granted