[opt](s3) Skip S3 listing for deterministic file paths using HEAD requests (#60414)#61775
[opt](s3) Skip S3 listing for deterministic file paths using HEAD requests (#60414)#61775dataroaring wants to merge 1 commit intoapache:branch-4.0from
Conversation
…uests (apache#60414) - For S3 paths without wildcards (`*`, `?`, `[...]`), use HEAD requests instead of ListObjectsV2 to avoid requiring `s3:ListBucket` permission - Brace patterns like `{1..10}` are expanded to concrete file paths and verified individually with HEAD requests - This enables loading data from S3 when only `s3:GetObject` permission is granted S3 `ListBucket` permission is often more restricted than `GetObject` in enterprise environments. When users specify exact file paths or deterministic patterns like `file{1..3}.csv`, listing is unnecessary since the file names can be determined from the input. | File | Description | |------|-------------| | `S3Util.java` | Added `isDeterministicPattern()` to detect paths without wildcards, and `expandBracePatterns()` to expand brace patterns to concrete paths | | `S3ObjStorage.java` | Modified `globListInternal()` to use HEAD requests for deterministic paths | | `S3UtilTest.java` | Added unit tests for new utility methods | | Path | Deterministic? | Behavior | |------|----------------|----------| | `s3://bucket/data/file.csv` | ✅ Yes | Single HEAD request | | `s3://bucket/data/file{1..3}.csv` | ✅ Yes | 3 HEAD requests | | `s3://bucket/data/*.csv` | ❌ No | Falls back to LIST | - [x] Added unit tests for `isDeterministicPattern()` - [x] Added unit tests for `expandBracePatterns()` - [ ] Manual testing with S3 TVF and Broker Load 🤖 Generated with [Claude Code](https://claude.ai/code)
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
Cherry-pick that optimizes object-storage glob listing for deterministic S3/Azure paths by avoiding bucket/container listing calls and instead probing concrete keys with per-object metadata requests (HEAD / getProperties), reducing required permissions (notably avoiding ListBucket when only GetObject is available).
Changes:
- Add deterministic-pattern detection plus bracket/brace expansion helpers in
S3Util. - Update S3/Azure glob listing to use HEAD/getProperties for deterministic patterns behind new config flags.
- Add unit tests covering deterministic detection and expansion behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/common/util/S3Util.java | Adds deterministic-pattern detection and pattern expansion utilities used by the new fast paths. |
| fe/fe-core/src/main/java/org/apache/doris/fs/obj/S3ObjStorage.java | Introduces HEAD-based glob listing optimization for deterministic S3 keys (config-gated). |
| fe/fe-core/src/main/java/org/apache/doris/fs/obj/AzureObjStorage.java | Introduces getProperties-based glob listing optimization for deterministic Azure blob keys (config-gated). |
| fe/fe-common/src/main/java/org/apache/doris/common/Config.java | Adds configs to enable/disable deterministic-path optimization and cap the expansion fan-out. |
| fe/fe-core/src/test/java/org/apache/doris/common/util/S3UtilTest.java | Adds unit tests for deterministic detection and brace/bracket expansion pipeline. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public static List<String> expandBracePatterns(String pathPattern) { | ||
| List<String> result = new ArrayList<>(); | ||
| expandBracePatternsRecursive(pathPattern, result); | ||
| return result; | ||
| } |
There was a problem hiding this comment.
expandBracePatterns() always fully expands into a List with no way to cap the number of generated paths. Since this is used to guard HEAD/getProperties fan-out, consider adding a limit-aware variant (e.g. expandBracePatterns(pattern, maxPaths)) or stopping recursion once a threshold is exceeded, to avoid large allocations on patterns that will later be rejected.
| String keyPattern = uri.getKey(); | ||
| if (Config.s3_skip_list_for_deterministic_path | ||
| && !isUsePathStyle | ||
| && S3Util.isDeterministicPattern(keyPattern) | ||
| && !hasLimits && startFile == null) { | ||
| GlobListResult headResult = globListByHeadRequests( | ||
| bucket, keyPattern, result, fileNameOnly, startTime); | ||
| if (headResult != null) { | ||
| return headResult; | ||
| } | ||
| // If headResult is null, fall through to use listing | ||
| } |
There was a problem hiding this comment.
The HEAD fast-path returns early, but the outer globListInternal() finally block will still log elementCnt/matchCnt from the LIST code path (which remain 0). This makes debug logs misleading when the HEAD optimization is used. Consider tracking a usedHeadPath flag and either skip the LIST-path metrics log or update the counters based on the HEAD results before returning.
| // First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces | ||
| String expandedPattern = S3Util.expandBracketPatterns(keyPattern); | ||
| expandedPattern = S3Util.extendGlobs(expandedPattern); | ||
| List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern); | ||
|
|
||
| // Fall back to listing if too many paths to avoid overwhelming S3 with HEAD requests | ||
| // Controlled by config: s3_head_request_max_paths | ||
| if (expandedPaths.size() > Config.s3_head_request_max_paths) { | ||
| LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST", | ||
| expandedPaths.size(), Config.s3_head_request_max_paths); | ||
| return null; | ||
| } |
There was a problem hiding this comment.
expandedPaths is fully materialized before checking s3_head_request_max_paths. Large brace/range expansions (e.g. {1..100000} or multi-brace cartesian products) can cause high CPU/memory usage even though you later fall back to LIST. To make s3_head_request_max_paths actually protective, consider adding an early-stop/limit-aware expansion API in S3Util (or short-circuit during recursion) that stops once the limit is exceeded.
|
|
||
| return new GlobListResult(Status.OK, "", bucket, ""); | ||
| } catch (Exception e) { | ||
| LOG.warn("Failed to use HEAD requests, falling back to listing: {}", e.getMessage()); |
There was a problem hiding this comment.
The fallback log drops the stack trace (LOG.warn(..., e.getMessage())), which makes diagnosing why HEAD failed (auth vs throttling vs networking) much harder. Log the exception as the throwable (and ideally include bucket/keyPattern) so operators can debug without reproducing.
| LOG.warn("Failed to use HEAD requests, falling back to listing: {}", e.getMessage()); | |
| LOG.warn("Failed to use HEAD requests for bucket={}, keyPattern={}, falling back to listing", | |
| bucket, keyPattern, e); |
| // Optimization: For deterministic paths (no wildcards like *, ?), | ||
| // use getProperties requests instead of listing to avoid requiring list permission. | ||
| // Controlled by config: s3_skip_list_for_deterministic_path | ||
| // Note: Skip when using path style (see S3ObjStorage for detailed explanation) | ||
| String keyPattern = uri.getKey(); | ||
| if (Config.s3_skip_list_for_deterministic_path | ||
| && !isUsePathStyle | ||
| && S3Util.isDeterministicPattern(keyPattern)) { | ||
| Status headStatus = globListByGetProperties(bucket, keyPattern, result, fileNameOnly, startTime); | ||
| if (headStatus != null) { | ||
| return headStatus; | ||
| } | ||
| // If headStatus is null, fall through to use listing | ||
| } |
There was a problem hiding this comment.
The early return for the deterministic-path optimization means the finally block still logs LIST-path counters (elementCnt/matchCnt) which remain 0. This makes the INFO log misleading when the optimization is taken. Consider skipping the LIST metrics log when returning from globListByGetProperties() or updating the counters based on expandedPaths/result size.
| // First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces | ||
| String expandedPattern = S3Util.expandBracketPatterns(keyPattern); | ||
| expandedPattern = S3Util.extendGlobs(expandedPattern); | ||
| List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern); | ||
|
|
||
| // Fall back to listing if too many paths to avoid overwhelming Azure with requests | ||
| // Controlled by config: s3_head_request_max_paths | ||
| if (expandedPaths.size() > Config.s3_head_request_max_paths) { | ||
| LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST", | ||
| expandedPaths.size(), Config.s3_head_request_max_paths); | ||
| return null; |
There was a problem hiding this comment.
Same as the S3 implementation: expandedPaths is fully expanded into a list before comparing to s3_head_request_max_paths. Very large brace/range expansions can consume significant CPU/memory even though you then fall back to LIST. Consider implementing a limit-aware expansion that stops once the threshold is exceeded.
| // First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces | |
| String expandedPattern = S3Util.expandBracketPatterns(keyPattern); | |
| expandedPattern = S3Util.extendGlobs(expandedPattern); | |
| List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern); | |
| // Fall back to listing if too many paths to avoid overwhelming Azure with requests | |
| // Controlled by config: s3_head_request_max_paths | |
| if (expandedPaths.size() > Config.s3_head_request_max_paths) { | |
| LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST", | |
| expandedPaths.size(), Config.s3_head_request_max_paths); | |
| return null; | |
| // First expand [...] brackets to {...} braces, then expand {..} ranges. | |
| // NOTE: We intentionally avoid unbounded brace expansion here to prevent | |
| // potential CPU/memory exhaustion on very large patterns. | |
| String expandedPattern = S3Util.expandBracketPatterns(keyPattern); | |
| expandedPattern = S3Util.extendGlobs(expandedPattern); | |
| final List<String> expandedPaths; | |
| if (expandedPattern.contains("{")) { | |
| // Pattern still contains brace/range expressions. To avoid potentially | |
| // huge expansions, conservatively fall back to LIST-based globbing. | |
| LOG.info("Pattern '{}' contains brace/range expressions after normalization; " | |
| + "skipping deterministic getProperties optimization and falling back to LIST", | |
| expandedPattern); | |
| return null; | |
| } else { | |
| // Deterministic single path; no brace expansion needed. | |
| expandedPaths = new ArrayList<>(1); | |
| expandedPaths.add(expandedPattern); |
| props.getBlobSize(), | ||
| props.getBlobSize(), | ||
| props.getLastModified() != null | ||
| ? props.getLastModified().toEpochSecond() : 0 |
There was a problem hiding this comment.
RemoteFile modification times elsewhere in FE are generally in epoch milliseconds (e.g. HDFS FileStatus#getModificationTime(), S3 toEpochMilli()), and FileGroupInfo passes this through to scan ranges. Using toEpochSecond() here will make Azure deterministic-path results inconsistent and likely incorrect by 1000x. Prefer props.getLastModified().toInstant().toEpochMilli() (and consider aligning the existing Azure LIST path too).
| ? props.getLastModified().toEpochSecond() : 0 | |
| ? props.getLastModified().toInstant().toEpochMilli() : 0 |
|
|
||
| return Status.OK; | ||
| } catch (Exception e) { | ||
| LOG.warn("Failed to use getProperties requests, falling back to listing: {}", e.getMessage()); |
There was a problem hiding this comment.
The fallback log drops the exception stack trace (LOG.warn(..., e.getMessage())), which makes it hard to understand why getProperties failed (permission vs missing blob vs transient errors). Log the throwable (and include container/keyPattern) so operational debugging is possible.
| LOG.warn("Failed to use getProperties requests, falling back to listing: {}", e.getMessage()); | |
| LOG.warn("Failed to use getProperties requests for container '{}' and keyPattern '{}', " | |
| + "falling back to listing", bucket, keyPattern, e); |
Summary
Cherry-pick of #60414 to branch-4.0.
*,?,[...]), use HEAD requests instead of ListObjectsV2 to avoid requirings3:ListBucketpermission{1..10}are expanded to concrete file paths and verified individually with HEAD requestss3:GetObjectpermission is granted