Skip to content

[opt](s3) Skip S3 listing for deterministic file paths using HEAD requests (#60414)#61775

Open
dataroaring wants to merge 1 commit intoapache:branch-4.0from
freemandealer:pick/60414-branch-4.0
Open

[opt](s3) Skip S3 listing for deterministic file paths using HEAD requests (#60414)#61775
dataroaring wants to merge 1 commit intoapache:branch-4.0from
freemandealer:pick/60414-branch-4.0

Conversation

@dataroaring
Copy link
Contributor

Summary

Cherry-pick of #60414 to branch-4.0.

  • For S3 paths without wildcards (*, ?, [...]), use HEAD requests instead of ListObjectsV2 to avoid requiring s3:ListBucket permission
  • Brace patterns like {1..10} are expanded to concrete file paths and verified individually with HEAD requests
  • This enables loading data from S3 when only s3:GetObject permission is granted

…uests (apache#60414)

- For S3 paths without wildcards (`*`, `?`, `[...]`), use HEAD requests
instead of ListObjectsV2 to avoid requiring `s3:ListBucket` permission
- Brace patterns like `{1..10}` are expanded to concrete file paths and
verified individually with HEAD requests
- This enables loading data from S3 when only `s3:GetObject` permission
is granted

S3 `ListBucket` permission is often more restricted than `GetObject` in
enterprise environments. When users specify exact file paths or
deterministic patterns like `file{1..3}.csv`, listing is unnecessary
since the file names can be determined from the input.

| File | Description |
|------|-------------|
| `S3Util.java` | Added `isDeterministicPattern()` to detect paths
without wildcards, and `expandBracePatterns()` to expand brace patterns
to concrete paths |
| `S3ObjStorage.java` | Modified `globListInternal()` to use HEAD
requests for deterministic paths |
| `S3UtilTest.java` | Added unit tests for new utility methods |

| Path | Deterministic? | Behavior |
|------|----------------|----------|
| `s3://bucket/data/file.csv` | ✅ Yes | Single HEAD request |
| `s3://bucket/data/file{1..3}.csv` | ✅ Yes | 3 HEAD requests |
| `s3://bucket/data/*.csv` | ❌ No | Falls back to LIST |

- [x] Added unit tests for `isDeterministicPattern()`
- [x] Added unit tests for `expandBracePatterns()`
- [ ] Manual testing with S3 TVF and Broker Load

🤖 Generated with [Claude Code](https://claude.ai/code)
@dataroaring dataroaring requested a review from yiguolei as a code owner March 26, 2026 12:07
Copilot AI review requested due to automatic review settings March 26, 2026 12:07
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Cherry-pick that optimizes object-storage glob listing for deterministic S3/Azure paths by avoiding bucket/container listing calls and instead probing concrete keys with per-object metadata requests (HEAD / getProperties), reducing required permissions (notably avoiding ListBucket when only GetObject is available).

Changes:

  • Add deterministic-pattern detection plus bracket/brace expansion helpers in S3Util.
  • Update S3/Azure glob listing to use HEAD/getProperties for deterministic patterns behind new config flags.
  • Add unit tests covering deterministic detection and expansion behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
fe/fe-core/src/main/java/org/apache/doris/common/util/S3Util.java Adds deterministic-pattern detection and pattern expansion utilities used by the new fast paths.
fe/fe-core/src/main/java/org/apache/doris/fs/obj/S3ObjStorage.java Introduces HEAD-based glob listing optimization for deterministic S3 keys (config-gated).
fe/fe-core/src/main/java/org/apache/doris/fs/obj/AzureObjStorage.java Introduces getProperties-based glob listing optimization for deterministic Azure blob keys (config-gated).
fe/fe-common/src/main/java/org/apache/doris/common/Config.java Adds configs to enable/disable deterministic-path optimization and cap the expansion fan-out.
fe/fe-core/src/test/java/org/apache/doris/common/util/S3UtilTest.java Adds unit tests for deterministic detection and brace/bracket expansion pipeline.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +598 to +602
public static List<String> expandBracePatterns(String pathPattern) {
List<String> result = new ArrayList<>();
expandBracePatternsRecursive(pathPattern, result);
return result;
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expandBracePatterns() always fully expands into a List with no way to cap the number of generated paths. Since this is used to guard HEAD/getProperties fan-out, consider adding a limit-aware variant (e.g. expandBracePatterns(pattern, maxPaths)) or stopping recursion once a threshold is exceeded, to avoid large allocations on patterns that will later be rejected.

Copilot uses AI. Check for mistakes.
Comment on lines +597 to +608
String keyPattern = uri.getKey();
if (Config.s3_skip_list_for_deterministic_path
&& !isUsePathStyle
&& S3Util.isDeterministicPattern(keyPattern)
&& !hasLimits && startFile == null) {
GlobListResult headResult = globListByHeadRequests(
bucket, keyPattern, result, fileNameOnly, startTime);
if (headResult != null) {
return headResult;
}
// If headResult is null, fall through to use listing
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HEAD fast-path returns early, but the outer globListInternal() finally block will still log elementCnt/matchCnt from the LIST code path (which remain 0). This makes debug logs misleading when the HEAD optimization is used. Consider tracking a usedHeadPath flag and either skip the LIST-path metrics log or update the counters based on the HEAD results before returning.

Copilot uses AI. Check for mistakes.
Comment on lines +758 to +769
// First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces
String expandedPattern = S3Util.expandBracketPatterns(keyPattern);
expandedPattern = S3Util.extendGlobs(expandedPattern);
List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern);

// Fall back to listing if too many paths to avoid overwhelming S3 with HEAD requests
// Controlled by config: s3_head_request_max_paths
if (expandedPaths.size() > Config.s3_head_request_max_paths) {
LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST",
expandedPaths.size(), Config.s3_head_request_max_paths);
return null;
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expandedPaths is fully materialized before checking s3_head_request_max_paths. Large brace/range expansions (e.g. {1..100000} or multi-brace cartesian products) can cause high CPU/memory usage even though you later fall back to LIST. To make s3_head_request_max_paths actually protective, consider adding an early-stop/limit-aware expansion API in S3Util (or short-circuit during recursion) that stops once the limit is exceeded.

Copilot uses AI. Check for mistakes.

return new GlobListResult(Status.OK, "", bucket, "");
} catch (Exception e) {
LOG.warn("Failed to use HEAD requests, falling back to listing: {}", e.getMessage());
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback log drops the stack trace (LOG.warn(..., e.getMessage())), which makes diagnosing why HEAD failed (auth vs throttling vs networking) much harder. Log the exception as the throwable (and ideally include bucket/keyPattern) so operators can debug without reproducing.

Suggested change
LOG.warn("Failed to use HEAD requests, falling back to listing: {}", e.getMessage());
LOG.warn("Failed to use HEAD requests for bucket={}, keyPattern={}, falling back to listing",
bucket, keyPattern, e);

Copilot uses AI. Check for mistakes.
Comment on lines +363 to +376
// Optimization: For deterministic paths (no wildcards like *, ?),
// use getProperties requests instead of listing to avoid requiring list permission.
// Controlled by config: s3_skip_list_for_deterministic_path
// Note: Skip when using path style (see S3ObjStorage for detailed explanation)
String keyPattern = uri.getKey();
if (Config.s3_skip_list_for_deterministic_path
&& !isUsePathStyle
&& S3Util.isDeterministicPattern(keyPattern)) {
Status headStatus = globListByGetProperties(bucket, keyPattern, result, fileNameOnly, startTime);
if (headStatus != null) {
return headStatus;
}
// If headStatus is null, fall through to use listing
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return for the deterministic-path optimization means the finally block still logs LIST-path counters (elementCnt/matchCnt) which remain 0. This makes the INFO log misleading when the optimization is taken. Consider skipping the LIST metrics log when returning from globListByGetProperties() or updating the counters based on expandedPaths/result size.

Copilot uses AI. Check for mistakes.
Comment on lines +470 to +480
// First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces
String expandedPattern = S3Util.expandBracketPatterns(keyPattern);
expandedPattern = S3Util.extendGlobs(expandedPattern);
List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern);

// Fall back to listing if too many paths to avoid overwhelming Azure with requests
// Controlled by config: s3_head_request_max_paths
if (expandedPaths.size() > Config.s3_head_request_max_paths) {
LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST",
expandedPaths.size(), Config.s3_head_request_max_paths);
return null;
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the S3 implementation: expandedPaths is fully expanded into a list before comparing to s3_head_request_max_paths. Very large brace/range expansions can consume significant CPU/memory even though you then fall back to LIST. Consider implementing a limit-aware expansion that stops once the threshold is exceeded.

Suggested change
// First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces
String expandedPattern = S3Util.expandBracketPatterns(keyPattern);
expandedPattern = S3Util.extendGlobs(expandedPattern);
List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern);
// Fall back to listing if too many paths to avoid overwhelming Azure with requests
// Controlled by config: s3_head_request_max_paths
if (expandedPaths.size() > Config.s3_head_request_max_paths) {
LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST",
expandedPaths.size(), Config.s3_head_request_max_paths);
return null;
// First expand [...] brackets to {...} braces, then expand {..} ranges.
// NOTE: We intentionally avoid unbounded brace expansion here to prevent
// potential CPU/memory exhaustion on very large patterns.
String expandedPattern = S3Util.expandBracketPatterns(keyPattern);
expandedPattern = S3Util.extendGlobs(expandedPattern);
final List<String> expandedPaths;
if (expandedPattern.contains("{")) {
// Pattern still contains brace/range expressions. To avoid potentially
// huge expansions, conservatively fall back to LIST-based globbing.
LOG.info("Pattern '{}' contains brace/range expressions after normalization; "
+ "skipping deterministic getProperties optimization and falling back to LIST",
expandedPattern);
return null;
} else {
// Deterministic single path; no brace expansion needed.
expandedPaths = new ArrayList<>(1);
expandedPaths.add(expandedPattern);

Copilot uses AI. Check for mistakes.
props.getBlobSize(),
props.getBlobSize(),
props.getLastModified() != null
? props.getLastModified().toEpochSecond() : 0
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoteFile modification times elsewhere in FE are generally in epoch milliseconds (e.g. HDFS FileStatus#getModificationTime(), S3 toEpochMilli()), and FileGroupInfo passes this through to scan ranges. Using toEpochSecond() here will make Azure deterministic-path results inconsistent and likely incorrect by 1000x. Prefer props.getLastModified().toInstant().toEpochMilli() (and consider aligning the existing Azure LIST path too).

Suggested change
? props.getLastModified().toEpochSecond() : 0
? props.getLastModified().toInstant().toEpochMilli() : 0

Copilot uses AI. Check for mistakes.

return Status.OK;
} catch (Exception e) {
LOG.warn("Failed to use getProperties requests, falling back to listing: {}", e.getMessage());
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback log drops the exception stack trace (LOG.warn(..., e.getMessage())), which makes it hard to understand why getProperties failed (permission vs missing blob vs transient errors). Log the throwable (and include container/keyPattern) so operational debugging is possible.

Suggested change
LOG.warn("Failed to use getProperties requests, falling back to listing: {}", e.getMessage());
LOG.warn("Failed to use getProperties requests for container '{}' and keyPattern '{}', "
+ "falling back to listing", bucket, keyPattern, e);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants