Skip to content

[opt](s3) Skip S3 listing for deterministic file paths using HEAD requests (#60414)#61776

Open
dataroaring wants to merge 1 commit intoapache:branch-4.1from
freemandealer:pick/60414-branch-4.1
Open

[opt](s3) Skip S3 listing for deterministic file paths using HEAD requests (#60414)#61776
dataroaring wants to merge 1 commit intoapache:branch-4.1from
freemandealer:pick/60414-branch-4.1

Conversation

@dataroaring
Copy link
Contributor

Summary

Cherry-pick of #60414 to branch-4.1.

  • For S3 paths without wildcards (*, ?, [...]), use HEAD requests instead of ListObjectsV2 to avoid requiring s3:ListBucket permission
  • Brace patterns like {1..10} are expanded to concrete file paths and verified individually with HEAD requests
  • This enables loading data from S3 when only s3:GetObject permission is granted

…uests (apache#60414)

- For S3 paths without wildcards (`*`, `?`, `[...]`), use HEAD requests
instead of ListObjectsV2 to avoid requiring `s3:ListBucket` permission
- Brace patterns like `{1..10}` are expanded to concrete file paths and
verified individually with HEAD requests
- This enables loading data from S3 when only `s3:GetObject` permission
is granted

S3 `ListBucket` permission is often more restricted than `GetObject` in
enterprise environments. When users specify exact file paths or
deterministic patterns like `file{1..3}.csv`, listing is unnecessary
since the file names can be determined from the input.

| File | Description |
|------|-------------|
| `S3Util.java` | Added `isDeterministicPattern()` to detect paths
without wildcards, and `expandBracePatterns()` to expand brace patterns
to concrete paths |
| `S3ObjStorage.java` | Modified `globListInternal()` to use HEAD
requests for deterministic paths |
| `S3UtilTest.java` | Added unit tests for new utility methods |

| Path | Deterministic? | Behavior |
|------|----------------|----------|
| `s3://bucket/data/file.csv` | ✅ Yes | Single HEAD request |
| `s3://bucket/data/file{1..3}.csv` | ✅ Yes | 3 HEAD requests |
| `s3://bucket/data/*.csv` | ❌ No | Falls back to LIST |

- [x] Added unit tests for `isDeterministicPattern()`
- [x] Added unit tests for `expandBracePatterns()`
- [ ] Manual testing with S3 TVF and Broker Load

🤖 Generated with [Claude Code](https://claude.ai/code)
Copilot AI review requested due to automatic review settings March 26, 2026 12:07
@dataroaring dataroaring requested a review from yiguolei as a code owner March 26, 2026 12:07
@Thearas
Copy link
Contributor

Thearas commented Mar 26, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cherry-picks an optimization to avoid ListObjectsV2/list-permission requirements for object-storage paths that can be deterministically expanded, by using per-object metadata requests (S3 HEAD, Azure getProperties) when safe.

Changes:

  • Add deterministic-pattern detection plus bracket/brace expansion utilities in S3Util.
  • Use S3 HeadObject / Azure getProperties to validate expanded deterministic paths, guarded by new FE configs.
  • Add unit tests for the new S3Util pattern utilities.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
fe/fe-core/src/main/java/org/apache/doris/fs/obj/S3ObjStorage.java Adds deterministic-path fast path using S3 HEAD requests with configurable expansion limits.
fe/fe-core/src/main/java/org/apache/doris/fs/obj/AzureObjStorage.java Adds analogous deterministic-path fast path using Azure getProperties requests.
fe/fe-core/src/main/java/org/apache/doris/common/util/S3Util.java Introduces deterministic-pattern checks and bracket/brace expansion helpers.
fe/fe-common/src/main/java/org/apache/doris/common/Config.java Adds configs to enable/disable the optimization and cap expanded path counts.
fe/fe-core/src/test/java/org/apache/doris/common/util/S3UtilTest.java Adds unit tests for deterministic detection and pattern expansion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +823 to +826
} catch (Exception e) {
LOG.warn("Failed to use HEAD requests, falling back to listing: {}", e.getMessage());
return null;
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an exception occurs after some successful HEADs, this method returns null to fall back to LIST but leaves already-added entries in result, which can lead to duplicates or partial results when listing adds the same files again. Consider accumulating results in a temporary list and only appending to result if the whole HEAD pass succeeds (or rollback to the original size before returning null).

Copilot uses AI. Check for mistakes.
Comment on lines +758 to +769
// First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces
String expandedPattern = S3Util.expandBracketPatterns(keyPattern);
expandedPattern = S3Util.extendGlobs(expandedPattern);
List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern);

// Fall back to listing if too many paths to avoid overwhelming S3 with HEAD requests
// Controlled by config: s3_head_request_max_paths
if (expandedPaths.size() > Config.s3_head_request_max_paths) {
LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST",
expandedPaths.size(), Config.s3_head_request_max_paths);
return null;
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3_head_request_max_paths is checked only after fully expanding bracket/range/brace patterns. For very large deterministic patterns, expansion itself can be expensive (CPU/memory) before you ever hit the limit. Consider adding an expansion routine that can short-circuit once the max is exceeded (or pre-compute/estimate expansion cardinality) so the config actually protects the system.

Copilot uses AI. Check for mistakes.
Comment on lines +530 to +533
} catch (Exception e) {
LOG.warn("Failed to use getProperties requests, falling back to listing: {}", e.getMessage());
return null;
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an exception occurs after some successful getProperties calls, this method returns null to fall back to LIST but leaves already-added entries in result, which can lead to duplicates/partial results when the listing path runs. Consider building results in a temporary list and only appending on success (or rollback result before returning null).

Copilot uses AI. Check for mistakes.
Comment on lines +470 to +481
// First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces
String expandedPattern = S3Util.expandBracketPatterns(keyPattern);
expandedPattern = S3Util.extendGlobs(expandedPattern);
List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern);

// Fall back to listing if too many paths to avoid overwhelming Azure with requests
// Controlled by config: s3_head_request_max_paths
if (expandedPaths.size() > Config.s3_head_request_max_paths) {
LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST",
expandedPaths.size(), Config.s3_head_request_max_paths);
return null;
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3_head_request_max_paths is checked only after expanding patterns into the full expandedPaths list, so extremely large deterministic patterns can still consume significant CPU/memory during expansion before triggering the fallback. Consider adding short-circuiting/early-abort expansion once the configured max is exceeded.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants