Skip to content

branch-4.1: [improve](partition) Increase partition limit defaults to 20000 and add near-limit metrics #61511#61765

Open
dataroaring wants to merge 1 commit intobranch-4.1from
pick/61511-branch-4.1
Open

branch-4.1: [improve](partition) Increase partition limit defaults to 20000 and add near-limit metrics #61511#61765
dataroaring wants to merge 1 commit intobranch-4.1from
pick/61511-branch-4.1

Conversation

@dataroaring
Copy link
Contributor

Summary

Cherry-pick of #61511 to branch-4.1.

  • Raise max_dynamic_partition_num default from 500 to 20000 and max_auto_partition_num from 2000 to 20000 to match modern production workloads
  • Add warning logs when partition counts exceed 80% of their configured limits, enabling proactive detection before hard failures
  • Add Prometheus counter metrics (auto_partition_near_limit_count, dynamic_partition_near_limit_count) for monitoring/alerting

Conflict Resolution

  • Config.java: Trivial context conflict in max_auto_partition_num description formatting — resolved by taking the incoming change (20000 default + updated English description).

Test plan

  • Verify existing dynamic partition tests pass with new default
  • Verify auto-partition limit check still errors correctly when exceeded
  • Verify warning logs appear when partition count is between 80%-100% of limit
  • Verify new metrics appear in /metrics Prometheus endpoint

…dd near-limit metrics (#61511)

- Raise `max_dynamic_partition_num` default from 500 to 20000 and
`max_auto_partition_num` from 2000 to 20000 to match modern production
workloads
- Add warning logs when partition counts exceed 80% of their configured
limits, enabling proactive detection before hard failures
- Add Prometheus counter metrics (`auto_partition_near_limit_count`,
`dynamic_partition_near_limit_count`) for monitoring/alerting

- [ ] Verify existing dynamic partition tests pass with new default
(tests explicitly set config values, so unaffected)
- [ ] Verify auto-partition limit check still errors correctly when
exceeded
- [ ] Verify warning logs appear when partition count is between
80%-100% of limit
- [ ] Verify new metrics appear in `/metrics` Prometheus endpoint
- [ ] Test Prometheus alert rule:
`rate(doris_fe_auto_partition_near_limit_count[5m]) > 0`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: dataroaring <dataroaring@users.noreply.github.com>
@dataroaring dataroaring requested a review from yiguolei as a code owner March 26, 2026 08:03
Copilot AI review requested due to automatic review settings March 26, 2026 08:03
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cherry-picks improvements to partition limit handling on branch-4.1 by raising default partition limits and adding early-warning signals (logs + Prometheus counters) when partition counts approach configured caps.

Changes:

  • Increased default max_dynamic_partition_num and max_auto_partition_num to 20000.
  • Added warning logs when partition counts exceed 80% of the configured limit.
  • Added Prometheus counter metrics for near-limit events for both auto and dynamic partitions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java Adds near-limit warning log + metric increment for auto-partition count checks.
fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java Registers two new Prometheus counter metrics for partition near-limit warnings.
fe/fe-core/src/main/java/org/apache/doris/common/util/DynamicPartitionUtil.java Adds near-limit warning log + metric increment for dynamic partition count checks.
fe/fe-common/src/main/java/org/apache/doris/common/Config.java Raises default partition limit config values to 20000 (and updates English description for auto partitions).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -2966,9 +2966,8 @@ public class Config extends ConfigBase {
@ConfField(mutable = true, masterOnly = true, description = {
"对于自动分区表,防止用户意外创建大量分区,每个 OLAP 表允许的分区数量为`max_auto_partition_num`。默认 2000。",
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Chinese description still says the default is 2000 (“默认 2000。”) but the actual default was changed to 20000. Update the Chinese string to match the new default to avoid misleading operators.

Suggested change
"对于自动分区表,防止用户意外创建大量分区,每个 OLAP 表允许的分区数量为`max_auto_partition_num`。默认 2000。",
"对于自动分区表,防止用户意外创建大量分区,每个 OLAP 表允许的分区数量为`max_auto_partition_num`。默认 20000。",

Copilot uses AI. Check for mistakes.
result.setStatus(errorStatus);
LOG.warn("send create partition error status: {}", result);
return result;
} else if (partitionNum > autoPartitionLimit * 8 / 10) {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autoPartitionLimit * 8 / 10 is computed using int arithmetic and can overflow if max_auto_partition_num is configured to a large value (this is a mutable config). Use long arithmetic (cast before multiply) or compare via partitionNum * 10 > autoPartitionLimit * 8 using long to avoid overflow/truncation pitfalls.

Suggested change
} else if (partitionNum > autoPartitionLimit * 8 / 10) {
} else if ((long) partitionNum * 10 > (long) autoPartitionLimit * 8) {

Copilot uses AI. Check for mistakes.
Comment on lines +3810 to +3816
} else if (partitionNum > autoPartitionLimit * 8 / 10) {
LOG.warn("Table {}.{} auto partition count {} is approaching limit {} (>80%)."
+ " Consider increasing max_auto_partition_num.",
db.getFullName(), olapTable.getName(), partitionNum, autoPartitionLimit);
if (MetricRepo.isInit) {
MetricRepo.COUNTER_AUTO_PARTITION_NEAR_LIMIT.increase(1L);
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning (and counter increment) will trigger on every createPartition call once the table is above the 80% threshold, which can produce noisy logs and rapidly increasing counters in busy clusters. Consider adding throttling/deduping (e.g., log at most once per table per time window, or only when crossing the threshold) and similarly gate the metric increment to threshold-crossing events rather than per-request.

Suggested change
} else if (partitionNum > autoPartitionLimit * 8 / 10) {
LOG.warn("Table {}.{} auto partition count {} is approaching limit {} (>80%)."
+ " Consider increasing max_auto_partition_num.",
db.getFullName(), olapTable.getName(), partitionNum, autoPartitionLimit);
if (MetricRepo.isInit) {
MetricRepo.COUNTER_AUTO_PARTITION_NEAR_LIMIT.increase(1L);
}
} else {
// Only emit the warning and increment the metric when crossing the 80% threshold.
// Estimate the partition count before this request by subtracting the number of
// partitions requested to be added. This avoids noisy logs/metrics when the table
// is already above the threshold.
int prevPartitionNumEstimate = partitionNum - addPartitionClauseMap.size();
if (prevPartitionNumEstimate < 0) {
prevPartitionNumEstimate = 0;
}
int threshold80 = autoPartitionLimit * 8 / 10;
if (partitionNum > threshold80 && prevPartitionNumEstimate <= threshold80) {
LOG.warn("Table {}.{} auto partition count {} is approaching limit {} (>80%)."
+ " Consider increasing max_auto_partition_num.",
db.getFullName(), olapTable.getName(), partitionNum, autoPartitionLimit);
if (MetricRepo.isInit) {
MetricRepo.COUNTER_AUTO_PARTITION_NEAR_LIMIT.increase(1L);
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +652 to +657
LOG.warn("Dynamic partition count {} is approaching limit {} (>80%)."
+ " Consider increasing max_dynamic_partition_num.",
expectCreatePartitionNum, dynamicPartitionLimit);
if (MetricRepo.isInit) {
MetricRepo.COUNTER_DYNAMIC_PARTITION_NEAR_LIMIT.increase(1L);
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the auto-partition path, this will warn and increment the counter on every analysis call above the 80% threshold, which can be very frequent (DDL validations and retries). Consider throttling/deduping, or incrementing only on threshold crossing to keep logs/metrics actionable and avoid alert fatigue.

Copilot uses AI. Check for mistakes.
@dataroaring
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 40.91% (9/22) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants