branch-4.1: [improve](partition) Increase partition limit defaults to 20000 and add near-limit metrics #61511#61765
branch-4.1: [improve](partition) Increase partition limit defaults to 20000 and add near-limit metrics #61511#61765dataroaring wants to merge 1 commit intobranch-4.1from
Conversation
…dd near-limit metrics (#61511) - Raise `max_dynamic_partition_num` default from 500 to 20000 and `max_auto_partition_num` from 2000 to 20000 to match modern production workloads - Add warning logs when partition counts exceed 80% of their configured limits, enabling proactive detection before hard failures - Add Prometheus counter metrics (`auto_partition_near_limit_count`, `dynamic_partition_near_limit_count`) for monitoring/alerting - [ ] Verify existing dynamic partition tests pass with new default (tests explicitly set config values, so unaffected) - [ ] Verify auto-partition limit check still errors correctly when exceeded - [ ] Verify warning logs appear when partition count is between 80%-100% of limit - [ ] Verify new metrics appear in `/metrics` Prometheus endpoint - [ ] Test Prometheus alert rule: `rate(doris_fe_auto_partition_near_limit_count[5m]) > 0` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: dataroaring <dataroaring@users.noreply.github.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR cherry-picks improvements to partition limit handling on branch-4.1 by raising default partition limits and adding early-warning signals (logs + Prometheus counters) when partition counts approach configured caps.
Changes:
- Increased default
max_dynamic_partition_numandmax_auto_partition_numto 20000. - Added warning logs when partition counts exceed 80% of the configured limit.
- Added Prometheus counter metrics for near-limit events for both auto and dynamic partitions.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java | Adds near-limit warning log + metric increment for auto-partition count checks. |
| fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java | Registers two new Prometheus counter metrics for partition near-limit warnings. |
| fe/fe-core/src/main/java/org/apache/doris/common/util/DynamicPartitionUtil.java | Adds near-limit warning log + metric increment for dynamic partition count checks. |
| fe/fe-common/src/main/java/org/apache/doris/common/Config.java | Raises default partition limit config values to 20000 (and updates English description for auto partitions). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -2966,9 +2966,8 @@ public class Config extends ConfigBase { | |||
| @ConfField(mutable = true, masterOnly = true, description = { | |||
| "对于自动分区表,防止用户意外创建大量分区,每个 OLAP 表允许的分区数量为`max_auto_partition_num`。默认 2000。", | |||
There was a problem hiding this comment.
The Chinese description still says the default is 2000 (“默认 2000。”) but the actual default was changed to 20000. Update the Chinese string to match the new default to avoid misleading operators.
| "对于自动分区表,防止用户意外创建大量分区,每个 OLAP 表允许的分区数量为`max_auto_partition_num`。默认 2000。", | |
| "对于自动分区表,防止用户意外创建大量分区,每个 OLAP 表允许的分区数量为`max_auto_partition_num`。默认 20000。", |
| result.setStatus(errorStatus); | ||
| LOG.warn("send create partition error status: {}", result); | ||
| return result; | ||
| } else if (partitionNum > autoPartitionLimit * 8 / 10) { |
There was a problem hiding this comment.
autoPartitionLimit * 8 / 10 is computed using int arithmetic and can overflow if max_auto_partition_num is configured to a large value (this is a mutable config). Use long arithmetic (cast before multiply) or compare via partitionNum * 10 > autoPartitionLimit * 8 using long to avoid overflow/truncation pitfalls.
| } else if (partitionNum > autoPartitionLimit * 8 / 10) { | |
| } else if ((long) partitionNum * 10 > (long) autoPartitionLimit * 8) { |
| } else if (partitionNum > autoPartitionLimit * 8 / 10) { | ||
| LOG.warn("Table {}.{} auto partition count {} is approaching limit {} (>80%)." | ||
| + " Consider increasing max_auto_partition_num.", | ||
| db.getFullName(), olapTable.getName(), partitionNum, autoPartitionLimit); | ||
| if (MetricRepo.isInit) { | ||
| MetricRepo.COUNTER_AUTO_PARTITION_NEAR_LIMIT.increase(1L); | ||
| } |
There was a problem hiding this comment.
This warning (and counter increment) will trigger on every createPartition call once the table is above the 80% threshold, which can produce noisy logs and rapidly increasing counters in busy clusters. Consider adding throttling/deduping (e.g., log at most once per table per time window, or only when crossing the threshold) and similarly gate the metric increment to threshold-crossing events rather than per-request.
| } else if (partitionNum > autoPartitionLimit * 8 / 10) { | |
| LOG.warn("Table {}.{} auto partition count {} is approaching limit {} (>80%)." | |
| + " Consider increasing max_auto_partition_num.", | |
| db.getFullName(), olapTable.getName(), partitionNum, autoPartitionLimit); | |
| if (MetricRepo.isInit) { | |
| MetricRepo.COUNTER_AUTO_PARTITION_NEAR_LIMIT.increase(1L); | |
| } | |
| } else { | |
| // Only emit the warning and increment the metric when crossing the 80% threshold. | |
| // Estimate the partition count before this request by subtracting the number of | |
| // partitions requested to be added. This avoids noisy logs/metrics when the table | |
| // is already above the threshold. | |
| int prevPartitionNumEstimate = partitionNum - addPartitionClauseMap.size(); | |
| if (prevPartitionNumEstimate < 0) { | |
| prevPartitionNumEstimate = 0; | |
| } | |
| int threshold80 = autoPartitionLimit * 8 / 10; | |
| if (partitionNum > threshold80 && prevPartitionNumEstimate <= threshold80) { | |
| LOG.warn("Table {}.{} auto partition count {} is approaching limit {} (>80%)." | |
| + " Consider increasing max_auto_partition_num.", | |
| db.getFullName(), olapTable.getName(), partitionNum, autoPartitionLimit); | |
| if (MetricRepo.isInit) { | |
| MetricRepo.COUNTER_AUTO_PARTITION_NEAR_LIMIT.increase(1L); | |
| } | |
| } |
| LOG.warn("Dynamic partition count {} is approaching limit {} (>80%)." | ||
| + " Consider increasing max_dynamic_partition_num.", | ||
| expectCreatePartitionNum, dynamicPartitionLimit); | ||
| if (MetricRepo.isInit) { | ||
| MetricRepo.COUNTER_DYNAMIC_PARTITION_NEAR_LIMIT.increase(1L); | ||
| } |
There was a problem hiding this comment.
Similar to the auto-partition path, this will warn and increment the counter on every analysis call above the 80% threshold, which can be very frequent (DDL validations and retries). Consider throttling/deduping, or incrementing only on threshold crossing to keep logs/metrics actionable and avoid alert fatigue.
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
Summary
Cherry-pick of #61511 to branch-4.1.
max_dynamic_partition_numdefault from 500 to 20000 andmax_auto_partition_numfrom 2000 to 20000 to match modern production workloadsauto_partition_near_limit_count,dynamic_partition_near_limit_count) for monitoring/alertingConflict Resolution
Config.java: Trivial context conflict inmax_auto_partition_numdescription formatting — resolved by taking the incoming change (20000 default + updated English description).Test plan
/metricsPrometheus endpoint