Skip to content

feat(common): Add Policy for cleanup/rollback before each write#18197

Merged
nsivabalan merged 8 commits intoapache:masterfrom
kbuci:PR-17901
Mar 11, 2026
Merged

feat(common): Add Policy for cleanup/rollback before each write#18197
nsivabalan merged 8 commits intoapache:masterfrom
kbuci:PR-17901

Conversation

@kbuci
Copy link
Copy Markdown
Contributor

@kbuci kbuci commented Feb 14, 2026

Describe the issue this Pull Request addresses

Adds an option to run CLEAN and/or rollback of failed writes at the start of each write (when startCommit or startCommitWithTime is used), so that writers that repeatedly UPSERT or INSERT_OVERWRITE the same partition can avoid buildup of uncleaned or failed-write data when a previous run failed before finishing clean/rollback (Without having to switch to eager clean and single writer concurrency mode)

Context in #17901

Summary and Changelog

Summary: New config hoodie.cleaner.prewrite.cleaner.policy controls whether to run a CLEAN and/or rollback of failed writes before starting a new ingestion commit. Supports "clean" (run CLEAN, which also rolls back failed writes) and "rollback_failed_writes" (only roll back failed writes). Default is empty (no pre-write clean/rollback). Intended for temporary use to “force” a clean before the next write when needed; not recommended as a default because a slow or failing clean can further block or delay the next ingestion write.

Changelog:

  • HoodieCleanConfig: Added PREWRITE_CLEANER_POLICY and withPreWriteCleanerPolicy(String).
  • HoodieWriteConfig: Added getPreWriteCleanerPolicy().
  • BaseHoodieWriteClient: Invoke pre-write cleaner policy in startCommit
  • TestCleaner: Added parameterized test testPreWriteCleanPolicy for both policy values

Impact

  • Public API / user-facing: New config only. No existing config defaults or method signatures changed.
  • Behavior: When hoodie.cleaner.prewrite.cleaner.policy is set to "clean" or "rollback_failed_writes", each call to startCommit() or startCommitWithTime() runs the corresponding clean/rollback before creating the new commit. Unset (default) preserves current behavior. Not intended for writes to metadata table.
  • Performance: If enabled, each write start may run one CLEAN and/or one rollback of failed writes; impact depends on table size and clean/rollback cost.

Risk Level

Low. Change is opt-in and backward compatible. Default is empty (no pre-write clean/rollback)

Documentation Update

  • Config: Document new config in the Hudi config reference (or equivalent) under Cleaner config:
    • Config key: hoodie.cleaner.prewrite.cleaner.policy
    • Description: If set, run clean and/or rollback of failed writes before starting a new ingestion commit when startCommit or startCommitWithTime is used. "clean" = run CLEAN (and thus also roll back failed writes); "rollback_failed_writes" = only roll back failed writes. Empty (default) = no pre-write clean/rollback. Use only when needed to avoid buildup of uncleaned/failed-write data; not recommended as default because a slow or failing clean can block the next ingestion.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Feb 14, 2026
@kbuci kbuci changed the title Add Policy for cleanup/rollback before each write feature(common) Add Policy for cleanup/rollback before each write Feb 17, 2026
@kbuci kbuci marked this pull request as ready for review February 17, 2026 21:00
@nsivabalan
Copy link
Copy Markdown
Contributor

Are you going to revisit the patch based on #17901 (comment) ?
if yes, we don't want to introduce another policy right

@kbuci
Copy link
Copy Markdown
Contributor Author

kbuci commented Feb 21, 2026

Are you going to revisit the patch based on #17901 (comment) ? if yes, we don't want to introduce another policy right

Yes I updated that issue to reflect this PR:
We aren't adding a new policy, just a new standalone config that will apply regardless of cleaner rolllback failed writes policy.

@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Mar 2, 2026
@github-actions github-actions Bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels Mar 2, 2026
@kbuci kbuci changed the title feature(common) Add Policy for cleanup/rollback before each write feature(common): Add Policy for cleanup/rollback before each write Mar 2, 2026
@kbuci kbuci changed the title feature(common): Add Policy for cleanup/rollback before each write feat(common): Add Policy for cleanup/rollback before each write Mar 3, 2026
}
switch (policy) {
case CLEAN:
tableServiceClient.clean(Option.empty(), true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need clean here?
if we have clean, it will automatically take care of rolling back failed writes as well right?
So, I see the need only for either of the enum values.

if you agree, we can just keep it a boolean config flag as well.
lmk wdyt

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I was thinking of clean as an option is since a user might have a scenario where

  • They are repeatedly insert_overwriting the same partition
  • The ingestion job fails after completing the commit but before/while performing clean
  • older files start to buildup in DFS folder

so having an option to clean before each write allows users a way of preventing/mitigating this without needing to launch/orchestrate/handhold a separate "clean job"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, don't we need to support a policy for both clean and rollback_failed_writes ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since clean implicitly does rollback of failed writes, I think it makes sense to keep the option of clean vs rollback_failed_writes vs do nothing mutually exclusive.

Copy link
Copy Markdown
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets align on the config before I can review the tests.

@kbuci kbuci requested a review from nsivabalan March 4, 2026 23:08
@kbuci kbuci requested a review from nsivabalan March 7, 2026 02:38
Krishen Bhan added 8 commits March 9, 2026 23:41
Guard runPreWriteCleanerPolicy() in BaseHoodieWriteClient to log and
return early when table services are disabled, rather than overriding
the config value in HoodieWriteConfig.Builder.

Made-with: Cursor
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 39cb726 into apache:master Mar 11, 2026
72 checks passed
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 41.02564% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.30%. Comparing base (b5daa30) to head (df23ba4).
⚠️ Report is 29 commits behind head on master.

Files with missing lines Patch % Lines
...hudi/common/model/HoodiePreWriteCleanerPolicy.java 35.29% 6 Missing and 5 partials ⚠️
.../org/apache/hudi/client/BaseHoodieWriteClient.java 23.07% 8 Missing and 2 partials ⚠️
...java/org/apache/hudi/config/HoodieCleanConfig.java 75.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18197      +/-   ##
============================================
- Coverage     57.31%   57.30%   -0.02%     
- Complexity    18631    18632       +1     
============================================
  Files          1955     1956       +1     
  Lines        106932   106971      +39     
  Branches      13231    13236       +5     
============================================
+ Hits          61291    61296       +5     
- Misses        39854    39878      +24     
- Partials       5787     5797      +10     
Flag Coverage Δ
hadoop-mr-java-client 45.21% <41.02%> (-0.03%) ⬇️
spark-java-tests 47.48% <41.02%> (-0.02%) ⬇️
spark-scala-tests 45.57% <41.02%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...java/org/apache/hudi/config/HoodieWriteConfig.java 85.01% <100.00%> (+0.01%) ⬆️
...java/org/apache/hudi/config/HoodieCleanConfig.java 88.20% <75.00%> (-0.63%) ⬇️
.../org/apache/hudi/client/BaseHoodieWriteClient.java 72.24% <23.07%> (-1.21%) ⬇️
...hudi/common/model/HoodiePreWriteCleanerPolicy.java 35.29% <35.29%> (ø)

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants