feat(common): Add Policy for cleanup/rollback before each write#18197
feat(common): Add Policy for cleanup/rollback before each write#18197nsivabalan merged 8 commits intoapache:masterfrom
Conversation
|
Are you going to revisit the patch based on #17901 (comment) ? |
Yes I updated that issue to reflect this PR: |
| } | ||
| switch (policy) { | ||
| case CLEAN: | ||
| tableServiceClient.clean(Option.empty(), true); |
There was a problem hiding this comment.
why do we need clean here?
if we have clean, it will automatically take care of rolling back failed writes as well right?
So, I see the need only for either of the enum values.
if you agree, we can just keep it a boolean config flag as well.
lmk wdyt
There was a problem hiding this comment.
The reason I was thinking of clean as an option is since a user might have a scenario where
- They are repeatedly insert_overwriting the same partition
- The ingestion job fails after completing the commit but before/while performing clean
- older files start to buildup in DFS folder
so having an option to clean before each write allows users a way of preventing/mitigating this without needing to launch/orchestrate/handhold a separate "clean job"
There was a problem hiding this comment.
then, don't we need to support a policy for both clean and rollback_failed_writes ?
There was a problem hiding this comment.
Since clean implicitly does rollback of failed writes, I think it makes sense to keep the option of clean vs rollback_failed_writes vs do nothing mutually exclusive.
nsivabalan
left a comment
There was a problem hiding this comment.
lets align on the config before I can review the tests.
Guard runPreWriteCleanerPolicy() in BaseHoodieWriteClient to log and return early when table services are disabled, rather than overriding the config value in HoodieWriteConfig.Builder. Made-with: Cursor
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18197 +/- ##
============================================
- Coverage 57.31% 57.30% -0.02%
- Complexity 18631 18632 +1
============================================
Files 1955 1956 +1
Lines 106932 106971 +39
Branches 13231 13236 +5
============================================
+ Hits 61291 61296 +5
- Misses 39854 39878 +24
- Partials 5787 5797 +10
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Adds an option to run CLEAN and/or rollback of failed writes at the start of each write (when
startCommitorstartCommitWithTimeis used), so that writers that repeatedly UPSERT or INSERT_OVERWRITE the same partition can avoid buildup of uncleaned or failed-write data when a previous run failed before finishing clean/rollback (Without having to switch to eager clean and single writer concurrency mode)Context in #17901
Summary and Changelog
Summary: New config
hoodie.cleaner.prewrite.cleaner.policycontrols whether to run a CLEAN and/or rollback of failed writes before starting a new ingestion commit. Supports"clean"(run CLEAN, which also rolls back failed writes) and"rollback_failed_writes"(only roll back failed writes). Default is empty (no pre-write clean/rollback). Intended for temporary use to “force” a clean before the next write when needed; not recommended as a default because a slow or failing clean can further block or delay the next ingestion write.Changelog:
PREWRITE_CLEANER_POLICYandwithPreWriteCleanerPolicy(String).getPreWriteCleanerPolicy().startCommittestPreWriteCleanPolicyfor both policy valuesImpact
hoodie.cleaner.prewrite.cleaner.policyis set to"clean"or"rollback_failed_writes", each call tostartCommit()orstartCommitWithTime()runs the corresponding clean/rollback before creating the new commit. Unset (default) preserves current behavior. Not intended for writes to metadata table.Risk Level
Low. Change is opt-in and backward compatible. Default is empty (no pre-write clean/rollback)
Documentation Update
hoodie.cleaner.prewrite.cleaner.policystartCommitorstartCommitWithTimeis used."clean"= run CLEAN (and thus also roll back failed writes);"rollback_failed_writes"= only roll back failed writes. Empty (default) = no pre-write clean/rollback. Use only when needed to avoid buildup of uncleaned/failed-write data; not recommended as default because a slow or failing clean can block the next ingestion.Contributor's checklist