CEP-45: Incremental repair for mutation tracking by aweisberg · Pull Request #4696 · apache/cassandra

aweisberg · 2026-03-27T19:59:39Z

No description provided.

…on-atomically

…dn't work in 2 minutes it won't work.

…tionTrackingIncrementalRepairTask

…repair

…ce shard for keyspace distributed_test_keyspace, but it already exists on bounce

…tables

…marking sstables repaired to effect the migration

…e it will hang, so just use IR instead

…ntirely inside migrated ranges or entirely outside, but not both

…een completed, use tryFailure instead

maedhroz · 2026-03-31T19:34:36Z

src/java/org/apache/cassandra/replication/MutationTrackingSyncCoordinator.java

+        for (Shard shard : overlappingShards)
+        {
+            ShardSyncState state = new ShardSyncState(shard, liveHostIds);
+            shardStates.put(shard.range, state);


Trying to reason about the thread safety of shardStates here...

The assignment is clearly visible after the CAS above, but are the iterations inside the callbacks later guaranteed to see the results of the put()s here?

Register sync coordinator is a write barrier because it does a put in a ConcurrentHashMap? So any prior writes will be visible? As long shardStates is effectively immutable after that particular map should be OK.

This could be an ImmutableMap which might make it a little clearer so I'll make that change.

maedhroz · 2026-04-01T04:17:28Z

src/java/org/apache/cassandra/repair/MutationTrackingIncrementalRepairTask.java

+            finally
+            {
+                if (!allSucceeded)
+                    syncCoordinator.cancel();


What happens if the try block above produces InterruptedException? Do we need to cancel the rest of the sync coordinators (that hadn't been processed yet)?

We should clean them up just so they don't have to wait for their timeout to elapse to clean up. I'll rework the exception handling here to catch Exception instead of RuntimeException.

maedhroz · 2026-04-01T04:38:47Z

test/distributed/org/apache/cassandra/distributed/test/sai/PartialUpdateHandlingTest.java

-            CLUSTER.get(1).nodetoolResult("repair", specification.keyspaceName()).asserts().success();
+            // Background reconciliation doesn't exist/work so incremental repair just hangs waiting for reconciliation that never occurs
+            if (specification.replicationType.isTracked())
+                CLUSTER.get(1).nodetoolResult("repair", "-full", specification.keyspaceName()).asserts().success();


Should an incremental repair request succeed after a successful full repair? It tried this, and it appears to hang, but I'm not sure why yet...

"node1_Repair#4:1" #270 daemon prio=5 os_prio=31 cpu=0.08ms elapsed=92.27s tid=0x000000013de75200 nid=0x2530b waiting on condition [0x000000036944a000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.19/Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.19/LockSupport.java:357) at org.apache.cassandra.utils.concurrent.AsyncFuture.awaitUntil(AsyncFuture.java:221) at org.apache.cassandra.utils.concurrent.Awaitable$Defaults.await(Awaitable.java:114) at org.apache.cassandra.utils.concurrent.AbstractFuture.await(AbstractFuture.java:482) at org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:252) at org.apache.cassandra.replication.MutationTrackingSyncCoordinator.awaitCompletion(MutationTrackingSyncCoordinator.java:351) at org.apache.cassandra.repair.MutationTrackingIncrementalRepairTask.waitForSyncCompletion(MutationTrackingIncrementalRepairTask.java:127)

Anyway, I think the lack of background reconciliation still means that this won't work. The transfer IDs are only there to make sure read reconciliation works.

I misunderstood the test. I don't think IR should hang in this test because we aren't relying on background reconciliation. There aren't any down nodes at all.

Ah right now I remember. So the test inserts data using executeInternal which gives the mutation and id and applys it locally correclty, but because it's only applied locally it never propagates because there is no background reconciliation.

Mutations applied via execute/StorageProxy are given to ActiveLogReconciler which is basically in-memory hinted handoff for mutation tracking.

So this is working as intended for now in that we need to use full repair here instead of IR since IR can't complete until background reconciliation is done.

maedhroz · 2026-04-01T19:36:47Z

src/java/org/apache/cassandra/repair/messages/MutationTrackingSyncResponse.java

+    public int hashCode()
+    {
+        return Objects.hash(desc, offsetsByShard);
+    }


nit: Do we ever actually put MutationTrackingSyncResponse in a collection?

I'll remove hashCode and equals. I ran the tests and I don't think they get used anymore.

It's needed for RepairMessageSerializationsTest

maedhroz · 2026-04-01T19:38:38Z

src/java/org/apache/cassandra/repair/MutationTrackingIncrementalRepairTask.java

+        {
+            logger.warn("Mutation tracking sync failed for keyspace {}", keyspace, error);
+            resultPromise.tryFailure(error);
+            return;


nit: Coverage tooling indicates this might not be tested.

I'll add a test that allows the timeout to elapse.

Ah the test that is supposed to test timeouts blocks all verbs so you it times out on the prepare not doing the actual sync. I'll fix that test.

maedhroz · 2026-04-01T19:38:55Z

src/java/org/apache/cassandra/repair/MutationTrackingIncrementalRepairTask.java

+            catch (RuntimeException e)
+            {
+                allSucceeded = false;
+                error = Throwables.merge(error, e);


nit: Coverage tooling indicates this might not be tested.

I'll convert timeouts to exceptions so it can be exercised.

maedhroz · 2026-04-01T19:39:05Z

src/java/org/apache/cassandra/repair/MutationTrackingIncrementalRepairTask.java

+            catch (Exception e)
+            {
+                logger.error("Error during mutation tracking repair", e);
+                resultPromise.tryFailure(e);


nit: Coverage tooling indicates this might not be tested.

The errors are put in resultPromise and don't get surfaced by allowing them to bubble up. To make this branch fire I could let exceptions bubble up and then get handled here. Would be less exception handling in general and then it would show up as tested.

I'll do that.

maedhroz · 2026-04-01T19:39:14Z

src/java/org/apache/cassandra/repair/MutationTrackingIncrementalRepairTask.java

+        if (allRanges.isEmpty())
+        {
+            logger.info("No common ranges to repair for keyspace {}", keyspace);
+            return new AsyncPromise<CoordinatedRepairResult>().setSuccess(CoordinatedRepairResult.create(List.of(), List.of()));


nit: Coverage tooling indicates this might not be tested.

maedhroz · 2026-04-01T19:39:54Z

src/java/org/apache/cassandra/replication/MutationTrackingSyncCoordinator.java

+        if (overlappingShards.isEmpty())
+        {
+            completionFuture.setSuccess(null);
+            return;


nit: Coverage tooling indicates this might not be tested.

maedhroz · 2026-04-01T19:40:13Z

src/java/org/apache/cassandra/replication/MutationTrackingSyncCoordinator.java

+                                                     public void onFailure(InetAddressAndPort from, RequestFailure failure)
+                                                     {
+                                                         fail(new RuntimeException(
+                                                             String.format("Mutation tracking sync failed: participant %s returned failure %s", from, failure.reason)));


nit: Coverage tooling indicates this might not be tested.

maedhroz · 2026-04-01T19:40:33Z

src/java/org/apache/cassandra/replication/MutationTrackingSyncCoordinator.java

+            Shard currentShard = getCurrentShard(state.shard.range);
+            if (currentShard != state.shard)
+            {
+                failWithTopologyChange();


nit: Coverage tooling indicates this might not be tested.

Topology changes aren't supported yet https://issues.apache.org/jira/browse/CASSANDRA-20386

I'll take a look and see if I can at least induce one to exercise this failure path.

It might end up being more unit test then end to end test.

maedhroz · 2026-04-01T19:41:56Z

src/java/org/apache/cassandra/repair/messages/MutationTrackingSyncRequest.java

+ * their current witnessed offsets. This establishes a happens-before relationship: the
+ * participant's response contains offsets captured after receiving this request, which is
+ * sent after the repair starts.
+ *


nit:

Suggested change

*

* <p>

maedhroz · 2026-04-01T20:36:24Z

src/java/org/apache/cassandra/io/sstable/format/SSTableWriter.java

+                inMigrationPendingRange = migrationInfo.isRangeInPendingMigration(metadata().id,
+                                                                                   first.getToken(),
+                                                                                   last.getToken());
+            }


nit: Could replace the above w/

KeyspaceMigrationInfo migrationInfo = ClusterMetadata.current().mutationTrackingMigrationState.getKeyspaceInfo(metadata().keyspace); boolean inMigrationPendingRange = migrationInfo != null && migrationInfo.isRangeInPendingMigration(metadata().id, first.getToken(), last.getToken());

maedhroz · 2026-04-01T20:46:10Z

src/java/org/apache/cassandra/db/lifecycle/Tracker.java

+        // when incremental repair streams SSTables that were written before tracking was enabled.
+        Preconditions.checkState(!cfstore.metadata().replicationType().isTracked()
+                                 || ClusterMetadata.current().mutationTrackingMigrationState
+                                    .getKeyspaceInfo(cfstore.metadata().keyspace) != null);


nit: Might be nice to have something like an isMigrating(String) on MTMS, but just a matter of taste I guess.

I'll update it to use a helper.

maedhroz · 2026-04-01T20:46:49Z

src/java/org/apache/cassandra/db/lifecycle/Tracker.java

    {
-        Preconditions.checkState(!cfstore.metadata().replicationType().isTracked());
+        // Tracked tables may legitimately use this path during migration from untracked to tracked,
+        // when incremental repair streams SSTables that were written before tracking was enabled.


Does this mean that during migration to tracked, we'd expect these SSTables to have no coordinator log offsets then? Is that worth asserting?

It shouldn't matter for imports, since the keyspace being currently tracked means we'll avoid this method.

No we will actually hit this method during migration. The sstables might actually have offsets in them since tracked writes have already started and the incremental repair starts after.

maedhroz · 2026-04-02T16:38:46Z

src/java/org/apache/cassandra/db/AbstractMutationVerbHandler.java

+        // flag on the mutation hasn't been set yet at this point — it's set later in
+        // applyMutation() — so we check the handler type instead.
+        if (this instanceof ReadRepairVerbHandler)
+            return metadata;


nit: I guess the other option would be something like a handlesReadRepair() method that only ReadRepairVerbHandler overrides, but it's literally called ReadRepairVerbHandler, and we probably won't have something else handle RR mutations.

In any case, I'm remembering blocking RR is going to be reworked for migration anyway, so ignore me :D

maedhroz · 2026-04-02T16:52:22Z

src/java/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfo.java

+                                              @Nonnull Collection<String> columnFamilies)
+    {
+        Iterable<TableMetadata> tables;
+        if (!columnFamilies.isEmpty())


nit: This is almost a case where null would be nice to indicate "all tables", in the sense that an empty collection might be more likely than null to indicate incorrect argument construction.

maedhroz · 2026-04-02T17:17:46Z

src/java/org/apache/cassandra/repair/RepairCoordinator.java

+            RepairTask task = new PreviewRepairTask(this, state.id, neighborsAndRanges.filterCommonRanges(state.keyspace, cfnames), neighborsAndRanges.shouldExcludeDeadParticipants, cfnames);
+            return task.perform(executor, validationScheduler)
+                       .<Pair<CoordinatedRepairResult, Supplier<String>>>map(r -> Pair.create(r, task::successMessage))
+                       .addCallback((s, f) -> executor.shutdown());


This block here is duplicated 3 more times below. The original code here avoided that by returning after the if/else stuff, but we could just delegate to a submitRepairTask() or something similar.

maedhroz · 2026-04-02T17:26:57Z

src/java/org/apache/cassandra/repair/MutationTrackingIncrementalRepairTask.java

+                RepairJobDesc desc = new RepairJobDesc(parentSession, TimeUUID.Generator.nextTimeUUID(),
+                                                       keyspace, "Mutation Tracking Sync", List.of(range));
+                MutationTrackingSyncCoordinator syncCoordinator = new MutationTrackingSyncCoordinator(
+                    coordinator.ctx, desc, commonRange.endpoints, metadata);


MutationTrackingSyncCoordinator syncCoordinator = new MutationTrackingSyncCoordinator(coordinator.ctx, desc, commonRange.endpoints, metadata);

...might be a little easier on the eyes.

maedhroz · 2026-04-02T17:34:56Z

src/java/org/apache/cassandra/repair/RepairCoordinator.java

+                            Pair<CoordinatedRepairResult, Supplier<String>> irPair = Pair.create(irResult, incrementalTask::successMessage);
+                            mtTask.perform(executor, validationScheduler)
+                                    .addCallback(
+                                            mtResult -> result.trySuccess(irPair),


Do we need to handle partial failure here? (i.e. Do we just return the irPair result if the MT task partially fails?)

maedhroz · 2026-04-02T17:36:57Z

src/java/org/apache/cassandra/repair/MutationTrackingIncrementalRepairTask.java

+     * Determines if this keyspace should use mutation tracking incremental repair.
+     * Returns true if:
+     * - Keyspace uses mutation tracking replication, OR
+     * - Keyspace is currently migrating (either direction)


nit: Not strictly true if migrating to untracked?

maedhroz · 2026-04-02T17:38:46Z

src/java/org/apache/cassandra/repair/MutationTrackingIncrementalRepairTask.java

+            for (Range<Token> range : commonRange.ranges)
+            {
+                RepairJobDesc desc = new RepairJobDesc(parentSession, TimeUUID.Generator.nextTimeUUID(),
+                                                       keyspace, "Mutation Tracking Sync", List.of(range));


Table name is meaningless here, right?

Yes but I figured for debugging purposes it's clearer to not leave it empty.

maedhroz · 2026-04-02T19:04:27Z

src/java/org/apache/cassandra/replication/MutationTrackingSyncCoordinator.java

+
+        if (overlappingShards.isEmpty())
+        {
+            completionFuture.setSuccess(null);


nit: Might be nice to have a DEBUG level log message to indicate this happened.

maedhroz · 2026-04-02T19:07:01Z

src/java/org/apache/cassandra/replication/MutationTrackingSyncCoordinator.java

+            }
+            // Always include the local node
+            liveHostIds.add(metadata.directory.peerId(ctx.broadcastAddressAndPort()).id());
+        }


nit: If we just build the liveHostIds at construction time, could we make it final?

maedhroz · 2026-04-02T19:19:23Z

src/java/org/apache/cassandra/replication/MutationTrackingSyncCoordinator.java

+        if (completionFuture.isDone())
+            return;
+
+        recaptureTargets();


It looks like this is called from updateReplicatedOffsets(), but does that mean we keep expanding the targets after the initial round of sync requests? (i.e. If there are ongoing writes, can this cause the whole IR to time out?)

aparnanaik0522 and others added 30 commits March 18, 2026 16:23

CEP-45: Incremental Repair Blocking Wait for offsets

a249e80

end--no-edit

9e750e4

Create new test to validate inc repair on ALL replicas

5f51ed1

SyncCoordinatorTest file fix

4aced30

Change shardStates from CHM -> HM

42b465e

Fix possible shard staleness

30cf06f

Fix for happens-before

92ab440

Fix MutationTrackingIncrementalRepairTask file

6d68ef0

Fix IR still doing anti-compaction and TCM consulted multiple times n…

3ab3b55

…on-atomically

Collect offsets via message exchange

f0f68c3

Remove extra timeouts, make top level timeout configurable hot prop

44a3bd3

Fix timeout calculation and clean up error handling

3c457fc

Mutation tracking sync timeout doesn't need to be that long. If it di…

17095cd

…dn't work in 2 minutes it won't work.

Using IncrementalRepairTask directly instead of embedding inside Muta…

65996e4

…tionTrackingIncrementalRepairTask

Add support for force and with hosts

f99c9d5

Fix MutationTrackingSyncCoordinatorTest

4774648

Clean up/fix result handling when pairing incremental repair with MT …

d591087

…repair

Fix java.lang.IllegalStateException: Attempted to create a new keyspa…

42e23c2

…ce shard for keyspace distributed_test_keyspace, but it already exists on bounce

Add MutationTrackingRepairTest

65a366e

checkpoint before big mess

886e4bd

During migration incremental repair might legitimately need to add ss…

2e27603

…tables

During migration SSTableWriter.finalizeMetadata() needs to accept IR …

d42314e

…marking sstables repaired to effect the migration

Can't run MT sync during migration away from mutation tracking becaus…

94cf54b

…e it will hang, so just use IR instead

RepairCoordinator add logging

9cf77b9

During mutation tracking migration restrict repairs to being either e…

adf3638

…ntirely inside migrated ranges or entirely outside, but not both

Spurious error trying to set failure on promise when it has already b…

b4b06b3

…een completed, use tryFailure instead

Allow read repair mutations to be applied during migration

3f6f640

Finish MutationTrackingRepairTest

20c464c

Final test and bug fixes after rebase

04c49ba

A bunch of self review changes/improvements

15a98da

Fix flaky testFailedMutationRedelivery

76a6aac

aweisberg requested a review from maedhroz March 27, 2026 19:59

Reset filters only after checking data is inconsistent

5ac9e2f

maedhroz mentioned this pull request Mar 31, 2026

CEP-45: Incremental Repair Blocking Wait for offsets #4569

Open

maedhroz reviewed Mar 31, 2026

View reviewed changes

maedhroz reviewed Apr 1, 2026

View reviewed changes

Fix MutationTrackingRepairTest style

4ab0f2b

maedhroz reviewed Apr 1, 2026

View reviewed changes

maedhroz reviewed Apr 2, 2026

View reviewed changes

Conversation

aweisberg commented Mar 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aweisberg Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

aweisberg Apr 2, 2026 •

edited

Loading