Skip to content

Add defensive sequence number validation in replicated event persistence#2815

Draft
He-Pin wants to merge 1 commit intoapache:mainfrom
He-Pin:improve/validate-replica-seqnr
Draft

Add defensive sequence number validation in replicated event persistence#2815
He-Pin wants to merge 1 commit intoapache:mainfrom
He-Pin:improve/validate-replica-seqnr

Conversation

@He-Pin
Copy link
Copy Markdown
Member

@He-Pin He-Pin commented Mar 28, 2026

Motivation

When handleExternalReplicatedEventPersist receives a replicated event, it updates seenPerReplica with the event's originSequenceNr without validating it matches the expected next sequence number. This was flagged as a FIXME referencing akka#29259.

If events arrive out of order (e.g., seqNr 7 arrives when expecting 6), seenPerReplica advances past the gap, and the missed event (seqNr 6) will later be filtered as "already seen" — permanently lost.

Modification

  • Replaced the FIXME with a proper defensive validation check
  • Added warning log when originSequenceNr != expectedSeqNr for the replica
  • Documented the caller contracts explaining why only gap events (not duplicates) can reach this validation:
    • onPublishedEvent: Rejects both duplicates and gaps before calling — only exact-match seqNr passes
    • onReplicatedEvent: Filters duplicates via alreadySeen(), but gaps may pass through
  • Preserved backward compatibility: Events are still persisted even on mismatch (rejecting could stall replication)

Design Decision: Why != instead of separate < and > branches

Analysis confirmed that:

  1. The < expectedSeqNr (duplicate) branch would be dead code — callers already filter duplicates
  2. Only the > expectedSeqNr (gap) scenario can actually fire
  3. Using a single != check avoids dead code while remaining defensive against future caller changes

Result

  • Operators now get actionable warning logs when sequence number gaps occur in replication
  • The warning message includes the expected vs actual seqNr and notes that earlier events may be skipped
  • No behavior change for normal operation (exact-match events processed silently)
  • All 18 replicated event tests pass (ReplicatedEventSourcingSpec + ReplicatedEventPublishingSpec)

References

  • Resolves FIXME from akka#29259 (which is now Apache licensed)

Replace the FIXME (akka#29259) comment in handleExternalReplicatedEventPersist
with a proper defensive validation of the replica sequence number before
updating seenPerReplica.

The validation logs a warning when the incoming event's originSequenceNr
does not match the expected next sequence number for that replica. This
covers the gap scenario where events from a replica may arrive out of
order via the replication stream (onReplicatedEvent path). The event is
still persisted for backward compatibility — rejecting it could stall
the replication stream.

Key design decisions (confirmed by cross-review from GPT-5.4 and Sonnet 4.6):
- Only gap detection (seqNr > expected) can fire from current callers;
  onPublishedEvent filters both duplicates and gaps before calling.
  onReplicatedEvent filters duplicates via alreadySeen() but allows gaps.
- Uses != check (not separate < and > branches) to avoid dead code.
- Warning message includes the advancing seqNr to help operators diagnose
  potential event loss.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@He-Pin He-Pin force-pushed the improve/validate-replica-seqnr branch from afbf92c to 14f3ab7 Compare March 28, 2026 12:50
Copy link
Copy Markdown
Member Author

@He-Pin He-Pin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work resolving the long-standing FIXME from akka#29259. The approach is sensible.

Concerns:

  1. The warning uses warnN (no-arg interpolation) -- verify this is the correct method for the internal logger. Some Pekko internal loggers use different interpolation syntax.

  2. Silently advancing with a warning may not be sufficient -- the PR acknowledges that events are still persisted to avoid stalling replication, but this means the gap problem isn't actually fixed, just documented. The warning message says 'Advancing seenPerReplica to [X]; earlier events from this replica may be skipped' which is accurate but operators who see this warning have no automatic recovery path. Consider whether a metric or counter should also be incremented so this can be monitored.

  3. The != vs separate < and > analysis is correct -- dead code elimination is good. However, if future caller changes introduce a case where a duplicate (seqNr < expected) reaches this code, the warning message would be misleading ('earlier events may be skipped' doesn't apply to duplicates). Consider adding a guard comment or a separate branch for < that logs a different message, even if it's currently unreachable.

  4. The caller contract documentation in the comments is excellent -- very helpful for future maintainers.

Copy link
Copy Markdown
Member Author

@He-Pin He-Pin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep CR: PR #2815 - Add defensive sequence number validation in replicated event persistence

Architecture Review

This PR resolves a long-standing FIXME from akka/akka-core#29259. The issue is that handleExternalReplicatedEventPersist updates seenPerReplica with the event's originSequenceNr without validating it is the expected next number. If events arrive out of order (e.g., seqNr 7 when expecting 6), seenPerReplica advances past the gap, permanently losing seqNr 6.

Design decision analysis:

The implementation uses a single != check rather than separate < and > branches. The PR description provides solid justification:

  • < expectedSeqNr (duplicate) is dead code - callers already filter duplicates via alreadySeen()
  • Only > expectedSeqNr (gap) can actually reach this code
  • A single != check is defensive against future caller changes

However, the warning message says "Advancing seenPerReplica to [X]; earlier events from this replica may be skipped." This message is accurate for the gap scenario but misleading if a future code change introduces a duplicate path. The current caller analysis is correct, but this should be documented at the call site to prevent future confusion.

Binary Compatibility

No public API changes. This is an internal behavior change in persistence-typed internals (Running.scala). No MiMa impact.

Test Coverage

Missing explicit tests: The PR description mentions "All 18 replicated event tests pass" but there are no NEW tests specifically exercising the warning path. This is a gap - the warning logic should have a test that:

  1. Simulates an out-of-order event delivery (seqNr gap)
  2. Verifies the warning is logged
  3. Verifies the event is still persisted (backward compatibility)
  4. Verifies seenPerReplica advances to the gap value

Since this is in persistence-typed/internal, testing requires either:

  • Multi-node replication tests (expensive, already covered by existing ReplicatedEventSourcingSpec)
  • Unit-level testing of the Running state machine (complex due to internal APIs)

Recommendation: At minimum, add a comment in the test file explaining why this path is not directly tested and how the existing tests indirectly exercise it.

Performance Impact

Minimal. One additional getOrElse lookup and comparison per replicated event. The comparison is O(1) and the seenPerReplica map is bounded by the number of replicas (typically 2-5).

Operational Impact

Warning log frequency: In a healthy replication setup, this warning should NEVER fire. If it fires frequently, it indicates a serious problem in the replication layer. The warning message is actionable and includes all necessary debugging information.

No automatic recovery: The warning logs the issue but does not attempt recovery. Operators seeing this warning have no built-in remediation path. Consider whether a metric counter (e.g., pekko.persistence.replicated-event-sequence-gaps) should be added for monitoring/alerting.

Suggestions

  1. Add a test case that exercises the warning path, or document why it cannot be tested directly.
  2. Consider adding a metric counter for sequence number gaps for operational monitoring.
  3. Add a comment at the call site (onReplicatedEvent) documenting why gaps are not filtered before reaching this code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant