fix: Fail fast when executor ahead #2966

alpe · 2026-01-12T13:27:29Z

Ensure consistent state.

Reverts part of 9d4c64c#diff-c3cd6bebbc2af5fea33694c828f08dedbebb4e613f93ecf587e74bc56f9b76b6R78

github-actions · 2026-01-12T13:27:42Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Jan 12, 2026, 1:28 PM

claude · 2026-01-12T13:27:45Z

Claude finished @alpe's task in 2m 6s —— View job

Comprehensive Code Review

I've completed a thorough review of this PR, which changes the behavior when the execution layer is ahead of the target height from a warning to a fail-fast error. This is a critical correctness improvement for maintaining state consistency.

Summary

Overall Assessment: ✅ Approved with Suggestions

This PR correctly enforces state consistency by failing fast when the execution layer is ahead of the target height. The change is well-tested and properly documented. I've identified one actionable improvement around error handling patterns and several observations about the broader system design.

1. Code Quality & Architecture ✅

Strengths:

Clear separation of concerns: Replayer is focused on synchronization logic
Proper use of optional interface pattern (HeightProvider) for backward compatibility
Good naming and clear intent in the code
Appropriate logging at different levels (Info → Error for the divergence case)

Observations:

The function SyncToHeight at block/internal/common/replay.go:46 now has three distinct exit paths (skip if no HeightProvider, skip if at genesis, fail if ahead, sync if behind). This is reasonable but worth noting for maintainability.
The comment at lines 70-71 correctly explains the rationale for the fail-fast behavior, which is excellent documentation.

2. Error Handling & Resilience ⚠️

Critical Issue - Structured Error Type:

At block/internal/common/replay.go:73-74, the code returns a generic fmt.Errorf error. This makes it difficult for callers to distinguish between this specific error condition and other failures programmatically.

Impact: Looking at the callers:

block/internal/executing/executor.go:225-228 wraps and returns the error, then calls sendCriticalError
block/internal/syncing/syncer.go:285-287 wraps and returns the error

Both callers treat this as a fatal error during startup, which is correct. However, they cannot differentiate between "execution layer ahead" (requires manual rollback) and other sync failures (might be transient).

Recommendation: Define a custom error type to enable programmatic error handling:

// ExecutorAheadError is returned when the execution layer is ahead of the target height,
// indicating state divergence that requires manual intervention.
type ExecutorAheadError struct {
    ExecHeight   uint64
    TargetHeight uint64
}

func (e *ExecutorAheadError) Error() string {
    return fmt.Sprintf("execution layer height (%d) ahead of target height (%d): manually rollback execution layer to height %d",
        e.ExecHeight, e.TargetHeight, e.TargetHeight)
}

// In SyncToHeight:
return &ExecutorAheadError{ExecHeight: execHeight, TargetHeight: targetHeight}

Benefits:

Callers can use errors.As() to detect this specific condition
More precise error testing (see test improvement below)
Error contains structured data (heights) for programmatic access
Follows Go best practices for error handling (see Go Blog on Error Handling)

Fix this →

Context Handling:

Good use of context.Context parameter in SyncToHeight
Context is properly passed down to GetLatestHeight and replayBlock
No context cancellation checks in the sync loop (lines 86-90), but this is acceptable since each replayBlock call respects the context

3. Security Considerations ✅

Positive Findings:

No sensitive data exposure in logs or error messages
State hash validation at replay.go:162-174 properly detects state divergence during replay
The fail-fast behavior prevents operating with inconsistent state, which is a security improvement
No injection vulnerabilities or unsafe operations

Additional Observations:

The error message explicitly tells operators to "manually rollback execution layer to height X", which is good operational guidance
The logging at line 65-68 (Info level) provides visibility without exposing sensitive state data

4. Performance & Resource Efficiency ✅

Positive Aspects:

Early returns for skip conditions (lines 50-51, 55-57) avoid unnecessary work
Height comparison is O(1) before any expensive operations
The sync loop (86-90) processes blocks sequentially, which is correct for state consistency

Considerations:

For large height gaps, the sequential sync could be slow, but this is necessary for correctness
The replayBlock function makes multiple store reads per block (lines 113, 136), but this is unavoidable for safe replay
No goroutine leaks or unbounded resource allocation

5. Testing & Reliability ✅

Strong Test Coverage:

The updated test at block/internal/common/replay_test.go:124-151 correctly validates the new behavior:

✅ Tests the error return (line 146)
✅ Verifies error message contains "ahead of target height" (line 147)
✅ Confirms no replay is attempted (line 150)

Suggested Test Improvement:

If implementing the structured error type, update the test to assert the specific error type:

func TestReplayer_SyncToHeight_ExecutorAhead(t *testing.T) {
    // ... existing setup ...
    
    err := syncer.SyncToHeight(ctx, targetHeight)
    require.Error(t, err)
    
    // Assert specific error type
    var execAheadErr *common.ExecutorAheadError
    require.True(t, errors.As(err, &execAheadErr))
    require.Equal(t, execHeight, execAheadErr.ExecHeight)
    require.Equal(t, targetHeight, execAheadErr.TargetHeight)
    
    // No replay should be attempted
    mockExec.AssertNotCalled(t, "ExecuteTxs", ...)
}

Additional Test Coverage:

Existing tests cover: executor behind, in sync, ahead, no HeightProvider, at genesis, multiple blocks, first block, app hash mismatch
Edge cases are well covered
No obvious gaps in test scenarios

6. Documentation & Developer Experience ✅

Excellent Documentation:

Function comment at lines 41-45 clearly explains the purpose and return conditions
Inline comment at lines 70-71 explains the why behind the fail-fast behavior
Error message provides actionable guidance ("manually rollback execution layer to height X")
Logging messages are informative and include relevant context (heights)

Suggestions:

Consider adding a link to operational runbooks in the error message or code comment for how to perform the manual rollback
The CLAUDE.md file could be updated with guidance on handling this error condition during operations

Impact Analysis

Callers Affected:

block/internal/executing/executor.go:225 - Called during executor initialization
- Impact: Executor will fail to start if execution layer is ahead
- Behavior: Calls sendCriticalError and returns error, preventing node startup
- ✅ Correct: This is the desired behavior - preventing operation with inconsistent state
block/internal/syncing/syncer.go:285 - Called during syncer initialization
- Impact: Syncer will fail to start if execution layer is ahead
- Behavior: Returns error during startup, preventing syncer from running
- ✅ Correct: Same rationale as above

Breaking Change Assessment:

This is a behavior change that could break existing deployments where the execution layer has gotten ahead
However, this is a correctness fix - the previous behavior was unsafe
Operators will need to manually intervene if this condition occurs, but that's the right trade-off for data integrity

Additional Observations

Previous Behavior (being reverted):
The previous code (from commit 9d4c64c) allowed the node to continue when the execution layer was ahead, just logging a warning. This PR correctly identifies this as unsafe and returns an error instead.

Commit Message:
The second commit "Remove verbose error log" (a5bee2e) removes the redundant Error log since the error is now returned and will be logged by the callers.

Summary of Recommendations

High Priority: Implement structured error type (ExecutorAheadError) for better error handling
Medium Priority: Update test to assert specific error type when using structured errors
Low Priority: Consider adding operational documentation reference in error message or comments

Conclusion

This is a well-implemented correctness improvement that properly enforces state consistency. The change is appropriately tested and clearly documented. The main suggestion is to use a structured error type for better programmatic error handling, but the current implementation is functionally correct.

The fail-fast approach is the right design choice for maintaining system integrity, even though it may require manual intervention in edge cases.

Review Status: ✅ Approved with suggestions for structured error handling
| Branch: alex/fail_state_ahead

gemini-code-assist · 2026-01-12T13:27:46Z

Summary of Changes

Hello @alpe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the synchronization logic within the Replayer component to strictly enforce state consistency. It changes the behavior from a warning to an immediate error when the execution layer is found to be ahead of the target height, thereby preventing potential state divergence and requiring explicit manual resolution.

Highlights

Changed Error Handling for Execution Layer Ahead: The Replayer.SyncToHeight function now explicitly returns an error and halts execution if the execution layer's height is found to be ahead of the target height. Previously, this condition would only log a warning and allow the process to continue.
Enforced State Consistency: This modification ensures that the system fails fast when a state divergence is detected, preventing the node from operating with an inconsistent execution layer and requiring manual intervention (rollback) to resolve the issue.
Updated Test Case: The TestReplayer_SyncToHeight_ExecutorAhead unit test has been updated to assert that an error is indeed returned when the execution layer is ahead, validating the new 'fail fast' behavior.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly changes the behavior to fail fast when the execution layer is ahead of the target height, which is a crucial improvement for maintaining state consistency. The implementation is clear and the accompanying test validates the new error condition. I've added one suggestion to further improve the error handling by using a structured error type, which would make the API more robust and easier for callers to handle this specific failure case.

I am having trouble creating individual review comments. Click here to see my feedback.

block/internal/common/replay.go (77-78)

Returning a generic error from fmt.Errorf makes it difficult for callers to handle this specific error condition programmatically. It's a best practice in Go to use a structured error or a sentinel error for such cases.

Consider defining a custom error type. This allows callers to use errors.As() for type-safe error handling and would also enable more precise testing in TestReplayer_SyncToHeight_ExecutorAhead by asserting the specific error type and its fields.

For example, you could define and use a custom error like this:

// ExecutorAheadError is returned when the execution layer is ahead of the target height.
type ExecutorAheadError struct {
	ExecHeight   uint64
	TargetHeight uint64
}

func (e *ExecutorAheadError) Error() string {
	return fmt.Sprintf("execution layer height (%d) ahead of target height (%d): manually rollback execution layer to height %d",
		e.ExecHeight, e.TargetHeight, e.TargetHeight)
}

// Then, in SyncToHeight, you would return:
return &ExecutorAheadError{ExecHeight: execHeight, TargetHeight: targetHeight}

codecov · 2026-01-12T13:31:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.75%. Comparing base (aaae087) to head (a5bee2e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2966      +/-   ##
==========================================
- Coverage   57.88%   57.75%   -0.14%     
==========================================
  Files          97       97              
  Lines        9306     9303       -3     
==========================================
- Hits         5387     5373      -14     
- Misses       3315     3326      +11     
  Partials      604      604

Flag	Coverage Δ
combined	`57.75% <100.00%> (-0.14%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

julienrbrt

ACK. We need to fix the EVM rollback command in a follow up.
I'll log an issue.

* main: fix: Fail fast when executor ahead (#2966) feat(block): async epoch fetching (#2952) perf: tune badger defaults and add db bench (#2950)

Fail fast when executor ahead

d2c45ab

github-actions bot assigned alpe Jan 12, 2026

Remove verbose error log

a5bee2e

gemini-code-assist bot reviewed Jan 12, 2026

View reviewed changes

alpe changed the title ~~Fail fast when executor ahead~~ fix: Fail fast when executor ahead Jan 12, 2026

tac0turtle approved these changes Jan 12, 2026

View reviewed changes

alpe added this pull request to the merge queue Jan 12, 2026

alpe removed this pull request from the merge queue due to a manual request Jan 12, 2026

randygrok approved these changes Jan 12, 2026

View reviewed changes

julienrbrt approved these changes Jan 12, 2026

View reviewed changes

julienrbrt mentioned this pull request Jan 12, 2026

Improve EVM rollback #2967

Open

alpe added this pull request to the merge queue Jan 12, 2026

Merged via the queue into main with commit 6d132bf Jan 12, 2026
30 of 32 checks passed

alpe deleted the alex/fail_state_ahead branch January 12, 2026 14:51

alpe added a commit that referenced this pull request Jan 12, 2026

Merge branch 'main' into alex/adr023_ha-failover_raft8

d7fbf79

* main: fix: Fail fast when executor ahead (#2966) feat(block): async epoch fetching (#2952) perf: tune badger defaults and add db bench (#2950)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fail fast when executor ahead #2966

fix: Fail fast when executor ahead #2966

alpe commented Jan 12, 2026

Uh oh!

github-actions bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

codecov bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

julienrbrt left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix: Fail fast when executor ahead #2966

fix: Fail fast when executor ahead #2966

Conversation

alpe commented Jan 12, 2026

Uh oh!

github-actions bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review

Summary

1. Code Quality & Architecture ✅

2. Error Handling & Resilience ⚠️

3. Security Considerations ✅

4. Performance & Resource Efficiency ✅

5. Testing & Reliability ✅

6. Documentation & Developer Experience ✅

Impact Analysis

Additional Observations

Summary of Recommendations

Conclusion

Uh oh!

gemini-code-assist bot commented Jan 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

block/internal/common/replay.go (77-78)

Uh oh!

codecov bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

julienrbrt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Jan 12, 2026 •

edited

Loading

claude bot commented Jan 12, 2026 •

edited

Loading

codecov bot commented Jan 12, 2026 •

edited

Loading