fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing' by void-ptr974 · Pull Request #4740 · apache/bookkeeper

void-ptr974 · 2026-04-07T05:15:59Z

Motivation

A bookie may fail to restart with:

IOException: Recovery log xxx is missing

This can happen when the persisted journal lastMark moves backwards and points to a journal file that has already been garbage-collected.

Root Cause

With DbLedgerStorage and a single ledger directory, one flush cycle may complete checkpoints out of order.

A possible sequence is:

SyncThread.flush() captures an older checkpoint, for example journal 5.
It calls ledgerStorage.flush().
SingleDirectoryDbLedgerStorage.flush() captures and completes a newer checkpoint, for example journal 7.
Completing checkpoint 7 persists lastMark=7.
Journal GC may then delete older journal files, including journal 5.
SyncThread.flush() resumes and completes the older checkpoint 5.
The old code overwrites lastMark back to 5.

If the bookie crashes after this, restart reads lastMark=5, but journal 5 may no longer exist, so recovery fails.

Example

Assume the bookie currently has these journal files:

Journal file	Status
`5.txn`	old journal, still needed if `lastMark=5`
`6.txn`	old journal
`7.txn`	newer journal

A possible execution order:

Step	Path	Action	Persisted `lastMark`	Result
1	`SyncThread.flush()`	captures checkpoint at journal `5`	`5`	outer flush starts from an older checkpoint
2	`ledgerStorage.flush()`	single-dir DB storage captures newer checkpoint `7`	`5`	inner flush sees newer journal progress
3	`checkpointComplete(7)`	newer checkpoint completes first	`7`	recovery point moves forward
4	Journal GC	deletes journals older than `7`	`7`	`5.txn` may be deleted
5	`checkpointComplete(5)`	older outer checkpoint completes later	`5` before this patch	`lastMark` moves backwards
6	Bookie restart	reads `lastMark=5`	`5`	restart fails because `5.txn` is missing

Checkpoint 5 is valid when it is created. It becomes stale only after checkpoint 7 completes first and allows journal GC to remove older logs.

Race

The same issue can happen when two checkpoint completions are interleaved.

Time	Thread A	Thread B	Persisted `lastMark`
T1	starts `checkpointComplete(7)`		`5`
T2	checks current persisted mark: `5`		`5`
T3		starts `checkpointComplete(9)`	`5`
T4		checks current persisted mark: `5`	`5`
T5		persists `lastMark=9`	`9`
T6	persists `lastMark=7`		`7` before this patch

Without holding the stale check, persist, and in-memory update under the same lock, an older checkpoint can pass the stale check before a newer checkpoint is persisted, then write later and move lastMark backwards.

Changes

This patch makes journal checkpoint completion monotonic.

Journal.checkpointComplete() now:

tracks the latest persisted lastMark;
skips checkpoint completions older than the latest persisted mark;
keeps the stale check, persist, and in-memory mark update under the same lock;
initializes the tracked persisted mark from the lastMark loaded from disk during journal startup.

In other words, lastMark may move from 5 to 7 to 9, but it must never move back from 9 to 7 or from 7 to 5.

Performance Impact

The impact should be negligible.

Checkpoint completion is not on the normal journal write hot path. The added comparison is in-memory and constant-time. The lock only protects checkpoint mark persistence, and stale checkpoints may now avoid unnecessary lastMark writes.

Tests

Added coverage for:

stale checkpoint completion after a newer checkpoint;
initializing the monotonic guard from a reloaded lastMark;
concurrent checkpoint completion where an older checkpoint finishes after a newer one.

Verified with:

mvn -pl bookkeeper-server -am -P '!mac' -Dtest=DbLedgerStorageTest -Dsurefire.failIfNoSpecifiedTests=false test

mvn -pl bookkeeper-server -P '!mac' -DskipTests checkstyle:check

StevenLuMT · 2026-04-08T01:34:59Z

good jobs,
Could you describe the triggering scenario of the problem more clearly?

Copilot

Pull request overview

This PR addresses BookKeeper startup failures where a referenced recovery journal file has been garbage-collected due to a race between concurrent checkpointComplete() calls (e.g., SyncThread vs SingleDirectoryDbLedgerStorage.flush()), causing lastMark to move backwards and point at a deleted journal.

Changes:

Adds a monotonic-advance guard to Journal.checkpointComplete() intended to prevent lastMark regression under concurrent checkpoints.
Adds a regression test attempting to reproduce the “recovery log is missing” scenario and validate the fix.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java`	Introduces a lock + persisted-mark tracking intended to keep `lastMark` monotonic across concurrent checkpoints.
`bookkeeper-server/src/test/java/org/apache/bookkeeper/bookie/storage/ldb/DbLedgerStorageTest.java`	Adds a test scenario for the journal-missing failure mode and expected monotonic behavior.

Comments suppressed due to low confidence (1)

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java:804

The monotonic guard is not sufficient to prevent lastMark regression: if an older checkpoint enters this method first (passes the check and updates lastPersistedMark) but reaches rollLog() after a newer checkpoint has already persisted a later mark, this call will still execute rollLog() and can overwrite the lastMark file backwards. Consider serializing the whole rollLog + journal GC block under the same lock, or re-checking the mark under the lock immediately before persisting/deleting so stale checkpoints cannot write.

        // Monotonic check: only advance lastMark forward, never backwards.
        synchronized (checkpointLock) {
            if (mark.getCurMark().compare(lastPersistedMark) <= 0) {
                return;
            }
            lastPersistedMark.setLogMark(
                    mark.getCurMark().getLogFileId(), mark.getCurMark().getLogFileOffset());
        }

        mark.rollLog(mark);
        if (compact) {
            // list the journals that have been marked
            List<Long> logs = listJournalIds(journalDirectory, new JournalRollingFilter(mark));
            // keep MAX_BACKUP_JOURNALS journal files before marked journal

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

void-ptr974 · 2026-04-08T10:27:00Z

@StevenLuMT Thanks for the feedback. I've updated the PR description with a clearer explanation of the triggering scenario.

StevenLuMT · 2026-04-09T02:55:10Z

good jobs, I have rerun CI

hangc0276

@void-ptr974 Thanks for your contribution.

Great diagnosis — the race is real, the description is precise, and the monotonic guard is the right shape of fix. A few concerns and suggestions before merging.

Concern 1: rollLog is outside the synchronized block — a narrower variant of the same race still exists

The fix correctly prevents an older mark from overwriting newer on the disk file when calls are interleaved as described. But two newer marks racing each other are not protected, because rollLog runs
outside checkpointLock:

synchronized (checkpointLock) {
if (mark.getCurMark().compare(lastPersistedMark) < 0) {
return;
}
lastPersistedMark.setLogMark(...); // updated under lock
}
mark.rollLog(mark); // executed outside the lock

Scenario:

Thread C: checkpointComplete(mark=7) passes guard, lastPersistedMark=7, starts rollLog (slow disk write + force(true)).
Thread D: checkpointComplete(mark=9) passes guard (9 > 7), lastPersistedMark=9, starts rollLog.
Thread D's rollLog finishes first → file contains 9.
Thread C's rollLog finishes second → file contains 7. lastMark regresses.

rollLog opens new FileOutputStream(file) which truncates, so it really is last-writer-wins. This window is much smaller than the one the PR targets, but it's structurally the same bug and the guard
doesn't close it.

Suggested fix: either move rollLog inside the synchronized block, or have rollLog write lastPersistedMark (the source of truth) rather than the per-call mark argument. The former is simpler:

synchronized (checkpointLock) {
if (mark.getCurMark().compare(lastPersistedMark) < 0) {
return;
}
lastPersistedMark.setLogMark(
mark.getCurMark().getLogFileId(), mark.getCurMark().getLogFileOffset());
mark.rollLog(mark);
}
if (compact) {
...
}

rollLog does a 16-byte write + fsync. Holding the lock across it is fine — checkpointComplete is not on the hot path, and serializing it is exactly the invariant we want.

Concern 2: the test doesn't actually exercise the race

testConcurrentCheckpointCompleteLastMarkRegression calls checkpointComplete(newer) then checkpointComplete(stale) sequentially on the same thread. It verifies the guard logic but doesn't reproduce the
actual concurrent path that motivated the PR.

A concurrent test would:

Spin up two threads, one with the stale checkpoint and one with the newer.
Use latches/barriers to force the documented interleaving (Path B's checkpointComplete runs before Path A's, both with rollLog in flight).
Assert lastMark file contents after both finish.

This also surfaces Concern 1 — if you run two threads with newer marks (7 and 9) concurrently in a loop, you can demonstrate the residual race.

Concern 3: lastPersistedMark is initialized to (0, 0)

At startup, LastLogMark.readLog() populates lastLogMark.curMark from the on-disk file (could be at, say, mark 100). But lastPersistedMark stays at (0, 0). In practice this is benign because
newCheckpoint() only ever produces marks ≥ the current journal position, so the first post-startup checkpointComplete will be ≥ 100 and the guard advances correctly. But it's conceptually inconsistent
— lastPersistedMark claims to be "what's on disk" and is actually zero.

Consider seeding it from lastLogMark when Journal.readLog() runs, or at least document the invariant.

Smaller comments

Style: mark.rollLog(mark) (a method on LastLogMark taking a LastLogMark and using the argument's curMark) is awkward. Pre-existing, not for this PR — but a follow-up to make rollLog parameter-less
and use this.curMark would clean this up.
Deeper question (not blocking): zooming out, SingleDirectoryDbLedgerStorage.flush() calls checkpointSource.checkpointComplete(cp, true) internally (line 925), and SyncThread.flush() also calls
and compact=true, the outer one with the stale snapshot and compact=false. The monotonic guard handles this safely now, but it might be worth questioning whether SDLS should be calling
checkpointComplete at all when invoked from SyncThread's path. That's out of scope here, but the duplicated responsibility for advancing lastMark is a smell.
Comment: the inline // Skip if this mark is not newer than what was already persisted. comment is good. Adding a one-liner reference to the issue/PR (// See #4105) would help future readers trace the
rationale.

What looks good

Guard semantics (compare < 0 not ≤ 0) correctly tolerates an idempotent retry with the same mark.
The skip path doesn't suppress compact=true work that matters: if a stale call's compaction is skipped, the newer call's compaction has already deleted everything older.
The synchronized block is tiny (compare + set) so contention is a non-issue.

Once Concerns 1 and 2 are addressed I think this is ready.

When journal and dbledgerstorage on the same disk, concurrent checkpointComplete calls from SyncThread and SingleDirectoryDbLedgerStorage.flush() can overwrite lastMark backwards,causing journal files to be deleted while still referenced by the older mark.

When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards. Root cause SyncThread.flush() has a nested call pattern that causes lastMark regression: SyncThread.flush(): 1. outerCheckpoint = newCheckpoint() → mark = (file 5, offset 100) 2. ledgerStorage.flush() → SingleDirectoryDbLedgerStorage.flush(): a. innerCheckpoint = newCheckpoint() → mark = (file 7, offset 200), journal has advanced b. flushes data to disk c. checkpointComplete(mark=7, compact=true) → persists lastMark=7, deletes journal files with id < 7 (including file 5) 3. checkpointComplete(mark=5, compact=false) → persists lastMark=5, overwriting the 7 written in step 2c 4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a position whose journal file was already deleted. Conditions - singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally) - Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput) - maxBackupJournals small enough for old files to actually be deleted Fix Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.

void-ptr974 · 2026-05-22T14:55:20Z

Thanks @hangc0276, I addressed your comments.

The patch now initializes lastPersistedMark from disk, keeps the stale check + persistence + in-memory update under the same lock, and adds tests for both the reload case and concurrent checkpoint completion. I also updated the PR description with the root cause, example sequence, race table, performance impact, and tests.

…is missing' (#4740) * fix #4105 When journal and dbledgerstorage on the same disk, concurrent checkpointComplete calls from SyncThread and SingleDirectoryDbLedgerStorage.flush() can overwrite lastMark backwards,causing journal files to be deleted while still referenced by the older mark. * fix #4105 When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards. Root cause SyncThread.flush() has a nested call pattern that causes lastMark regression: SyncThread.flush(): 1. outerCheckpoint = newCheckpoint() → mark = (file 5, offset 100) 2. ledgerStorage.flush() → SingleDirectoryDbLedgerStorage.flush(): a. innerCheckpoint = newCheckpoint() → mark = (file 7, offset 200), journal has advanced b. flushes data to disk c. checkpointComplete(mark=7, compact=true) → persists lastMark=7, deletes journal files with id < 7 (including file 5) 3. checkpointComplete(mark=5, compact=false) → persists lastMark=5, overwriting the 7 written in step 2c 4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a position whose journal file was already deleted. Conditions - singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally) - Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput) - maxBackupJournals small enough for old files to actually be deleted Fix Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards. * fix #4105 When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards. Root cause SyncThread.flush() has a nested call pattern that causes lastMark regression: SyncThread.flush(): 1. outerCheckpoint = newCheckpoint() → mark = (file 5, offset 100) 2. ledgerStorage.flush() → SingleDirectoryDbLedgerStorage.flush(): a. innerCheckpoint = newCheckpoint() → mark = (file 7, offset 200), journal has advanced b. flushes data to disk c. checkpointComplete(mark=7, compact=true) → persists lastMark=7, deletes journal files with id < 7 (including file 5) 3. checkpointComplete(mark=5, compact=false) → persists lastMark=5, overwriting the 7 written in step 2c 4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a position whose journal file was already deleted. Conditions - singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally) - Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput) - maxBackupJournals small enough for old files to actually be deleted Fix Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards. * fix #4105 When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards. Root cause SyncThread.flush() has a nested call pattern that causes lastMark regression: SyncThread.flush(): 1. outerCheckpoint = newCheckpoint() → mark = (file 5, offset 100) 2. ledgerStorage.flush() → SingleDirectoryDbLedgerStorage.flush(): a. innerCheckpoint = newCheckpoint() → mark = (file 7, offset 200), journal has advanced b. flushes data to disk c. checkpointComplete(mark=7, compact=true) → persists lastMark=7, deletes journal files with id < 7 (including file 5) 3. checkpointComplete(mark=5, compact=false) → persists lastMark=5, overwriting the 7 written in step 2c 4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a position whose journal file was already deleted. Conditions - singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally) - Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput) - maxBackupJournals small enough for old files to actually be deleted Fix Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards. * Fix lastMark regression in checkpoint completion (cherry picked from commit ccfabfa)

void-ptr974 changed the title ~~fix Recovery log xxx is missing~~ fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing' Apr 7, 2026

StevenLuMT requested a review from Copilot April 8, 2026 01:32

Copilot started reviewing on behalf of StevenLuMT April 8, 2026 01:33 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

StevenLuMT requested review from dlg99, eolivelli, hezhangjian, lhotari and zymap April 9, 2026 02:55

hangc0276 reviewed May 22, 2026

View reviewed changes

hangc0276 assigned void-ptr974 May 22, 2026

hangc0276 added type/bug release/4.18.0 labels May 22, 2026

hangc0276 added this to the 4.18.0 milestone May 22, 2026

hangc0276 requested a review from merlimat May 22, 2026 02:24

void-ptr974 added 5 commits May 22, 2026 22:37

fix apache#4105

7fd5293

When journal and dbledgerstorage on the same disk, concurrent checkpointComplete calls from SyncThread and SingleDirectoryDbLedgerStorage.flush() can overwrite lastMark backwards,causing journal files to be deleted while still referenced by the older mark.

Fix lastMark regression in checkpoint completion

5498af4

void-ptr974 force-pushed the fix_journal_missing branch from 2d35897 to 5498af4 Compare May 22, 2026 14:50

hangc0276 approved these changes May 26, 2026

View reviewed changes

hangc0276 merged commit ccfabfa into apache:master May 28, 2026
48 of 52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing'#4740

fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing'#4740
hangc0276 merged 5 commits into
apache:masterfrom
void-ptr974:fix_journal_missing

void-ptr974 commented Apr 7, 2026 •

edited

Loading

Uh oh!

StevenLuMT commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

void-ptr974 commented Apr 8, 2026

Uh oh!

StevenLuMT commented Apr 9, 2026

Uh oh!

hangc0276 left a comment •

edited

Loading

Uh oh!

void-ptr974 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

void-ptr974 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Root Cause

Example

Race

Changes

Performance Impact

Tests

Uh oh!

StevenLuMT commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

void-ptr974 commented Apr 8, 2026

Uh oh!

StevenLuMT commented Apr 9, 2026

Uh oh!

hangc0276 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Concern 1: rollLog is outside the synchronized block — a narrower variant of the same race still exists

Concern 2: the test doesn't actually exercise the race

Concern 3: lastPersistedMark is initialized to (0, 0)

What looks good

Uh oh!

void-ptr974 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

void-ptr974 commented Apr 7, 2026 •

edited

Loading

hangc0276 left a comment •

edited

Loading