fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing'#4740
fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing'#4740void-ptr974 wants to merge 4 commits intoapache:masterfrom
Conversation
When journal and dbledgerstorage on the same disk, concurrent checkpointComplete calls from SyncThread and SingleDirectoryDbLedgerStorage.flush() can overwrite lastMark backwards,causing journal files to be deleted while still referenced by the older mark.
|
good jobs, |
There was a problem hiding this comment.
Pull request overview
This PR addresses BookKeeper startup failures where a referenced recovery journal file has been garbage-collected due to a race between concurrent checkpointComplete() calls (e.g., SyncThread vs SingleDirectoryDbLedgerStorage.flush()), causing lastMark to move backwards and point at a deleted journal.
Changes:
- Adds a monotonic-advance guard to
Journal.checkpointComplete()intended to preventlastMarkregression under concurrent checkpoints. - Adds a regression test attempting to reproduce the “recovery log is missing” scenario and validate the fix.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java |
Introduces a lock + persisted-mark tracking intended to keep lastMark monotonic across concurrent checkpoints. |
bookkeeper-server/src/test/java/org/apache/bookkeeper/bookie/storage/ldb/DbLedgerStorageTest.java |
Adds a test scenario for the journal-missing failure mode and expected monotonic behavior. |
Comments suppressed due to low confidence (1)
bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java:804
- The monotonic guard is not sufficient to prevent lastMark regression: if an older checkpoint enters this method first (passes the check and updates lastPersistedMark) but reaches rollLog() after a newer checkpoint has already persisted a later mark, this call will still execute rollLog() and can overwrite the lastMark file backwards. Consider serializing the whole rollLog + journal GC block under the same lock, or re-checking the mark under the lock immediately before persisting/deleting so stale checkpoints cannot write.
// Monotonic check: only advance lastMark forward, never backwards.
synchronized (checkpointLock) {
if (mark.getCurMark().compare(lastPersistedMark) <= 0) {
return;
}
lastPersistedMark.setLogMark(
mark.getCurMark().getLogFileId(), mark.getCurMark().getLogFileOffset());
}
mark.rollLog(mark);
if (compact) {
// list the journals that have been marked
List<Long> logs = listJournalIds(journalDirectory, new JournalRollingFilter(mark));
// keep MAX_BACKUP_JOURNALS journal files before marked journal
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java
Show resolved
Hide resolved
...eeper-server/src/test/java/org/apache/bookkeeper/bookie/storage/ldb/DbLedgerStorageTest.java
Show resolved
Hide resolved
...eeper-server/src/test/java/org/apache/bookkeeper/bookie/storage/ldb/DbLedgerStorageTest.java
Outdated
Show resolved
Hide resolved
...eeper-server/src/test/java/org/apache/bookkeeper/bookie/storage/ldb/DbLedgerStorageTest.java
Show resolved
Hide resolved
When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.
Root cause
SyncThread.flush() has a nested call pattern that causes lastMark regression:
SyncThread.flush():
1. outerCheckpoint = newCheckpoint() → mark = (file 5, offset 100)
2. ledgerStorage.flush()
→ SingleDirectoryDbLedgerStorage.flush():
a. innerCheckpoint = newCheckpoint() → mark = (file 7, offset 200), journal has advanced
b. flushes data to disk
c. checkpointComplete(mark=7, compact=true)
→ persists lastMark=7, deletes journal files with id < 7 (including file 5)
3. checkpointComplete(mark=5, compact=false)
→ persists lastMark=5, overwriting the 7 written in step 2c
4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash
Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
position whose journal file was already deleted.
Conditions
- singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
- Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
- maxBackupJournals small enough for old files to actually be deleted
Fix
Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
|
@StevenLuMT Thanks for the feedback. I've updated the PR description with a clearer explanation of the triggering scenario. |
When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.
Root cause
SyncThread.flush() has a nested call pattern that causes lastMark regression:
SyncThread.flush():
1. outerCheckpoint = newCheckpoint() → mark = (file 5, offset 100)
2. ledgerStorage.flush()
→ SingleDirectoryDbLedgerStorage.flush():
a. innerCheckpoint = newCheckpoint() → mark = (file 7, offset 200), journal has advanced
b. flushes data to disk
c. checkpointComplete(mark=7, compact=true)
→ persists lastMark=7, deletes journal files with id < 7 (including file 5)
3. checkpointComplete(mark=5, compact=false)
→ persists lastMark=5, overwriting the 7 written in step 2c
4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash
Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
position whose journal file was already deleted.
Conditions
- singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
- Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
- maxBackupJournals small enough for old files to actually be deleted
Fix
Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.
Root cause
SyncThread.flush() has a nested call pattern that causes lastMark regression:
SyncThread.flush():
1. outerCheckpoint = newCheckpoint() → mark = (file 5, offset 100)
2. ledgerStorage.flush()
→ SingleDirectoryDbLedgerStorage.flush():
a. innerCheckpoint = newCheckpoint() → mark = (file 7, offset 200), journal has advanced
b. flushes data to disk
c. checkpointComplete(mark=7, compact=true)
→ persists lastMark=7, deletes journal files with id < 7 (including file 5)
3. checkpointComplete(mark=5, compact=false)
→ persists lastMark=5, overwriting the 7 written in step 2c
4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash
Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
position whose journal file was already deleted.
Conditions
- singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
- Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
- maxBackupJournals small enough for old files to actually be deleted
Fix
Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
|
good jobs, I have rerun CI |
Fix #4105