Skip to content

fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing'#4740

Open
void-ptr974 wants to merge 4 commits intoapache:masterfrom
void-ptr974:fix_journal_missing
Open

fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing'#4740
void-ptr974 wants to merge 4 commits intoapache:masterfrom
void-ptr974:fix_journal_missing

Conversation

@void-ptr974
Copy link
Copy Markdown

@void-ptr974 void-ptr974 commented Apr 7, 2026

Fix #4105

  Problem

  When singleLedgerDirs=true, concurrent checkpointComplete calls can cause lastMark to regress backwards, pointing to a journal file that has already been garbage-collected. If the
  bookie crashes before the next checkpoint corrects lastMark, restart fails with "Recovery log is missing".

  Root cause

  SyncThread.checkpoint() captures a checkpoint before heavy flush I/O (which holds flushMutex for seconds), then calls checkpointComplete with that now-stale mark after I/O completes.
   Meanwhile, triggerFlushAndAddEntry may invoke SingleDirectoryDbLedgerStorage.flush() on a separate executor thread, which captures a newer checkpoint during or after the I/O.

  After SyncThread releases flushMutex, both threads race to call checkpointComplete:

  ┌──────┬──────────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────┐
  │ Time │                               SyncThread executor                                │                       SingleDirectoryDbLedgerStorage executor                        │
  ├──────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ T0   │ checkpoint = newCheckpoint() → mark(5)                                           │                                                                                      │
  ├──────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ T1   │ SDLS.checkpoint(): acquires flushMutex, starts I/O                               │                                                                                      │
  ├──────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ T2   │ (I/O in progress, journal advances to file 7)                                    │ writeCache full → SDLS.flush(): cp = newCheckpoint() → mark(7), blocks on flushMutex │
  ├──────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ T3   │ releases flushMutex                                                              │ acquires flushMutex, writeCache empty, returns immediately                           │
  ├──────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ T4   │                                                                                  │ checkpointComplete(7, true) → lastMark=7, deletes journal 5,6                        │
  ├──────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ T5   │ checkpointComplete(5, true) → rollLog overwrites lastMark from 7 to 5            │                                                                                      │
  ├──────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ T6   │ lastMark=5 but journal 5 no longer exists. Crash in this window → unrecoverable. │                                                                                      │
  └──────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────┘

  Trigger conditions

  - singleLedgerDirs=true (journal and ledger on the same disk)
  - writeCache fills during SyncThread.checkpoint() I/O, triggering triggerFlushAndAddEntry on a separate thread
  - Journal file rotates during the I/O window (sufficient write throughput)
  - The SingleDirectoryDbLedgerStorage executor thread reaches checkpointComplete before SyncThread
  - Bookie crashes (OOM, kill, hardware failure) before the next checkpoint advances lastMark

  Fix

  Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far (lastPersistedMark), and skip any call whose mark is not newer. This prevents rollLog from
  overwriting lastMark backwards regardless of call ordering.

When journal and dbledgerstorage on the same disk, concurrent checkpointComplete calls from SyncThread and SingleDirectoryDbLedgerStorage.flush() can overwrite lastMark backwards,causing journal files to be deleted while still referenced by the older mark.
@void-ptr974 void-ptr974 changed the title fix Recovery log xxx is missing fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing' Apr 7, 2026
@StevenLuMT StevenLuMT requested a review from Copilot April 8, 2026 01:32
@StevenLuMT
Copy link
Copy Markdown
Member

good jobs,
Could you describe the triggering scenario of the problem more clearly?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses BookKeeper startup failures where a referenced recovery journal file has been garbage-collected due to a race between concurrent checkpointComplete() calls (e.g., SyncThread vs SingleDirectoryDbLedgerStorage.flush()), causing lastMark to move backwards and point at a deleted journal.

Changes:

  • Adds a monotonic-advance guard to Journal.checkpointComplete() intended to prevent lastMark regression under concurrent checkpoints.
  • Adds a regression test attempting to reproduce the “recovery log is missing” scenario and validate the fix.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java Introduces a lock + persisted-mark tracking intended to keep lastMark monotonic across concurrent checkpoints.
bookkeeper-server/src/test/java/org/apache/bookkeeper/bookie/storage/ldb/DbLedgerStorageTest.java Adds a test scenario for the journal-missing failure mode and expected monotonic behavior.
Comments suppressed due to low confidence (1)

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java:804

  • The monotonic guard is not sufficient to prevent lastMark regression: if an older checkpoint enters this method first (passes the check and updates lastPersistedMark) but reaches rollLog() after a newer checkpoint has already persisted a later mark, this call will still execute rollLog() and can overwrite the lastMark file backwards. Consider serializing the whole rollLog + journal GC block under the same lock, or re-checking the mark under the lock immediately before persisting/deleting so stale checkpoints cannot write.
        // Monotonic check: only advance lastMark forward, never backwards.
        synchronized (checkpointLock) {
            if (mark.getCurMark().compare(lastPersistedMark) <= 0) {
                return;
            }
            lastPersistedMark.setLogMark(
                    mark.getCurMark().getLogFileId(), mark.getCurMark().getLogFileOffset());
        }

        mark.rollLog(mark);
        if (compact) {
            // list the journals that have been marked
            List<Long> logs = listJournalIds(journalDirectory, new JournalRollingFilter(mark));
            // keep MAX_BACKUP_JOURNALS journal files before marked journal

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.

  Root cause

  SyncThread.flush() has a nested call pattern that causes lastMark regression:

  SyncThread.flush():
    1. outerCheckpoint = newCheckpoint()        → mark = (file 5, offset 100)
    2. ledgerStorage.flush()
       → SingleDirectoryDbLedgerStorage.flush():
         a. innerCheckpoint = newCheckpoint()   → mark = (file 7, offset 200), journal has advanced
         b. flushes data to disk
         c. checkpointComplete(mark=7, compact=true)
            → persists lastMark=7, deletes journal files with id < 7 (including file 5)
    3. checkpointComplete(mark=5, compact=false)
       → persists lastMark=5, overwriting the 7 written in step 2c
    4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash

  Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
  position whose journal file was already deleted.

  Conditions

  - singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
  - Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
  - maxBackupJournals small enough for old files to actually be deleted

  Fix

  Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
@void-ptr974
Copy link
Copy Markdown
Author

@StevenLuMT Thanks for the feedback. I've updated the PR description with a clearer explanation of the triggering scenario.

When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.

  Root cause

  SyncThread.flush() has a nested call pattern that causes lastMark regression:

  SyncThread.flush():
    1. outerCheckpoint = newCheckpoint()        → mark = (file 5, offset 100)
    2. ledgerStorage.flush()
       → SingleDirectoryDbLedgerStorage.flush():
         a. innerCheckpoint = newCheckpoint()   → mark = (file 7, offset 200), journal has advanced
         b. flushes data to disk
         c. checkpointComplete(mark=7, compact=true)
            → persists lastMark=7, deletes journal files with id < 7 (including file 5)
    3. checkpointComplete(mark=5, compact=false)
       → persists lastMark=5, overwriting the 7 written in step 2c
    4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash

  Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
  position whose journal file was already deleted.

  Conditions

  - singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
  - Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
  - maxBackupJournals small enough for old files to actually be deleted

  Fix

  Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.

  Root cause

  SyncThread.flush() has a nested call pattern that causes lastMark regression:

  SyncThread.flush():
    1. outerCheckpoint = newCheckpoint()        → mark = (file 5, offset 100)
    2. ledgerStorage.flush()
       → SingleDirectoryDbLedgerStorage.flush():
         a. innerCheckpoint = newCheckpoint()   → mark = (file 7, offset 200), journal has advanced
         b. flushes data to disk
         c. checkpointComplete(mark=7, compact=true)
            → persists lastMark=7, deletes journal files with id < 7 (including file 5)
    3. checkpointComplete(mark=5, compact=false)
       → persists lastMark=5, overwriting the 7 written in step 2c
    4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash

  Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
  position whose journal file was already deleted.

  Conditions

  - singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
  - Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
  - maxBackupJournals small enough for old files to actually be deleted

  Fix

  Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
@StevenLuMT
Copy link
Copy Markdown
Member

good jobs, I have rerun CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recovery log is missing

3 participants