Skip to content

[Issue #1327] pixels retina recovery protocol#1328

Draft
gengdy1545 wants to merge 3 commits into
pixelsdb:masterfrom
gengdy1545:feature/recovery
Draft

[Issue #1327] pixels retina recovery protocol#1328
gengdy1545 wants to merge 3 commits into
pixelsdb:masterfrom
gengdy1545:feature/recovery

Conversation

@gengdy1545
Copy link
Copy Markdown
Collaborator

@gengdy1545 gengdy1545 commented May 12, 2026

Pixels Retina Recovery Protocol

Summary

This PR introduces a recovery protocol for Pixels Retina that defines how the system recovers internal consistency after process crashes, machine restarts, CDC crashes, or simultaneous CDC + Retina crashes. On startup, Retina first cleans the RG-visibility snapshot in the latest recovery checkpoint into a self-consistent internal recovery starting point; CDC then replays the source-side events that were dropped, starting from the recovery replay timestamp derived from that checkpoint. The system finally converges to the pre-crash state under READY. Queries are fully fail-closed during RECOVERING and only become available after a successful MarkReady barrier.

This document describes the final recovery-capable design — there is no online dual-track or fall-back to the legacy recovery path. A one-shot schema/catalog/bootstrap upgrade is allowed during rollout, but the runtime code after the upgrade must obey the semantics defined here.


Task List

The implementation is organized as a sequence of commit-level tasks. The numbering only encodes dependency and suggested merge order, not priority. No intermediate commit is allowed to let an unfinished recovery-capable path enter a queryable READY.

  • C00 — Baseline hardening of the current code

    • C00.1 Make MetadataService.addFiles / updateFile / deleteFiles return true on success and treat false/exception as publish-barrier failure in all callers
    • C00.2 Fix MainIndex flush durability: MainIndexBuffer snapshot → SQLite transaction commit → drop buffer/cache only after commit; row_id_ranges writes idempotent on retry
    • C00.3 Fix PixelsWriteBuffer.addRow RowLocation race (capture currentMemTable at append time)
    • C00.4 Stop FileWriterManager.finish() from publishing REGULAR directly; centralize publish in PixelsWriteBuffer
    • C00.5 Tighten query-visible enumeration from FILE_TYPE <> 0 to FILE_TYPE = REGULAR
    • C00.6 Make initialization and background GC fail-closed (no half-initialized service, no GC scheduler started from constructor)
  • C01 — Metadata schema and enumeration APIs

    • C01.1 Extend file types to TEMPORARY_INGEST / TEMPORARY_GC / REGULAR / RETIRED and add FILE_CLEANUP_AT
    • C01.2 Add recovery file metadata: tableId, virtualNodeId, ingestSeq/firstBlockId, fileMinCommitTs, fileMaxCommitTs, rowIdStart/rowCount
    • C01.3 Add typed/admin APIs: getRegularFiles, listFilesByType, listTemporaryFilesBefore, listRetiredFilesBefore, deleteTemporaryFiles, deleteRetiredFiles
    • C01.4 Migrate planner / cache / Storage GC candidate scan / Retina startup off getFiles(pathId); recovery-capable startup must not auto-fill baseline via getRegularFiles
    • C01.5 Make query-facing enumeration depend on a server-side gated read-only transaction
    • C01.6 Implement atomicSwapFiles(newFileId, oldFileIds, cleanupAt) in a single metadata transaction
    • C01.7 One-shot catalog upgrade mapping legacy TEMPORARY to TEMPORARY_INGEST / TEMPORARY_GC
  • C02 — Ingest publish ordering and write-buffer prefix visibility

    • C02.1 FileWriterManager.finish() only does object-block flush, writer.close(), footer/length/checksum check
    • C02.2 Maintain ingest-file runtime state (physicalClosed / indexFlushed / metadataRegular …)
    • C02.3 Maintain append-order fileMinCommitTs / fileMaxCommitTs at the write boundary; footer hidden-timestamp stats are audit-only
    • C02.4 Add AppendSegmentState.visibleSize, batch-level append handle, and publishPendingAppend(handle, visible|hidden)
    • C02.5 Add per-stream file publisher ordered by firstBlockId; out-of-order physical close / index flush, but in-order TEMPORARY_INGEST → REGULAR publish
    • C02.6 Centralize publish in PixelsWriteBuffer: physical close → LocalIndexService.flushIndexEntriesOfFile → atomic metadata update (file type + rowId range + commit-ts range) → object cleanup
    • C02.7 Unify scheduler flush and PixelsWriteBuffer.close() paths through the same publisher
  • C03 — LocalIndexService staging and Visibility replay semantics

    • C03.1 Add staged primary/MainIndex APIs: resolvePrimary, putMainIndexEntries, putPrimaryEntriesOnly, tombstonePrimaryResolved, updatePrimaryResolved, restorePrimaryEntries, deleteMainIndexRange
    • C03.2 resolvePrimary returns tri-state FOUND / NOT_FOUND_OR_ORPHAN / BACKEND_ERROR against the current baseline visible file set
    • C03.3 DELETE order: resolvePrimary → Visibility delete → tombstonePrimaryResolved
    • C03.4 INSERT order: append pending → putMainIndexEntries → putPrimaryEntriesOnly → publishPendingAppend(visible); failure path compensates via Visibility delete + primary tombstone + publish-hidden or fail-closed
    • C03.5 CDC replace must be DELETE + INSERT, not legacy UpdateData
    • C03.6 Single-source commit timestamp: validate IndexKey.timestamp == TableUpdateData.timestamp and use it everywhere
    • C03.7 In RGVisibility / JNI / native TileVisibility, fork DELETE by T <= baseTimestamp (COW fold into baseBitmap) vs T > baseTimestamp (append deletion chain)
    • C03.8 Explicitly exclude secondary index from the recovery-correctness scope of this stage
  • C04 — Recovery checkpoint and startup cleansing

    • C04.1 Define checkpoint body binary format with checksum: header (retinaNodeId, checkpointId, writerEpoch, writeTime, checkpointAppliedTs, topologyHash) + fileEntries[] / ingestSegmentEntries[] / rgEntries[]
    • C04.2 Per-node etcd current / previous slots under /pixels/retina/recovery/checkpoint/${encodedNodeId}/; publish order: write body + fsync → atomic etcd swap → async cleanup of replaced body
    • C04.3 Checkpoint generator reads TransService HWM, sets checkpointAppliedTs = HWM - 1, then dumps RGVisibilityIndex + checkpoint-admitted REGULAR catalog metadata + ingest segment snapshot
    • C04.4 Startup loads body from current/previous and validates magic / schema / checksum / retinaNodeId / topologyHash / checkpointAppliedTs
    • C04.5 Cleanse FileEntry / VisibilityEntry / IngestSegmentEntry against the catalog; rebuild RGVisibilityIndex and MainIndex baseline; produce baseline visible file set
    • C04.6 Compute scopeReplayFromTs / vnodeReplayFromTs / nodeReplayFromTs per §3.4; degrade to MIN_REPLAY_TS when segment chain is untrustworthy
    • C04.7 Mark catalog-only ingest REGULAR files (not covered by checkpoint, not protected by Storage GC journal) as RETIRED and enqueue cleanup; FAILED if no checkpoint but catalog has REGULAR
  • C05 — Lifecycle, query gate and startup gating

    • C05.1 Introduce RetinaLifecycleState and lifecycle coordinator publishing RECOVERING / READY / FAILED to a leased /pixels/retina/lifecycle/<host:retinaPort> key
    • C05.2 Replace RetinaServerImpl constructor logic with the recovery coordinator; remove metadata full-preload from the recovery-capable path
    • C05.3 Move the periodic GC scheduler out of RetinaResourceManager constructor into the lifecycle READY hook
    • C05.4 Gate QueryVisibility / GetWriteBuffer / RegisterOffload / UpdateRecord / StreamUpdateRecord / AddVisibility / AddWriteBuffer on lifecycle state
    • C05.5 Add a server-side QueryAvailabilityGate to TransServiceImpl.beginTrans / beginTransBatch for readOnly=true
    • C05.6 Make planner / cache / Trino connector / Turbo planner enumeration depend on a successful gated read-only transaction
    • C05.7 Land a fail-closed markReady(...) skeleton; production paths must never observe READY until C06
  • C06 — CDC recovery replay and MarkReady barrier

    • C06.1 Extend proto/retina.proto with GetRetinaStatus / GetRecoveryReplayTs / MarkReady, state enum, recoveryAttemptToken / recoveryEpoch / checkpointId / replayTsReady / vnodeReplayFromTs / nodeReplayFromTs
    • C06.2 Add optional RecoveryReplayContext(recoveryAttemptToken, checkpointId, replayMode) to UpdateRecordRequest
    • C06.3 In the replay sub-phase of RECOVERING, accept only DELETE / CDC-replace requests with matching context and timestamp >= replayFromTs; reject standalone INSERT
    • C06.4 Multi-table / multi-stream requests are committed with request-level all-or-nothing ack
    • C06.5 Track in-flight recovery replay requests; close the replay write entry when MarkReady barrier starts
    • C06.6 Implement lifecycle coordinator markReady(recoveryAttemptToken, checkpointId): validate → close entry → drain in-flight → switch to READY → start GC scheduler → invalidate split/cache → unblock queries
    • C06.7 Empty-DB / no-replay-needed startup also goes through the same self-MarkReady path; no shortcut allowed
  • C07 — Storage GC recovery

    • C07.1 Switch GC temporary file type to TEMPORARY_GC; candidate scan via getRegularFiles; rewrite cutoff = TransService.getSafeGcTimestamp()
    • C07.2 Rewrite filtering on delete_ts <= safeGcTs; new file Visibility initialized with baseTimestamp = safeGcTs and chain items for delete_ts > safeGcTs
    • C07.3 Move Storage GC's rowId allocation, MainIndex write/flush, primary switch, rollback and range deletion onto LocalIndexService
    • C07.4 Write-ahead rollback journal before primary switch: persist rollback entry first, then switch, then mark updated
    • C07.5 After atomicSwapFiles, journal becomes SWAPPED_NOT_CHECKPOINTED; promote to CHECKPOINTED only after a durable recovery checkpoint baseline accepts the new file
    • C07.6 Startup processes the journal: INDEX_SWITCHING rolls back, SWAPPED_NOT_CHECKPOINTED either promotes or rolls back depending on checkpoint baseline acceptance, missing journal with primary pointing to a non-baseline newRowId is FAILED
  • C08 — Background cleanup, testing and operations

    • C08.1 Background sweep listTemporaryFilesBefore for long-hanging temporary files
    • C08.2 Background sweep listRetiredFilesBefore(now, limit) cleaning old Visibility → old MainIndex range → physical file → retired catalog, in that order
    • C08.3 Metrics, orphan-index counters and alerts; only emit hashes / short prefixes of recoveryAttemptToken
    • C08.4 Runbook: no-checkpoint, broken checkpoint, bootstrap, CDC lag, Storage GC journal mismatch, query-gate unavailable
    • C08.5 Crash-injection and protocol tests covering ingest, checkpoint, cleansing, replay, MarkReady, Storage GC, query gate, background cleanup

Detailed Design

1. Goals and boundaries

The recovery protocol covers four crash scenarios:

  • Retina-only crash;
  • CDC-only crash;
  • simultaneous CDC + Retina crash;
  • a second crash during recovery.

After Retina restarts, it loads the latest valid recovery checkpoint, cleanses its RG-visibility snapshot against the catalog, and rebuilds an internal recovery starting point. CDC then replays source events from the replay timestamp derived from that checkpoint, in inclusive semantics. Once CDC has acknowledged completion via MarkReady, Retina runs an internal barrier that closes the replay write entry, drains in-flight replay requests, and switches to READY — the only edge where queries become visible.

Recovery is bound to a fixed Retina topology. The expected node set comes from the static configuration $PIXELS_HOME/etc/retina, and each checkpoint records a topologyHash. Within one recovery attempt, the expected Retina set, retina.server.port, node.virtual.num and the vnode-to-Retina mapping must stay constant; otherwise the attempt fails closed or restarts.

READY only guarantees per-effect visibility (a single INSERT or single DELETE). CDC replace's DELETE + INSERT pair does not provide combined atomic scan visibility; the short live intermediate state (DELETE applied, INSERT not yet, or vice versa) is normal READY freshness behavior, not a recovery-correctness violation.

2. Lifecycle and file states

Retina has three external lifecycle states:

State CDC writes Queries
RECOVERING rejected before cleansing completes; only recovery-replay writes after rejected
READY accepted accepted
FAILED rejected rejected

RECOVERING is the only externally visible recovery state. Internally it is split into checkpoint cleansing, waiting for CDC recovery replay, and the MarkReady barrier. Even after cleansing completes and the replay timestamp is computed, queries remain rejected; otherwise the same read timestamp could observe different snapshots as replay progresses.

File states in the catalog are extended to four:

  • REGULAR — published, query-visible data file;
  • TEMPORARY_INGEST — pre-allocated or in-progress ingest file;
  • TEMPORARY_GC — Storage GC rewrite file, governed by the GC journal;
  • RETIRED — old REGULAR retired by GC swap or recovery cleansing, with FILE_CLEANUP_AT driving delayed cleanup.

Query-visible enumeration is done only through getRegularFiles(pathId); non-REGULAR files are reachable only through typed/admin APIs (e.g. listFilesByType, listRetiredFilesBefore).

3. Recovery checkpoint

A recovery checkpoint is not a query baseline — it is the only source of truth Retina uses to bootstrap its internal recovery state. The contract is:

Provide a starting point such that "cleansing + CDC replay" can converge to a queryable READY.

Each checkpoint body captures, for one Retina node:

  • the RGVisibilityIndex snapshot ((fileId, rgId) -> bitmap);
  • checkpointAppliedTs = HWM - 1 (the visibility-applied cut for DELETE / UPDATE-old-row);
  • a per-(tableId, virtualNodeId) ingest segment chain (REGULAR-admitted, pending, open);
  • topologyHash, retinaNodeId, writerEpoch / leaseId, checksum, length.

Commit protocol uses an immutable body object + per-node etcd two-slot pointer: write body + fsync, then atomically swap current/previous in a single etcd transaction. Recovery only reads its own node's current then previous; it never lists storage directly.

checkpointAppliedTs requires a co-design with the write path: a write transaction can only enter committed state (and be covered by HWM) after Retina has synchronously completed its visibility / index / write-buffer effects and acked CDC. This invariant is what makes HWM - 1 safe to use directly.

Recovery replay timestamps are derived, not stored:

RecoveryScope = (retinaNodeId, tableId, virtualNodeId)

scopeReplayFromTs = min(checkpointAppliedTs, earliestUnsafeInsertTs)
vnodeReplayFromTs[v] = min(scopeReplayFromTs over tables on this Retina)
nodeReplayFromTs    = min(vnodeReplayFromTs)

earliestUnsafeInsertTs is the minCommitTs of the first non-empty pending/open ingest segment after the last checkpoint-admitted REGULAR segment in that scope. All-DELETE scopes produce +INF here, so the replay starts from checkpointAppliedTs and never falls back to 0. Untrustworthy segment chains degrade to a conservative MIN_REPLAY_TS = 0 — only ever increasing replay, never reducing it.

CDC consumes either VNODE mode (vnodeReplayFromTs[v]) or NODE mode (nodeReplayFromTs). Both are inclusive.

4. Cleansing rules

For every entry in the loaded checkpoint body:

Condition Action
fileId not in catalog discard entry, WARN
fileId exists but FILE_TYPE != REGULAR discard entry
fileId is REGULAR but recordNum / rowIdStart / rowCount mismatch discard entry, WARN
fileId is REGULAR and matches keep entry

Catalog files that are REGULAR but have no entry in the checkpoint body and are not protected by Storage GC journal must be atomically marked RETIRED (FILE_CLEANUP_AT = now) before READY; their source events are redone by CDC replay. This avoids long-lived catalog-only REGULAR orphans being picked up by getRegularFiles.

Discarding individual entries does not invalidate the checkpoint. Only header / checksum / format errors invalidate the entire body and force a fallback to previous.

5. Subsystem recovery

Visibility. Each kept entry rebuilds an RGVisibility with baseTimestamp and baseBitmap taken from the body and an empty deletion chain. CDC replay DELETEs are forked at the apply path:

  • T <= baseTimestamp → COW fold into baseBitmap, no chain item;
  • T > baseTimestamp → append to deletion chain at original T.

The replay start time has no required ordering with baseTimestamp.

WriteBuffer. Memtables are dropped; pre-allocated TEMPORARY_INGEST objects are best-effort deleted; new empty buffers are created. To cleanly handle in-flight appends across crashes, every memtable carries an AppendSegmentState separating physical rowBatch.size from a query-visible monotonically increasing visibleSize. GetWriteBuffer and object flush only consume the 0, visibleSize) prefix; failed appends are compensated via Visibility delete + primary tombstone and then [publishPendingAppend(handle, hidden), or the writer fails closed.

Ingest publish. A file becomes REGULAR only after physical close + footer/length/checksum check + MainIndex durable flush + becoming nextCommitFirstBlockId in its stream. The catalog metadata update atomically persists FILE_TYPE = REGULAR, rowIdStart / rowCount, and append-order fileMinCommitTs / fileMaxCommitTs. The conservative replay rule is that a file is REGULAR_ADMITTED only after a durable recovery checkpoint has captured it; otherwise its source events are redone by replay.

Index. Recovery rebuilds MainIndex baseline from the kept (fileId) set using catalog rowIdStart / rowCount. All write paths — normal writes, recovery replay, Storage GC — go through LocalIndexService only. The service exposes staged primitives (resolvePrimary, putMainIndexEntries, putPrimaryEntriesOnly, tombstonePrimaryResolved, updatePrimaryResolved, restorePrimaryEntries, deleteMainIndexRange). resolvePrimary returns a tri-state FOUND / NOT_FOUND_OR_ORPHAN / BACKEND_ERROR, where NOT_FOUND_OR_ORPHAN covers missing keys, primary tombstones, RowLocations into non-baseline / retired / cleansed-out files. Only BACKEND_ERROR may fail the request. Secondary index is explicitly out of scope for recovery correctness in this stage.

Unified write order.

  • DELETE: resolvePrimary → Visibility delete → tombstonePrimaryResolved
  • INSERT: append pending → putMainIndexEntries → putPrimaryEntriesOnly → publishPendingAppend(visible)
  • Optional compatibility UPDATE: resolvePrimary → append pending → putMainIndexEntries → Visibility delete → updatePrimaryResolved → publishPendingAppend(visible)

CDC replace = DELETE + INSERT at the same source timestamp T, in the same all-or-nothing request, in the same stream, DELETE first.

Storage GC. Candidate scan and rewrite cutoff use the runtime TransService.getSafeGcTimestamp(), decoupled from recovery checkpoint. New-file Visibility starts at baseTimestamp = safeGcTs. Storage GC writes a write-ahead rollback journal (INDEX_SWITCHING / SWAPPED_NOT_CHECKPOINTED / CHECKPOINTED / ABORTED) before primary switch. After atomicSwapFiles(newFileId, oldFileIds, cleanupAt), the new file is online but the journal stays at SWAPPED_NOT_CHECKPOINTED until a durable recovery checkpoint baseline accepts it. Recovery decides per-task: keep + advance, roll back to old-file baseline, or fail closed if rollback anchors are unavailable.

6. CDC ↔ Retina protocol

New RPCs on the Retina side:

  • GetRetinaStatus returns lifecycle state, recoveryAttemptToken (CSPRNG, in-memory only), checkpointId, recoveryEpoch (diagnostic), replayTsReady;
  • GetRecoveryReplayTs(token, checkpointId, mode=VNODE|NODE) returns the replay starting points;
  • MarkReady(token, checkpointId) finishes the recovery cycle.

Recovery replay writes carry an optional RecoveryReplayContext(recoveryAttemptToken, checkpointId, replayMode). In RECOVERING, Retina accepts only requests matching the current context with timestamp >= replayFromTs (per the chosen mode). Standalone INSERTs are protocol errors; CDC must always emit DELETE + INSERT.

Multi-stream requests are acked all-or-nothing at the request level, but request-level failure does not imply physical rollback. Already-applied prefix effects must converge idempotently on retry; otherwise Retina fails closed.

MarkReady performs an internal barrier:

  1. validate recoveryAttemptToken and checkpointId against the current attempt;
  2. close the recovery-replay write entry inside RECOVERING;
  3. drain in-flight replay requests;
  4. switch to READY;
  5. fire the previously dormant hooks (start GC scheduler, invalidate planner/cache, unblock queries).

CDC unilateral failures never push Retina into RECOVERING; backlog catchup happens entirely under READY with the same DELETE + INSERT idempotent encoding.

7. Query gate

The query gate's authoritative boundary is TransServiceImpl.beginTrans / beginTransBatch for readOnly = true. With retina.enable = true:

  • Expected Retina set comes from $PIXELS_HOME/etc/retina;
  • TransServer maintains a local QueryAvailabilitySnapshot from the static expected membership and a watch on /pixels/retina/lifecycle/<host:retinaPort>;
  • Read-only transactions are allowed only when every expected node is present in the snapshot and is READY. Anything else — RECOVERING, FAILED, missing, stale, malformed, or watch-not-yet-initialized — fails closed.

Planner / cache / Trino connector / Turbo planner must produce file lists, splits, cache fills, or scan inputs only under a successful gated read-only transaction. Bypassing the gate to call metadata enumeration is a recovery-correctness violation, not an ordinary protocol bug.

8. Failure scenarios covered

  • CDC-only crash → Retina stays READY; CDC replays under normal backlog catchup.
  • Retina-only crash → Retina re-enters RECOVERING, cleanses checkpoint, exposes replay timestamps, waits for CDC MarkReady.
  • Both crash → both recover independently and rendez-vous via GetRetinaStatus.
  • Crash during recovery → next start re-enters RECOVERING; all cleanup steps are idempotent.
  • Crash during checkpoint write → unswitched body objects are ignored; the previous valid current is used.
  • Crash during ingest publish → file stays non-REGULAR and is redone by CDC replay.
  • Crash during Storage GC → the rollback journal + checkpoint baseline acceptance decide rollback vs keep.
  • Crash before/after ack → covered by idempotent write paths and CDC re-send.

9. Invariants and acceptance

The acceptance section in the design enumerates ~50 invariants. Highlights:

  • No queries can read Retina data in RECOVERING or FAILED.
  • MarkReady barrier is the only edge where queries are unblocked.
  • getRegularFiles(pathId) only returns REGULAR after READY; non-REGULAR files are unreachable from query paths.
  • Replay timestamps are always derived; no single persistent replay field exists in the checkpoint header.
  • CDC replace is the only idempotent write encoding for source INSERT/UPDATE/UPSERT during recovery and backlog catchup.
  • READY does not provide combined atomic scan visibility for CDC replace — short windows where DELETE is applied but INSERT is not (or vice versa) are normal live freshness, not bugs.
  • Phase 4 (lifecycle / gate) and Phase 5 (MarkReady barrier) ship as a single closed loop; no production binary in between is allowed to expose READY.

@gengdy1545 gengdy1545 added this to the Real-time CRUD milestone May 12, 2026
@gengdy1545 gengdy1545 self-assigned this May 12, 2026
@gengdy1545 gengdy1545 added the enhancement New feature or request label May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[pixels-retina] comprehensive fault recovery protocol must be implemented

1 participant