[Issue #1327] pixels retina recovery protocol by gengdy1545 · Pull Request #1328 · pixelsdb/pixels

gengdy1545 · 2026-05-12T07:35:09Z

Pixels Retina Recovery Protocol

Summary

This PR introduces a recovery protocol for Pixels Retina that defines how the system recovers internal consistency after process crashes, machine restarts, CDC crashes, or simultaneous CDC + Retina crashes. On startup, Retina first cleans the RG-visibility snapshot in the latest recovery checkpoint into a self-consistent internal recovery starting point; CDC then replays the source-side events that were dropped, starting from the recovery replay timestamp derived from that checkpoint. The system finally converges to the pre-crash state under READY. Queries are fully fail-closed during RECOVERING and only become available after a successful MarkReady barrier.

This document describes the final recovery-capable design — there is no online dual-track or fall-back to the legacy recovery path. A one-shot schema/catalog/bootstrap upgrade is allowed during rollout, but the runtime code after the upgrade must obey the semantics defined here.

Task List

The implementation is organized as a sequence of commit-level tasks. The numbering only encodes dependency and suggested merge order, not priority. No intermediate commit is allowed to let an unfinished recovery-capable path enter a queryable READY.

Detailed Design

1. Goals and boundaries

The recovery protocol covers four crash scenarios:

Retina-only crash;
CDC-only crash;
simultaneous CDC + Retina crash;
a second crash during recovery.

After Retina restarts, it loads the latest valid recovery checkpoint, cleanses its RG-visibility snapshot against the catalog, and rebuilds an internal recovery starting point. CDC then replays source events from the replay timestamp derived from that checkpoint, in inclusive semantics. Once CDC has acknowledged completion via MarkReady, Retina runs an internal barrier that closes the replay write entry, drains in-flight replay requests, and switches to READY — the only edge where queries become visible.

Recovery is bound to a fixed Retina topology. The expected node set comes from the static configuration $PIXELS_HOME/etc/retina, and each checkpoint records a topologyHash. Within one recovery attempt, the expected Retina set, retina.server.port, node.virtual.num and the vnode-to-Retina mapping must stay constant; otherwise the attempt fails closed or restarts.

READY only guarantees per-effect visibility (a single INSERT or single DELETE). CDC replace's DELETE + INSERT pair does not provide combined atomic scan visibility; the short live intermediate state (DELETE applied, INSERT not yet, or vice versa) is normal READY freshness behavior, not a recovery-correctness violation.

2. Lifecycle and file states

Retina has three external lifecycle states:

State	CDC writes	Queries
`RECOVERING`	rejected before cleansing completes; only recovery-replay writes after	rejected
`READY`	accepted	accepted
`FAILED`	rejected	rejected

RECOVERING is the only externally visible recovery state. Internally it is split into checkpoint cleansing, waiting for CDC recovery replay, and the MarkReady barrier. Even after cleansing completes and the replay timestamp is computed, queries remain rejected; otherwise the same read timestamp could observe different snapshots as replay progresses.

File states in the catalog are extended to four:

REGULAR — published, query-visible data file;
TEMPORARY_INGEST — pre-allocated or in-progress ingest file;
TEMPORARY_GC — Storage GC rewrite file, governed by the GC journal;
RETIRED — old REGULAR retired by GC swap or recovery cleansing, with FILE_CLEANUP_AT driving delayed cleanup.

Query-visible enumeration is done only through getRegularFiles(pathId); non-REGULAR files are reachable only through typed/admin APIs (e.g. listFilesByType, listRetiredFilesBefore).

3. Recovery checkpoint

A recovery checkpoint is not a query baseline — it is the only source of truth Retina uses to bootstrap its internal recovery state. The contract is:

Provide a starting point such that "cleansing + CDC replay" can converge to a queryable READY.

Each checkpoint body captures, for one Retina node:

the RGVisibilityIndex snapshot ((fileId, rgId) -> bitmap);
checkpointAppliedTs = HWM - 1 (the visibility-applied cut for DELETE / UPDATE-old-row);
a per-(tableId, virtualNodeId) ingest segment chain (REGULAR-admitted, pending, open);
topologyHash, retinaNodeId, writerEpoch / leaseId, checksum, length.

Commit protocol uses an immutable body object + per-node etcd two-slot pointer: write body + fsync, then atomically swap current/previous in a single etcd transaction. Recovery only reads its own node's current then previous; it never lists storage directly.

checkpointAppliedTs requires a co-design with the write path: a write transaction can only enter committed state (and be covered by HWM) after Retina has synchronously completed its visibility / index / write-buffer effects and acked CDC. This invariant is what makes HWM - 1 safe to use directly.

Recovery replay timestamps are derived, not stored:

RecoveryScope = (retinaNodeId, tableId, virtualNodeId)

scopeReplayFromTs = min(checkpointAppliedTs, earliestUnsafeInsertTs)
vnodeReplayFromTs[v] = min(scopeReplayFromTs over tables on this Retina)
nodeReplayFromTs    = min(vnodeReplayFromTs)

earliestUnsafeInsertTs is the minCommitTs of the first non-empty pending/open ingest segment after the last checkpoint-admitted REGULAR segment in that scope. All-DELETE scopes produce +INF here, so the replay starts from checkpointAppliedTs and never falls back to 0. Untrustworthy segment chains degrade to a conservative MIN_REPLAY_TS = 0 — only ever increasing replay, never reducing it.

CDC consumes either VNODE mode (vnodeReplayFromTs[v]) or NODE mode (nodeReplayFromTs). Both are inclusive.

4. Cleansing rules

For every entry in the loaded checkpoint body:

Condition	Action
`fileId` not in catalog	discard entry, WARN
`fileId` exists but `FILE_TYPE != REGULAR`	discard entry
`fileId` is `REGULAR` but `recordNum / rowIdStart / rowCount` mismatch	discard entry, WARN
`fileId` is `REGULAR` and matches	keep entry

Catalog files that are REGULAR but have no entry in the checkpoint body and are not protected by Storage GC journal must be atomically marked RETIRED (FILE_CLEANUP_AT = now) before READY; their source events are redone by CDC replay. This avoids long-lived catalog-only REGULAR orphans being picked up by getRegularFiles.

Discarding individual entries does not invalidate the checkpoint. Only header / checksum / format errors invalidate the entire body and force a fallback to previous.

5. Subsystem recovery

Visibility. Each kept entry rebuilds an RGVisibility with baseTimestamp and baseBitmap taken from the body and an empty deletion chain. CDC replay DELETEs are forked at the apply path:

T <= baseTimestamp → COW fold into baseBitmap, no chain item;
T > baseTimestamp → append to deletion chain at original T.

The replay start time has no required ordering with baseTimestamp.

WriteBuffer. Memtables are dropped; pre-allocated TEMPORARY_INGEST objects are best-effort deleted; new empty buffers are created. To cleanly handle in-flight appends across crashes, every memtable carries an AppendSegmentState separating physical rowBatch.size from a query-visible monotonically increasing visibleSize. GetWriteBuffer and object flush only consume the 0, visibleSize) prefix; failed appends are compensated via Visibility delete + primary tombstone and then [publishPendingAppend(handle, hidden), or the writer fails closed.

Ingest publish. A file becomes REGULAR only after physical close + footer/length/checksum check + MainIndex durable flush + becoming nextCommitFirstBlockId in its stream. The catalog metadata update atomically persists FILE_TYPE = REGULAR, rowIdStart / rowCount, and append-order fileMinCommitTs / fileMaxCommitTs. The conservative replay rule is that a file is REGULAR_ADMITTED only after a durable recovery checkpoint has captured it; otherwise its source events are redone by replay.

Index. Recovery rebuilds MainIndex baseline from the kept (fileId) set using catalog rowIdStart / rowCount. All write paths — normal writes, recovery replay, Storage GC — go through LocalIndexService only. The service exposes staged primitives (resolvePrimary, putMainIndexEntries, putPrimaryEntriesOnly, tombstonePrimaryResolved, updatePrimaryResolved, restorePrimaryEntries, deleteMainIndexRange). resolvePrimary returns a tri-state FOUND / NOT_FOUND_OR_ORPHAN / BACKEND_ERROR, where NOT_FOUND_OR_ORPHAN covers missing keys, primary tombstones, RowLocations into non-baseline / retired / cleansed-out files. Only BACKEND_ERROR may fail the request. Secondary index is explicitly out of scope for recovery correctness in this stage.

Unified write order.

DELETE: resolvePrimary → Visibility delete → tombstonePrimaryResolved
INSERT: append pending → putMainIndexEntries → putPrimaryEntriesOnly → publishPendingAppend(visible)
Optional compatibility UPDATE: resolvePrimary → append pending → putMainIndexEntries → Visibility delete → updatePrimaryResolved → publishPendingAppend(visible)

CDC replace = DELETE + INSERT at the same source timestamp T, in the same all-or-nothing request, in the same stream, DELETE first.

Storage GC. Candidate scan and rewrite cutoff use the runtime TransService.getSafeGcTimestamp(), decoupled from recovery checkpoint. New-file Visibility starts at baseTimestamp = safeGcTs. Storage GC writes a write-ahead rollback journal (INDEX_SWITCHING / SWAPPED_NOT_CHECKPOINTED / CHECKPOINTED / ABORTED) before primary switch. After atomicSwapFiles(newFileId, oldFileIds, cleanupAt), the new file is online but the journal stays at SWAPPED_NOT_CHECKPOINTED until a durable recovery checkpoint baseline accepts it. Recovery decides per-task: keep + advance, roll back to old-file baseline, or fail closed if rollback anchors are unavailable.

6. CDC ↔ Retina protocol

New RPCs on the Retina side:

GetRetinaStatus returns lifecycle state, recoveryAttemptToken (CSPRNG, in-memory only), checkpointId, recoveryEpoch (diagnostic), replayTsReady;
GetRecoveryReplayTs(token, checkpointId, mode=VNODE|NODE) returns the replay starting points;
MarkReady(token, checkpointId) finishes the recovery cycle.

Recovery replay writes carry an optional RecoveryReplayContext(recoveryAttemptToken, checkpointId, replayMode). In RECOVERING, Retina accepts only requests matching the current context with timestamp >= replayFromTs (per the chosen mode). Standalone INSERTs are protocol errors; CDC must always emit DELETE + INSERT.

Multi-stream requests are acked all-or-nothing at the request level, but request-level failure does not imply physical rollback. Already-applied prefix effects must converge idempotently on retry; otherwise Retina fails closed.

MarkReady performs an internal barrier:

validate recoveryAttemptToken and checkpointId against the current attempt;
close the recovery-replay write entry inside RECOVERING;
drain in-flight replay requests;
switch to READY;
fire the previously dormant hooks (start GC scheduler, invalidate planner/cache, unblock queries).

CDC unilateral failures never push Retina into RECOVERING; backlog catchup happens entirely under READY with the same DELETE + INSERT idempotent encoding.

7. Query gate

The query gate's authoritative boundary is TransServiceImpl.beginTrans / beginTransBatch for readOnly = true. With retina.enable = true:

Expected Retina set comes from $PIXELS_HOME/etc/retina;
TransServer maintains a local QueryAvailabilitySnapshot from the static expected membership and a watch on /pixels/retina/lifecycle/<host:retinaPort>;
Read-only transactions are allowed only when every expected node is present in the snapshot and is READY. Anything else — RECOVERING, FAILED, missing, stale, malformed, or watch-not-yet-initialized — fails closed.

Planner / cache / Trino connector / Turbo planner must produce file lists, splits, cache fills, or scan inputs only under a successful gated read-only transaction. Bypassing the gate to call metadata enumeration is a recovery-correctness violation, not an ordinary protocol bug.

8. Failure scenarios covered

CDC-only crash → Retina stays READY; CDC replays under normal backlog catchup.
Retina-only crash → Retina re-enters RECOVERING, cleanses checkpoint, exposes replay timestamps, waits for CDC MarkReady.
Both crash → both recover independently and rendez-vous via GetRetinaStatus.
Crash during recovery → next start re-enters RECOVERING; all cleanup steps are idempotent.
Crash during checkpoint write → unswitched body objects are ignored; the previous valid current is used.
Crash during ingest publish → file stays non-REGULAR and is redone by CDC replay.
Crash during Storage GC → the rollback journal + checkpoint baseline acceptance decide rollback vs keep.
Crash before/after ack → covered by idempotent write paths and CDC re-send.

9. Invariants and acceptance

The acceptance section in the design enumerates ~50 invariants. Highlights:

No queries can read Retina data in RECOVERING or FAILED.
MarkReady barrier is the only edge where queries are unblocked.
getRegularFiles(pathId) only returns REGULAR after READY; non-REGULAR files are unreachable from query paths.
Replay timestamps are always derived; no single persistent replay field exists in the checkpoint header.
CDC replace is the only idempotent write encoding for source INSERT/UPDATE/UPSERT during recovery and backlog catchup.
READY does not provide combined atomic scan visibility for CDC replace — short windows where DELETE is applied but INSERT is not (or vice versa) are normal live freshness, not bugs.
Phase 4 (lifecycle / gate) and Phase 5 (MarkReady barrier) ship as a single closed loop; no production binary in between is allowed to expose READY.

gengdy1545 added 2 commits May 12, 2026 10:54

chore: ignore AI tool artifacts

690b54a

fix: metadata mutation return values and barrier checks

e942e40

gengdy1545 added this to the Real-time CRUD milestone May 12, 2026

gengdy1545 self-assigned this May 12, 2026

gengdy1545 added the enhancement New feature or request label May 12, 2026

gengdy1545 linked an issue May 12, 2026 that may be closed by this pull request

[pixels-retina] comprehensive fault recovery protocol must be implemented #1327

Open

fix: make sqlite main index flushes retryable

f087ebc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue #1327] pixels retina recovery protocol#1328

[Issue #1327] pixels retina recovery protocol#1328
gengdy1545 wants to merge 3 commits into
pixelsdb:masterfrom
gengdy1545:feature/recovery

gengdy1545 commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gengdy1545 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pixels Retina Recovery Protocol

Summary

Task List

Detailed Design

1. Goals and boundaries

2. Lifecycle and file states

3. Recovery checkpoint

4. Cleansing rules

5. Subsystem recovery

6. CDC ↔ Retina protocol

7. Query gate

8. Failure scenarios covered

9. Invariants and acceptance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gengdy1545 commented May 12, 2026 •

edited

Loading