[Issue #1327] pixels retina recovery protocol#1328
Draft
gengdy1545 wants to merge 3 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pixels Retina Recovery Protocol
Summary
This PR introduces a recovery protocol for Pixels Retina that defines how the system recovers internal consistency after process crashes, machine restarts, CDC crashes, or simultaneous CDC + Retina crashes. On startup, Retina first cleans the RG-visibility snapshot in the latest recovery checkpoint into a self-consistent internal recovery starting point; CDC then replays the source-side events that were dropped, starting from the recovery replay timestamp derived from that checkpoint. The system finally converges to the pre-crash state under
READY. Queries are fully fail-closed duringRECOVERINGand only become available after a successfulMarkReadybarrier.This document describes the final recovery-capable design — there is no online dual-track or fall-back to the legacy recovery path. A one-shot schema/catalog/bootstrap upgrade is allowed during rollout, but the runtime code after the upgrade must obey the semantics defined here.
Task List
The implementation is organized as a sequence of commit-level tasks. The numbering only encodes dependency and suggested merge order, not priority. No intermediate commit is allowed to let an unfinished recovery-capable path enter a queryable
READY.C00 — Baseline hardening of the current code
MetadataService.addFiles / updateFile / deleteFilesreturntrueon success and treatfalse/exception as publish-barrier failure in all callersMainIndexBuffersnapshot → SQLite transaction commit → drop buffer/cache only after commit;row_id_rangeswrites idempotent on retryPixelsWriteBuffer.addRowRowLocation race (capturecurrentMemTableat append time)FileWriterManager.finish()from publishingREGULARdirectly; centralize publish inPixelsWriteBufferFILE_TYPE <> 0toFILE_TYPE = REGULARC01 — Metadata schema and enumeration APIs
TEMPORARY_INGEST / TEMPORARY_GC / REGULAR / RETIREDand addFILE_CLEANUP_ATtableId,virtualNodeId,ingestSeq/firstBlockId,fileMinCommitTs,fileMaxCommitTs,rowIdStart/rowCountgetRegularFiles,listFilesByType,listTemporaryFilesBefore,listRetiredFilesBefore,deleteTemporaryFiles,deleteRetiredFilesgetFiles(pathId); recovery-capable startup must not auto-fill baseline viagetRegularFilesatomicSwapFiles(newFileId, oldFileIds, cleanupAt)in a single metadata transactionTEMPORARYtoTEMPORARY_INGEST/TEMPORARY_GCC02 — Ingest publish ordering and write-buffer prefix visibility
FileWriterManager.finish()only does object-block flush,writer.close(), footer/length/checksum checkfileMinCommitTs / fileMaxCommitTsat the write boundary; footer hidden-timestamp stats are audit-onlyAppendSegmentState.visibleSize, batch-level append handle, andpublishPendingAppend(handle, visible|hidden)firstBlockId; out-of-order physical close / index flush, but in-orderTEMPORARY_INGEST → REGULARpublishPixelsWriteBuffer: physical close →LocalIndexService.flushIndexEntriesOfFile→ atomic metadata update (file type + rowId range + commit-ts range) → object cleanupPixelsWriteBuffer.close()paths through the same publisherC03 — LocalIndexService staging and Visibility replay semantics
resolvePrimary,putMainIndexEntries,putPrimaryEntriesOnly,tombstonePrimaryResolved,updatePrimaryResolved,restorePrimaryEntries,deleteMainIndexRangeresolvePrimaryreturns tri-stateFOUND / NOT_FOUND_OR_ORPHAN / BACKEND_ERRORagainst the current baseline visible file setresolvePrimary → Visibility delete → tombstonePrimaryResolvedappend pending → putMainIndexEntries → putPrimaryEntriesOnly → publishPendingAppend(visible); failure path compensates via Visibility delete + primary tombstone +publish-hiddenor fail-closedUpdateDataIndexKey.timestamp == TableUpdateData.timestampand use it everywhereRGVisibility/ JNI / nativeTileVisibility, fork DELETE byT <= baseTimestamp(COW fold intobaseBitmap) vsT > baseTimestamp(append deletion chain)C04 — Recovery checkpoint and startup cleansing
retinaNodeId,checkpointId,writerEpoch,writeTime,checkpointAppliedTs,topologyHash) +fileEntries[]/ingestSegmentEntries[]/rgEntries[]current/previousslots under/pixels/retina/recovery/checkpoint/${encodedNodeId}/; publish order: write body + fsync → atomic etcd swap → async cleanup of replaced bodycheckpointAppliedTs = HWM - 1, then dumps RGVisibilityIndex + checkpoint-admitted REGULAR catalog metadata + ingest segment snapshotretinaNodeId/topologyHash/checkpointAppliedTsFileEntry / VisibilityEntry / IngestSegmentEntryagainst the catalog; rebuild RGVisibilityIndex and MainIndex baseline; produce baseline visible file setscopeReplayFromTs / vnodeReplayFromTs / nodeReplayFromTsper §3.4; degrade toMIN_REPLAY_TSwhen segment chain is untrustworthyREGULARfiles (not covered by checkpoint, not protected by Storage GC journal) asRETIREDand enqueue cleanup;FAILEDif no checkpoint but catalog hasREGULARC05 — Lifecycle, query gate and startup gating
RetinaLifecycleStateand lifecycle coordinator publishingRECOVERING / READY / FAILEDto a leased/pixels/retina/lifecycle/<host:retinaPort>keyRetinaServerImplconstructor logic with the recovery coordinator; remove metadata full-preload from the recovery-capable pathRetinaResourceManagerconstructor into the lifecycleREADYhookQueryVisibility / GetWriteBuffer / RegisterOffload / UpdateRecord / StreamUpdateRecord / AddVisibility / AddWriteBufferon lifecycle stateQueryAvailabilityGatetoTransServiceImpl.beginTrans / beginTransBatchforreadOnly=truemarkReady(...)skeleton; production paths must never observeREADYuntil C06C06 — CDC recovery replay and MarkReady barrier
proto/retina.protowithGetRetinaStatus / GetRecoveryReplayTs / MarkReady, state enum,recoveryAttemptToken / recoveryEpoch / checkpointId / replayTsReady / vnodeReplayFromTs / nodeReplayFromTsRecoveryReplayContext(recoveryAttemptToken, checkpointId, replayMode)toUpdateRecordRequestRECOVERING, accept only DELETE / CDC-replace requests with matching context andtimestamp >= replayFromTs; reject standalone INSERTMarkReadybarrier startsmarkReady(recoveryAttemptToken, checkpointId): validate → close entry → drain in-flight → switch toREADY→ start GC scheduler → invalidate split/cache → unblock queriesMarkReadypath; no shortcut allowedC07 — Storage GC recovery
TEMPORARY_GC; candidate scan viagetRegularFiles; rewrite cutoff =TransService.getSafeGcTimestamp()delete_ts <= safeGcTs; new file Visibility initialized withbaseTimestamp = safeGcTsand chain items fordelete_ts > safeGcTsLocalIndexServiceupdatedatomicSwapFiles, journal becomesSWAPPED_NOT_CHECKPOINTED; promote toCHECKPOINTEDonly after a durable recovery checkpoint baseline accepts the new fileINDEX_SWITCHINGrolls back,SWAPPED_NOT_CHECKPOINTEDeither promotes or rolls back depending on checkpoint baseline acceptance, missing journal with primary pointing to a non-baseline newRowId isFAILEDC08 — Background cleanup, testing and operations
listTemporaryFilesBeforefor long-hanging temporary fileslistRetiredFilesBefore(now, limit)cleaning old Visibility → old MainIndex range → physical file → retired catalog, in that orderrecoveryAttemptTokenDetailed Design
1. Goals and boundaries
The recovery protocol covers four crash scenarios:
After Retina restarts, it loads the latest valid recovery checkpoint, cleanses its RG-visibility snapshot against the catalog, and rebuilds an internal recovery starting point. CDC then replays source events from the replay timestamp derived from that checkpoint, in inclusive semantics. Once CDC has acknowledged completion via
MarkReady, Retina runs an internal barrier that closes the replay write entry, drains in-flight replay requests, and switches toREADY— the only edge where queries become visible.Recovery is bound to a fixed Retina topology. The expected node set comes from the static configuration
$PIXELS_HOME/etc/retina, and each checkpoint records atopologyHash. Within one recovery attempt, the expected Retina set,retina.server.port,node.virtual.numand the vnode-to-Retina mapping must stay constant; otherwise the attempt fails closed or restarts.READYonly guarantees per-effect visibility (a single INSERT or single DELETE). CDC replace's DELETE + INSERT pair does not provide combined atomic scan visibility; the short live intermediate state (DELETE applied, INSERT not yet, or vice versa) is normalREADYfreshness behavior, not a recovery-correctness violation.2. Lifecycle and file states
Retina has three external lifecycle states:
RECOVERINGREADYFAILEDRECOVERINGis the only externally visible recovery state. Internally it is split into checkpoint cleansing, waiting for CDC recovery replay, and theMarkReadybarrier. Even after cleansing completes and the replay timestamp is computed, queries remain rejected; otherwise the same read timestamp could observe different snapshots as replay progresses.File states in the catalog are extended to four:
REGULAR— published, query-visible data file;TEMPORARY_INGEST— pre-allocated or in-progress ingest file;TEMPORARY_GC— Storage GC rewrite file, governed by the GC journal;RETIRED— oldREGULARretired by GC swap or recovery cleansing, withFILE_CLEANUP_ATdriving delayed cleanup.Query-visible enumeration is done only through
getRegularFiles(pathId); non-REGULARfiles are reachable only through typed/admin APIs (e.g.listFilesByType,listRetiredFilesBefore).3. Recovery checkpoint
A recovery checkpoint is not a query baseline — it is the only source of truth Retina uses to bootstrap its internal recovery state. The contract is:
Each checkpoint body captures, for one Retina node:
RGVisibilityIndexsnapshot ((fileId, rgId) -> bitmap);checkpointAppliedTs = HWM - 1(the visibility-applied cut for DELETE / UPDATE-old-row);(tableId, virtualNodeId)ingest segment chain (REGULAR-admitted, pending, open);topologyHash,retinaNodeId,writerEpoch / leaseId,checksum,length.Commit protocol uses an immutable body object + per-node etcd two-slot pointer: write body + fsync, then atomically swap
current/previousin a single etcd transaction. Recovery only reads its own node'scurrentthenprevious; it never lists storage directly.checkpointAppliedTsrequires a co-design with the write path: a write transaction can only enter committed state (and be covered by HWM) after Retina has synchronously completed its visibility / index / write-buffer effects and acked CDC. This invariant is what makesHWM - 1safe to use directly.Recovery replay timestamps are derived, not stored:
earliestUnsafeInsertTsis theminCommitTsof the first non-empty pending/open ingest segment after the last checkpoint-admitted REGULAR segment in that scope. All-DELETE scopes produce+INFhere, so the replay starts fromcheckpointAppliedTsand never falls back to 0. Untrustworthy segment chains degrade to a conservativeMIN_REPLAY_TS = 0— only ever increasing replay, never reducing it.CDC consumes either VNODE mode (
vnodeReplayFromTs[v]) or NODE mode (nodeReplayFromTs). Both are inclusive.4. Cleansing rules
For every entry in the loaded checkpoint body:
fileIdnot in catalogfileIdexists butFILE_TYPE != REGULARfileIdisREGULARbutrecordNum / rowIdStart / rowCountmismatchfileIdisREGULARand matchesCatalog files that are
REGULARbut have no entry in the checkpoint body and are not protected by Storage GC journal must be atomically markedRETIRED(FILE_CLEANUP_AT = now) beforeREADY; their source events are redone by CDC replay. This avoids long-lived catalog-onlyREGULARorphans being picked up bygetRegularFiles.Discarding individual entries does not invalidate the checkpoint. Only header / checksum / format errors invalidate the entire body and force a fallback to
previous.5. Subsystem recovery
Visibility. Each kept entry rebuilds an
RGVisibilitywithbaseTimestampandbaseBitmaptaken from the body and an empty deletion chain. CDC replay DELETEs are forked at the apply path:T <= baseTimestamp→ COW fold intobaseBitmap, no chain item;T > baseTimestamp→ append to deletion chain at originalT.The replay start time has no required ordering with
baseTimestamp.WriteBuffer. Memtables are dropped; pre-allocated
TEMPORARY_INGESTobjects are best-effort deleted; new empty buffers are created. To cleanly handle in-flight appends across crashes, every memtable carries anAppendSegmentStateseparating physicalrowBatch.sizefrom a query-visible monotonically increasingvisibleSize.GetWriteBufferand object flush only consume the0, visibleSize)prefix; failed appends are compensated via Visibility delete + primary tombstone and then [publishPendingAppend(handle, hidden), or the writer fails closed.Ingest publish. A file becomes
REGULARonly after physical close + footer/length/checksum check + MainIndex durable flush + becomingnextCommitFirstBlockIdin its stream. The catalog metadata update atomically persistsFILE_TYPE = REGULAR,rowIdStart / rowCount, and append-orderfileMinCommitTs / fileMaxCommitTs. The conservative replay rule is that a file isREGULAR_ADMITTEDonly after a durable recovery checkpoint has captured it; otherwise its source events are redone by replay.Index. Recovery rebuilds MainIndex baseline from the kept
(fileId)set using catalogrowIdStart / rowCount. All write paths — normal writes, recovery replay, Storage GC — go throughLocalIndexServiceonly. The service exposes staged primitives (resolvePrimary,putMainIndexEntries,putPrimaryEntriesOnly,tombstonePrimaryResolved,updatePrimaryResolved,restorePrimaryEntries,deleteMainIndexRange).resolvePrimaryreturns a tri-stateFOUND / NOT_FOUND_OR_ORPHAN / BACKEND_ERROR, whereNOT_FOUND_OR_ORPHANcovers missing keys, primary tombstones, RowLocations into non-baseline / retired / cleansed-out files. OnlyBACKEND_ERRORmay fail the request. Secondary index is explicitly out of scope for recovery correctness in this stage.Unified write order.
resolvePrimary → Visibility delete → tombstonePrimaryResolvedappend pending → putMainIndexEntries → putPrimaryEntriesOnly → publishPendingAppend(visible)resolvePrimary → append pending → putMainIndexEntries → Visibility delete → updatePrimaryResolved → publishPendingAppend(visible)CDC replace = DELETE + INSERT at the same source timestamp
T, in the same all-or-nothing request, in the same stream, DELETE first.Storage GC. Candidate scan and rewrite cutoff use the runtime
TransService.getSafeGcTimestamp(), decoupled from recovery checkpoint. New-file Visibility starts atbaseTimestamp = safeGcTs. Storage GC writes a write-ahead rollback journal (INDEX_SWITCHING / SWAPPED_NOT_CHECKPOINTED / CHECKPOINTED / ABORTED) before primary switch. AfteratomicSwapFiles(newFileId, oldFileIds, cleanupAt), the new file is online but the journal stays atSWAPPED_NOT_CHECKPOINTEDuntil a durable recovery checkpoint baseline accepts it. Recovery decides per-task: keep + advance, roll back to old-file baseline, or fail closed if rollback anchors are unavailable.6. CDC ↔ Retina protocol
New RPCs on the Retina side:
GetRetinaStatusreturns lifecycle state,recoveryAttemptToken(CSPRNG, in-memory only),checkpointId,recoveryEpoch(diagnostic),replayTsReady;GetRecoveryReplayTs(token, checkpointId, mode=VNODE|NODE)returns the replay starting points;MarkReady(token, checkpointId)finishes the recovery cycle.Recovery replay writes carry an optional
RecoveryReplayContext(recoveryAttemptToken, checkpointId, replayMode). InRECOVERING, Retina accepts only requests matching the current context withtimestamp >= replayFromTs(per the chosen mode). Standalone INSERTs are protocol errors; CDC must always emit DELETE + INSERT.Multi-stream requests are acked all-or-nothing at the request level, but request-level failure does not imply physical rollback. Already-applied prefix effects must converge idempotently on retry; otherwise Retina fails closed.
MarkReadyperforms an internal barrier:recoveryAttemptTokenandcheckpointIdagainst the current attempt;RECOVERING;READY;CDC unilateral failures never push Retina into
RECOVERING; backlog catchup happens entirely underREADYwith the same DELETE + INSERT idempotent encoding.7. Query gate
The query gate's authoritative boundary is
TransServiceImpl.beginTrans / beginTransBatchforreadOnly = true. Withretina.enable = true:$PIXELS_HOME/etc/retina;QueryAvailabilitySnapshotfrom the static expected membership and a watch on/pixels/retina/lifecycle/<host:retinaPort>;READY. Anything else —RECOVERING,FAILED, missing, stale, malformed, or watch-not-yet-initialized — fails closed.Planner / cache / Trino connector / Turbo planner must produce file lists, splits, cache fills, or scan inputs only under a successful gated read-only transaction. Bypassing the gate to call metadata enumeration is a recovery-correctness violation, not an ordinary protocol bug.
8. Failure scenarios covered
READY; CDC replays under normal backlog catchup.RECOVERING, cleanses checkpoint, exposes replay timestamps, waits for CDCMarkReady.GetRetinaStatus.RECOVERING; all cleanup steps are idempotent.currentis used.REGULARand is redone by CDC replay.9. Invariants and acceptance
The acceptance section in the design enumerates ~50 invariants. Highlights:
RECOVERINGorFAILED.MarkReadybarrier is the only edge where queries are unblocked.getRegularFiles(pathId)only returnsREGULARafterREADY; non-REGULARfiles are unreachable from query paths.READYdoes not provide combined atomic scan visibility for CDC replace — short windows where DELETE is applied but INSERT is not (or vice versa) are normal live freshness, not bugs.MarkReadybarrier) ship as a single closed loop; no production binary in between is allowed to exposeREADY.