Skip to content

Make retried Onyx writes idempotent on storage retry#792

Open
elirangoshen wants to merge 4 commits into
Expensify:mainfrom
callstack-internal:elirangoshen/fix/retry-side-effects-idempotency
Open

Make retried Onyx writes idempotent on storage retry#792
elirangoshen wants to merge 4 commits into
Expensify:mainfrom
callstack-internal:elirangoshen/fix/retry-side-effects-idempotency

Conversation

@elirangoshen
Copy link
Copy Markdown
Contributor

@elirangoshen elirangoshen commented May 25, 2026

Details

Follow-up to #787. The cache-first / storage-second refactor of mergeCollectionWithPatches surfaced two pre-existing bugs that affect every Onyx write path that goes through retryOperation. Both were flagged on the #787 review and left for this PR (review comment):

  1. "new vs existing" routing is recomputed against an already-mutated cache on retry. In mergeCollectionWithPatches, setCollectionWithRetry, and partialSetCollection, the existing/new split is derived from getAllKeys() / cached state. When the first attempt's cache.set / cache.merge has already populated brand-new keys into the cache, the retry attempt reclassifies them as "existing" and routes them through multiMerge instead of multiSet. Benign on IDB (where multiMerge on a missing key behaves like set) but wrong semantics, and the inline comment in mergeCollectionWithPatches explicitly warns this throws on some storage backends.

  2. Subscriber notifications re-fire on every retry attempt. multiSetWithRetry, setCollectionWithRetry, mergeCollectionWithPatches, and partialSetCollection call keyChanged / keysChanged before the storage write. On a failed storage write, retryOperation re-enters the orchestrator, which fires keyChanged / keysChanged again with the same payload. The internal === dedup in keysChanged absorbs most subscribers, but Onyx.connect({ waitForCollectionCallback: true }) subscribers re-fire on every call by contract.

setWithRetry was already safe because broadcastUpdate short-circuits on hasChanged === false and on retry the cache already reflects the new value — that pattern is the design model.

Fix shape

For each of the four affected functions, split the body into:

  • Outer orchestrator (existing public name) — runs cache.set / cache.merge, fires keyChanged / keysChanged, prepares the storage key-value pairs, then delegates. Called exactly once per Onyx.X(...) invocation.
  • File-private write helper (new) — accepts the already-prepared storage pairs as params, issues the storage write, and on rejection calls retryOperation re-entering the write helper, never the orchestrator.

Concretely:

Orchestrator Write helper
mergeCollectionWithPatches persistMergedCollectionWrite
multiSetWithRetry persistMultiSetWrite
setCollectionWithRetry persistCollectionWrite
partialSetCollection persistCollectionWrite (shared — storage tail is identical)

Result: the existing/new split is computed once and captured in the write helper's params (fixes bug 1), subscribers are notified once from the orchestrator (fixes bug 2), and retryOperation's signature is unchanged — it's already generic over RetriableOnyxOperation (extended here with the three new helpers).

Related Issues

No separate GH issue — this PR addresses #787 review comment directly.

Linked E/App PR

Expensify/App#91626

Automated Tests

Added a new retry side-effect idempotency describe block in tests/unit/onyxUtilsTest.ts with 5 cases (one per affected path + one routing-specific assertion):

  1. mergeCollection — new keys still route through multiSet on retry, not downgraded to multiMerge. Seeds an existing member, mocks Storage.multiMerge to reject once then resolve, asserts the new key is never present in any multiMerge call's args.
  2. mergeCollectionwaitForCollectionCallback subscriber fires once across retries. Same setup, asserts subscriber call count is exactly 1.
  3. Onyx.multiSet — collection subscriber fires once across retries.
  4. Onyx.setCollection — collection subscriber fires once across retries.
  5. OnyxUtils.partialSetCollection — collection subscriber fires once across retries.

All 5 fail with Received number of calls: 2 (and the routing test fails differently — multiMerge is called with the new key on retry) if you revert just the lib changes. Confirmed locally via git stash of lib/OnyxUtils.ts lib/types.ts.

All 456 existing unit tests still pass — including the timing-sensitive retryOperation describe block, the mergeCollection cache-first ordering regression tests from #787, and the storage-eviction tests.

Manual Tests

The four refactored paths (mergeCollectionWithPatches, multiSetWithRetry, setCollectionWithRetry, partialSetCollection) are reached from the App via Pusher events, the OpenApp response, every Onyx.update batch that contains a MERGE_COLLECTION / MULTI_SET / SET_COLLECTION op, LHN refresh, Search filters, chat sends, mark-all-as-read, hold/unhold, workspace switching, etc. The companion App PR Expensify/App#91626 pins react-native-onyx to this branch's head for end-to-end exercise; the steps below are the same protocol used for #787.

Setup

  1. In the App repo, check out Expensify/App#91626 (elirangoshen/fix/retry-side-effects-idempotency on callstack-internal/Expensify-App), which bumps package.json's react-native-onyx to this branch's head.
  2. npm install under Node 20.20.0, then npm run web.
  3. Open https://dev.new.expensify.com:8082/ in Chrome with DevTools open.
  4. Sign in to a test account.

Functional smoke — no regression in the refactored paths

The refactor preserves the cache-first invariant from #787, so all of these should look indistinguishable from main to the user:

  1. Initial hydration — after login, LHN reports list and workspace switcher populate within a few seconds. No missing rows vs. a baseline session on main. (exercises mergeCollectionWithPatches via Pusher / OpenApp)
  2. Chat message — open a chat, send a message. Appears immediately (optimistic merge into reportActions_), confirms via Pusher, persists after reload. (mergeCollectionWithPatches)
  3. Mark all as read — click the mark-all-as-read action. All unread badges clear immediately and stay cleared after reload. (mergeCollectionWithPatches)
  4. Search & filter — open Search, apply a filter. Results populate within a few seconds; live updates as filters change; no duplicates. (mergeCollectionWithPatches + setCollectionWithRetry)
  5. Hold / unhold an expense — toggle hold on an expense in a report. Badge appears/clears immediately; persists after reload. (mergeCollectionWithPatches)
  6. Submit expense — FAB → Submit expense. Expense appears in the report immediately, no duplicate, persists after reload. (mergeCollectionWithPatches)
  7. Switch workspaces — switch via the workspace switcher. LHN reports filter to the new workspace within a few seconds. (mergeCollectionWithPatches)

Storage-failure simulation — core regression guard for this PR

This is the test that exercises the retry path the fix targets. With the fix in place, retries on storage failure should produce one subscriber notification per logical operation (not one per retry attempt), and brand-new keys should stay routed through multiSet even when an earlier multiMerge retry kicked in.

  1. With the App running and authenticated, open Chrome DevTools → Application tab → IndexedDB.
  2. Find the database used by Onyx (typically named OnyxDB).
  3. Right-click → Delete database while the App is still running and connected (do not reload).
  4. Immediately trigger a mergeCollection-driven action — switch workspaces, send a chat message, or apply a search filter.
  5. Expected with this fix: the UI updates correctly; the new state is visible to subscribers even though the underlying IDB write fails and retryOperation kicks in. Console shows storage error logs and retry attempts (look for Failed to save to storage. Error: ... retryAttempt: N/5 in Logger.logInfo), but no white screen, no stale UI, no data loss within the session, and no duplicate waitForCollectionCallback subscriber invocations (verify via React DevTools "Highlight updates when components render" or via a console.count instrumented in a temporary useOnyx callback if you want to be precise).
  6. Try the same with a multiSet-driven action (least common — typically a deep-link prep) and a setCollection-driven action (Search filter result population) to cover the other refactored paths.
  7. (Optional regression check) Revert this PR's commits locally and re-run step 4. Without the fix, you'll either see (a) Onyx.connect({waitForCollectionCallback: true}) callbacks firing twice in DevTools traces, or (b) on the routing bug specifically, no observable user-facing difference on IDB but the issue is fixed under the hood for non-IDB backends.

Cold-cache merge — preserves the pre-warm fast path from #787

  1. Sign out and clear site data to force a cold cache.
  2. Sign back in. LHN should populate within the same window as a baseline main session — the cache pre-warm path (existing in mergeCollectionWithPatches) is untouched by this PR.

Author Checklist

  • I linked the correct issue in the ### Related Issues section above
  • I linked the corresponding Expensify/App PR in the ### Linked E/App PR section above, and verified this change against it (E/App CI passed and manual testing completed) (companion PR [No QA] Pin react-native-onyx to PR #792 head for retry-idempotency end-to-end testing App#91626 pins this branch's HEAD for end-to-end exercise)
  • I wrote clear testing steps that cover the changes made in this PR
    • I added steps for local testing in the Tests section
    • I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability). (pending companion App PR #91626 manual run)
  • I included screenshots or videos for tests on all platforms (pending companion App PR #91626; this PR has no UI surface)
  • I ran the tests on all platforms & verified they passed on:
    • Android / native
    • Android / Chrome
    • iOS / native
    • iOS / Safari
    • MacOS / Chrome / Safari
  • I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
  • I followed proper code patterns (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
    • I verified that the left part of a conditional rendering a React component is a boolean and NOT a string, e.g. myBool && <MyComponent />.
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers (orchestrator/write-helper split was suggested in Make mergeCollectionWithPatches cache-first #787 review)
  • I followed the guidelines as stated in the Review Guidelines
  • I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected) (459/459 unit tests green)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests) (persistCollectionWrite is shared between setCollectionWithRetry and partialSetCollection since their storage tails are byte-identical)
  • I verified any variables that can be defined as constants (ie. in CONST.js or at the top of the file that uses the constant) are defined as such
  • I verified that if a function's arguments changed that all usages have also been updated correctly
  • If a new component is created I verified that: (N/A — no new components)
    • A similar component doesn't exist in the codebase (N/A)
    • All props are defined accurately and each prop has a /** comment above it */ (N/A)
    • The file is named correctly (N/A)
    • The component has a clear name that is non-ambiguous and the purpose of the component can be inferred from the name alone (N/A)
    • The only data being stored in the state is data necessary for rendering and nothing else (N/A)
    • If we are not using the full Onyx data that we loaded, I've added the proper selector in order to ensure the component only re-renders when the data it is using changes (N/A)
    • For Class Components, any internal methods passed to components event handlers are bound to this properly so there are no scoping issues (i.e. for onClick={this.submit} the method this.submit should be bound to this in the constructor) (N/A)
    • Any internal methods bound to this are necessary to be bound (i.e. avoid this.submit = this.submit.bind(this); if this.submit is never passed to a component event handler like onClick) (N/A)
    • All JSX used for rendering exists in the render method (N/A)
    • The component has the minimum amount of code necessary for its purpose, and it is broken down into smaller components in order to separate concerns and functions (N/A)
  • If any new file was added I verified that: (N/A — no new files)
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory (N/A)
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases) (pending companion App PR #91626)
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps. (N/A — no merge from main yet)
  • I have checked off every checkbox in the PR author checklist, including those that don't apply to this PR.

Screenshots/Videos

Android: Native
Android: mWeb Chrome
iOS: Native
iOS: mWeb Safari
MacOS: Chrome / Safari
Screen.Recording.2026-05-26.at.14.41.29.mov
Screen.Recording.2026-05-26.at.14.42.06.mov
Screen.Recording.2026-05-26.at.14.42.27.mov
Screen.Recording.2026-05-26.at.14.43.07.mov
Screen.Recording.2026-05-26.at.14.44.17.mov
Screen.Recording.2026-05-26.at.14.44.48.mov

🤖 Generated with Claude Code

Follow-up to Expensify#787. Splits each retriable collection write into an
orchestrator (cache + subscriber notify + storage prep) and a small
file-private write helper (Storage.multiSet/multiMerge + retryOperation
+ devtools log). retryOperation re-enters the write helper, not the
orchestrator, which fixes two pre-existing bugs flagged in Expensify#787 review:

1. New keys no longer get silently downgraded to multiMerge on retry.
   The existing/new key split was previously re-derived from getAllKeys
   against a cache that the first attempt had already mutated, routing
   brand-new keys through multiMerge instead of multiSet on retry.
   Benign on IDB but wrong semantics and a crash on backends that
   require multiSet for missing keys.

2. keysChanged/keyChanged subscribers (notably waitForCollectionCallback
   ones, which re-fire on every call by contract) are now notified
   exactly once per logical operation rather than once per retry attempt.

Affected paths: mergeCollectionWithPatches, multiSetWithRetry,
setCollectionWithRetry, partialSetCollection. setWithRetry was already
safe via its hasValueChanged guard. partialSetCollection shares the
new persistCollectionWrite helper with setCollectionWithRetry since
their storage tails are identical.

Adds a 'retry side-effect idempotency' test block covering all four
paths. Tests confirmed to fail without the lib changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`npm run build:docs` output for the three new file-private helpers
introduced in the previous commit: persistMultiSetWrite,
persistCollectionWrite, persistMergedCollectionWrite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elirangoshen elirangoshen marked this pull request as ready for review May 26, 2026 12:50
@elirangoshen elirangoshen requested a review from a team as a code owner May 26, 2026 12:50
@melvin-bot melvin-bot Bot requested review from dangrous and removed request for a team May 26, 2026 12:50
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: de62ae5eae

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/OnyxUtils.ts
const {keyValuePairsToStore, newData} = params;

return Storage.multiSet(keyValuePairsToStore)
.catch((error) => OnyxUtils.retryOperation(error, persistMultiSetWrite, params, retryAttempt))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Retry orchestrator after eviction to restore cache state

When Storage.multiSet fails with a capacity error, retryOperation() evicts an LRU key via remove(), which immediately drops that key from cache and notifies subscribers before retrying. Retrying persistMultiSetWrite (and similarly persistCollectionWrite / persistMergedCollectionWrite) only re-attempts the storage write, so if the evicted key is part of the in-flight write, its cache value is never restored and subscribers keep the stale “removed” state even though storage eventually succeeds. This is a regression from the previous behavior where retries re-ran the orchestrator and re-applied cache updates.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 11bc0a6.

Fix: each orchestrator (multiSetWithRetry, setCollectionWithRetry, partialSetCollection, mergeCollectionWithPatches) now constructs a restoreEvictedKey(key) closure that re-applies its cache write + subscriber notification for a single in-flight key. The closure is passed via the write helper's params; retryOperation invokes it after remove(keyForRemoval) and before re-entering the write helper. No-op when the evicted key isn't part of this write (the typical case).

mergeCollection nuance: the closure snapshots the cache's post-merge state (not just the delta) so previously-merged-in fields survive when the eviction's cache.drop forces us to re-populate cache from scratch — otherwise we'd lose fields like {id: 1} when restoring after a mergeCollection({key: {value: 'new'}}) retry.

Regression coverage: added 3 tests under storage eviction — one per write helper — that mock Storage.multiSet/Storage.multiMerge to fail with database or disk is full once on a setup where the LRU evictable key is the same key being written, and assert both cache + subscriber reflect the restored value (not undefined). All 3 fail without the lib changes (Received: undefined / missing fields) and pass with them.

459/459 tests pass; tsc and prettier clean.

Addresses Codex P1 on Expensify#792 review:
Expensify#792 (comment)

When `retryOperation()` evicts an LRU key via `remove()` on a storage
capacity error, `remove()` immediately drops the key from cache and
fires `keyChanged(undefined)`. With the orchestrator/write-helper split
introduced in this PR, the retry only re-attempts the storage write --
it does not re-apply cache state. So if the evicted key is part of the
in-flight write, cache + subscribers stay in the "removed" state even
though the storage retry eventually succeeds, leaving the App's view
of the world inconsistent with persisted data.

Fix: each orchestrator now constructs a `restoreEvictedKey(key)`
closure that re-applies its cache write + subscriber notification for
a single in-flight key, and passes it via the write helper's params.
`retryOperation` invokes the closure after `remove(keyForRemoval)`,
before re-entering the write helper. The closure is a no-op when the
evicted key isn't part of this write (the typical case where some
unrelated evictable key gets selected).

For mergeCollection specifically, the closure snapshots the cache's
post-merge state for in-flight keys (rather than just the merge delta)
so that previously-merged-in fields are preserved when the eviction's
cache.drop forces us to re-populate cache from scratch.

Adds 3 regression tests under `storage eviction`:
- `multiSet` -- LRU evictable is the in-flight key (persistMultiSetWrite path)
- `setCollection` -- same scenario via persistCollectionWrite
- `mergeCollection` -- same scenario via persistMergedCollectionWrite,
  asserts the previously-existing `id` field survives the eviction

All 3 tests fail with `Received: undefined` / missing fields if the
lib changes are reverted; pass with them in place.

All 459 unit tests pass (456 existing + 3 new).
CI flagged 3 occurrences in the eviction-restoration regression tests
added in the previous commit. The repo's eslint config prefers the
shorthand for simple types via @typescript-eslint/array-type.
@fabioh8010
Copy link
Copy Markdown
Contributor

i will review it soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants