[world-local] Reduce sequential replay I/O#2152
Conversation
🦋 Changeset detectedLatest commit: d9a284c The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (1 failed)vite (1 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
| // Terminal events clear their run's entries so a long-lived dev server does | ||
| // not retain completed-run payloads. Externally written event paths are | ||
| // absent from this cache and still load from disk normally. | ||
| const eventCache = new Map<string, Event>(); |
e4860ff to
d9a284c
Compare
There was a problem hiding this comment.
Pull request overview
This PR targets the local filesystem-backed world (@workflow/world-local) to reduce sequential-step replay overhead by avoiding repeated disk reads of just-written event files and repeated recursive mkdir calls for stable directories.
Changes:
- Add a bounded in-memory cache for recently appended events and plumb it into
paginatedFileSystemQuery()via an optionalcachedItemsmap. - Add a process-local directory creation cache + retry-on-
ENOENTbehavior for atomic/exclusive writes to tolerate external data-dir deletion. - Add regression tests and a patch changeset covering cache isolation, oversize handling, normalization, and cache clearing on
world.clear()/ terminal runs.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
packages/world-local/src/storage/events-storage.ts |
Introduces bounded recent-event caching and uses it to short-circuit replay pagination disk reads. |
packages/world-local/src/fs.ts |
Adds directory caching + retry-on-missing-dir writes; extends paginatedFileSystemQuery to optionally serve results from an in-memory cache. |
packages/world-local/src/storage/index.ts |
Extends local storage shape with clearCache() and wires it to the events cache. |
packages/world-local/src/index.ts |
Clears storage cache on world.clear()/world.close() and ensures write-path caches are cleared when removing the whole data directory. |
packages/world-local/src/storage.test.ts |
Adds tests to verify cached events don’t alias returned instances, oversize events bypass cache, normalization matches disk reads, and cache can be released. |
packages/world-local/src/fs.test.ts |
Adds tests for directory caching and for recreating cached directories after external removal for atomic/exclusive writes. |
packages/world-local/src/tag.test.ts |
Adds regression test ensuring untagged world.clear() doesn’t leave cached directories that break subsequent writes. |
.changeset/quick-local-replay.md |
Adds a patch changeset describing the optimization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -544,7 +582,11 @@ export async function paginatedFileSystemQuery<T extends { createdAt: Date }>( | |||
| const filePath = path.join(directory, `${fileId}.json`); | |||
| expect(result.data[3].eventType).toBe('hook_disposed'); | ||
| }); | ||
| }); | ||
|
|
karthikscale3
left a comment
There was a problem hiding this comment.
ai review: A few notes from a regression/coverage pass. One blocking correctness issue on the read-cache key, plus some test-coverage gaps (several marked optional).
| let item: T | null = null; | ||
| try { | ||
| item = await readJSON(filePath, schema); | ||
| const cachedItem = cachedItems?.get(filePath); |
There was a problem hiding this comment.
ai review (critical): this lookup key never matches the cache key when dataDir is relative, so the event read-cache is silently a no-op in the default config.
storeEvent keys the cache via taggedPath -> resolveWithinBase -> path.resolve(...) (absolute), but here filePath = path.join(directory, ...) where directory = path.join(basedir, 'events') stays relative when basedir is relative. Verified:
cacheKey : /…/workflow/.next/workflow-data/events/wrun_ABC-evt_123.json
lookupKey: .next/workflow-data/events/wrun_ABC-evt_123.json
match : false
This is the default path: world-local defaults to '.workflow-data' and the Next.js adapter sets WORKFLOW_LOCAL_DATA_DIR = '.next/workflow-data' (both relative). So in a Next.js dev server — the long-lived-server scenario this PR targets — events.list() re-reads every event from disk and the headline 240ms -> 115ms win does not apply. The mkdir cache is unaffected (it path.resolves on both sides).
Suggested fix: normalize both sides, e.g. key the lookup by path.resolve(directory, ${fileId}.json) here (defensive for any relative-directory caller), or build the cache key in storeEvent with the same path.join(basedir, 'events', ...) the query uses.
| }); | ||
| }); | ||
|
|
||
| it('reuses locally appended events without exposing cached instances', async () => { |
There was a problem hiding this comment.
ai review (test gap): every suite uses fs.mkdtemp(os.tmpdir(), …) (absolute), which is exactly why the relative-dataDir cache-key mismatch above isn't caught. Please add a regression test that builds storage with a relative basedir, writes an event, lists, and asserts eventFileReads is empty — that would have flagged the no-op.
Also, this test models a single run_created event, but the optimization targets sequential multi-step replay. A test that appends N events and asserts the incremental list() does zero event-file reads would actually exercise the benchmarked path.
| const result = await paginatedFileSystemQuery({ | ||
| directory: path.join(basedir, 'events'), | ||
| schema: EventSchema, | ||
| cachedItems: eventCache, |
There was a problem hiding this comment.
ai review (test gap, optional): listByCorrelationId now passes cachedItems: eventCache but there's no test asserting a correlation-id list actually hits the cache (the create/list paths are covered). Low priority since it shares the same code path, but worth a small case.
| expect(second.data[0]?.eventType).toBe('run_created'); | ||
| }); | ||
|
|
||
| it('reads oversized event payloads from disk instead of retaining them', async () => { |
There was a problem hiding this comment.
ai review (test gap, optional): the > maxCachedEventBytes skip is covered here, but the FIFO eviction loop (maxCachedEventEntries / totalCachedEventBytes accounting) has no direct test. Optional — a test that writes >1000 small events and asserts bounded cache size / no byte-counter drift would lock in the eviction logic, but it's not blocking.
TooTallNate
left a comment
There was a problem hiding this comment.
Request changes — critical bug confirmed: cache silently a no-op with relative dataDir
The structure of this PR is great — bounded LRU-ish cache, careful mutation isolation via structuredClone, proper terminal-run cleanup, sensible 4 MiB / 1000-entry caps, ENOENT retry for the createdDirectoriesCache. The test coverage for the in-cache scenarios is genuinely thorough (5 new storage tests + 3 new fs tests covering hit/miss/oversized/normalization/clearCache/terminal-release).
But there's a critical correctness bug that @karthikscale3 and Vade already flagged, and I've now reproduced it locally — see inline. The cache is silently a no-op in the default dataDir: '.workflow-data' config, which means the perf numbers in the PR description don't hold for production.
Verified locally
- 375/375 tests pass (the existing absolute-path tests don't exercise the bug)
- Reproduced the cache-key mismatch in a 6-line Node script —
path.resolvevspath.joinproduce different strings whenbasediris relative
Other points from the prior reviews
I agree with @karthikscale3 on the remaining items:
- Test gap — relative dataDir (high): Every existing suite uses
mkdtempSync(os.tmpdir(), ...)which gives absolute paths. Add a regression test that builds the world with a relativedataDir— inline comment includes a snippet. - Test gap —
listByCorrelationIdcache (optional): ThelistByCorrelationIdpath also passescachedItems: eventCachenow, but no test asserts it hits the cache. Easy to add. - Test gap — FIFO eviction (optional): The
> maxCachedEventBytesskip is tested, but thewhile (...)eviction loop has no direct coverage. Worth adding a test that writes >1000 events and asserts the oldest gets evicted. - Test mocks not restored (Copilot): The new
vi.spyOn(fs, 'readFile')calls should be wrapped intry { ... } finally { fs.readFile.mockRestore() }(or rely on a globalafterEach(() => vi.restoreAllMocks())if the suite has one).
What's good
The non-cache parts of the PR are solid:
createdDirectoriesCachewithwithEnsuredDirectoryENOENT retry is the right pattern — covers the dev-server-survives-rm scenario explicitly.- Cache release on terminal runs (
run_completed/run_failed/run_cancelled) is the right boundary — bounds memory growth for long-lived dev servers. structuredCloneon cache hit detaches caller-owned mutations from cached state — the test that mutatesfirst.data[0].eventType = 'run_failed'and asserts the next list returns the original is exactly right.- The
world.clear()+world.close()integration is correct — both clear caches, includingcreatedFilesCacheso newly-created files afterclear()work.
Request changes
Pending the relative-path bug fix + regression test. Once those are in, this is a clean perf improvement.
| @@ -544,7 +582,11 @@ export async function paginatedFileSystemQuery<T extends { createdAt: Date }>( | |||
| const filePath = path.join(directory, `${fileId}.json`); | |||
There was a problem hiding this comment.
Confirming the critical bug @karthikscale3 and Vade flagged: the event cache is silently a no-op when dataDir is relative.
I traced it through and reproduced the mismatch locally:
// taggedPath() (cache write key) uses path.resolve():
resolveWithinBase('.workflow-data', 'events', 'wrun_X-evt_Y.json')
// → '/Users/me/repo/.workflow-data/events/wrun_X-evt_Y.json'
// paginatedFileSystemQuery's filePath lookup uses path.join():
path.join(path.join('.workflow-data', 'events'), 'wrun_X-evt_Y.json')
// → '.workflow-data/events/wrun_X-evt_Y.json'The two never match. cachedItems?.get(filePath) always returns undefined, the code falls through to readJSON(filePath, schema), and the cache contributes zero perf benefit in the default config.
The perf numbers in the PR description (24% faster, 52% fewer events.list ms) were measured against a setup that probably used os.tmpdir() (absolute), which is exactly why the bug isn't caught by any existing test — every suite uses mkdtempSync(os.tmpdir(), ...).
Fix
Normalize the cache key on both sides. Simplest: have storeEvent use the same path.join(directory, ...) construction the lookup uses, OR have paginatedFileSystemQuery resolve filePath before consulting the cache:
// In paginatedFileSystemQuery:
const filePath = path.join(directory, `${fileId}.json`);
const cacheKey = path.resolve(filePath); // <- add this
const cachedItem = cachedItems?.get(cacheKey);And in storeEvent (already using taggedPath which resolves, so just keep that and don't change the read side). The cleaner version is to standardize on path.resolve(...) for cache keys everywhere.
Regression test
Karthik's suggestion to add a regression test that builds the world with a relative dataDir is exactly right. Suggested:
it('event cache works when dataDir is a relative path', async () => {
const cwd = process.cwd();
const relativeDir = path.relative(cwd, testDir); // make it relative
const localStorage = createStorage(relativeDir);
const run = await createRun(localStorage, {
deploymentId: 'dep',
workflowName: 'wf',
input: new Uint8Array([1]),
});
const readFileSpy = vi.spyOn(fs, 'readFile');
await localStorage.events.list({ runId: run.runId });
const eventFileReads = readFileSpy.mock.calls.filter(([fp]) =>
String(fp).includes(`${path.sep}events${path.sep}`)
);
expect(eventFileReads).toHaveLength(0); // cache hit
});Blocking on this — the perf claim doesn't hold for the default config until the cache keys align.
Summary
mkdirsyscallsRoot cause
The existing
sequentialStepsWorkflow(count, 0)benchmark reproduces the zero-work sequential-step shape. On currentmain, its local-world storage work is dominated by three persisted lifecycle events per step and the incrementalevents.list()call used for replay. The listing path rereads append-only event files that the same storage instance just wrote, while the write path repeatedly callsmkdir(..., { recursive: true })for fixed directories.This workload does not exercise streams; the previously landed stream metadata optimization is separate from this path.
Correctness and memory safeguards
Reviewing the initial caching implementation exposed two reproducible correctness issues:
events.create()returned an event object that was also retained in cache; mutating the returned payload changed a subsequent cachedevents.list()response.ENOENTwhen a live dev server's data directory was externally removed.This version addresses those issues and bounds retention:
EventSchema, so they are detached from caller mutations and have the same normalized shape as disk reads4 MiBand1000entries across active runs; oversized events are read from disk instead of retainedworld.clear()orworld.close()is calledMeasurement
I modeled the event/replay lifecycle for a no-delay sequential workflow directly through
@workflow/world-localstorage:For 200 steps, incremental
events.list()time fell from240.60 msto115.13 ms(52.1%lower).A 50-step filesystem-operation trace demonstrates the removed work:
readFilemkdirAn end-to-end workbench probe also showed that most remaining no-delay sequential-workflow latency occurs above this storage path: a 200-step run reported
22.8 sinside/.well-known/workflow/v1/flow.Validation
events.create()result surfaced inevents.list(); verified the amended implementation returns the persisted valueENOENTon the next event write); verified the amended implementation recreates the directorypnpm --filter @workflow/world-local typecheckpnpm --filter @workflow/world-local test(375tests passed)pnpm --filter '@workflow/world-local...' buildpnpm changeset status --since=origin-https/maingit diff --check