fix(threadpool): per-task nursery Scope so spawned coroutines can't outlive the task snapshot (UAF) by EdmondDantes · Pull Request #159 · true-async/php-async

EdmondDantes · 2026-06-06T16:00:00Z

Problem

ThreadPool::submit() deep-copies the task closure and every nested closure into a per-task snapshot arena (the bytecode/op_array). The non-coroutine worker freed that arena the instant the task body returned (thread_pool.c task_cleanup) — but any coroutine the task had spawn()ed was still pending in the scheduler. That coroutine's op_array lives in the just-freed arena, so the scheduler later ran it against freed memory:

0xC0000005 (hard crash) on the Windows debug heap (0xFEEEFEEE freed-fill).
Latent on Linux (glibc free() doesn't clobber) — and the server suite has no ASAN job, so it went unnoticed there.

Minimal, server-free repro (crashes before this fix):

$pool = new Async\ThreadPool(1);
$f = $pool->submit(function () {
    Async\spawn(function () { echo "child\n"; });  // pending when the body returns
});
Async\await($f);

Surfaced by true-async/server core/007-server-transfer-handler-dispatch (its task does spawn(stop-coroutine) + $server->start()).

Fix — structured concurrency (nursery)

Run each task body under its own per-task Scope in NOT-safe mode. Async\spawn parents to the current coroutine's scope, so coroutines the task spawns now land in task_scope. On task exit the worker:

restores its own scope,
cancels task_scope (NOT-safe → stragglers are cancelled, never awaited as zombies),
awaits the scope's teardown — ZEND_ASYNC_WAKER_NEW + zend_async_resume_when(&task_scope->event, …) + SUSPEND, mirroring Scope::awaitAfterCancellation at the C level,

…all before the snapshot is freed. So no spawned coroutine can outlive the snapshot. Un-awaited children are cancelled at task exit (bounded by the runtime cancellation grace); a task that wants a child to finish must await it.

Verification (Windows Debug_TS)

core/007-server-transfer-handler-dispatch: red → green
minimal repro: no longer crashes (exit 0)
ext/async/tests/thread_pool/: 65 pass / 0 fail (incl. the flock(): stack-use-after-return when coroutine is cancelled mid-flock (real ASAN finding) #146/ThreadPool/ThreadChannel: no test coverage for transferring $this-bound closures (incl. method-as-closure) #154 lifetime tests) — no regressions
new regression test tests/thread_pool/065-task_scope_nursery_no_uaf.phpt (will trip ASAN on Linux without the fix)

Follow-up (not in this PR)

Coroutine-mode submit already runs each task in its own task_scope, but frees the snapshot on the task coroutine's dispose rather than on scope completion — the same latent UAF (a coroutine-mode minimal repro also crashes today). Tracking separately.

…shot UAF) submit() deep-copies the task closure and every nested closure into a per-task snapshot arena. The non-coroutine worker freed that arena the instant the task body returned, but any coroutine the task had spawn()ed was still pending in the scheduler — its op_array lived in the just-freed arena, so the scheduler ran it against freed memory: 0xC0000005 on the Windows debug heap, latent (ASAN-detectable) on Linux. Fix: run each task body under its own per-task Scope in NOT-safe mode (a nursery). Async\spawn uses the current coroutine's scope, so coroutines the task spawns now land in task_scope. On task exit the worker restores its own scope, cancels task_scope and awaits its teardown (cancel + waker-resume on the scope event + SUSPEND, mirroring Scope::awaitAfterCancellation at the C level) BEFORE the snapshot is freed — so no spawned coroutine can outlive the snapshot. NOT-safe cancel means un-awaited children are cancelled at task exit, never awaited as zombies (bounded by the runtime grace). Verified on Windows: true-async/server core/007-server-transfer-handler- dispatch goes red->green; a minimal repro (ThreadPool task that spawns an un-awaited coroutine) no longer crashes; ext/async thread_pool suite 65 pass / 0 fail (incl. the #146/#154 lifetime tests). Regression test tests/thread_pool/065-task_scope_nursery_no_uaf.phpt. Follow-up: coroutine-mode submit already runs each task in its own scope but frees the snapshot on the task coroutine's dispose rather than on scope completion — same latent UAF, not yet fixed here.

codecov · 2026-06-06T16:07:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…outine count) The first cut gated the post-cancel await on IS_COMPLETELY_DONE, which (can_be_disposed with_zombies=true) flips true the moment children are cancelled — while their coroutine objects, holding op_arrays that live in the snapshot arena, are still queued for disposal on a later scheduler tick. The worker skipped the await and freed the arena; the deferred coroutine_dispose then ran destroy_op_array against freed memory (ASAN heap-use-after-free on Linux; masked by timing on the Windows debug heap). It also called resume_when on an already-terminated scope event when the task body had run its own children to completion (core/007: "event cannot be used after it has been terminated"). Mirror Scope::awaitAfterCancellation: skip when the scope is already CLOSED (children disposed, event gone), otherwise await while live coroutines or child scopes remain (async_scope_t.coroutines.length). The SUSPEND now lets each child run to full disposal while the arena is alive; cooperative scheduling guarantees they finish before the worker resumes and frees it. core/007 green 3/3; ext/async thread_pool 66/0.

…free ASAN caught a second UAF — this one on the task scope itself. When the last child coroutine finishes during the worker's SUSPEND, async_scope_notify_ coroutine_finished -> scope_try_to_dispose -> scope_dispose frees the scope; the worker then resumes and touches the freed scope (CLR_OWNER_PINNED / RELEASE) -> heap-use-after-free at thread_pool.c. Pin the scope with ZEND_ASYNC_SCOPE_SET_OWNER_PINNED before the cancel/await (same pattern pool_scope uses to outlive its tasks) and clear it right before RELEASE, so the scope survives the await and the worker owns its disposal. core/007 green 3/3; ext/async thread_pool 66/0.

… task Two correctness fixes on top of the per-task nursery scope: 1. Fatal in a task body no longer UAFs the snapshot. The entry op_array's name strings (function_name, filename) are materialized into refcounted heap strings (thread_materialize_op_array_names), so holders that outlive the snapshot arena — the closure freed at request shutdown, and PG(last_error_file) — are freed by refcount instead of dangling into the freed arena. The sync body's bailout is caught and the future rejected. 2. A bailout through a parked SUSPEND no longer leaks the libuv loop. The ThreadChannel send/recv triggers and the worker's slot_event are uv_async handles; a fatal re-raising zend_bailout() through SUSPEND skipped their dispose, leaving an open uv_async that blocked uv_loop_close (EBUSY) and leaked the loop internals. Both paths now catch the bailout, dispose the trigger on the worker's reactor, and re-raise. Debug builds dump any handle that survives reactor shutdown. Tests: thread_pool/066 (sync fatal rejects future), 067 (coroutine fatal disposes channel trigger), 068 (concurrency fatal disposes slot_event). All green under ASAN+LSan.

…longer null) Two related improvements to how a ThreadPool task's fatal reaches the awaiter: 1. Coroutine-mode tasks no longer resolve to a silent null on a fatal. The pool_task_dispose completion path only checked coroutine->exception, but a bailout (OOM/fatal/exit) is not a thrown exception, so the future completed with UNDEF -> null. It now treats "no exception + UNDEF result" as a bailout and rejects the future with the cause, matching synchronous mode. 2. The cause crosses to the parent as a plain string, not a PHP object built in the dying worker. New async_future_shared_state_reject_bailout stores a pestrdup'd message (system malloc, not bound by memory_limit) on the shared state; shared_state_trigger_cb builds the ThreadTransferException on the healthy parent side. So the real fatal message survives even when the worker can no longer allocate a PHP object under the exhausted allocator. Tests thread_pool/067, 068 now assert the cause is delivered (was %ANULL).

…n_thread A transferred closure's use-variable names were copied with zend_string_dup(), which returns interned strings as-is. Those names are interned in the parent thread, so the worker's snapshot stored pointers into the parent's interned- string table. If the parent ended/aborted its request (zend_interned_strings_ deactivate frees the table) while a worker was still constructing the closure, the worker read freed memory in async_thread_create_closure (ZSTR_VAL(key)) — heap-use-after-free, reproducible under WSL2+ASAN, deterministic 6/6. Copy each key into a private persistent string (zend_string_init persistent=1) owned by the snapshot, independent of the parent's interned-string table. Verified: tests/thread/050,052 now pass 5/5 under --asan + detect_leaks=1 (were failing 6/6); full thread suite 178 passed / 0 failed.

…bug assert) The previous commit copied the captured-variable name into a persistent string with zend_string_init(persistent=1) but did not flag it GC_PERSISTENT_LOCAL. On the parent thread (where zend_rc_debug is enabled in --enable-rc-debug ASAN builds) the GC_ADDREF inside zend_hash_add then asserts "GC_PERSISTENT without GC_PERSISTENT_LOCAL" — failing every cross-thread test in the CI OpCache ASAN job (remote_future/*, thread/030,031, task_group/040). Mark the key PERSISTENT_LOCAL right after init, exactly as the dst->bound_vars hash itself is marked one line above. Verified under a full -DZEND_RC_DEBUG=1 + opcache.enable_cli=1 + ASAN/LSan rebuild: remote_future/thread/task_group/thread_pool/thread_channel = 236 passed, 0 failed.

The cross-thread bailout cause does not belong on the generic zend_future_shared_state_t (it is shared by RemoteFuture and every cross-thread future, not just ThreadPool). Drop the bailout_cause field, reject_bailout, the destination-side build branch, and the async_thread_create_transfer_exception helper. A task that bailed out now resolves its future the normal way: build a ThreadTransferException in the worker and reject via async_future_shared_state_ reject. Coroutine-mode still reports the cause instead of a silent null (pool_task_dispose detects "no exception + UNDEF result" and rejects). Also trims the verbose comments added during this work to 1-2 lines. Verified under -DZEND_RC_DEBUG=1 + opcache.enable_cli=1 + ASAN/LSan: remote_future/thread/task_group/thread_pool/thread_channel = 236 passed, 0 failed.

EdmondDantes added 2 commits June 6, 2026 19:32

EdmondDantes mentioned this pull request Jun 7, 2026

fix(threadpool): per-task nursery scope for sync tasks — snapshot UAF (no scope-pointer swap) #160

Merged

EdmondDantes closed this Jun 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(threadpool): per-task nursery Scope so spawned coroutines can't outlive the task snapshot (UAF)#159

fix(threadpool): per-task nursery Scope so spawned coroutines can't outlive the task snapshot (UAF)#159
EdmondDantes wants to merge 3 commits into
mainfrom
windows-threadpool-snapshot-uaf

EdmondDantes commented Jun 6, 2026

Uh oh!

codecov Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EdmondDantes commented Jun 6, 2026

Problem

Fix — structured concurrency (nursery)

Verification (Windows Debug_TS)

Follow-up (not in this PR)

Uh oh!

codecov Bot commented Jun 6, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant