Skip to content

fix(threadpool): per-task nursery Scope so spawned coroutines can't outlive the task snapshot (UAF)#159

Closed
EdmondDantes wants to merge 3 commits into
mainfrom
windows-threadpool-snapshot-uaf
Closed

fix(threadpool): per-task nursery Scope so spawned coroutines can't outlive the task snapshot (UAF)#159
EdmondDantes wants to merge 3 commits into
mainfrom
windows-threadpool-snapshot-uaf

Conversation

@EdmondDantes

Copy link
Copy Markdown
Contributor

Problem

ThreadPool::submit() deep-copies the task closure and every nested closure into a per-task snapshot arena (the bytecode/op_array). The non-coroutine worker freed that arena the instant the task body returned (thread_pool.c task_cleanup) — but any coroutine the task had spawn()ed was still pending in the scheduler. That coroutine's op_array lives in the just-freed arena, so the scheduler later ran it against freed memory:

  • 0xC0000005 (hard crash) on the Windows debug heap (0xFEEEFEEE freed-fill).
  • Latent on Linux (glibc free() doesn't clobber) — and the server suite has no ASAN job, so it went unnoticed there.

Minimal, server-free repro (crashes before this fix):

$pool = new Async\ThreadPool(1);
$f = $pool->submit(function () {
    Async\spawn(function () { echo "child\n"; });  // pending when the body returns
});
Async\await($f);

Surfaced by true-async/server core/007-server-transfer-handler-dispatch (its task does spawn(stop-coroutine) + $server->start()).

Fix — structured concurrency (nursery)

Run each task body under its own per-task Scope in NOT-safe mode. Async\spawn parents to the current coroutine's scope, so coroutines the task spawns now land in task_scope. On task exit the worker:

  1. restores its own scope,
  2. cancels task_scope (NOT-safe → stragglers are cancelled, never awaited as zombies),
  3. awaits the scope's teardown — ZEND_ASYNC_WAKER_NEW + zend_async_resume_when(&task_scope->event, …) + SUSPEND, mirroring Scope::awaitAfterCancellation at the C level,

…all before the snapshot is freed. So no spawned coroutine can outlive the snapshot. Un-awaited children are cancelled at task exit (bounded by the runtime cancellation grace); a task that wants a child to finish must await it.

Verification (Windows Debug_TS)

Follow-up (not in this PR)

Coroutine-mode submit already runs each task in its own task_scope, but frees the snapshot on the task coroutine's dispose rather than on scope completion — the same latent UAF (a coroutine-mode minimal repro also crashes today). Tracking separately.

…shot UAF)

submit() deep-copies the task closure and every nested closure into a
per-task snapshot arena. The non-coroutine worker freed that arena the
instant the task body returned, but any coroutine the task had spawn()ed
was still pending in the scheduler — its op_array lived in the just-freed
arena, so the scheduler ran it against freed memory: 0xC0000005 on the
Windows debug heap, latent (ASAN-detectable) on Linux.

Fix: run each task body under its own per-task Scope in NOT-safe mode (a
nursery). Async\spawn uses the current coroutine's scope, so coroutines the
task spawns now land in task_scope. On task exit the worker restores its own
scope, cancels task_scope and awaits its teardown (cancel + waker-resume on
the scope event + SUSPEND, mirroring Scope::awaitAfterCancellation at the C
level) BEFORE the snapshot is freed — so no spawned coroutine can outlive
the snapshot. NOT-safe cancel means un-awaited children are cancelled at
task exit, never awaited as zombies (bounded by the runtime grace).

Verified on Windows: true-async/server core/007-server-transfer-handler-
dispatch goes red->green; a minimal repro (ThreadPool task that spawns an
un-awaited coroutine) no longer crashes; ext/async thread_pool suite 65
pass / 0 fail (incl. the #146/#154 lifetime tests). Regression test
tests/thread_pool/065-task_scope_nursery_no_uaf.phpt.

Follow-up: coroutine-mode submit already runs each task in its own scope
but frees the snapshot on the task coroutine's dispose rather than on scope
completion — same latent UAF, not yet fixed here.
@codecov

codecov Bot commented Jun 6, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…outine count)

The first cut gated the post-cancel await on IS_COMPLETELY_DONE, which
(can_be_disposed with_zombies=true) flips true the moment children are
cancelled — while their coroutine objects, holding op_arrays that live in
the snapshot arena, are still queued for disposal on a later scheduler tick.
The worker skipped the await and freed the arena; the deferred
coroutine_dispose then ran destroy_op_array against freed memory (ASAN
heap-use-after-free on Linux; masked by timing on the Windows debug heap).
It also called resume_when on an already-terminated scope event when the
task body had run its own children to completion (core/007: "event cannot
be used after it has been terminated").

Mirror Scope::awaitAfterCancellation: skip when the scope is already CLOSED
(children disposed, event gone), otherwise await while live coroutines or
child scopes remain (async_scope_t.coroutines.length). The SUSPEND now lets
each child run to full disposal while the arena is alive; cooperative
scheduling guarantees they finish before the worker resumes and frees it.

core/007 green 3/3; ext/async thread_pool 66/0.
…free

ASAN caught a second UAF — this one on the task scope itself. When the last
child coroutine finishes during the worker's SUSPEND, async_scope_notify_
coroutine_finished -> scope_try_to_dispose -> scope_dispose frees the scope;
the worker then resumes and touches the freed scope (CLR_OWNER_PINNED /
RELEASE) -> heap-use-after-free at thread_pool.c.

Pin the scope with ZEND_ASYNC_SCOPE_SET_OWNER_PINNED before the cancel/await
(same pattern pool_scope uses to outlive its tasks) and clear it right
before RELEASE, so the scope survives the await and the worker owns its
disposal. core/007 green 3/3; ext/async thread_pool 66/0.
EdmondDantes added a commit that referenced this pull request Jun 8, 2026
… task

Two correctness fixes on top of the per-task nursery scope:

1. Fatal in a task body no longer UAFs the snapshot. The entry op_array's
   name strings (function_name, filename) are materialized into refcounted
   heap strings (thread_materialize_op_array_names), so holders that outlive
   the snapshot arena — the closure freed at request shutdown, and
   PG(last_error_file) — are freed by refcount instead of dangling into the
   freed arena. The sync body's bailout is caught and the future rejected.

2. A bailout through a parked SUSPEND no longer leaks the libuv loop. The
   ThreadChannel send/recv triggers and the worker's slot_event are uv_async
   handles; a fatal re-raising zend_bailout() through SUSPEND skipped their
   dispose, leaving an open uv_async that blocked uv_loop_close (EBUSY) and
   leaked the loop internals. Both paths now catch the bailout, dispose the
   trigger on the worker's reactor, and re-raise. Debug builds dump any
   handle that survives reactor shutdown.

Tests: thread_pool/066 (sync fatal rejects future), 067 (coroutine fatal
disposes channel trigger), 068 (concurrency fatal disposes slot_event).
All green under ASAN+LSan.
EdmondDantes added a commit that referenced this pull request Jun 8, 2026
…longer null)

Two related improvements to how a ThreadPool task's fatal reaches the awaiter:

1. Coroutine-mode tasks no longer resolve to a silent null on a fatal. The
   pool_task_dispose completion path only checked coroutine->exception, but a
   bailout (OOM/fatal/exit) is not a thrown exception, so the future completed
   with UNDEF -> null. It now treats "no exception + UNDEF result" as a bailout
   and rejects the future with the cause, matching synchronous mode.

2. The cause crosses to the parent as a plain string, not a PHP object built
   in the dying worker. New async_future_shared_state_reject_bailout stores a
   pestrdup'd message (system malloc, not bound by memory_limit) on the shared
   state; shared_state_trigger_cb builds the ThreadTransferException on the
   healthy parent side. So the real fatal message survives even when the worker
   can no longer allocate a PHP object under the exhausted allocator.

Tests thread_pool/067, 068 now assert the cause is delivered (was %ANULL).
EdmondDantes added a commit that referenced this pull request Jun 8, 2026
…n_thread

A transferred closure's use-variable names were copied with zend_string_dup(),
which returns interned strings as-is. Those names are interned in the parent
thread, so the worker's snapshot stored pointers into the parent's interned-
string table. If the parent ended/aborted its request (zend_interned_strings_
deactivate frees the table) while a worker was still constructing the closure,
the worker read freed memory in async_thread_create_closure (ZSTR_VAL(key)) —
heap-use-after-free, reproducible under WSL2+ASAN, deterministic 6/6.

Copy each key into a private persistent string (zend_string_init persistent=1)
owned by the snapshot, independent of the parent's interned-string table.

Verified: tests/thread/050,052 now pass 5/5 under --asan + detect_leaks=1
(were failing 6/6); full thread suite 178 passed / 0 failed.
EdmondDantes added a commit that referenced this pull request Jun 8, 2026
…bug assert)

The previous commit copied the captured-variable name into a persistent string
with zend_string_init(persistent=1) but did not flag it GC_PERSISTENT_LOCAL.
On the parent thread (where zend_rc_debug is enabled in --enable-rc-debug ASAN
builds) the GC_ADDREF inside zend_hash_add then asserts
"GC_PERSISTENT without GC_PERSISTENT_LOCAL" — failing every cross-thread test
in the CI OpCache ASAN job (remote_future/*, thread/030,031, task_group/040).

Mark the key PERSISTENT_LOCAL right after init, exactly as the dst->bound_vars
hash itself is marked one line above.

Verified under a full -DZEND_RC_DEBUG=1 + opcache.enable_cli=1 + ASAN/LSan
rebuild: remote_future/thread/task_group/thread_pool/thread_channel = 236
passed, 0 failed.
EdmondDantes added a commit that referenced this pull request Jun 8, 2026
The cross-thread bailout cause does not belong on the generic
zend_future_shared_state_t (it is shared by RemoteFuture and every cross-thread
future, not just ThreadPool). Drop the bailout_cause field, reject_bailout, the
destination-side build branch, and the async_thread_create_transfer_exception
helper.

A task that bailed out now resolves its future the normal way: build a
ThreadTransferException in the worker and reject via async_future_shared_state_
reject. Coroutine-mode still reports the cause instead of a silent null
(pool_task_dispose detects "no exception + UNDEF result" and rejects).

Also trims the verbose comments added during this work to 1-2 lines.

Verified under -DZEND_RC_DEBUG=1 + opcache.enable_cli=1 + ASAN/LSan:
remote_future/thread/task_group/thread_pool/thread_channel = 236 passed, 0 failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant