Skip to content

fix(threadpool): per-task nursery scope for sync tasks — snapshot UAF (no scope-pointer swap)#160

Merged
EdmondDantes merged 7 commits into
mainfrom
windows-threadpool-snapshot-uaf-clean
Jun 8, 2026
Merged

fix(threadpool): per-task nursery scope for sync tasks — snapshot UAF (no scope-pointer swap)#160
EdmondDantes merged 7 commits into
mainfrom
windows-threadpool-snapshot-uaf-clean

Conversation

@EdmondDantes

Copy link
Copy Markdown
Contributor

Problem

A synchronous ThreadPool task (new ThreadPool(N), i.e. coroutine_mode = false) ran its body inline in the worker's main coroutine and freed the per-task snapshot arena the instant the body returned. If the body had Async\spawn()ed a coroutine it never awaited, that coroutine was still pending in the scheduler — and its op_array lives in the just-freed arena. Running it later dereferenced freed memory: hard crash on the Windows debug heap, latent + ASAN-caught on Linux.

Approach (supersedes #159)

#159 fixed this by temporarily swapping the worker coroutine's scope pointer to redirect Async\spawn() into a per-task nursery, plus a hand-rolled cancel/await/pin dance. This PR removes that swap entirely.

The task body now runs as a real coroutine in its own per-task nursery Scope. Because CURRENT SCOPE follows the running coroutine, Async\spawn() inside the body lands in that scope on its own — no coro->scope hijacking. On task exit the worker cancels the scope and drains it — awaiting the physical disposal of every spawned coroutine — before freeing the snapshot.

Invariant: no coroutine spawned by the task outlives the snapshot. (Whether such a coroutine got to run at all is timing-dependent and intentionally not asserted — the only guarantee is no use-after-free.)

The drain reuses the canonical zombie-aware logic of Scope::awaitAfterCancellation instead of re-implementing it: its body is factored into a C function exposed on the engine API.

Changes

  • php-src (ABI v0.20.0, already on true-async-stable): new zend_async_scope_await_after_cancellation_fn + ZEND_ASYNC_SCOPE_AWAIT_AFTER_CANCELLATION macro. Scope::awaitAfterCancellation is now a thin wrapper over the shared C core.
  • ext/async: sync task path in thread_pool.c rewritten (NEW_SCOPE nursery → SPAWN_WITH → await body → cancel → drain → free snapshot); scope.c/.h factor + register the C core; async_API.c registration. pool_task_dispose/coroutine-mode path untouched.

Tests

tests/thread_pool/065-task_scope_nursery_no_uaf.phpt (spawned-not-awaited coroutine still pending at task return). Full tests/thread_pool/ 66/66 (+1 pre-existing XFAIL) and scope/task_group/task_set/coroutine/spawn/await/future/pool/channel (≈480) green under ASAN+ZTS.

Requires php-src true-async-stable at ABI v0.20.0 (pushed first so this PR's CI clones it).

…shot UAF)

A synchronous ThreadPool task ran its body inline in the worker and freed
the per-task snapshot arena the moment the body returned — while a coroutine
the body spawned but never awaited was still pending. That coroutine's
op_array lived in the freed arena (Windows debug-heap crash; ASAN-caught on
Linux).

Run the body as a coroutine in its own per-task nursery Scope instead:
Async\spawn() inside the body lands in that scope on its own (no coro->scope
hijacking). On task exit the worker cancels the scope and drains it via the
new ZEND_ASYNC_SCOPE_AWAIT_AFTER_CANCELLATION — awaiting physical disposal of
every spawned coroutine — before freeing the snapshot. Invariant: no spawned
coroutine outlives the snapshot.

Requires php-src ABI v0.20.0 (zend_async_scope_await_after_cancellation_fn).
Regression test tests/thread_pool/065-task_scope_nursery_no_uaf.phpt.
@codecov

codecov Bot commented Jun 7, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Running the body as a coroutine meant a fatal (e.g. OOM) re-raised
zend_bailout() out of the task coroutine and longjmped to the worker's
bailout handler, past the future resolution — and that handler only drains
the channel, not the already-dequeued in-flight task, so the awaiter hung.
Track the in-flight task's future in a volatile and reject it from the
bailout handler (ThreadTransferException), restoring the prior behaviour.

Also trim the over-verbose comments from the previous commit.

Test tests/thread_pool/066-task_fatal_rejects_future.phpt.
… task

Two correctness fixes on top of the per-task nursery scope:

1. Fatal in a task body no longer UAFs the snapshot. The entry op_array's
   name strings (function_name, filename) are materialized into refcounted
   heap strings (thread_materialize_op_array_names), so holders that outlive
   the snapshot arena — the closure freed at request shutdown, and
   PG(last_error_file) — are freed by refcount instead of dangling into the
   freed arena. The sync body's bailout is caught and the future rejected.

2. A bailout through a parked SUSPEND no longer leaks the libuv loop. The
   ThreadChannel send/recv triggers and the worker's slot_event are uv_async
   handles; a fatal re-raising zend_bailout() through SUSPEND skipped their
   dispose, leaving an open uv_async that blocked uv_loop_close (EBUSY) and
   leaked the loop internals. Both paths now catch the bailout, dispose the
   trigger on the worker's reactor, and re-raise. Debug builds dump any
   handle that survives reactor shutdown.

Tests: thread_pool/066 (sync fatal rejects future), 067 (coroutine fatal
disposes channel trigger), 068 (concurrency fatal disposes slot_event).
All green under ASAN+LSan.
…longer null)

Two related improvements to how a ThreadPool task's fatal reaches the awaiter:

1. Coroutine-mode tasks no longer resolve to a silent null on a fatal. The
   pool_task_dispose completion path only checked coroutine->exception, but a
   bailout (OOM/fatal/exit) is not a thrown exception, so the future completed
   with UNDEF -> null. It now treats "no exception + UNDEF result" as a bailout
   and rejects the future with the cause, matching synchronous mode.

2. The cause crosses to the parent as a plain string, not a PHP object built
   in the dying worker. New async_future_shared_state_reject_bailout stores a
   pestrdup'd message (system malloc, not bound by memory_limit) on the shared
   state; shared_state_trigger_cb builds the ThreadTransferException on the
   healthy parent side. So the real fatal message survives even when the worker
   can no longer allocate a PHP object under the exhausted allocator.

Tests thread_pool/067, 068 now assert the cause is delivered (was %ANULL).
…n_thread

A transferred closure's use-variable names were copied with zend_string_dup(),
which returns interned strings as-is. Those names are interned in the parent
thread, so the worker's snapshot stored pointers into the parent's interned-
string table. If the parent ended/aborted its request (zend_interned_strings_
deactivate frees the table) while a worker was still constructing the closure,
the worker read freed memory in async_thread_create_closure (ZSTR_VAL(key)) —
heap-use-after-free, reproducible under WSL2+ASAN, deterministic 6/6.

Copy each key into a private persistent string (zend_string_init persistent=1)
owned by the snapshot, independent of the parent's interned-string table.

Verified: tests/thread/050,052 now pass 5/5 under --asan + detect_leaks=1
(were failing 6/6); full thread suite 178 passed / 0 failed.
…bug assert)

The previous commit copied the captured-variable name into a persistent string
with zend_string_init(persistent=1) but did not flag it GC_PERSISTENT_LOCAL.
On the parent thread (where zend_rc_debug is enabled in --enable-rc-debug ASAN
builds) the GC_ADDREF inside zend_hash_add then asserts
"GC_PERSISTENT without GC_PERSISTENT_LOCAL" — failing every cross-thread test
in the CI OpCache ASAN job (remote_future/*, thread/030,031, task_group/040).

Mark the key PERSISTENT_LOCAL right after init, exactly as the dst->bound_vars
hash itself is marked one line above.

Verified under a full -DZEND_RC_DEBUG=1 + opcache.enable_cli=1 + ASAN/LSan
rebuild: remote_future/thread/task_group/thread_pool/thread_channel = 236
passed, 0 failed.
The cross-thread bailout cause does not belong on the generic
zend_future_shared_state_t (it is shared by RemoteFuture and every cross-thread
future, not just ThreadPool). Drop the bailout_cause field, reject_bailout, the
destination-side build branch, and the async_thread_create_transfer_exception
helper.

A task that bailed out now resolves its future the normal way: build a
ThreadTransferException in the worker and reject via async_future_shared_state_
reject. Coroutine-mode still reports the cause instead of a silent null
(pool_task_dispose detects "no exception + UNDEF result" and rejects).

Also trims the verbose comments added during this work to 1-2 lines.

Verified under -DZEND_RC_DEBUG=1 + opcache.enable_cli=1 + ASAN/LSan:
remote_future/thread/task_group/thread_pool/thread_channel = 236 passed, 0 failed.
@EdmondDantes EdmondDantes merged commit 9af3e2e into main Jun 8, 2026
9 checks passed
@EdmondDantes EdmondDantes deleted the windows-threadpool-snapshot-uaf-clean branch June 8, 2026 17:48
EdmondDantes added a commit that referenced this pull request Jun 8, 2026
…in CI

063-bootloader_exception: a bootloader exception reaches the awaiter either as
the real RuntimeException (submit landed before the worker closed the channel)
or as Async\ThreadTransferException carrying the same message (submit raced the
close). Both deliver the true cause, so assert on the message instead of the
timing-dependent class.

034-stdio_fopen_fwrite: concurrent fopen('a') appends to a single shared file
from two workers are not atomic on Windows and can lose a write (got lines=3).
Each task now writes its own file ('w'), keeping "worker I/O hits disk"
coverage without the cross-thread append contention.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant