Skip to content

fix(envd): kill command's whole process tree on signal#2933

Open
mishushakov wants to merge 3 commits into
mainfrom
mishushakov/envd-kill-process-tree
Open

fix(envd): kill command's whole process tree on signal#2933
mishushakov wants to merge 3 commits into
mainfrom
mishushakov/envd-kill-process-tree

Conversation

@mishushakov
Copy link
Copy Markdown
Member

envd's SendSignal only signaled the single process it manages (the command leader), so child processes the command spawned kept running — and consuming resources — after a kill() (see e2b-dev/E2B#1389, which worked around this SDK-side with a pgrep shell walk). Non-PTY commands now start in their own process group (Setpgid) and SendSignal delivers the signal to the whole group, terminating the leader together with its children; command timeouts tear down the group too, and PTY processes are left untouched. This is the same idiom already used for socat in internal/port/forward.go, and it's transparently backward-compatible — existing SDKs get whole-tree termination for free and can later drop their workaround. Adds a regression test (TestSendSignal_KillsProcessTree) and bumps the envd version to 0.6.2.

🤖 Generated with Claude Code

envd's SendSignal only signaled the single process it manages (the
command leader), so child processes the command spawned kept running —
and consuming resources — after a kill. Non-PTY commands now start in
their own process group (Setpgid) and SendSignal delivers the signal to
the whole group, terminating the leader together with its children;
command timeouts tear down the group too. PTY processes are left
untouched. Bumps the envd version.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented Jun 5, 2026

PR Summary

Medium Risk
Changes how processes are started and signaled in the sandbox; incorrect group signaling could affect unrelated processes, though PTY is excluded and reaping is guarded.

Overview
Non-PTY commands in envd now run in their own process group so SendSignal and command timeouts kill the leader and any child processes it spawned, instead of leaving orphans running after a kill. PTY sessions keep signaling only the managed process. A reaped guard blocks process-group signals after the leader is reaped so a recycled PID is not hit by mistake. A regression test covers whole-tree kill, and envd is bumped to 0.6.2.

Reviewed by Cursor Bugbot for commit a537272. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ensures that non-PTY commands run in their own process group so that signals and context cancellations terminate the entire process tree instead of just the leader process. A test is added to verify this behavior, and the package version is bumped to 0.6.2. There are no review comments, and I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
2749 2 2747 5
View the full list of 2 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestEnvdAccessTokenAutoResumeViaProxy

Flake rate in main: 39.32% (Passed 909 times, Failed 589 times)

Stack Traces | 10.6s run time
=== RUN   TestEnvdAccessTokenAutoResumeViaProxy
=== PAUSE TestEnvdAccessTokenAutoResumeViaProxy
=== CONT  TestEnvdAccessTokenAutoResumeViaProxy
    traffic_access_token_test.go:357: 
        	Error Trace:	.../tests/proxies/traffic_access_token_test.go:357
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:3002/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestEnvdAccessTokenAutoResumeViaProxy
--- FAIL: TestEnvdAccessTokenAutoResumeViaProxy (10.63s)
github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestSandboxAutoResumeViaProxy

Flake rate in main: 39.77% (Passed 907 times, Failed 599 times)

Stack Traces | 17.2s run time
=== RUN   TestSandboxAutoResumeViaProxy
=== PAUSE TestSandboxAutoResumeViaProxy
=== CONT  TestSandboxAutoResumeViaProxy
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"ivzvc0q048au1tyism7vo","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:116: 
        	Error Trace:	.../tests/proxies/auto_resume_test.go:116
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:3002": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestSandboxAutoResumeViaProxy
--- FAIL: TestSandboxAutoResumeViaProxy (17.17s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Comment thread packages/envd/internal/services/process/handler/handler.go
Comment thread packages/envd/internal/services/process/start_test.go
Address PR review: signalProcessGroup bypassed Go's ErrProcessDone
guard, so a stale signal arriving after the leader was reaped (but
before deregistration) could kill(-pid) an unrelated command that
recycled the pid as its process group leader. Bail out when
cmd.ProcessState is set, mirroring os.Process.Signal. Also make the
test's processAlive read /proc/<pid>/stat and treat zombies as not
alive, so it doesn't hang when orphans aren't reaped promptly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 847707bb7d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/envd/internal/services/process/handler/handler.go Outdated
Address PR review: reading cmd.ProcessState in SendSignal raced with
cmd.Wait setting it in the reaper goroutine. Replace the guard with a
handler-level atomic.Bool set immediately after cmd.Wait reaps the
leader, and consult it before sending a process-group signal. The
timeout-driven cmd.Cancel path no longer reads ProcessState and is safe
because exec stops invoking Cancel once the process exits.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant