[https://nvbugs/6043291][fix] Add fatal error detection to prevent zombie worker pods#12718
[https://nvbugs/6043291][fix] Add fatal error detection to prevent zombie worker pods#12718chienchunhung wants to merge 8 commits intoNVIDIA:mainfrom
Conversation
f32efd0 to
1a7c2d4
Compare
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #41528 [ run ] triggered by Bot. Commit: |
|
PR_Github #41528 [ run ] completed with state
|
When the C++ executor crashes (e.g. CUDA OOM), the Python process previously stayed alive with health checks returning 200, creating zombie pods that accept connections but never produce tokens. This adds defense-in-depth: 1. GenerationExecutor.check_health() drains error queue and checks fatal error state instead of only checking doing_shutdown 2. GenerationExecutorProxy checks MPI worker future liveness and runs a background error monitor thread for auto-shutdown 3. PyExecutor detects CUDA fatal errors and consecutive error loops, initiating shutdown instead of silently failing each batch 4. Health endpoint raises SIGINT on fatal error to terminate the process (same pattern as existing CppExecutorError handlers) NVBug 6043291 Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
- Join _error_monitor_thread during proxy shutdown to prevent pytest-threadleak detection of the daemon thread - Update test_health to patch check_health() instead of is_shutdown() since BaseLLM._check_health() now delegates to executor.check_health() Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor
c89534a to
6fb3968
Compare
…udget
- Fix self-join deadlock: skip _error_monitor_thread.join() when
shutdown() runs on the monitor thread itself
- Replace _handle_background_error() call in monitor loop with direct
queue drain to avoid re-entrancy (documented for main-thread use)
- Replace simple consecutive error counter with token-bucket scheme:
budget starts at 1.0, recovers at 0.1/s of error-free time, each
transient error costs 0.1 and each severe error costs 0.5
- Split error classification into three tiers:
- immediate_fatal: device-side assert, illegal address, launch failure
— crash on first occurrence
- severe: CUDA OOM, NCCL error — drains budget 5x faster
- transient: all other errors — tolerated until budget exhausted
- Use epsilon comparison (< 1e-9) to avoid IEEE 754 rounding issues
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
- Add Args/Returns docstrings to _classify_error, _is_fatal_error, _consume_error_budget, _handle_errors, check_health, _set_fatal_error - Add type annotations to class attributes and return types - Add inline comment block explaining the token-bucket fields - Move time import to module level in proxy.py - Parametrize tests: consolidate is_shutdown, check_health, proxy worker states, gRPC health, OpenAI health, and BaseLLM delegation into pytest.mark.parametrize for better coverage with fewer test methods - Promote _make_request to module-level helper to remove duplication Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor
3e5a9b1 to
5442515
Compare
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #41722 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughAdds deterministic error classification and a token-bucket error budget to PyExecutor, records the first fatal background error at executor level, introduces a proxy monitor thread that surfaces fatal conditions, and propagates fatal-status checks through gRPC/OpenAI/LLM health endpoints to trigger coordinated shutdown. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant OpenAI as "OpenAI Server"
participant LLM as "BaseLLM"
participant Executor as "GenerationExecutor"
participant ProxyMonitor as "Proxy Monitor Thread"
participant PyExec as "PyExecutor"
Client->>OpenAI: GET /health
OpenAI->>LLM: _check_health()
LLM->>Executor: check_health()
Executor->>PyExec: inspect error queue / fatal state
ProxyMonitor->>Executor: periodic poll (mpi_futures / error_queue)
alt fatal condition detected
ProxyMonitor->>Executor: _set_fatal_error(err)
ProxyMonitor->>Executor: pre_shutdown() / shutdown()
Executor->>PyExec: enqueue_shutdown_request()
end
Executor-->>LLM: check_health() = False (if fatal)
LLM-->>OpenAI: unhealthy
alt fatal present
OpenAI->>OpenAI: signal.raise_signal(SIGINT)
else
OpenAI-->>Client: 503 LLM unavailable
end
sequenceDiagram
participant PyExec as "PyExecutor"
participant Classifier as "classify_error()"
participant Budget as "Error Budget"
participant ReqQueue as "Executor Request Queue"
PyExec->>Classifier: _classify_error(error_msg)
Classifier-->>PyExec: immediate_fatal / severe / transient
alt immediate_fatal
PyExec->>PyExec: set _fatal_error immediately
else severe or transient
PyExec->>Budget: _consume_error_budget(severity)
Budget-->>PyExec: budget_exhausted? (after recovery/deduction)
end
alt budget exhausted or immediate fatal
PyExec->>ReqQueue: enqueue_shutdown_request()
PyExec->>PyExec: mark is_shutdown, fail/complete active requests
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
3496-3523:⚠️ Potential issue | 🔴 CriticalFatal errors need to stop local scheduling immediately.
After setting
_fatal_error, Line 3522 only enqueues a shutdown request. This class's loop guards still key offself.is_shutdown, not_fatal_error(Lines 858-860), so requests already sitting inwaiting_queue— or ahead of the shutdown sentinel inexecutor_request_queue— can still be activated on the next iteration and run after a fatal CUDA error. Please flip the local shutdown state and fail/drain pending queued work in the same fatal path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3496 - 3523, After setting self._fatal_error in the fatal-error path, immediately set the local shutdown flag (self.is_shutdown = True) and fail/drain any pending work in waiting_queue (and any requests still in executor_request_queue if applicable) similar to how active_requests are handled: mark each queued request state=LlmRequestState.GENERATION_COMPLETE, populate error_responses for their py_request_id (use the same error_msg and client id logic used for active_requests), call self._enqueue_responses on those items, and call self._terminate_request for each; only after draining waiting_queue and owner-level queues call executor_request_queue.enqueue_shutdown_request() so no queued requests get scheduled after a fatal CUDA error.
3502-3519:⚠️ Potential issue | 🔴 CriticalCopy
active_requestsbefore clearing them.When
requestsisNone, Line 3502 aliasesfailed_requeststoself.active_requests. Line 3511 then clears that list before the termination loop, so Lines 3518-3519 never free resources for the failed requests. Broad failures from_forward_step(),_sample_async(),_update_requests(), etc. will therefore leak KV/sequence state.Suggested fix
- failed_requests = requests if requests is not None else self.active_requests + failed_requests = list(requests) if requests is not None else list( + self.active_requests + ) for request in failed_requests: req_id = request.py_request_id request.state = LlmRequestState.GENERATION_COMPLETE error_responses[req_id] = LlmResponse( request_id=req_id, @@ if requests is None: self.active_requests.clear()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3502 - 3519, The loop currently aliases failed_requests to self.active_requests when requests is None, then clears self.active_requests before calling _terminate_request, causing resource leaks; change the assignment so that failed_requests is a shallow copy (e.g., failed_requests = list(self.active_requests)) when requests is None, keep the existing branch that filters self.active_requests when requests is provided, and ensure the subsequent calls to _enqueue_responses and _terminate_request iterate over that copied failed_requests so _terminate_request(request) runs for each failed request (reference variables/methods: failed_requests, self.active_requests, _enqueue_responses, _terminate_request).
🧹 Nitpick comments (1)
tensorrt_llm/executor/executor.py (1)
318-320: Keep fatal crashes distinguishable from graceful shutdown.Now that
is_shutdown()returnsTruefor both cases, callers likeBaseLLM.generate_async()collapse fatal engine failures into the generic"LLM is shutting down"error. Consider exposing the latched fatal error separately so request paths can surface the real cause while still usingis_shutdown()for control flow.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/executor/executor.py` around lines 318 - 320, is_shutdown() currently conflates graceful shutdown and fatal crashes by returning True whenever self._fatal_error is set; add a separate accessor (e.g., a read-only property or method named fatal_error or get_fatal_error) that returns the latched exception (self._fatal_error) while leaving is_shutdown() unchanged, and update call sites such as BaseLLM.generate_async to first check executor.fatal_error and surface/raise that exception if present before treating is_shutdown() as a generic shutdown case. Ensure the accessor returns None when there is no fatal error and do not change the existing boolean semantics of is_shutdown().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3473-3495: The helper _handle_errors currently always calls
self._consume_error_budget and charges the process-wide fatal budget even for
request-local failures; change _handle_errors(signature) to accept an optional
flag (e.g., charge_process_budget: bool = True) and only call
self._consume_error_budget(error_msg) when charge_process_budget is True (leave
behavior unchanged by default). Update all call sites that represent
request-scoped/user-controlled failures (the validation error path that calls
_handle_errors from the validation code, the KV-transfer timeout caller, and the
guided-decoder failure caller) to pass charge_process_budget=False so those do
not decrement the shared budget; keep fatal-shutdown callers using the default
True. Ensure references to symbols: _handle_errors, _consume_error_budget, and
active_requests are preserved and that tests cover both charged and uncharged
failure paths.
- Around line 3377-3411: The _classify_error logic currently treats "illegal
memory access" as only a severe error because _IMMEDIATE_FATAL_PATTERNS lacks
that text, so add patterns like "illegal memory access" and "illegal address"
(or a single substring that matches the common CUDA/PyTorch phrasing) to
_IMMEDIATE_FATAL_PATTERNS and remove/avoid duplicating it in
_SEVERE_ERROR_PATTERNS; update the classification unit tests to assert that an
error message containing "CUDA error: an illegal memory access was encountered"
(and similar variants) returns "immediate_fatal" from _classify_error,
referencing the _classify_error method and the
_IMMEDIATE_FATAL_PATTERNS/_SEVERE_ERROR_PATTERNS lists.
In `@tensorrt_llm/executor/executor.py`:
- Around line 104-106: The TOCTOU race in Executor._set_fatal_error() must be
fixed by making the check-and-set atomic with a lock: add a threading.Lock
(e.g., self._fatal_error_lock) in the Executor.__init__ and wrap the body of
_set_fatal_error() with that lock so the "if self._fatal_error is None:
self._fatal_error = error" sequence is performed under the lock; this ensures
calls from _error_monitor_loop() and _handle_background_error() cannot overwrite
the first error and preserves the documented first-error-wins behavior.
In `@tensorrt_llm/executor/proxy.py`:
- Around line 148-157: The health check path that iterates self.mpi_futures (in
check_health) must not call shutdown() inline when a future is done; instead,
when a done() future is detected capture its exception (using f.exception() if
not f.cancelled()), call self._set_fatal_error(error), and trigger the
non-blocking pre-shutdown path (call the pre_shutdown mechanism or schedule it)
without waiting on other futures or calling shutdown() synchronously; ensure
doing_shutdown is set/checked to avoid double-triggering and avoid letting the
queued worker exception be re-raised by the synchronous shutdown path so
check_health() can reliably return False.
In `@tensorrt_llm/grpc/grpc_request_manager.py`:
- Around line 176-180: The health-check branch in grpc_request_manager.py
currently returns self.llm._executor._fatal_error verbatim (which may include
full tracebacks via CppExecutorError.__str__), so change the return to a
sanitized summary: build a short string like "<ErrorType>: <short message>"
using type(self.llm._executor._fatal_error).__name__ and a trimmed message
(e.g., str(self.llm._executor._fatal_error).splitlines()[0]) and return that to
gRPC clients; keep logging the full error/traceback to your logger (e.g.,
process or executor logger) for diagnostics but do not include the stack in
HealthCheckResponse.status. Ensure this change is applied where the code checks
self.llm._executor.check_health() and references
self.llm._executor._fatal_error.
---
Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3496-3523: After setting self._fatal_error in the fatal-error
path, immediately set the local shutdown flag (self.is_shutdown = True) and
fail/drain any pending work in waiting_queue (and any requests still in
executor_request_queue if applicable) similar to how active_requests are
handled: mark each queued request state=LlmRequestState.GENERATION_COMPLETE,
populate error_responses for their py_request_id (use the same error_msg and
client id logic used for active_requests), call self._enqueue_responses on those
items, and call self._terminate_request for each; only after draining
waiting_queue and owner-level queues call
executor_request_queue.enqueue_shutdown_request() so no queued requests get
scheduled after a fatal CUDA error.
- Around line 3502-3519: The loop currently aliases failed_requests to
self.active_requests when requests is None, then clears self.active_requests
before calling _terminate_request, causing resource leaks; change the assignment
so that failed_requests is a shallow copy (e.g., failed_requests =
list(self.active_requests)) when requests is None, keep the existing branch that
filters self.active_requests when requests is provided, and ensure the
subsequent calls to _enqueue_responses and _terminate_request iterate over that
copied failed_requests so _terminate_request(request) runs for each failed
request (reference variables/methods: failed_requests, self.active_requests,
_enqueue_responses, _terminate_request).
---
Nitpick comments:
In `@tensorrt_llm/executor/executor.py`:
- Around line 318-320: is_shutdown() currently conflates graceful shutdown and
fatal crashes by returning True whenever self._fatal_error is set; add a
separate accessor (e.g., a read-only property or method named fatal_error or
get_fatal_error) that returns the latched exception (self._fatal_error) while
leaving is_shutdown() unchanged, and update call sites such as
BaseLLM.generate_async to first check executor.fatal_error and surface/raise
that exception if present before treating is_shutdown() as a generic shutdown
case. Ensure the accessor returns None when there is no fatal error and do not
change the existing boolean semantics of is_shutdown().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e74ae06f-199f-4f3d-9fd7-3375887e229f
📒 Files selected for processing (8)
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/executor/executor.pytensorrt_llm/executor/proxy.pytensorrt_llm/grpc/grpc_request_manager.pytensorrt_llm/llmapi/llm.pytensorrt_llm/serve/openai_server.pytests/unittest/executor/test_fatal_error_health_check.pytests/unittest/llmapi/apps/_test_openai_metrics.py
Critical fixes: - Fix aliased list bug: _handle_errors() iterated self.active_requests then called clear(), so _terminate_request never ran on the fatal path. Now copies the list before clearing. - Set is_shutdown=True immediately on fatal error to prevent the executor loop from scheduling more requests on a corrupted context. Robustness improvements: - Monitor loop and base check_health() now skip per-request errors (RequestError, str) from the error queue instead of promoting them to fatal — a single bad request should not crash the server. - Base check_health() drains the error queue directly instead of calling _handle_background_error() (avoids re-entrancy/thread issues since health checks run from the event loop thread, not main). Testability: - Extract classify_error() and pattern constants into standalone error_classification.py module (no CUDA/C++ dependencies). - Tests import and test the real classify_error() function directly. - Fix pattern: "cudaerrorillegaladd" -> "cudaerrorillegaladdress". Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor
- Add "illegal memory access" to immediate-fatal patterns so "CUDA error: an illegal memory access" is not misclassified as severe (matched "cuda error" first) - Sanitize _fatal_error in gRPC health response: return "TypeName: first line" instead of full traceback to clients; log full error to server logger for diagnostics - Fix test mock ConcreteExecutor.check_health() to skip str errors (matching production behavior) - Fix docstring: "below zero" -> "below a near-zero threshold" Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #41730 [ run ] triggered by Bot. Commit: |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
3458-3487:⚠️ Potential issue | 🟠 MajorFatal shutdown still leaves queued work runnable.
Line 3460 sets
is_shutdown, but this path only clearsactive_requests.should_stop_processingat Lines 859-861 still waits forwaiting_queueto empty, andenqueue_shutdown_request()intensorrt_llm/_torch/pyexecutor/executor_request_queue.py, Lines 133-136 only appends a shutdown item. Already-buffered requests can therefore be promoted and scheduled after the fatal error.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3458 - 3487, The fatal-shutdown path sets self._fatal_error and clears active_requests but leaves already-buffered work in the executor_request_queue.waiting_queue eligible for promotion; to fix this, when is_fatal is true (inside the fatal branch around self._fatal_error/self.is_shutdown) also drain or invalidate the executor_request_queue.waiting_queue so no pending requests can be promoted: iterate over executor_request_queue.waiting_queue, mark each as failed (create error LlmResponse entries and call _terminate_request as done for active_requests), then clear waiting_queue (or add a new executor_request_queue.clear_waiting_queue() and call it) before calling enqueue_shutdown_request(); alternatively modify enqueue_shutdown_request() to atomically flush/invalidate waiting_queue when a fatal flag is present (check self._fatal_error) so no buffered requests are scheduled after fatal.
♻️ Duplicate comments (3)
tensorrt_llm/executor/executor.py (1)
304-316:⚠️ Potential issue | 🔴 CriticalMake
_set_fatal_error()actually first-error-wins.Lines 314-316 are still a check-then-set race.
_error_monitor_loop()runs on a background thread, while_handle_background_error()andcheck_health()can call the same setter from request/health paths, so the second writer can overwrite the first fatal cause.#!/bin/bash set -euo pipefail echo "Fatal setter call sites:" rg -n -C2 '\b_set_fatal_error\s*\(' tensorrt_llm/executor echo echo "Threaded callers and thread creation:" rg -n -C3 '_error_monitor_loop|threading\.Thread|check_health\(|_handle_background_error\(' tensorrt_llm/executor🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/executor/executor.py` around lines 304 - 316, The current _set_fatal_error(self, error) does a racey check-then-set; make it truly first-error-wins by protecting the check-and-assign with a mutex (e.g., add a threading.Lock instance on the Executor and acquire it inside _set_fatal_error) or by using an atomic compare-and-swap pattern; specifically, create a lock like self._fatal_lock in the Executor initializer, then in _set_fatal_error() do with self._fatal_lock: if self._fatal_error is None: self._fatal_error = error; logger.error(...). Update only _set_fatal_error (and the Executor init to add the lock) so calls from _error_monitor_loop, _handle_background_error, and check_health all respect the same mutual exclusion and the first writer wins.tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
3435-3457:⚠️ Potential issue | 🟠 MajorDon't charge request-local failures against the process-wide fatal budget.
Line 3456 still burns the shared bucket for every
_handle_errors()call, but_fetch_and_activate_new_requests()(Line 2747),_handle_responses()(Lines 3648-3650), and_handle_guided_decoder_errors()(Line 3851) use this helper for request-scoped failures. A burst of bad requests can still flip the whole worker fatal.Possible direction
def _handle_errors(self, error_msg: Optional[str] = None, *, - requests: Optional[List[LlmRequest]] = None) -> None: + requests: Optional[List[LlmRequest]] = None, + charge_process_budget: bool = True) -> None: @@ - is_fatal = self._consume_error_budget(error_msg) + is_fatal = ( + self._consume_error_budget(error_msg) + if charge_process_budget else False + )Then pass
charge_process_budget=Falsefrom the request-local call sites.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3435 - 3457, The _handle_errors helper currently always calls self._consume_error_budget(error_msg) which charges the shared process-wide fatal budget even for request-local failures; change _handle_errors to accept an optional flag (e.g., charge_process_budget: bool = True) and only call self._consume_error_budget(error_msg) when that flag is True, leaving request-local failures uncharged; then update the request-scoped call sites _fetch_and_activate_new_requests, _handle_responses, and _handle_guided_decoder_errors to pass charge_process_budget=False when invoking _handle_errors so only true process-level faults consume the shared budget.tensorrt_llm/executor/proxy.py (1)
149-157:⚠️ Potential issue | 🔴 CriticalDon't run full
shutdown()inline from the worker-crash health paths.Lines 154-156 and 188-189 call
shutdown()as soon as one MPI future is done. Butpre_shutdown()still only sends the quit sentinel when all futures are alive (Line 396), so surviving workers never get notified andshutdown()blocks on the remainingf.result()calls. This path can also re-raise out ofshutdown()instead of reliably reporting unhealthy.Safer direction
- if not self.doing_shutdown: - self.shutdown() + if not self.doing_shutdown: + self.pre_shutdown() return False ... - if not self.doing_shutdown: - self.shutdown() + if not self.doing_shutdown: + self.pre_shutdown() return- if all(not f.done() for f in self.mpi_futures): + if any(not f.done() for f in self.mpi_futures): self.request_queue.put_noblock(None, retry=4)#!/bin/bash set -euo pipefail echo "Health/monitor paths that call shutdown:" rg -n -C2 'f\.done\(\)|self\.shutdown\(\)' tensorrt_llm/executor/proxy.py echo echo "pre_shutdown sentinel condition:" sed -n '383,399p' tensorrt_llm/executor/proxy.py echo echo "shutdown wait loop:" sed -n '399,421p' tensorrt_llm/executor/proxy.pyAlso applies to: 178-190
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/executor/proxy.py` around lines 149 - 157, The health-check loop that inspects self.mpi_futures must not call self.shutdown() inline because shutdown() waits for remaining futures and blocks when survivors never receive the quit sentinel; instead, when a future is done set the fatal error via self._set_fatal_error(exc_or_runtimeerror), mark the intent to shutdown by toggling self.doing_shutdown = True (if not already), and return False so the higher-level monitor/cleanup path can perform an orderly shutdown (which will call pre_shutdown() and send sentinels correctly). Update both places that call self.shutdown() from the mpi_futures health checks (the loop using for f in self.mpi_futures and the other health-check block at lines ~178-190) to follow this pattern and avoid re-raising from within shutdown().
🧹 Nitpick comments (2)
tests/unittest/executor/test_fatal_error_health_check.py (2)
126-134: Consider catching specific exception instead of bareException.The
get_nowait()call can only raisequeue.Empty. While this mock mirrors production code patterns, using a specific exception makes the test mock's behavior clearer.♻️ Suggested fix
+from queue import Queue, Empty ... try: e = self._error_queue.get_nowait() self._error_queue.task_done() if not isinstance(e, str): self._set_fatal_error(e) self.doing_shutdown = True - except Exception: + except Empty: pass🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/executor/test_fatal_error_health_check.py` around lines 126 - 134, Replace the bare except in the block that calls self._error_queue.get_nowait() with a specific catch for queue.Empty (or Empty if imported) so only the expected no-item case is swallowed; update imports if necessary to reference queue.Empty, leaving the logic that sets _set_fatal_error(e) and toggles doing_shutdown unchanged.
403-455: Thread timing in tests could be sensitive to CI load.The monitor loop tests use fixed sleep times (
time.sleep(0.05),time.sleep(0.2)) for thread synchronization. While these work in normal conditions, they could be flaky in slow CI environments. The 2-second join timeouts provide good safety margins.If flakiness is observed, consider using synchronization primitives (e.g.,
threading.Event) instead of fixed sleeps for more deterministic behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/executor/test_fatal_error_health_check.py` around lines 403 - 455, Replace fragile time.sleep calls with threading.Event-based synchronization: create an Event (e.g., started = threading.Event()) and start the monitor thread with a wrapper target that sets started.set() then calls executor._error_monitor_loop(); replace time.sleep(0.05) with started.wait(timeout=1.0) to ensure the monitor is running; instead of time.sleep(0.2) wait up to a short timeout in a loop for the expected condition (e.g., wait for executor._fatal_error to become non-None or for executor.doing_shutdown/_shutdown_called to change) so tests like test_detects_worker_crash, test_detects_error_queue_item, and test_skips_per_request_string_errors use deterministic events and condition waits while still joining the thread afterwards.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/executor/executor.py`:
- Around line 296-300: When draining _error_queue, don't unconditionally mark
every dequeued item fatal; mirror the per-request filtering used elsewhere (see
the RequestError/str handling in the submit/init branch and check_health()). In
the block that processes items from self._error_queue in executor.py, inspect
each dequeued error and only call self._set_fatal_error(e) and self.shutdown()
for errors that are not instances of RequestError or str (treat those as
request-local and skip marking fatal), ensuring you still call task_done() for
each item.
---
Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3458-3487: The fatal-shutdown path sets self._fatal_error and
clears active_requests but leaves already-buffered work in the
executor_request_queue.waiting_queue eligible for promotion; to fix this, when
is_fatal is true (inside the fatal branch around
self._fatal_error/self.is_shutdown) also drain or invalidate the
executor_request_queue.waiting_queue so no pending requests can be promoted:
iterate over executor_request_queue.waiting_queue, mark each as failed (create
error LlmResponse entries and call _terminate_request as done for
active_requests), then clear waiting_queue (or add a new
executor_request_queue.clear_waiting_queue() and call it) before calling
enqueue_shutdown_request(); alternatively modify enqueue_shutdown_request() to
atomically flush/invalidate waiting_queue when a fatal flag is present (check
self._fatal_error) so no buffered requests are scheduled after fatal.
---
Duplicate comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3435-3457: The _handle_errors helper currently always calls
self._consume_error_budget(error_msg) which charges the shared process-wide
fatal budget even for request-local failures; change _handle_errors to accept an
optional flag (e.g., charge_process_budget: bool = True) and only call
self._consume_error_budget(error_msg) when that flag is True, leaving
request-local failures uncharged; then update the request-scoped call sites
_fetch_and_activate_new_requests, _handle_responses, and
_handle_guided_decoder_errors to pass charge_process_budget=False when invoking
_handle_errors so only true process-level faults consume the shared budget.
In `@tensorrt_llm/executor/executor.py`:
- Around line 304-316: The current _set_fatal_error(self, error) does a racey
check-then-set; make it truly first-error-wins by protecting the
check-and-assign with a mutex (e.g., add a threading.Lock instance on the
Executor and acquire it inside _set_fatal_error) or by using an atomic
compare-and-swap pattern; specifically, create a lock like self._fatal_lock in
the Executor initializer, then in _set_fatal_error() do with self._fatal_lock:
if self._fatal_error is None: self._fatal_error = error; logger.error(...).
Update only _set_fatal_error (and the Executor init to add the lock) so calls
from _error_monitor_loop, _handle_background_error, and check_health all respect
the same mutual exclusion and the first writer wins.
In `@tensorrt_llm/executor/proxy.py`:
- Around line 149-157: The health-check loop that inspects self.mpi_futures must
not call self.shutdown() inline because shutdown() waits for remaining futures
and blocks when survivors never receive the quit sentinel; instead, when a
future is done set the fatal error via
self._set_fatal_error(exc_or_runtimeerror), mark the intent to shutdown by
toggling self.doing_shutdown = True (if not already), and return False so the
higher-level monitor/cleanup path can perform an orderly shutdown (which will
call pre_shutdown() and send sentinels correctly). Update both places that call
self.shutdown() from the mpi_futures health checks (the loop using for f in
self.mpi_futures and the other health-check block at lines ~178-190) to follow
this pattern and avoid re-raising from within shutdown().
---
Nitpick comments:
In `@tests/unittest/executor/test_fatal_error_health_check.py`:
- Around line 126-134: Replace the bare except in the block that calls
self._error_queue.get_nowait() with a specific catch for queue.Empty (or Empty
if imported) so only the expected no-item case is swallowed; update imports if
necessary to reference queue.Empty, leaving the logic that sets
_set_fatal_error(e) and toggles doing_shutdown unchanged.
- Around line 403-455: Replace fragile time.sleep calls with
threading.Event-based synchronization: create an Event (e.g., started =
threading.Event()) and start the monitor thread with a wrapper target that sets
started.set() then calls executor._error_monitor_loop(); replace
time.sleep(0.05) with started.wait(timeout=1.0) to ensure the monitor is
running; instead of time.sleep(0.2) wait up to a short timeout in a loop for the
expected condition (e.g., wait for executor._fatal_error to become non-None or
for executor.doing_shutdown/_shutdown_called to change) so tests like
test_detects_worker_crash, test_detects_error_queue_item, and
test_skips_per_request_string_errors use deterministic events and condition
waits while still joining the thread afterwards.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 996f84b2-399b-4653-8d3d-4755442fb2cf
📒 Files selected for processing (6)
tensorrt_llm/_torch/pyexecutor/error_classification.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/executor/executor.pytensorrt_llm/executor/proxy.pytensorrt_llm/grpc/grpc_request_manager.pytests/unittest/executor/test_fatal_error_health_check.py
026c8de to
5d12cea
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/unittest/executor/test_fatal_error_health_check.py (1)
122-135: Mock'scheck_health()doesn't fully mirror productionRequestErrorfiltering.The production
check_health()inexecutor.pyfilters bothstrandRequestErroras per-request errors (line 352:if not isinstance(e, (str, RequestError))), but this mock only checks forstr(line 129). Consider addingRequestErrorfiltering to match production behavior more closely, or document this simplification.Suggested fix
+# Import or define a mock RequestError for test accuracy +class MockRequestError(Exception): + pass + class ConcreteExecutor: ... def check_health(self) -> bool: if self.doing_shutdown or self._fatal_error is not None: return False if not self._error_queue.empty(): try: e = self._error_queue.get_nowait() self._error_queue.task_done() - if not isinstance(e, str): + if not isinstance(e, (str, MockRequestError)): self._set_fatal_error(e) self.doing_shutdown = True except Empty: pass return self._fatal_error is None and not self.doing_shutdown return True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/executor/test_fatal_error_health_check.py` around lines 122 - 135, The mock's check_health method currently treats only str as per-request errors; update it to mirror production by treating both str and RequestError as non-fatal per-request errors: when pulling an item from _error_queue in check_health, check isinstance(e, (str, RequestError)) and only call _set_fatal_error and set doing_shutdown when the exception is not one of those; adjust imports or references so RequestError type is available to the test/mock.tensorrt_llm/executor/executor.py (1)
348-357: Narrow the exception handler incheck_health()to avoid silently swallowing unexpected errors.The
except Exception: passblock is too broad. It's intended to catchqueue.Emptyfromget_nowait(), but will also silently swallow any other unexpected exception (e.g., if_set_fatal_errororshutdownraises). Consider catchingEmptyspecifically.Suggested fix
+from queue import Empty ... if not self._error_queue.empty(): try: e = self._error_queue.get_nowait() self._error_queue.task_done() if not isinstance(e, (str, RequestError)): self._set_fatal_error(e) self.shutdown() - except Exception: + except Empty: pass return self._fatal_error is None and not self.doing_shutdown🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/executor/executor.py` around lines 348 - 357, In check_health(), narrow the broad "except Exception: pass" around self._error_queue.get_nowait() to only catch the queue.Empty exception (or queue.Empty and any specific expected exceptions) so you don't silently swallow errors from _set_fatal_error or shutdown; update the except clause to "except queue.Empty:" (and add the appropriate import/reference to queue.Empty) and let other exceptions propagate so they can be handled/logged by callers.tensorrt_llm/executor/proxy.py (1)
198-215: Consider narrowing inner exception handler toEmpty.The inner
except Exception: passat lines 210-211 is intended to catchqueue.Emptyfromget_nowait(), but will also swallow unexpected errors. The outer handler at lines 214-215 is appropriate for monitor resilience.Suggested fix
if not self._error_queue.empty(): try: e = self._error_queue.get_nowait() self._error_queue.task_done() if isinstance(e, (str, RequestError)): continue self._set_fatal_error(e) logger.error( "Error monitor: background error detected: " f"{e!r}") if not self.doing_shutdown: self.pre_shutdown() - except Exception: + except Empty: pass🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/executor/proxy.py` around lines 198 - 215, The inner exception handler around self._error_queue.get_nowait() is too broad and should only catch queue.Empty to avoid swallowing real errors; change the inner except Exception to except queue.Empty (or import Empty and use except Empty) so only the expected empty-queue case is ignored, while leaving the outer broad except Exception intact for monitor resilience; keep references to self._error_queue.get_nowait(), self._error_queue.task_done(), self._set_fatal_error(), self.pre_shutdown(), self._fatal_error and self.doing_shutdown when applying the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/executor/executor.py`:
- Around line 348-357: In check_health(), narrow the broad "except Exception:
pass" around self._error_queue.get_nowait() to only catch the queue.Empty
exception (or queue.Empty and any specific expected exceptions) so you don't
silently swallow errors from _set_fatal_error or shutdown; update the except
clause to "except queue.Empty:" (and add the appropriate import/reference to
queue.Empty) and let other exceptions propagate so they can be handled/logged by
callers.
In `@tensorrt_llm/executor/proxy.py`:
- Around line 198-215: The inner exception handler around
self._error_queue.get_nowait() is too broad and should only catch queue.Empty to
avoid swallowing real errors; change the inner except Exception to except
queue.Empty (or import Empty and use except Empty) so only the expected
empty-queue case is ignored, while leaving the outer broad except Exception
intact for monitor resilience; keep references to
self._error_queue.get_nowait(), self._error_queue.task_done(),
self._set_fatal_error(), self.pre_shutdown(), self._fatal_error and
self.doing_shutdown when applying the change.
In `@tests/unittest/executor/test_fatal_error_health_check.py`:
- Around line 122-135: The mock's check_health method currently treats only str
as per-request errors; update it to mirror production by treating both str and
RequestError as non-fatal per-request errors: when pulling an item from
_error_queue in check_health, check isinstance(e, (str, RequestError)) and only
call _set_fatal_error and set doing_shutdown when the exception is not one of
those; adjust imports or references so RequestError type is available to the
test/mock.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 69cb9bdd-c2c7-449a-9293-ea9a19f51a06
📒 Files selected for processing (3)
tensorrt_llm/executor/executor.pytensorrt_llm/executor/proxy.pytests/unittest/executor/test_fatal_error_health_check.py
- _handle_background_error() queue drain now filters RequestError/str (mirrors check_health/monitor behavior, prevents per-request errors from poisoning _fatal_error) - Use pre_shutdown() instead of shutdown() in proxy check_health() and _error_monitor_loop() to avoid blocking on f.result() for surviving workers - Fix pre_shutdown sentinel: all() -> any() so surviving workers get the quit signal even when one has already died - Test mocks: except Exception -> except Empty, shutdown -> pre_shutdown Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor
5d12cea to
97a81ef
Compare
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #41750 [ run ] triggered by Bot. Commit: |
|
PR_Github #41750 [ run ] completed with state
|
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Description
Summary
check_health()method onGenerationExecutorthat drains the error queue directly (skipping per-request errors) and checks fatal error state, replacing the previousis_shutdown()-only checks.GenerationExecutorProxyto detect MPI worker crashes even when nogenerate()calls or health checks are in flight.Problem
When a worker encounters an unrecoverable error (e.g. CUDA OOM or NCCL crash), the existing code only fails the in-flight requests but keeps the server alive. Kubernetes health probes continue to return 200 because they only check
is_shutdown(), which remainsFalse. The pod stays inRunningstate indefinitely — accepting new requests but never processing them.Approach
PyExecutor(engine-level):error_classification.pymodule (no CUDA/C++ dependencies):is_shutdown = Trueimmediately, fail all active requests (via a list copy to avoid aliased-list bug), and enqueue a shutdown request.GenerationExecutor(base class):_fatal_errorfield +_set_fatal_error()(first-error-wins).check_health()drains the error queue directly viaget_nowait()(not via_handle_background_error()which is documented for main-thread use). Per-request errors (RequestError,str) are skipped._handle_background_error()queue drain path also filtersRequestError/strto avoid poisoning_fatal_errorfrom per-request failures.is_shutdown()also returns True when fatal error is set.GenerationExecutorProxy(MPI multi-process):check_health()inspects MPI worker futures for unexpected exits. Callspre_shutdown()(notshutdown(), which would block onf.result()for surviving workers)._error_monitor_loopdaemon thread polls every ~5s, drains error queue directly, skips per-request errors. Also usespre_shutdown().pre_shutdown()sentinel condition fixed:all()→any()so surviving workers still get the quit signal when one has already died.shutdown()skips_error_monitor_thread.join()when called from the monitor thread itself.Serving layer:
GrpcRequestManager.health_checkusescheck_health()and surfaces a sanitized fatal error summary ("TypeName: first line") to clients; full error logged server-side.OpenAIServer /healthraisesSIGINTon fatal error to terminate the process.BaseLLM._check_health()delegates toexecutor.check_health().test_healthupdated to patchcheck_health()instead ofis_shutdown().Test coverage
classify_error()error_classification.pyviaimportlibwithout loading C++ extensions)is_shutdownset on fatalGenerationExecutor_set_fatal_errorfirst-wins,is_shutdown(4 states),check_health(4 states incl. error queue drain with per-request skip)GenerationExecutorProxy_error_monitor_loop/healthBaseLLM._check_healthPR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.