Skip to content

fix(router): map gRPC errors to proper HTTP codes (#31)#68

Open
Mark Chmarny (mchmarny) wants to merge 1 commit into
agent-substrate:mainfrom
mchmarny:issue-31-router-error-messages
Open

fix(router): map gRPC errors to proper HTTP codes (#31)#68
Mark Chmarny (mchmarny) wants to merge 1 commit into
agent-substrate:mainfrom
mchmarny:issue-31-router-error-messages

Conversation

@mchmarny
Copy link
Copy Markdown

@mchmarny Mark Chmarny (mchmarny) commented May 23, 2026

Fixes #31.

The Envoy ext_proc handler wrapped resume errors with %w, so the gRPC encoding leaked into the response body, and every non-NotFound failure returned 500. The 404 path returned "not found" without indicating what was missing.

This change adds a small mapping layer that translates gRPC status codes to HTTP status codes with short, client-safe bodies:

gRPC code HTTP Body
NotFound 404 actor "<id>" not found
FailedPrecondition 503 actor "<id>" unavailable: <desc> *
Unavailable 503 actor "<id>" unavailable
DeadlineExceeded 504 actor "<id>" request timed out
PermissionDenied 403 actor "<id>" access denied
Unauthenticated 401 actor "<id>" authentication required
ResourceExhausted 429 actor "<id>" rate limited
anything else 500 error resuming actor "<id>"

* FailedPrecondition is the only code whose gRPC desc is preserved verbatim; it carries actionable client-facing context like "no free workers available" and is not security-sensitive. Other codes deliberately drop the desc to avoid leaking server-side detail.

The original error is preserved via Unwrap so existing slog.ErrorContext logging keeps full fidelity.

Two smaller follow-ons made in the same change:

  • ActorResumer no longer translates codes.NotFound into a router-layer sentinel. gRPC errors now propagate untouched so the HTTP boundary owns the mapping. The retry behavior is unchanged (the backoff already exits on any non-Aborted error).
  • The "bad worker IP" branch (returned actor with an unparseable IP) now responds with an opaque actor "<id>" routing failed instead of the previous body that echoed back the invalid IP. The full IP is still logged server-side. This is deliberate; the IP is internal infrastructure detail and shouldn't appear in a client response.

Before / after

Before:

HTTP/1.1 500 Internal Server Error
error resuming actor ctr6: rpc error: code = FailedPrecondition desc = no free workers available
HTTP/1.1 404 Not Found
not found

After:

HTTP/1.1 503 Service Unavailable
actor "ctr6" unavailable: no free workers available
HTTP/1.1 404 Not Found
invalid host "ctr6.example.com": invalid actor_id: must end with actors.resources.substrate.ate.dev, got "ctr6.example.com"

Test plan

  • go test -race ./cmd/atenet/internal/app/router/... — all pass
  • go vet ./... — clean
  • make verify-fmt — clean
  • hack/verify-all.sh (boilerplate, go-generate, gofmt, licenses, mod-tidy) — all pass
  • make build-atenet — succeeds
  • New tests: 9-case table for mapResumeError, plus constructor and contract tests in errors_test.go; existing extproc_test.go cases expanded to cover the new gRPC code mappings end-to-end through handleRequestHeaders.

cc Tim Hockin (@thockin), is this what you had in mind in #31? PTAL

mchmarny

This comment was marked as resolved.

The Envoy ext_proc handler previously wrapped resume errors with %w,
leaking the gRPC encoding ("rpc error: code = ... desc = ...") into the
response body, and routed every non-NotFound failure through 500. It
also returned a generic "not found" body that did not identify which
resource was missing.

This change introduces a small mapping layer that translates gRPC status
codes into appropriate HTTP status codes with short, client-safe bodies:

  NotFound           -> 404
  FailedPrecondition -> 503 (desc preserved, e.g. "no free workers
                             available", which is actionable and not
                             security-sensitive)
  Unavailable        -> 503
  DeadlineExceeded   -> 504
  PermissionDenied   -> 403
  Unauthenticated    -> 401
  ResourceExhausted  -> 429
  other              -> 500 (no detail leak)

The original error is preserved via Unwrap so existing slog.ErrorContext
logging keeps full fidelity.

Layering: ActorResumer no longer translates codes.NotFound into a
router-layer sentinel; gRPC errors propagate untouched so the HTTP
boundary owns the mapping.
@mchmarny Mark Chmarny (mchmarny) force-pushed the issue-31-router-error-messages branch from 018f791 to 0b8af6e Compare May 23, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Router error messages are not great

1 participant