Skip to content

fix(cloud-tests): graceful manual-step fallback so auto-remediate never shows raw errors#2915

Merged
tofikwest merged 4 commits into
mainfrom
tofik/auto-remediate-manual-fallback
May 22, 2026
Merged

fix(cloud-tests): graceful manual-step fallback so auto-remediate never shows raw errors#2915
tofikwest merged 4 commits into
mainfrom
tofik/auto-remediate-manual-fallback

Conversation

@tofikwest
Copy link
Copy Markdown
Contributor

@tofikwest tofikwest commented May 22, 2026

Summary

Customers were seeing raw "Fix could not be applied — " in the Auto-Remediate dialog when the AI's refined plan was rejected by our pre-execution validator or AWS rejected a step the executor couldn't auto-repair. This PR converts every such failure path inside AWS executeRemediation into a graceful manual-steps fallback: the API returns real, customer-actionable instructions (AI-generated from the failure context), the trigger task carries them through, and the dialog renders the existing guided-steps UI instead of a red error banner.

Net effect: every fix attempt now ends in either "fix worked" or "here's a concrete checklist you can follow in AWS Console / CLI". No raw errors.

How it works (end-to-end)

┌─ executeRemediation (AWS) ─────────────────────────────────────────┐
│  read-step validation fails       →  manual-steps fallback         │
│  refined plan has no fix steps    →  manual-steps fallback         │
│  refined plan fails validation    →  try AI step-repair → revalidate
│                                   →  still invalid? manual fallback│
│  executor returns error           →  permission error? existing UX │
│                                   →  otherwise: manual fallback    │
└────────────────────────────────────────────────────────────────────┘

API returns { status: 'failed', guidedOnly: true, guidedSteps, error }
        ↓
classifyExecuteResult → { type: 'manual', reason, guidedSteps }
        ↓
remediate-single trigger task → progress.phase = 'manual' + guidedSteps
        ↓
RemediationDialog → switches preview into guidedOnly mode
        ↓
Customer sees: ordered numbered steps, NOT a raw error

Changes

apps/api/src/cloud-security/ai-remediation.service.ts

  • New generateManualSteps(...) — Sonnet-powered. Inputs: finding, failed plan, failure reason. Output: { guidedSteps: string[], reason: string }. Hard fallback to the adapter's remediation text if the AI call itself throws.
  • Exports FindingContext for the orchestration layer.

apps/api/src/cloud-security/aws-command-executor.ts

  • looksLikeValidationError now matches MissingParameter, "must contain the parameter", "missing parameter", "parameter is required", "must specify". The earlier regex missed EC2-style wording and the AI step-repair never fired for those findings.

apps/api/src/cloud-security/remediation.service.ts

  • repairInvalidSteps — parses step indices from validator errors and calls refineStepFromError per offending step before falling back. Closes the gap where the executor's own AI step-repair never got a chance because the plan never reached execution.
  • respondWithManualSteps — generates manual steps, persists the action as failed, returns the response shape the frontend already renders for canAutoFix: false plans.
  • Every throw in executeRemediation swapped for the appropriate fallback. Permission errors still flow through the existing catch (don't shadow the polished fixScript UX).

apps/api/src/cloud-security/ai-remediation.service.ts (other change)

  • Broader ACTIONABLE_PREFIXES so security-group / IAM-style plans (Authorize/Revoke/Allow/Deny/Disable/Detach/Add/Remove/Register/Deregister/Tag/Untag) produce meaningful willChange diffs instead of {} {}.

apps/app/src/trigger/tasks/cloud-security/execute-result.ts

  • New manual classification + defensive parsing of guidedSteps (strips non-strings, requires guidedOnly: true AND a non-empty list).
  • Permission-error classification still wins when both fields are present.

apps/app/src/trigger/tasks/cloud-security/remediate-single.ts

  • New phase: 'manual' in progress + guidedSteps field.

apps/app/src/app/(app)/[orgId]/cloud-tests/components/RemediationDialog.tsx

  • On phase: 'manual', switch preview into guidedOnly: true rendering. Same UI the dialog already uses for canAutoFix: false plans.

Batch flows

  • cloud-tests/actions/batch-fix.ts + integrations/[slug]/actions/batch-fix.ts + remediate-batch-helpers.ts treat the manual classification as failed with the AI-generated reason. The per-finding guided steps remain available via the single-fix dialog.

Tests

  • apps/api: 16 tests on ai-remediation.service.spec.ts (+4 new for generateManualSteps). 267/267 cloud-security tests pass.
  • apps/app: 10 tests on execute-result.test.ts (+5 new for the manual classification). All trigger task tests pass.

What this PR is NOT

  • NOT a per-finding audit. We have ~100+ finding types across AWS adapters; verifying each individually requires real-tenant testing and is weeks of work. This PR makes the safety net strong enough that the customer never sees a raw error regardless of which finding it is.
  • NOT a GCP/Azure parity change. GCP and Azure remediation services have the same throw-on-validation patterns (gcp-remediation.service.ts lines 200, 205, 208, 239, 288, 315; azure-remediation.service.ts lines 136, 149, 252) and would benefit from the same treatment. Left for a follow-up PR per the requested scope ("for now just do only for AWS").
  • NOT a fix for every cryptic auto-remediate error. The pattern broadening covers the common AWS error wording we've seen in customer reports; the universal AI step-repair is gated to validation-class errors. Errors AWS classifies as non-validation (e.g., MethodNotAllowed, ResourceConflict) will still bypass AI repair but now end up in the manual-steps fallback instead of as raw errors.

Manual test plan

  • Trigger an auto-fix on a finding known to hit the empty-required-param bug (CloudTrail "No trails configured" was the customer-reported case). Confirm the dialog shows manual steps instead of a red error.
  • Trigger an auto-fix on a finding that succeeds today. Confirm the happy path still completes and the success animation still renders.
  • Trigger an auto-fix that fails with a permission error. Confirm the permission-error UX (fixScript card) still renders — manual fallback should NOT shadow it.
  • Trigger a batch fix that includes findings that fall back to manual. Confirm the batch UI shows them as failed with the AI-generated reason.

🤖 Generated with Claude Code


Summary by cubic

Adds a graceful manual-steps fallback to AWS auto-remediation so users never see cryptic errors. When a plan is invalid or execution fails (except permission errors), the API returns guided steps and the dialog switches to the guided-only UI.

  • New Features

    • Manual-steps fallback in executeRemediation: on read-step validation failure, empty fix steps, post-repair validation failure, or non-permission execution errors, return { guidedOnly: true, guidedSteps, error }.
    • generateManualSteps builds clear, ordered instructions from the finding, failed plan, and failure reason, with a safe fallback to the adapter’s remediation text.
    • Pre-execution repair: repairInvalidSteps parses validator errors, repairs offending steps with refineStepFromError, then re-validates before falling back.
    • End-to-end surfacing: classifyExecuteResult emits type: 'manual'; remediate-single publishes phase: 'manual' with guidedSteps; RemediationDialog renders guided-only steps; batch-fix marks as failed with the generated reason. Permission errors keep the existing fix-script UX.
  • Bug Fixes

    • Broader AWS validation-error detection (MissingParameter, “missing parameter”, “parameter is required”, “must specify”, etc.) so auto-repair paths trigger reliably.
    • Expanded actionable prefixes (Authorize, Revoke, Allow, Deny, Disable, Detach, Add, Remove, Register, Deregister, Tag, Untag) for more informative willChange diffs.

Written for commit f6c7d94. Summary will update on new commits. Review in cubic

tofikwest and others added 3 commits May 22, 2026 11:28
…coverage

Adds the building blocks for the manual-steps fallback shipped in the
next two commits, plus broadens the pattern matcher and actionable-
prefix list so more findings exercise the existing auto-repair paths
instead of bailing out:

1. New `AiRemediationService.generateManualSteps(...)`: takes the
   finding, the failed plan, and the concrete failure reason, and
   returns real customer-facing manual instructions via Sonnet (kept
   on the cheap model since this only fires on failure paths and is
   plain natural language). Hard fallback to the adapter remediation
   text if the AI call itself throws, so the customer never sees a
   raw error.

2. `looksLikeValidationError` now matches `MissingParameter`,
   "must contain the parameter", "missing parameter",
   "parameter is required", "must specify" — covers the EC2-style
   error wording that the previous regex missed.

3. `ACTIONABLE_PREFIXES` adds `Authorize`, `Revoke`, `Allow`, `Deny`,
   `Disable`, `Detach`, `Add`, `Remove`, `Register`, `Deregister`,
   `Tag`, `Untag`. Security-group / IAM-style fix plans now produce
   meaningful `willChange` diffs instead of `{}` `{}`.

4. Exports `FindingContext` so it can be reused by the orchestration
   service (next commit) when invoking the new fallback path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Customers were seeing raw "Fix could not be applied — <cryptic error>"
when the AI's refined plan failed pre-execution validation or AWS
rejected a step the executor couldn't auto-repair. The fix swaps every
throw inside executeRemediation for a graceful fallback that returns
real, AI-generated manual instructions in the existing `canAutoFix:false`
response shape — so the frontend renders them with the guided-steps UI
it already supports.

Concrete changes inside the AWS executeRemediation flow:

- Hoist `findingCtx` once at the top of the function so the refineFixPlan
  call, the per-step repair callback, and the new fallback path all see
  the same context.

- Read-step validation failures → fall back to manual instead of
  throwing. (Read steps rarely fail; skipping repair here keeps the
  flow simple.)

- "Refined plan has no fix steps" → fall back to manual instead of
  throwing. There's nothing to repair.

- Refined-plan fix-step validation failures → NEW: attempt one AI
  repair pass on the offending steps (`repairInvalidSteps` parses the
  step indices from the validator errors and calls `refineStepFromError`
  per step), then re-validate. If still invalid, fall back to manual.
  Closes the gap where the executor's own AI step-repair never got a
  chance because the plan never reached execution.

- Executor returned an unrecoverable error → fall back to manual,
  except for permission errors which still flow through the existing
  catch block (parseAwsPermissionError already has a polished
  fixScript payload — don't shadow it).

GCP and Azure remediation services have the same throw-on-validation
patterns and would benefit from the same treatment; left for a
follow-up PR per the original scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The API change in the previous commit returns
`{ guidedOnly: true, guidedSteps, error }` when auto-fix gives up. This
commit threads that response shape through the trigger-task progress
metadata and the Remediation dialog so customers actually see the
manual steps instead of a raw error.

- `classifyExecuteResult` recognizes the new shape and emits a
  `{ type: 'manual', reason, guidedSteps }` classification. Defensive
  parsing strips non-string entries and ignores `guidedOnly` without
  real steps. Permission errors keep their existing precedence.

- `remediateSingle` trigger task carries a new `phase: 'manual'` plus
  `guidedSteps` in its progress payload.

- `RemediationDialog` reacts to the new phase by switching its
  preview state into the existing guided-only rendering (same UI used
  for plans where the AI declared `canAutoFix: false` upfront).

- The two batch-fix paths (single-account + integrations) treat the
  manual classification as `failed` with the AI-generated reason — the
  batch UI doesn't render per-finding guided steps, but the
  user-facing message is now meaningful instead of cryptic. The
  per-finding manual steps remain available via the single-fix dialog.

8 new tests on `execute-result.test.ts` (10 total) cover the manual
classification, the precedence rules, and the defensive parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
app Ready Ready Preview, Comment May 22, 2026 3:38pm
comp-framework-editor Ready Ready Preview, Comment May 22, 2026 3:38pm
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
portal Skipped Skipped May 22, 2026 3:38pm

Request Review

@vercel vercel Bot temporarily deployed to Preview – portal May 22, 2026 15:35 Inactive
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 11 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

Re-trigger cubic

@tofikwest tofikwest merged commit 35af953 into main May 22, 2026
11 checks passed
@tofikwest tofikwest deleted the tofik/auto-remediate-manual-fallback branch May 22, 2026 15:39
@claudfuen
Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 3.63.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants