Skip to content

Conversation

@rcholic
Copy link
Contributor

@rcholic rcholic commented Jan 15, 2026

Phase plan (P0/P1/P2) — next 2 weeks

Phase P0 (Days 1–5): v1 schema + state fields end-to-end (blocking)

Objective: add state fields to the canonical snapshot schema and make them queryable/assertable in both SDKs.

Deliverables

sentience-chrome (raw extraction)
  • Add raw DOM state signals to emitted element payload:
    • input_value (or value) for inputs/textarea/select (with redaction rules)
    • input_type (to support password redaction)
    • checked / aria_checked
    • disabled / aria_disabled
    • aria_expanded
    • name (best-effort): aria-label, aria-labelledby, associated <label for=...>, placeholder fallback
  • Privacy/safety
    • If input_type=password: omit value or set value_redacted=true
    • Clip value to a max length (e.g., 200 chars) to reduce PII risk + payload bloat
gateway (canonical response schema)
  • Extend gateway/src/snapshot/types.rs:
    • Add the new fields into Attributes and/or RawElement
    • Extend SmartElement output schema to include:
      • name, value (redacted/clipped), input_type
      • aria_checked, aria_disabled, aria_expanded
      • optional normalized booleans: checked, disabled, expanded
  • Update gateway/src/snapshot/processing.rs mapping:
    • pass-through raw state fields into SmartElement
    • keep existing ranking/refinement unchanged
sdk-python
  • Update sentience/models.py::Element with optional fields matching gateway output.
  • Update query engine selector fields to support:
    • checked=true|false|mixed
    • disabled=true|false
    • expanded=true|false
    • value="...", value~"...", name~"..." (if exposed)
  • Add v1 verification helpers in sentience/verification.py (implemented as predicates over query(...)):
    • is_enabled(selector) / is_disabled(selector)
    • is_checked(selector) / is_unchecked(selector)
    • value_contains(selector, substr) / value_equals(selector, value)
    • is_expanded(selector) / is_collapsed(selector)
sdk-ts
  • Update src/types.ts::Element with optional fields matching gateway output.
  • Update query engine selector fields to match python semantics.
  • Add v1 verification helpers in src/verification.ts mirroring python.
sentience-core (checkpoint)
  • Decision checkpoint: do we need sentience-core changes?
    • If any core algorithms (role inference / text extraction) should use new fields, update core traits + implementations.
    • Otherwise, keep core unchanged (preferred for scope control).

Tests (P0)

  • Gateway: add unit tests for JSON (de)serialization + passthrough mapping for new fields.
  • SDKs:
    • Add query engine tests validating selector parsing + filtering for each state field.
    • Add verification predicate tests (pure unit tests; no browser needed).
  • Extension:
    • Add a small JS test harness (or fixture) to validate emitted raw fields for a controlled HTML snippet.

Phase P1 (Days 6–10): v1 runtime ergonomics + failure intelligence

Objective: make assertions production-grade without requiring Studio.

Deliverables

Recommended API shape: AssertionHandle.eventually(...)

Adding assertEventually() / assertDoneEventually() creates a second “family” of runtime methods. A better UX (closer to Jest/Playwright/Cypress) is:

  • Keep existing single-shot assert_() / assert() behavior unchanged (returns bool, emits trace events).
  • Add a non-breaking builder that returns an AssertionHandle:
    • Python: runtime.check(predicate, label=..., required=False)AssertionHandle
    • TS: runtime.check(predicate, label, { required })AssertionHandle
  • AssertionHandle supports:
    • .once() (single evaluation; delegates to existing assert_()/assert())
    • .eventually(...) (retry loop with fresh snapshots + backoff)
    • optional sugar for task completion (e.g. runtime.checkDone(...).eventually(...)), but keep the core retry mechanism shared

Note: In Python, assert is a keyword; keep assert_ naming in the DSL/predicates and runtime method names.

sdk-python (AgentRuntime)
  • Add AssertionHandle + runtime.check(...) returning it.
  • Implement await handle.eventually(timeout_s=10, poll_s=0.25, min_confidence=0.7, max_retries=...).
  • Optional alias (only if needed): assert_eventually(...) can remain as a thin wrapper that internally calls runtime.check(...).eventually(...).
  • Add standardized failure reason codes into emitted verification event details:
    • no_snapshot, no_match, match_offscreen, match_occluded, state_mismatch
  • Add nearest-match suggestions (top N):
    • based on text similarity + bbox proximity to expected query (when possible)
sdk-ts (AgentRuntime)
  • Add AssertionHandle + runtime.check(...).
  • Implement await handle.eventually({ timeoutMs: 10_000, pollMs: 250, minConfidence: 0.7, maxRetries }).
  • Optional alias: assertEventually(...) can be a thin wrapper over check(...).eventually(...) if desired for discoverability.
  • Same failure reason codes + nearest-match suggestions into trace events.
CLI-first artifacts (both SDKs)
  • Optional: on failure, save:
    • snapshot JSON (redacted fields)
    • screenshot (if available) to a local path for debugging

Tests (P1)

  • Unit tests for retry loop semantics using mocked snapshots (no browser required).
  • Unit tests for nearest-match scoring.
  • Add “golden” JSON fixtures to ensure failure payload is stable for CI logs.

Phase P2 (Days 11–14): v2 snapshot confidence/exhaustion + minimal vision fallback

Objective: stop agents failing silently on unstable pages; provide deterministic escalation.

P2.1 Snapshot confidence + exhaustion

sentience-chrome
  • Emit snapshot attempt metrics alongside raw elements:
    • document_ready_state
    • node_count
    • quiet_ms (MutationObserver-based)
    • optional layout_delta (if feasible without major overhead)
gateway
  • Extend snapshot response with diagnostics (instead of meta):
    • confidence (0..1)
    • reasons[]
    • metrics (raw metrics above, for debugging)
    • (optional) attempt, exhausted for retry loops
  • Compute confidence in gateway (cheap + explainable):
    • combines ready_state, quiet_ms, node_count, and coarse “signal” like interactive element count
  • Define exhaustion semantics:
    • if confidence remains below threshold after N resnapshot attempts → snapshot_exhausted
sdk-python + sdk-ts
  • Update snapshot types to include diagnostics (and keep it optional for backward compatibility).
  • Update .eventually() to:
    • respect min_confidence
    • stop retrying when exhausted
    • emit a structured failure event (snapshot_exhausted) with reasons/metrics

P2.2 Vision fallback (verifier-only, last resort)

sdk-python
  • Use existing LLMProvider.generate_with_image(...):
    • if supports_vision() is true
    • ask a narrow yes/no question: “Is condition X satisfied? Answer YES/NO and 1 sentence.”
  • Wire into .eventually() after snapshot exhaustion (with an explicit option/flag so callers can enable/disable vision fallback per assertion).
sdk-ts
  • Extend LLMProvider interface (backward compatible):
    • supportsVision(): boolean (default false in base class)
    • generateWithImage(systemPrompt, userPrompt, imageBase64, options?)
  • Implement for providers where feasible:
    • OpenAI: use vision-capable chat format
    • Anthropic: use image blocks if supported by SDK
    • Gemini: image parts
    • GLM: only if supported; otherwise supportsVision=false
  • Wire into .eventually() after exhaustion (with an explicit option/flag so callers can enable/disable vision fallback per assertion).

Tests (P2)

  • Mocked tests for confidence/exhaustion logic (no browser).
  • Mocked tests for vision fallback invocation + result parsing (YES/NO).
  • Contract tests ensuring snapshot.diagnostics is optional and backward compatible.

Next 2–4 weeks (v2 hardening) — phased priorities

Phase P0 (Week 3): EscalationPolicy + structured failure events (runtime-level)

Deliverables

  • Introduce FailureKind + RecoveryAction enums and emit them in trace + return them to callers.
  • Add EscalationPolicy focused on assertion execution (not full agent orchestration):
    • bounded retries + resnapshot backoff
    • exhaustion thresholds
    • optional vision fallback
  • Provide a clean hook for agent layer to decide:
    • “switch LLM model” (outside runtime)
    • “reset checkpoint / reload” (outside runtime)

Tests

  • Deterministic policy decisions given counters + failure events.

Phase P1 (Week 3–4): State matcher completeness + normalization

Deliverables

  • Extend state coverage:
    • aria-pressed, aria-selected, role=switch, select/option value
  • Normalize semantics across DOM + ARIA (disabled attribute vs aria-disabled)

Tests

  • Fixture-driven state conformance tests across both SDKs.

Phase P2 (Week 4): Diff-based assertions (action-effect verification)

Deliverables

  • Track previous_snapshot in AssertContext (both SDKs) and in AgentRuntime.snapshot().
  • Implement diff predicates:
    • diff.added, diff.removed, diff.modified, diff.count_added, etc.

Tests

  • Unit tests on synthetic snapshots; optional integration tests in browser environments.

Phase P3 (Week 4): Vision fallback upgrade (optional)

Deliverables

  • If verifier-only is stable, add optional “candidate proposal” mode (bbox candidates) for a limited set of assertion types.
  • Keep invariants: assertions don’t change; fallback only changes verification/perception.

@rcholic rcholic merged commit 5a4fe92 into main Jan 15, 2026
3 checks passed
@rcholic rcholic deleted the assert_v2 branch January 15, 2026 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants