Smart assert v1 + v2 #146

rcholic · 2026-01-15T04:07:42Z

Phase plan (P0/P1/P2) — next 2 weeks

Phase P0 (Days 1–5): v1 schema + state fields end-to-end (blocking)

Objective: add state fields to the canonical snapshot schema and make them queryable/assertable in both SDKs.

Deliverables

sentience-chrome (raw extraction)

Add raw DOM state signals to emitted element payload:
- input_value (or value) for inputs/textarea/select (with redaction rules)
- input_type (to support password redaction)
- checked / aria_checked
- disabled / aria_disabled
- aria_expanded
- name (best-effort): aria-label, aria-labelledby, associated <label for=...>, placeholder fallback
Privacy/safety
- If input_type=password: omit value or set value_redacted=true
- Clip value to a max length (e.g., 200 chars) to reduce PII risk + payload bloat

gateway (canonical response schema)

Extend gateway/src/snapshot/types.rs:
- Add the new fields into Attributes and/or RawElement
- Extend SmartElement output schema to include:
  - name, value (redacted/clipped), input_type
  - aria_checked, aria_disabled, aria_expanded
  - optional normalized booleans: checked, disabled, expanded
Update gateway/src/snapshot/processing.rs mapping:
- pass-through raw state fields into SmartElement
- keep existing ranking/refinement unchanged

sdk-python

Update sentience/models.py::Element with optional fields matching gateway output.
Update query engine selector fields to support:
- checked=true|false|mixed
- disabled=true|false
- expanded=true|false
- value="...", value~"...", name~"..." (if exposed)
Add v1 verification helpers in sentience/verification.py (implemented as predicates over query(...)):
- is_enabled(selector) / is_disabled(selector)
- is_checked(selector) / is_unchecked(selector)
- value_contains(selector, substr) / value_equals(selector, value)
- is_expanded(selector) / is_collapsed(selector)

sdk-ts

Update src/types.ts::Element with optional fields matching gateway output.
Update query engine selector fields to match python semantics.
Add v1 verification helpers in src/verification.ts mirroring python.

sentience-core (checkpoint)

Decision checkpoint: do we need sentience-core changes?
- If any core algorithms (role inference / text extraction) should use new fields, update core traits + implementations.
- Otherwise, keep core unchanged (preferred for scope control).

Tests (P0)

Gateway: add unit tests for JSON (de)serialization + passthrough mapping for new fields.
SDKs:
- Add query engine tests validating selector parsing + filtering for each state field.
- Add verification predicate tests (pure unit tests; no browser needed).
Extension:
- Add a small JS test harness (or fixture) to validate emitted raw fields for a controlled HTML snippet.

Phase P1 (Days 6–10): v1 runtime ergonomics + failure intelligence

Objective: make assertions production-grade without requiring Studio.

Deliverables

Recommended API shape: `AssertionHandle.eventually(...)`

Adding assertEventually() / assertDoneEventually() creates a second “family” of runtime methods. A better UX (closer to Jest/Playwright/Cypress) is:

Keep existing single-shot assert_() / assert() behavior unchanged (returns bool, emits trace events).
Add a non-breaking builder that returns an AssertionHandle:
- Python: runtime.check(predicate, label=..., required=False) → AssertionHandle
- TS: runtime.check(predicate, label, { required }) → AssertionHandle
AssertionHandle supports:
- .once() (single evaluation; delegates to existing assert_()/assert())
- .eventually(...) (retry loop with fresh snapshots + backoff)
- optional sugar for task completion (e.g. runtime.checkDone(...).eventually(...)), but keep the core retry mechanism shared

Note: In Python, assert is a keyword; keep assert_ naming in the DSL/predicates and runtime method names.

sdk-python (AgentRuntime)

Add AssertionHandle + runtime.check(...) returning it.
Implement await handle.eventually(timeout_s=10, poll_s=0.25, min_confidence=0.7, max_retries=...).
Optional alias (only if needed): assert_eventually(...) can remain as a thin wrapper that internally calls runtime.check(...).eventually(...).
Add standardized failure reason codes into emitted verification event details:
- no_snapshot, no_match, match_offscreen, match_occluded, state_mismatch
Add nearest-match suggestions (top N):
- based on text similarity + bbox proximity to expected query (when possible)

sdk-ts (AgentRuntime)

Add AssertionHandle + runtime.check(...).
Implement await handle.eventually({ timeoutMs: 10_000, pollMs: 250, minConfidence: 0.7, maxRetries }).
Optional alias: assertEventually(...) can be a thin wrapper over check(...).eventually(...) if desired for discoverability.
Same failure reason codes + nearest-match suggestions into trace events.

CLI-first artifacts (both SDKs)

Optional: on failure, save:
- snapshot JSON (redacted fields)
- screenshot (if available) to a local path for debugging

Tests (P1)

Unit tests for retry loop semantics using mocked snapshots (no browser required).
Unit tests for nearest-match scoring.
Add “golden” JSON fixtures to ensure failure payload is stable for CI logs.

Phase P2 (Days 11–14): v2 snapshot confidence/exhaustion + minimal vision fallback

Objective: stop agents failing silently on unstable pages; provide deterministic escalation.

P2.1 Snapshot confidence + exhaustion

sentience-chrome

Emit snapshot attempt metrics alongside raw elements:
- document_ready_state
- node_count
- quiet_ms (MutationObserver-based)
- optional layout_delta (if feasible without major overhead)

gateway

Extend snapshot response with diagnostics (instead of meta):
- confidence (0..1)
- reasons[]
- metrics (raw metrics above, for debugging)
- (optional) attempt, exhausted for retry loops
Compute confidence in gateway (cheap + explainable):
- combines ready_state, quiet_ms, node_count, and coarse “signal” like interactive element count
Define exhaustion semantics:
- if confidence remains below threshold after N resnapshot attempts → snapshot_exhausted

sdk-python + sdk-ts

Update snapshot types to include diagnostics (and keep it optional for backward compatibility).
Update .eventually() to:
- respect min_confidence
- stop retrying when exhausted
- emit a structured failure event (snapshot_exhausted) with reasons/metrics

P2.2 Vision fallback (verifier-only, last resort)

sdk-python

Use existing LLMProvider.generate_with_image(...):
- if supports_vision() is true
- ask a narrow yes/no question: “Is condition X satisfied? Answer YES/NO and 1 sentence.”
Wire into .eventually() after snapshot exhaustion (with an explicit option/flag so callers can enable/disable vision fallback per assertion).

sdk-ts

Extend LLMProvider interface (backward compatible):
- supportsVision(): boolean (default false in base class)
- generateWithImage(systemPrompt, userPrompt, imageBase64, options?)
Implement for providers where feasible:
- OpenAI: use vision-capable chat format
- Anthropic: use image blocks if supported by SDK
- Gemini: image parts
- GLM: only if supported; otherwise supportsVision=false
Wire into .eventually() after exhaustion (with an explicit option/flag so callers can enable/disable vision fallback per assertion).

Tests (P2)

Mocked tests for confidence/exhaustion logic (no browser).
Mocked tests for vision fallback invocation + result parsing (YES/NO).
Contract tests ensuring snapshot.diagnostics is optional and backward compatible.

Next 2–4 weeks (v2 hardening) — phased priorities

Phase P0 (Week 3): EscalationPolicy + structured failure events (runtime-level)

Deliverables

Introduce FailureKind + RecoveryAction enums and emit them in trace + return them to callers.
Add EscalationPolicy focused on assertion execution (not full agent orchestration):
- bounded retries + resnapshot backoff
- exhaustion thresholds
- optional vision fallback
Provide a clean hook for agent layer to decide:
- “switch LLM model” (outside runtime)
- “reset checkpoint / reload” (outside runtime)

Tests

Deterministic policy decisions given counters + failure events.

Phase P1 (Week 3–4): State matcher completeness + normalization

Deliverables

Extend state coverage:
- aria-pressed, aria-selected, role=switch, select/option value
Normalize semantics across DOM + ARIA (disabled attribute vs aria-disabled)

Tests

Fixture-driven state conformance tests across both SDKs.

Phase P2 (Week 4): Diff-based assertions (action-effect verification)

Deliverables

Track previous_snapshot in AssertContext (both SDKs) and in AgentRuntime.snapshot().
Implement diff predicates:
- diff.added, diff.removed, diff.modified, diff.count_added, etc.

Tests

Unit tests on synthetic snapshots; optional integration tests in browser environments.

Phase P3 (Week 4): Vision fallback upgrade (optional)

Deliverables

If verifier-only is stable, add optional “candidate proposal” mode (bbox candidates) for a limited set of assertion types.
Keep invariants: assertions don’t change; fallback only changes verification/perception.

SentienceDEV added 6 commits January 14, 2026 20:03

P0 done

85a332d

P1 done

f0b1907

remove assertTrue

d1c2bf5

P2 with mutation observer

a421c68

updated readme

f4600c3

updated readme

b5cf83c

rcholic merged commit 5a4fe92 into main Jan 15, 2026
3 checks passed

rcholic deleted the assert_v2 branch January 15, 2026 05:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smart assert v1 + v2 #146

Smart assert v1 + v2 #146

Uh oh!

rcholic commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Smart assert v1 + v2 #146

Smart assert v1 + v2 #146

Uh oh!

Conversation

rcholic commented Jan 15, 2026

Phase plan (P0/P1/P2) — next 2 weeks

Phase P0 (Days 1–5): v1 schema + state fields end-to-end (blocking)

Deliverables

sentience-chrome (raw extraction)

gateway (canonical response schema)

sdk-python

sdk-ts

sentience-core (checkpoint)

Tests (P0)

Phase P1 (Days 6–10): v1 runtime ergonomics + failure intelligence

Deliverables

Recommended API shape: AssertionHandle.eventually(...)

sdk-python (AgentRuntime)

sdk-ts (AgentRuntime)

CLI-first artifacts (both SDKs)

Tests (P1)

Phase P2 (Days 11–14): v2 snapshot confidence/exhaustion + minimal vision fallback

P2.1 Snapshot confidence + exhaustion

sentience-chrome

gateway

sdk-python + sdk-ts

P2.2 Vision fallback (verifier-only, last resort)

sdk-python

sdk-ts

Tests (P2)

Next 2–4 weeks (v2 hardening) — phased priorities

Phase P0 (Week 3): EscalationPolicy + structured failure events (runtime-level)

Phase P1 (Week 3–4): State matcher completeness + normalization

Phase P2 (Week 4): Diff-based assertions (action-effect verification)

Phase P3 (Week 4): Vision fallback upgrade (optional)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Recommended API shape: `AssertionHandle.eventually(...)`