Skip to content

[Enhancement] Robust Snapshot Architecture: Dual-Source Accessibility, Defensive Href Extraction & Download Link Detection #675

@JaviMaligno

Description

@JaviMaligno

Related Issues

Problem Statement

While working on browser automation for AI agents, we identified several reliability gaps in snapshot/extraction that affect real-world usage:

  1. Href extraction edge cases - Some dynamically-rendered links or SPAs don't expose href reliably through the accessibility tree alone
  2. No proactive download identification - Agents must parse snapshot text manually to find downloadable files
  3. Single-source accessibility - Relying solely on Puppeteer's snapshot can miss semantics (as discussed in Accessibility tree/element(s) snapshot - exposing semantics, roles, states, ARIA,... #363)
  4. Snapshot fragility - Individual element failures can break the entire snapshot

Proposed Improvements

We've implemented and deployed (Azure production) solutions for these:

1. Dual-Fallback Href Extraction

// Runtime.callFunctionOn with fallback
return this.href || this.getAttribute('href') || '';

This handles edge cases where the standard property read fails (related to Puppeteer #6311 discussion).

2. Explicit downloadLinks Field

interface SnapshotResult {
  // ... existing fields
  downloadLinks: Array<{
    url: string;
    filename: string;
    extension: string;
  }>;
}

Automatically identifies downloadable files by extension (.csv, .xlsx, .zip, .pdf, .json, etc.). Agents no longer need to parse text manually.

3. Dual-Source Accessibility Tree

Source Purpose
Puppeteer page.accessibility.snapshot() Semantic structure
CDP backendNodeId Precise DOM element mapping

This addresses the gaps @BogdanCerovac identified in #363 - combining semantic accessibility with precise DOM mapping.

4. Resilient Error Handling

// Continue on individual element failures
for (const node of nodes) {
  try {
    await extractNodeData(node);
  } catch (e) {
    console.warn(`Skipping node: ${e.message}`);
    continue; // Don't fail entire snapshot
  }
}

Implementation

We have a working implementation deployed in production. Happy to:

  • Submit a PR with these improvements
  • Provide more technical details on any specific aspect
  • Discuss alternative approaches

Questions for Maintainers

  1. Would you prefer these as separate PRs or one consolidated change?
  2. For downloadLinks - should this be opt-in via a parameter or always included?
  3. Any concerns about the dual-source approach adding complexity?

/cc @OrKoN

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions