[Enhancement] Robust Snapshot Architecture: Dual-Source Accessibility, Defensive Href Extraction & Download Link Detection

## Related Issues
- #363 - Accessibility tree/element(s) snapshot
- #284 - Download folder configuration for automation
- Puppeteer #6311 - URL attribute for links

## Problem Statement

While working on browser automation for AI agents, we identified several reliability gaps in snapshot/extraction that affect real-world usage:

1. **Href extraction edge cases** - Some dynamically-rendered links or SPAs don't expose `href` reliably through the accessibility tree alone
2. **No proactive download identification** - Agents must parse snapshot text manually to find downloadable files
3. **Single-source accessibility** - Relying solely on Puppeteer's snapshot can miss semantics (as discussed in #363)
4. **Snapshot fragility** - Individual element failures can break the entire snapshot

## Proposed Improvements

We've implemented and deployed (Azure production) solutions for these:

### 1. Dual-Fallback Href Extraction
```javascript
// Runtime.callFunctionOn with fallback
return this.href || this.getAttribute('href') || '';
```
This handles edge cases where the standard property read fails (related to Puppeteer #6311 discussion).

### 2. Explicit `downloadLinks` Field
```typescript
interface SnapshotResult {
  // ... existing fields
  downloadLinks: Array<{
    url: string;
    filename: string;
    extension: string;
  }>;
}
```
Automatically identifies downloadable files by extension (`.csv`, `.xlsx`, `.zip`, `.pdf`, `.json`, etc.). Agents no longer need to parse text manually.

### 3. Dual-Source Accessibility Tree
| Source | Purpose |
|--------|---------|
| Puppeteer `page.accessibility.snapshot()` | Semantic structure |
| CDP `backendNodeId` | Precise DOM element mapping |

This addresses the gaps @BogdanCerovac identified in #363 - combining semantic accessibility with precise DOM mapping.

### 4. Resilient Error Handling
```typescript
// Continue on individual element failures
for (const node of nodes) {
  try {
    await extractNodeData(node);
  } catch (e) {
    console.warn(`Skipping node: ${e.message}`);
    continue; // Don't fail entire snapshot
  }
}
```

## Implementation

We have a working implementation deployed in production. Happy to:
- [ ] Submit a PR with these improvements
- [ ] Provide more technical details on any specific aspect
- [ ] Discuss alternative approaches

## Questions for Maintainers

1. Would you prefer these as separate PRs or one consolidated change?
2. For `downloadLinks` - should this be opt-in via a parameter or always included?
3. Any concerns about the dual-source approach adding complexity?

/cc @OrKoN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Robust Snapshot Architecture: Dual-Source Accessibility, Defensive Href Extraction & Download Link Detection #675

Related Issues

Problem Statement

Proposed Improvements

1. Dual-Fallback Href Extraction

2. Explicit `downloadLinks` Field

3. Dual-Source Accessibility Tree

4. Resilient Error Handling

Implementation

Questions for Maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Source	Purpose
Puppeteer `page.accessibility.snapshot()`	Semantic structure
CDP `backendNodeId`	Precise DOM element mapping

[Enhancement] Robust Snapshot Architecture: Dual-Source Accessibility, Defensive Href Extraction & Download Link Detection #675

Description

Related Issues

Problem Statement

Proposed Improvements

1. Dual-Fallback Href Extraction

2. Explicit downloadLinks Field

3. Dual-Source Accessibility Tree

4. Resilient Error Handling

Implementation

Questions for Maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. Explicit `downloadLinks` Field