Skip to content

Phase 3: Make script rewriters fragment-safe for full streaming #584

@aram356

Description

@aram356

Parent: #563

Context

lol_html fragments text nodes across input chunk boundaries when processing HTML incrementally. Script rewriters (NextJsNextDataRewriter, GoogleTagManagerIntegration) currently expect complete text content — if a domain string like "googletagmanager.com" is split across chunks, the rewrite silently fails.

Phase 1 works around this with a dual-mode HtmlRewriterAdapter: streaming mode when no script rewriters are registered, buffered mode when they are. This means streaming only benefits configs without GTM/NextJS script rewriters.

Phase 3 makes the rewriters themselves fragment-safe, enabling streaming for ALL configurations.

Approach

Each script rewriter accumulates text fragments internally via is_last_in_text_node, then operates on the complete text. Key considerations:

  • Intermediate fragments must return Replace("") (not Keep) to suppress output, since the full accumulated text is emitted on the final fragment
  • When the rewriter returns Keep on the full text but fragments were suppressed, must emit Replace(full_text) to restore the content
  • When text is NOT fragmented (single fragment), return Keep as before — no unnecessary replacement
  • Multiple rewriters on the same selector (e.g., NextJsNextDataRewriter on script#__NEXT_DATA__ + NextJsRscPlaceholderRewriter on script) each accumulate independently — last text.replace() wins, same as current behavior

Tasks

  • Add Mutex<String> accumulation to NextJsNextDataRewriter
  • Add Mutex<String> accumulation to GoogleTagManagerIntegration
  • Remove new_buffered() from HtmlRewriterAdapter — always stream
  • Remove has_script_rewriters gate from create_html_processor
  • Add small-chunk-size regression tests:
    • __NEXT_DATA__ rewrite with text split across chunk boundaries
    • GTM inline script rewrite with domain split across chunk boundaries
  • Full verification

Acceptance Criteria

  • All script rewriters produce correct output regardless of chunk boundaries
  • HtmlRewriterAdapter always streams (no buffered mode)
  • Streaming benefits all configurations, not just those without script rewriters
  • All existing tests pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions