Skip to content

[codex] HTML API: Add set_inner_html to the processor#69

Draft
sirreal wants to merge 7 commits into
trunkfrom
set-inner-html
Draft

[codex] HTML API: Add set_inner_html to the processor#69
sirreal wants to merge 7 commits into
trunkfrom
set-inner-html

Conversation

@sirreal

@sirreal sirreal commented Jun 14, 2026

Copy link
Copy Markdown
Owner

What changed

Adds WP_HTML_Processor::set_inner_html() for non-atomic tag openers. The method replaces the target element's raw inner HTML only when the replacement can be parsed without changing the tree outside the target; otherwise it returns false and leaves the source unchanged.

Also adds focused PHPUnit coverage and a deterministic standalone fuzzer for the outside-tree invariant, including BODY/HTML attribute-hoisting cases such as and safe template/foreign-content exceptions.

Validation

  • WP_TESTS_SKIP_INSTALL=1 ./vendor/bin/phpunit --group html-api tests/phpunit/tests/html-api/wpHtmlProcessorSetInnerHtml.php
  • WP_TESTS_SKIP_INSTALL=1 ./vendor/bin/phpunit --group html-api
  • php -l tools/html-api-fuzz/set-inner-html.php
  • ./vendor/bin/phpcs --standard=phpcs.xml.dist tools/html-api-fuzz/set-inner-html.php
  • php tools/html-api-fuzz/set-inner-html.php --iterations 0 --output-dir /tmp/set-inner-html-fuzz-corpus --stop-on-failure
  • php tools/html-api-fuzz/set-inner-html.php --iterations 100 --output-dir /tmp/set-inner-html-fuzz-smoke --stop-on-failure
  • php tools/html-api-fuzz/set-inner-html.php --iterations 5000 --output-dir /tmp/set-inner-html-fuzz-5000 --stop-on-failure
  • php tools/html-api-fuzz/set-inner-html.php --iterations 50000 --output-dir /tmp/set-inner-html-fuzz-50000 --stop-on-failure
  • php tools/html-api-fuzz/set-inner-html.php --iterations 50000 --start-seed 50001 --output-dir /tmp/set-inner-html-fuzz-50001-100000 --stop-on-failure

@sirreal

sirreal commented Jun 14, 2026

Copy link
Copy Markdown
Owner Author

set_inner_html() algorithm notes

The current implementation is intentionally conservative: it validates the proposed replacement by parsing in full document/fragment context rather than trying to reason about the replacement string in isolation.

High-level flow:

  1. Reject unsupported call sites: virtual tokens, non-matched states, tag closers, atomic/non-closer elements, integration-node tokens, or tokens without a source bookmark.
  2. Flush pending lexical updates so validation runs against the same source that will receive the replacement.
  3. Locate the target opener by source span and find the raw inner-HTML byte range by reparsing until the target element is popped/closed. This handles explicit closers, implicit closers, EOF virtual pops, and special full-document BODY/HTML end behavior.
  4. Build candidate source by splicing the proposed inner HTML into the original source.
  5. Reparse the original and candidate in the same public parsing mode and compute an outside-tree signature. The signature records tokens outside the target, including token type/name/namespace, closer state, breadcrumbs, and serialized token. Tokens inside the target are skipped for comparison, but still processed because they can affect parser state and where the target closes.
  6. Compare active formatting element state at target entry/exit to catch parser-state leaks such as reconstruction outside the target.
  7. Track non-visitable parser events for BODY/HTML start/end tags that are consumed without normal visitable stack events. Attribute-bearing <body ...> / <html ...> tokens inside the target/replacement range are rejected because they may hoist attributes onto the real body/html element rather than remain target-local. Safe cases such as template content and foreign-content HTML-looking tags are allowed.
  8. Queue one raw lexical replacement only if the original and candidate outside signatures match and no unsupported/parser-error condition is encountered. Otherwise return false and leave the source unchanged.

In other words: the safety property is enforced by full in-context reparsing plus strict outside-token comparison, with extra bookkeeping for parser effects that are not visible as normal visited tokens.

Performance notes / possible optimizations

The current version prioritizes correctness and simplicity, but it does extra work. It can parse the original once to find the target end, parse the original again for the outside signature, and parse the candidate for the candidate outside signature.

Potential follow-ups:

  1. Merge original passes. The end-finding pass and original outside-signature pass could likely be combined into one original parse that finds inner_end, records the outside signature, and detects original-side BODY/HTML hoist events.
  2. Stop at target close with a complete parser-state signature. Instead of parsing through EOF, compare original/candidate state immediately after the target closes. This could avoid reparsing an unchanged suffix, but only if the state signature is complete: open elements, active formatting elements, insertion mode/template mode, namespace/integration state, form/head/frameset state, and hoist events. This is the highest-risk optimization because omitting one state field could make the check unsound.
  3. Fast path for text-only replacements. If the replacement contains no markup introducers, common cases could skip candidate reparsing after inner_end is known, because plain text cannot introduce closers, active formatting changes, table repairs, or body/html hoists.
  4. Cheap pre-scan for obvious rejects. A lexical scan for high-risk constructs like target closers, <body, <html, nested non-nestable tags, or unclosed active formatting elements could reject many invalid replacements before constructing/parsing the full candidate. This would be an optimization only; the full parser validation should remain authoritative.
  5. Cache target metadata. If repeated attempts are made at the same processor position, the computed original target end/signature metadata could be reused.

The safest near-term optimization is merging the original passes. The largest potential win is stopping at target close with a complete parser-state comparison, but that requires careful proof that the state snapshot fully determines parsing of the unchanged suffix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant