[codex] HTML API: Add set_inner_html to the processor by sirreal · Pull Request #69 · sirreal/wordpress-develop

sirreal · 2026-06-14T09:26:49Z

What changed

Adds WP_HTML_Processor::set_inner_html() for non-atomic tag openers. The method replaces the target element's raw inner HTML only when the replacement can be parsed without changing the tree outside the target; otherwise it returns false and leaves the source unchanged.

Also adds focused PHPUnit coverage and a deterministic standalone fuzzer for the outside-tree invariant, including BODY/HTML attribute-hoisting cases such as and safe template/foreign-content exceptions.

Validation

WP_TESTS_SKIP_INSTALL=1 ./vendor/bin/phpunit --group html-api tests/phpunit/tests/html-api/wpHtmlProcessorSetInnerHtml.php
WP_TESTS_SKIP_INSTALL=1 ./vendor/bin/phpunit --group html-api
php -l tools/html-api-fuzz/set-inner-html.php
./vendor/bin/phpcs --standard=phpcs.xml.dist tools/html-api-fuzz/set-inner-html.php
php tools/html-api-fuzz/set-inner-html.php --iterations 0 --output-dir /tmp/set-inner-html-fuzz-corpus --stop-on-failure
php tools/html-api-fuzz/set-inner-html.php --iterations 100 --output-dir /tmp/set-inner-html-fuzz-smoke --stop-on-failure
php tools/html-api-fuzz/set-inner-html.php --iterations 5000 --output-dir /tmp/set-inner-html-fuzz-5000 --stop-on-failure
php tools/html-api-fuzz/set-inner-html.php --iterations 50000 --output-dir /tmp/set-inner-html-fuzz-50000 --stop-on-failure
php tools/html-api-fuzz/set-inner-html.php --iterations 50000 --start-seed 50001 --output-dir /tmp/set-inner-html-fuzz-50001-100000 --stop-on-failure

sirreal · 2026-06-14T18:17:51Z

`set_inner_html()` algorithm notes

The current implementation is intentionally conservative: it validates the proposed replacement by parsing in full document/fragment context rather than trying to reason about the replacement string in isolation.

High-level flow:

Reject unsupported call sites: virtual tokens, non-matched states, tag closers, atomic/non-closer elements, integration-node tokens, or tokens without a source bookmark.
Flush pending lexical updates so validation runs against the same source that will receive the replacement.
Locate the target opener by source span and find the raw inner-HTML byte range by reparsing until the target element is popped/closed. This handles explicit closers, implicit closers, EOF virtual pops, and special full-document BODY/HTML end behavior.
Build candidate source by splicing the proposed inner HTML into the original source.
Reparse the original and candidate in the same public parsing mode and compute an outside-tree signature. The signature records tokens outside the target, including token type/name/namespace, closer state, breadcrumbs, and serialized token. Tokens inside the target are skipped for comparison, but still processed because they can affect parser state and where the target closes.
Compare active formatting element state at target entry/exit to catch parser-state leaks such as reconstruction outside the target.
Track non-visitable parser events for BODY/HTML start/end tags that are consumed without normal visitable stack events. Attribute-bearing <body ...> / <html ...> tokens inside the target/replacement range are rejected because they may hoist attributes onto the real body/html element rather than remain target-local. Safe cases such as template content and foreign-content HTML-looking tags are allowed.
Queue one raw lexical replacement only if the original and candidate outside signatures match and no unsupported/parser-error condition is encountered. Otherwise return false and leave the source unchanged.

In other words: the safety property is enforced by full in-context reparsing plus strict outside-token comparison, with extra bookkeeping for parser effects that are not visible as normal visited tokens.

Performance notes / possible optimizations

The current version prioritizes correctness and simplicity, but it does extra work. It can parse the original once to find the target end, parse the original again for the outside signature, and parse the candidate for the candidate outside signature.

Potential follow-ups:

Merge original passes. The end-finding pass and original outside-signature pass could likely be combined into one original parse that finds inner_end, records the outside signature, and detects original-side BODY/HTML hoist events.
Stop at target close with a complete parser-state signature. Instead of parsing through EOF, compare original/candidate state immediately after the target closes. This could avoid reparsing an unchanged suffix, but only if the state signature is complete: open elements, active formatting elements, insertion mode/template mode, namespace/integration state, form/head/frameset state, and hoist events. This is the highest-risk optimization because omitting one state field could make the check unsound.
Fast path for text-only replacements. If the replacement contains no markup introducers, common cases could skip candidate reparsing after inner_end is known, because plain text cannot introduce closers, active formatting changes, table repairs, or body/html hoists.
Cheap pre-scan for obvious rejects. A lexical scan for high-risk constructs like target closers, <body, <html, nested non-nestable tags, or unclosed active formatting elements could reject many invalid replacements before constructing/parsing the full candidate. This would be an optimization only; the full parser validation should remain authoritative.
Cache target metadata. If repeated attempts are made at the same processor position, the computed original target end/signature metadata could be reused.

The safest near-term optimization is merging the original passes. The largest potential win is stopping at target close with a complete parser-state comparison, but that requires careful proof that the state snapshot fully determines parsing of the unchanged suffix.

sirreal added 7 commits June 13, 2026 23:39

HTML API: Add set_inner_html to the processor

7affe9f

HTML API: Add set_inner_html fuzzing coverage

a18bb5d

HTML API: Expand set_inner_html fuzzer coverage

f177666

HTML API: Add Lexbor oracle for set_inner_html fuzzing

ac30d95

HTML API: Reject active formatting leaks in set_inner_html

9ba941a

HTML API: Expand set_inner_html fuzzer target coverage

f936ca5

HTML API: Add set_inner_html fuzzer coverage inventory

89e6370

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] HTML API: Add set_inner_html to the processor#69

[codex] HTML API: Add set_inner_html to the processor#69
sirreal wants to merge 7 commits into
trunkfrom
set-inner-html

sirreal commented Jun 14, 2026

Uh oh!

sirreal commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sirreal commented Jun 14, 2026

What changed

Validation

Uh oh!

sirreal commented Jun 14, 2026

set_inner_html() algorithm notes

Performance notes / possible optimizations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`set_inner_html()` algorithm notes