Skip to content

Html support 2#22

Draft
sirreal wants to merge 17 commits intotrunkfrom
html-support-2
Draft

Html support 2#22
sirreal wants to merge 17 commits intotrunkfrom
html-support-2

Conversation

@sirreal
Copy link
Owner

@sirreal sirreal commented Feb 3, 2026

Trac ticket:


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

…_Elements

Add `get_at()`, `replace_at()`, and `index_of()` methods to support the
reconstruct active formatting elements algorithm. These methods enable
index-based traversal needed for the REWIND and ADVANCE phases of the
algorithm per the HTML5 specification.

- `get_at(int $index)`: Returns the entry at a specific index
- `replace_at(int $index, WP_HTML_Token $token)`: Replaces entry at index
- `index_of(WP_HTML_Token $token)`: Finds index of a token by bookmark name
Implements the full "reconstruct the active formatting elements"
algorithm per the HTML5 specification. This algorithm is called
when the parser needs to reopen formatting elements that were
opened in the current body, cell, or caption but haven't been
explicitly closed.

The implementation has two phases:
- REWIND: Walk backwards through the active formatting elements list
  to find where reconstruction should start (stopping at markers or
  elements already in the stack of open elements)
- ADVANCE: Walk forwards creating new virtual elements and updating
  the list

A new helper method create_element_for_formatting_token() creates
virtual element tokens following the pattern used by insert_virtual_node().

Known limitations:
- Attribute cloning is not yet implemented; elements with attributes
  will bail with a specific message rather than produce incorrect output
- Noah's Ark clause (limiting duplicate formatting elements) is a
  separate unimplemented feature; one test added to skip list

Test improvements:
- 18 new html5lib tests now pass (was 1087, now 1105 assertions)
- 19 fewer skipped tests (was 421, now 402)
- Updated unit tests to verify reconstruction works rather than
  testing for bail behavior
…rithm

Add comprehensive unit tests for the reconstruct active formatting
elements algorithm implemented in WP_HTML_Processor. These tests
cover:

- Single formatting element reconstruction across paragraph boundaries
- Multiple formatting elements reconstruction in correct order
- Deeply nested formatting elements
- Elements persisting after scope closes (button marker behavior)
- No-op when entry already in stack of open elements
- Reconstruction across multiple paragraph boundaries
- Closed formatting elements not being reconstructed
- Attribute limitation causing bail/unsupported error
- Reconstruction triggered by text nodes
- Interleaved block and formatting elements
- Empty active formatting elements list (no-op)
- Breadcrumb correctness during stepping

See #62357.
Add new public property to store attributes for formatting elements.
This enables the active formatting elements list to store attributes
as they were when elements were created, supporting reconstruction
and Noah's Ark duplicate detection per the HTML5 specification.

Keys are lowercase attribute names, values are decoded strings
or `true` for boolean attributes.
Add private method get_current_token_attributes() that captures all
attributes from the current token as an associative array. Returns
lowercase attribute names as keys and decoded values (or true for
boolean attributes) as values.

This helper will be used when pushing formatting elements to capture
their attributes for later reconstruction with correct attribute values.
Store current token attributes before pushing to the active
formatting elements list. This enables reconstruction to later
access original attributes and supports Noah's Ark duplicate
detection by attribute comparison.

Adds attribute capture at all three push locations:
- <a> tags
- b, big, code, em, font, i, s, small, strike, strong, tt, u tags
- <nobr> tags
When reconstructing active formatting elements, clone the stored
attributes from the original entry to the newly created token.
This ensures reconstructed elements have the same attributes as
their originals.

Removes the bail check that prevented reconstruction of elements
with attributes - that limitation is no longer needed since we
now properly capture and clone attributes.

Updates test to verify attributes are preserved through reconstruction.
…ents

Modify get_attribute() to check for stored attributes on the current
element's token before falling through to parent implementation. This
enables reconstructed formatting elements to expose their original
attributes via the standard API.

Key implementation details:
- Check current_element->token->attributes (stack event's token)
- Use case-insensitive lookup via strtolower()
- Return null for non-existent attributes (no parent fallthrough)

Added unit tests verifying:
- Single attribute access on reconstructed elements
- Multiple attribute access on reconstructed elements
…ed formatting elements

Modify get_attribute_names_with_prefix() to check for stored attributes on the
current element's token before falling through to parent implementation. This
enables reconstructed formatting elements to list their original attribute
names via the standard API.

Key implementation details:
- Check current_element->token->attributes (stack event's token)
- Use case-insensitive prefix matching via strtolower()
- Return empty array for virtual elements with no matching attributes
- Return null for tag closers

Added unit test verifying:
- All attributes returned with empty prefix
- Prefix filtering works correctly
- Non-matching prefix returns empty array
Add two private static methods to WP_HTML_Active_Formatting_Elements that
will be used by the Noah's Ark clause to detect duplicate formatting elements:

- elements_have_same_identity(): Compares two tokens by tag name, namespace,
  and attributes to determine if they represent the same formatting element.

- attributes_are_equal(): Order-independent attribute comparison using
  lowercase keys (already normalized during capture) and exact value matching.

These helpers follow the HTML5 spec requirement that two elements match when
they have identical tag name, namespace, and attributes (where attribute
comparison is by name and value, order-independent).

Props dmsnell.
See #62857.
…lements

Implement the Noah's Ark clause in WP_HTML_Active_Formatting_Elements::push()
which limits identical formatting elements to 3 in the active formatting
elements list.

When pushing a new element, the implementation:
1. Walks backwards through the stack counting elements that match the new
   token (same tag name, namespace, and attributes)
2. Stops at markers (which reset the duplicate count per spec)
3. If 3 or more identical elements exist, removes the earliest match
4. Adds the new element to the end of the list

This prevents unbounded accumulation of nested identical formatting elements
like `<b><b><b><b>...` - only 3 will be reconstructed when crossing implicit
paragraph boundaries.

Uses helper methods elements_have_same_identity() and attributes_are_equal()
added in the previous commit for element comparison.
Add five unit tests covering Noah's Ark behavior:
- test_noahs_ark_limits_identical_elements_to_three: Verifies that
  more than 3 identical formatting elements are limited to 3
- test_noahs_ark_different_attributes_are_different_elements: Verifies
  that elements with different attributes are not considered identical
- test_noahs_ark_respects_markers: Documents marker behavior when
  a scoped element (like BUTTON) closes
- test_noahs_ark_attribute_order_independent: Verifies that attribute
  order does not affect identity comparison
- test_noahs_ark_different_attribute_values_are_different_elements:
  Verifies that different attribute values make elements non-identical

Also removes the Noah's Ark skip from the html5lib test suite now that
the implementation is complete and the test passes.
…formatting elements

Override get_qualified_attribute_name() in WP_HTML_Processor to handle
virtual/reconstructed elements with stored attributes. For these elements,
the method returns the stored (lowercase) attribute name, applying foreign
attribute adjustments for SVG and MathML namespaces as needed.

This enables proper attribute name display in tree representations of
parsed HTML, where reconstructed formatting elements need to report their
original attribute names.

Also adds a unit test verifying the behavior for reconstructed elements.
@sirreal sirreal force-pushed the html-support-2 branch 3 times, most recently from 20558ec to 97b7277 Compare February 3, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant