Draft
Conversation
…_Elements Add `get_at()`, `replace_at()`, and `index_of()` methods to support the reconstruct active formatting elements algorithm. These methods enable index-based traversal needed for the REWIND and ADVANCE phases of the algorithm per the HTML5 specification. - `get_at(int $index)`: Returns the entry at a specific index - `replace_at(int $index, WP_HTML_Token $token)`: Replaces entry at index - `index_of(WP_HTML_Token $token)`: Finds index of a token by bookmark name
Implements the full "reconstruct the active formatting elements" algorithm per the HTML5 specification. This algorithm is called when the parser needs to reopen formatting elements that were opened in the current body, cell, or caption but haven't been explicitly closed. The implementation has two phases: - REWIND: Walk backwards through the active formatting elements list to find where reconstruction should start (stopping at markers or elements already in the stack of open elements) - ADVANCE: Walk forwards creating new virtual elements and updating the list A new helper method create_element_for_formatting_token() creates virtual element tokens following the pattern used by insert_virtual_node(). Known limitations: - Attribute cloning is not yet implemented; elements with attributes will bail with a specific message rather than produce incorrect output - Noah's Ark clause (limiting duplicate formatting elements) is a separate unimplemented feature; one test added to skip list Test improvements: - 18 new html5lib tests now pass (was 1087, now 1105 assertions) - 19 fewer skipped tests (was 421, now 402) - Updated unit tests to verify reconstruction works rather than testing for bail behavior
…rithm Add comprehensive unit tests for the reconstruct active formatting elements algorithm implemented in WP_HTML_Processor. These tests cover: - Single formatting element reconstruction across paragraph boundaries - Multiple formatting elements reconstruction in correct order - Deeply nested formatting elements - Elements persisting after scope closes (button marker behavior) - No-op when entry already in stack of open elements - Reconstruction across multiple paragraph boundaries - Closed formatting elements not being reconstructed - Attribute limitation causing bail/unsupported error - Reconstruction triggered by text nodes - Interleaved block and formatting elements - Empty active formatting elements list (no-op) - Breadcrumb correctness during stepping See #62357.
Add new public property to store attributes for formatting elements. This enables the active formatting elements list to store attributes as they were when elements were created, supporting reconstruction and Noah's Ark duplicate detection per the HTML5 specification. Keys are lowercase attribute names, values are decoded strings or `true` for boolean attributes.
Add private method get_current_token_attributes() that captures all attributes from the current token as an associative array. Returns lowercase attribute names as keys and decoded values (or true for boolean attributes) as values. This helper will be used when pushing formatting elements to capture their attributes for later reconstruction with correct attribute values.
Store current token attributes before pushing to the active formatting elements list. This enables reconstruction to later access original attributes and supports Noah's Ark duplicate detection by attribute comparison. Adds attribute capture at all three push locations: - <a> tags - b, big, code, em, font, i, s, small, strike, strong, tt, u tags - <nobr> tags
When reconstructing active formatting elements, clone the stored attributes from the original entry to the newly created token. This ensures reconstructed elements have the same attributes as their originals. Removes the bail check that prevented reconstruction of elements with attributes - that limitation is no longer needed since we now properly capture and clone attributes. Updates test to verify attributes are preserved through reconstruction.
…ents Modify get_attribute() to check for stored attributes on the current element's token before falling through to parent implementation. This enables reconstructed formatting elements to expose their original attributes via the standard API. Key implementation details: - Check current_element->token->attributes (stack event's token) - Use case-insensitive lookup via strtolower() - Return null for non-existent attributes (no parent fallthrough) Added unit tests verifying: - Single attribute access on reconstructed elements - Multiple attribute access on reconstructed elements
…ed formatting elements Modify get_attribute_names_with_prefix() to check for stored attributes on the current element's token before falling through to parent implementation. This enables reconstructed formatting elements to list their original attribute names via the standard API. Key implementation details: - Check current_element->token->attributes (stack event's token) - Use case-insensitive prefix matching via strtolower() - Return empty array for virtual elements with no matching attributes - Return null for tag closers Added unit test verifying: - All attributes returned with empty prefix - Prefix filtering works correctly - Non-matching prefix returns empty array
Add two private static methods to WP_HTML_Active_Formatting_Elements that will be used by the Noah's Ark clause to detect duplicate formatting elements: - elements_have_same_identity(): Compares two tokens by tag name, namespace, and attributes to determine if they represent the same formatting element. - attributes_are_equal(): Order-independent attribute comparison using lowercase keys (already normalized during capture) and exact value matching. These helpers follow the HTML5 spec requirement that two elements match when they have identical tag name, namespace, and attributes (where attribute comparison is by name and value, order-independent). Props dmsnell. See #62857.
…lements Implement the Noah's Ark clause in WP_HTML_Active_Formatting_Elements::push() which limits identical formatting elements to 3 in the active formatting elements list. When pushing a new element, the implementation: 1. Walks backwards through the stack counting elements that match the new token (same tag name, namespace, and attributes) 2. Stops at markers (which reset the duplicate count per spec) 3. If 3 or more identical elements exist, removes the earliest match 4. Adds the new element to the end of the list This prevents unbounded accumulation of nested identical formatting elements like `<b><b><b><b>...` - only 3 will be reconstructed when crossing implicit paragraph boundaries. Uses helper methods elements_have_same_identity() and attributes_are_equal() added in the previous commit for element comparison.
Add five unit tests covering Noah's Ark behavior: - test_noahs_ark_limits_identical_elements_to_three: Verifies that more than 3 identical formatting elements are limited to 3 - test_noahs_ark_different_attributes_are_different_elements: Verifies that elements with different attributes are not considered identical - test_noahs_ark_respects_markers: Documents marker behavior when a scoped element (like BUTTON) closes - test_noahs_ark_attribute_order_independent: Verifies that attribute order does not affect identity comparison - test_noahs_ark_different_attribute_values_are_different_elements: Verifies that different attribute values make elements non-identical Also removes the Noah's Ark skip from the html5lib test suite now that the implementation is complete and the test passes.
…formatting elements Override get_qualified_attribute_name() in WP_HTML_Processor to handle virtual/reconstructed elements with stored attributes. For these elements, the method returns the stored (lowercase) attribute name, applying foreign attribute adjustments for SVG and MathML namespaces as needed. This enables proper attribute name display in tree representations of parsed HTML, where reconstructed formatting elements need to report their original attribute names. Also adds a unit test verifying the behavior for reconstructed elements.
20558ec to
97b7277
Compare
97b7277 to
c5eea46
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Trac ticket:
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.