Add HtmlSerializer (Xerces-free) and comprehensive HtmlSerializerTest#1
Add HtmlSerializer (Xerces-free) and comprehensive HtmlSerializerTest#1Copilot wants to merge 2 commits into
Conversation
- Create HtmlSerializer.java: pure-Java replacement for ASHTMLSerializer with no Apache Xerces dependency - Fix double-'>' serializer bug: parentState.empty was not reset after writing '>' in serializeElement - Skip whitespace-only text nodes when indenting=true (mirrors Xerces IndentPrinter's buffering behaviour) - Add trailing newline after root-level elements when indenting=true - Create HtmlSerializerTest.java with 110 tests covering: basic elements, void/self-closing elements, attributes (regular, boolean, URI, empty), text escaping, entity encoding, style/script unescaped content, comments, indentation, preserve-space elements, round-trip DOM scanner tests, data-driven parameterised tests (50+ HTML constructs), edge cases and regression tests (issues nahsra#30, nahsra#453, nahsra#484) Agent-Logs-Url: https://github.com/HtmlUnit/antisamy/sessions/5e0402e0-bb48-4455-978e-86fb388a0ca3 Co-authored-by: rbri <2544132+rbri@users.noreply.github.com>
Agent-Logs-Url: https://github.com/HtmlUnit/antisamy/sessions/5e0402e0-bb48-4455-978e-86fb388a0ca3 Co-authored-by: rbri <2544132+rbri@users.noreply.github.com>
|
have changed the test suite to use the existing serializer like this private String serializeOld(DocumentFragment frag, InternalPolicy pol) throws Exception { } a bunch of tests failing with the old one - is my setup correct or do we have a problem with the test suite? |
The setup has a bug: the The corrected setup (using private String serializeOld(DocumentFragment frag, InternalPolicy pol) throws Exception {
StringWriter out = new StringWriter();
OutputFormat format = new OutputFormat();
format.setOmitXMLDeclaration(pol.isOmitXmlDeclaration()); // pol, not policy
format.setOmitDocumentType(pol.isOmitDoctypeDeclaration());
format.setPreserveEmptyAttributes(true);
format.setPreserveSpace(pol.isPreserveSpace());
if (pol.isFormatOutput()) { // pol, not policy
format.setLineWidth(80);
format.setIndenting(true);
format.setIndent(2);
}
org.apache.xml.serialize.HTMLSerializer ser = new ASHTMLSerializer(out, format, pol); // pol, not policy
ser.serialize(frag);
return out.getBuffer().toString();
}With the corrected setup, I ran 21 test cases (noFormat, format=true, entityEncodeIntlChars=true) and all produce byte-for-byte identical output from both serializers, so the test suite is correct. |
|
@copilot great, now we are closer but i still have two failing tests java.lang.AssertionError: java.lang.AssertionError: |
Those 2 failures are expected and the test suite is correct. They reveal a genuine bug in Confirmed with a direct comparison: Both With |
|
@copilot great, create a different pr that fixes this bug in the ASHTMLSerializer and also includes test cases proving the bug is fixed |
Summary
This PR is the first step toward removing the Apache Xerces dependency from AntiSamy, as described in the issue. It replaces the deprecated Xerces
HTMLSerializer/ASHTMLSerializerstack with a pure-JavaHtmlSerializer, and ships a comprehensive test suite to ensure nothing is broken.Changes
HtmlSerializer.java(new)Pure-Java HTML serializer that replaces
ASHTMLSerializer(which extends the XercesHTMLSerializer). It:DocumentFragmentto HTML with no Apache Xerces dependencyHTMLdtdmetadata (void elements, boolean attributes, URI attributes, preserve-space elements, entity table)formatOutput(indentation),entityEncodeIntlChars,preserveComments,omitXmlDeclaration,omitDoctypeDeclaration, policy-basedallowedEmptyTagsandrequiresClosingTagsAbstractAntiSamyScanner.java+AntiSamyDOMScanner.javaUpdated to use
HtmlSerializerinstead ofASHTMLSerializer.Bug fixes discovered and fixed
>bug:parentState.emptywas not reset tofalseafter writing>inserializeElement, causing a second>to be emitted for any content following an<a>or<td>child elementindenting=true(mirrors what XercesIndentPrinterdoes via buffering) — fixes indentation for GitHub issue Antisamy 1.7.5 version - <body> tag issue nahsra/antisamy#453\nafter each root-level element whenindenting=true— fixes GitHub issues Updated Batik CSS library version to resolve CVE-2018-8013 nahsra/antisamy#30 and Selfclosing Break Line Tag <br /> tag in html content is converted into <br> open tag. nahsra/antisamy#484HtmlSerializerTest.java(new — 110 tests)Comprehensive test suite covering:
br,hr,img,input,meta,col,param)href,src), boolean (selected,checked,disabled,multiple,readonly,nowrap), empty, special chars<,&,>,", ,©,€,–,—,“,é, …)formatOutputbehaviourstyle,script,textarea,pre)Test results
(83 original + 110 new
HtmlSerializerTest+ 39 other existing tests)