Skip to content

HTMLParser.unknown_decl receives corrupted data ('CDATA[') when parsing an empty CDATA section #140878

@T90REAL

Description

@T90REAL

Bug report

Bug description:

When parsing the input <![CDATA[]]>, the unknown_decl hook is incorrectly called with the corrupted, partial string 'CDATA['.

A correct parser has only two possible-and-correct behaviors:

  • If CDATA is supported: Call handle_cdata('').
  • If the declaration is "unrecognized," the unknown_decl hook must receive the entire content inside <!...>, which would be '[CDATA[]]'.

The actual result ('CDATA[') matches neither of those. I use the private _set_support_cdata(True) method here since I think it was the only available trigger to activate this specific code path to expose the bug.

from html.parser import HTMLParser

class CdataBugParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.unknown_decls = []

    def unknown_decl(self, data):
        self.unknown_decls.append(data)

html_input = "<![CDATA[]]>"
parser = CdataBugParser()
parser._set_support_cdata(True)
parser.feed(html_input)
print(parser.unknown_decls)
['CDATA[']

CPython versions tested on:

3.12

Operating systems tested on:

Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.13bugs and security fixes3.14bugs and security fixes3.15new features, bugs and security fixesstdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions