-
-
Notifications
You must be signed in to change notification settings - Fork 33.8k
gh-54873: Add support for namespaces prefixes to xml.sax.expatreader
#118317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ader from PyXml (0.8.4)
picnixz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but as this is a new feature:
- Better tests must be written. Why are we using a new file?
- A What's New entry must be added in addition to the NEWS entry.
- Documentation must be added.
- PyXML is no longer maintained. What about other XML parsers?
Lib/test/test_xml_expatreader.py
Outdated
| self.parser.setContentHandler(h) | ||
| self.parser.feed("<Q:E xmlns:Q='http://example.org/testuri'/>") | ||
| self.parser.close() | ||
| print("self.assertEqual") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A shamelessly forgot debug print.
I removed it
Lib/test/test_xml_expatreader.py
Outdated
| self.parser.feed("<Q:E xmlns:Q='http://example.org/testuri'/>") | ||
| self.parser.close() | ||
| print("self.assertEqual") | ||
| self.assertFalse(h.qname is None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not sufficient. We should match against the exact qname.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Lib/test/test_xml_expatreader.py
Outdated
| @@ -0,0 +1,24 @@ | |||
| import unittest | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we having a new test file? there should already be a testfile for expat reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed all xml unit tests were in the test_xml_* files. Looks like there is a test_sax.py.
Moved test there.
| @@ -0,0 +1,2 @@ | |||
| Backported namespaces prefixes support for xml.sax.expatreader from PyXml | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A What's New entry should be written; this is a new feature. Also, I don't think we should indicate PyXML either. Please look at other changelog entries involving pyexpat to have an idea of how to write them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
reason why I added the "from PyXml" is that I didn't write the code from scratch.
I backported / copied it from the existing PyXml implementation.
Not sure how to correctly credit the original library.
| elif name == feature_namespace_prefixes: | ||
| if state: | ||
| raise SAXNotSupportedException( | ||
| "expat does not report namespace prefixes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it a limitation from the C extension module? was it a Expat version limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick git blame, and looks like this specific check was added in this commit: 18476a3 (from 2002).
and in this commit Expat version used is 1.95 (which should support namespace prefixes), so I would say the limitation was probably just in the python wrapper.
| pair = name.split() | ||
| if len(pair) == 1: | ||
| # no namespace | ||
| elem_qname = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea whether this is correct or not here. Is there some specs that we could follow for the implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're referring specificaly to elem_qname, there is the spec: https://www.w3.org/TR/xml-names/#ns-qualnames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added another test to test the "else" branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@picnixz these lines are 1:1 from PyXML 0.8.4:
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be poked with soft cushions! |
|
|
||
| from xml.sax import make_parser, ContentHandler, \ | ||
| SAXException, SAXReaderNotAvailable, SAXParseException | ||
| SAXException, SAXReaderNotAvailable, SAXParseException, handler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why import handler when ContentHandler is already imported? why not use the latter directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the handler.ContentHandler, with ContentHandler.
but I still need to keep the handler import for the namespace/namespace features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?:
| SAXException, SAXReaderNotAvailable, SAXParseException, handler | |
| SAXException, SAXReaderNotAvailable, SAXParseException | |
| from xml.sax.handler import feature_namespaces, feature_namespace_prefixes |
| parser.feed("<E xmlns='http://example.org/testuri'/>") | ||
| parser.close() | ||
| self.assertEqual(h.qname, "E") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't leave 3 blank lines, only 2 is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Lib/test/test_sax.py
Outdated
| parser.feed("<E xmlns='http://example.org/testuri'/>") | ||
| parser.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use parser.parse(..., True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but I didn't add the True, it seems that the parse method only accept one argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Wait, @hartwork should we do feed() + close() or parse()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@picnixz hi!
My understanding is that:
create_parserisxml.sax.expatreader.create_parserand it creates anxml.sax.xmlreader.XMLReader, first sentence in the docsxml.sax.xmlreader.XMLReaderprovides.parsethat parses in one single go, one single parameter, no booleanisFinalxml.sax.xmlreader.XMLReaderdoes not provide.feedor.closebecause that needsIncrementalParser, not a plainXMLReader.- (
xml.sax.expatreader.create_parsercreatesxml.sax.expatreader.ExpatParserthat is an instance of bothIncrementalParserandXMLReaderfor me in practice.) - I therefore consider
parser.parse("<E xmlns='http://example.org/testuri'/>")to be better suited here because availability of.feedand.closeis not guaranteed on interface level. - (If you want to stick to
.feedhere or elsewhere maybe addself.assertIsInstance(parser, IncrementalParser)prior to a call to communicate that expectation and have the test fail meaningfully for regressions in the future.)
What do you think?
Lib/xml/sax/expatreader.py
Outdated
| self._entity_stack = [] | ||
| self._external_ges = 0 | ||
| self._interning = None | ||
| self._namespace_prefixes = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it 1 by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use a new namespacePrefixesHandling argument.
| pair = name.split() | ||
| if len(pair) == 1: | ||
| # no namespace | ||
| elem_qname = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Doc/whatsnew/3.15.rst
Outdated
|
|
||
| .. _billion laughs: https://en.wikipedia.org/wiki/Billion_laughs_attack | ||
|
|
||
| * Add support for namespace prefixes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be also documented in the expat docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean I should add a link to the C libexpat doc ?
EDIT: I think you probably meant to document this here: https://docs.python.org/3/library/pyexpat.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, to the specs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Sorry, I meant: keep the NEWS entry and add the What's New entry (IOW, both of them must be added). |
|
cc @hartwork as the maintainer of Expat |
xml.sax.expatreader
hartwork
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I understand the ideas behind this pull request now. I have compared with what PyXML 0.8.4 did and will also attach my local play code here (that is derived from the test case in this pull request):
- Download: sax_debug.py
I am optimistic that this pull request will be in mergable shape soon. Here is what I found:
| * Add support for `namespace prefixes <https://www.w3.org/TR/xml-names/#dt-prefix>`_. | ||
| (Contributed by Yassir Karroum in :gh:`118317`.) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding to section xml.parsers.expat while the changes in here are patching code elsewhere. (I'm unsure when/if news files should be written to directly. I have only seen things go through news files myself.)
|
|
||
| from xml.sax import make_parser, ContentHandler, \ | ||
| SAXException, SAXReaderNotAvailable, SAXParseException | ||
| SAXException, SAXReaderNotAvailable, SAXParseException, handler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?:
| SAXException, SAXReaderNotAvailable, SAXParseException, handler | |
| SAXException, SAXReaderNotAvailable, SAXParseException | |
| from xml.sax.handler import feature_namespaces, feature_namespace_prefixes |
| def startElementNS(self, name, qname, attrs): | ||
| self.qname = qname | ||
|
|
||
| for xml_s, expected_qname in zip(["<Q:E xmlns:Q='http://example.org/testuri'/>", "<E xmlns='http://example.org/testuri'/>", "<E />"], ["Q:E", "E", "E"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider this variant much more readable:
| for xml_s, expected_qname in zip(["<Q:E xmlns:Q='http://example.org/testuri'/>", "<E xmlns='http://example.org/testuri'/>", "<E />"], ["Q:E", "E", "E"]): | |
| for xml_s, expected_qname in ( | |
| ("<Q:E xmlns:Q='http://example.org/testuri'/>", "Q:E"), | |
| ("<E xmlns='http://example.org/testuri'/>", "E"), | |
| ("<E />", "E"), | |
| ): |
| """SAX driver for the pyexpat C module.""" | ||
|
|
||
| def __init__(self, namespaceHandling=0, bufsize=2**16-20): | ||
| def __init__(self, namespaceHandling=0, namespacePrefixesHandling=0, bufsize=2**16-20): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking aloud: if setFeature can toggle this after initialization and PyXML 0.8.4 did not have this, maybe we do not need this direct channel and can just initialize with = 0 below. So my vote for simplification.
| elif name == feature_namespace_prefixes: | ||
| self._namespace_prefixes = state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote for moving this one further down — right after elif name == feature_external_pes: — to match the left side of the diff.
| elem_qname = name | ||
| pair = (None, name) | ||
| elif len(pair) == 3: | ||
| elem_qname = "%s:%s" % (pair[2], pair[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: This is the key value provider in this pull request. pair[2] was previously not exposed.
| pair = name.split() | ||
| if len(pair) == 1: | ||
| # no namespace | ||
| elem_qname = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@picnixz these lines are 1:1 from PyXML 0.8.4:
| @@ -0,0 +1,2 @@ | |||
| Backported namespaces prefixes support for xml.sax.expatreader from PyXml | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Backported namespaces prefixes support for xml.sax.expatreader from PyXml | |
| Backported namespaces prefixes support for xml.sax.expatreader from PyXML |
| qnames[apair] = qname | ||
|
|
||
| self._cont_handler.startElementNS(pair, None, | ||
| self._cont_handler.startElementNS(pair, elem_qname, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will never pass None even with support for namespace prefixes disabled.
The following patch is arguably a bit silly, but it keeps things backwards compatible (and demos the idea):
| self._cont_handler.startElementNS(pair, elem_qname, | |
| if not self._namespace_prefixes: | |
| elem_qname = None | |
| self._cont_handler.startElementNS(pair, elem_qname, |
| for xml_s, expected_qname in zip(["<Q:E xmlns:Q='http://example.org/testuri'/>", "<E xmlns='http://example.org/testuri'/>", "<E />"], ["Q:E", "E", "E"]): | ||
| parser = create_parser() | ||
| parser.setFeature(handler.feature_namespaces, 1) | ||
| parser.setFeature(handler.feature_namespace_prefixes, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there should also be a test for the behavior with this disabled where all QNames returned are None as previously.
Uh oh!
There was an error while loading. Please reload this page.