File tree Expand file tree Collapse file tree 1 file changed +4
-11
lines changed
Expand file tree Collapse file tree 1 file changed +4
-11
lines changed Original file line number Diff line number Diff line change 2222HTMLement
2323---------
2424
25- Why another Python HTML Parser? There is no "HTML Parser" in the "Python" Standard Library.
26- Actually, there is the `html.parser.HTMLParser `_ that simply "traverses the DOM tree" and allows me to be notified as
27- each tag is being parsed. Usually, when "parsing HTML" I want to query its elements and extract data from it.
25+ HTMLement is a pure Python HTML Parser.
2826
29- There are a few third party "HTML parsers" available like "lxml", "html5lib" and "beautifulsoup".
30- * "lxml" is the best "parser" available, fast and reliable but since it requires "C libraries", it's not always possible to install.
31- * "html5lib" is a "pure-python library" and is designed to conform to the "WHATWG HTML" specification. But it is very slow at parsing HTML.
32- * "beautifulsoup" is also a "pure-python library" but is considered by most to be "very slow".
33-
34- The "Object" of this project is to be a "pure-python HTML parser" which is also "faster" than "beautifulsoup".
27+ The object of this project is to be a "pure-python HTML parser" which is also "faster" than "beautifulsoup".
3528And like "beautifulsoup", will also parse invalid html.
36- The most simple way to do this is to use `XPath expressions `__.
29+
30+ The most simple way to do this is to use ElementTree `XPath expressions `__.
3731Python does support a simple (read limited) XPath engine inside its "ElementTree" module.
3832A benefit of using "ElementTree" is that it can use a "C implementation" whenever available.
3933
4034This "HTML Parser" extends `html.parser.HTMLParser `_ to build a tree of `ElementTree.Element `_ instances.
41- The returned "root element" natively supports the ElementTree API.
4235
4336Install
4437-------
You can’t perform that action at this time.
0 commit comments