You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 21, 2026. It is now read-only.
Based off discussion in #295. Both Sigil and Github have a need to make small local modifications to the parse tree before reserializing it out. This is currently very difficult because of the number of pointers that must be kept in sync, the possibility of introducing memory leaks by not updating them, and the need to pass a GumboParser around for the allocator.
Concrete proposal
Remove the ability to set custom allocators on GumboOptions. Use the system malloc for all memory.
Expose create_node, destroy_node, get_attribute, set_attribute, set_attribute_value, and the vector modification functions (add, remove, remove_at, insert_at) to the public API.
Current workaround
We currently recommend that people who want mutation wrap the whole parse tree in an API of their choice, mutate that, and then serialize it out. Gumbo's API is simple enough that a tree-walker can be written in a page or so of code, and tree traversal time is negligible compared to parse time (~1%). Several outside bindings have DOM APIs already, eg. lua-gumbo, gumbo-libxml, and the html5lib and BeautifulSoup adaptors that come with the main distribution.
Benefits
If this is useful to you, you'll probably know it immediately. :-) But enumerating them:
No need to use & learn an outside library just to do mutation.
Mutation can work in terms of the GumboNodes you already have; if you're doing querying or traversal already, there's no need to adapt that code to work on a different DOM representation.
Well-suited to small local mutations, where it feels like overkill to have to reserialize the whole parse tree just to change one node.
Possibly marginally faster, since there's no need to traverse & allocate for a new parse tree, although empirically this effect has been negligible.
Simplified API in some cases, since some functions that previously needed a GumboParser/GumboOptions argument no longer do (notably gumbo_destroy_output).
There is a partial branch demonstrating some of these changes at vmg/development.
Drawbacks
Incompatible with the existing allocator machinery.
Backwards-incompatible; at a minimum, this change results in signature changes for GumboOptions and gumbo_destroy_output, and exposes a half dozen or so new functions.
Possibly more API surface for third-party bindings to wrap. External bindings are under no obligation to offer the full feature set of Gumbo, but if this goes in, there will likely be pressure from users to expand the feature set of them.
More API surface for new users of the library to learn.
Many of the existing helpers that would be exposed by this proposal are not designed for efficiency or for this usage. gumbo_get_attribute, for example, takes linear time, and gumbo_create_node wouldn't know where to insert the node in the list of next/prev pointers.
Compromise solutions
Replace the custom allocators in GumboOptions with global gumbo_set_allocator/gumbo_set_deallocator functions. This restores custom allocators, but still eliminates the ability of different instances of gumbo_parse (eg. in a multithreaded program) to run with separate heaps, so eg. a per-parse arena would require locking that destroys many of the speed benefits of an arena.
Have functions take an optional first parameter, perhaps a GumboOptions with the allocator/deallocator functions or a GumboArena, and use that if provided. If NULL, it falls back to the system malloc. This gets all the functionality and doesn't compromise any design options, but it still results in an ugly API, and it's very easy to make a mistake and forget what allocator you used.
Comment with a +1 or -1, or any additional comments or considerations.
Based off discussion in #295. Both Sigil and Github have a need to make small local modifications to the parse tree before reserializing it out. This is currently very difficult because of the number of pointers that must be kept in sync, the possibility of introducing memory leaks by not updating them, and the need to pass a GumboParser around for the allocator.
Concrete proposal
Current workaround
We currently recommend that people who want mutation wrap the whole parse tree in an API of their choice, mutate that, and then serialize it out. Gumbo's API is simple enough that a tree-walker can be written in a page or so of code, and tree traversal time is negligible compared to parse time (~1%). Several outside bindings have DOM APIs already, eg. lua-gumbo, gumbo-libxml, and the html5lib and BeautifulSoup adaptors that come with the main distribution.
Benefits
If this is useful to you, you'll probably know it immediately. :-) But enumerating them:
There is a partial branch demonstrating some of these changes at vmg/development.
Drawbacks
Compromise solutions
Comment with a +1 or -1, or any additional comments or considerations.