Skip to content

Commit f0ae78d

Browse files
mxaminclaude
andcommitted
Decouple lxml from xmlsec in template.add_reference (#356)
python-xmlsec hands lxml's raw libxml2 node pointers straight to xmlsec1. That only works when lxml and xmlsec link the same libxml2 at runtime; when they differ (e.g. lxml's bundled libxml2 vs a system/homebrew one), mixing the two libraries' nodes corrupts memory and segfaults. Rework add_reference to exchange serialized XML instead of pointers: lxml serializes the template with its own libxml2, xmlsec re-parses and mutates a private copy, and the resulting <Reference> is grafted back into the lxml tree via lxml. Only bytes cross the boundary, never raw node pointers, so the two libxml2 builds never touch each other's nodes. Details: - add PyXmlSec_LxmlElementToBytes/FromBytes bridge helpers (lxml.c/.h) that round-trip through lxml's own etree.tostring/fromstring. - unlink the new node before xmlReconciliateNs so a standalone dump keeps the inherited dsig namespace (reconcile is a no-op while still attached). - capture and reapply the trailing whitespace text node xmlsec emits after the reference, keeping the canonicalized (signed) bytes byte-identical. - gate the libxml2 version-mismatch import guard behind PYXMLSEC_SKIP_VERSION_CHECK so the decoupled path can be exercised under a mismatch (still unsafe for the not-yet-converted node-passing paths). Scope is template.add_reference only; the rest of template.c, ds.c, enc.c and tree.c still pass raw nodes. See developer.md / CLAUDE.md for the approach, gotchas, build/validation recipe and rollout checklist. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent ab5b099 commit f0ae78d

6 files changed

Lines changed: 677 additions & 4 deletions

File tree

CLAUDE.md

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# CLAUDE.md — developer notes
2+
3+
This file is for anyone (human or AI) working on the C internals of
4+
`python-xmlsec`. User-facing docs live in `doc/source/`. The focus here is the
5+
libxml2 ABI work tracked in
6+
[issue #356](https://github.com/xmlsec/python-xmlsec/issues/356).
7+
8+
For the design narrative — the before/after of the approach with code — see
9+
[developer.md](developer.md). This file is the operational companion: build
10+
commands, gotchas to remember, and the rollout checklist.
11+
12+
## What this project is
13+
14+
A CPython C extension (`src/*.c`) that bridges two libraries that both build on
15+
**libxml2**:
16+
17+
- **lxml** — provides the XML tree the user manipulates in Python (`_Element`).
18+
- **xmlsec1** (`libxmlsec1`) — the C library that signs/encrypts XML.
19+
20+
The extension takes lxml `_Element` objects, reaches into them for the raw
21+
`xmlNodePtr` (`node->_c_node`) / `xmlDocPtr` (`node->_doc->_c_doc`), and hands
22+
those pointers to `xmlsec1`.
23+
24+
## The ABI problem (#356)
25+
26+
That pointer-passing is only safe when **lxml and xmlsec1 link the *same*
27+
libxml2 at runtime**. They often don't:
28+
29+
- lxml wheels bundle their own static libxml2.
30+
- xmlsec is built against a system / homebrew / static libxml2.
31+
32+
When the two libxml2 versions differ, passing a node allocated by one to the
33+
other mixes incompatible struct layouts and allocators → memory corruption,
34+
double-frees, bogus signatures, and segfaults.
35+
36+
Historically the only mitigation was a hard refusal to import on mismatch (the
37+
`"lxml & xmlsec libxml2 library version mismatch"` guard, issue #283, in
38+
`PyXmlSec_InitLxmlModule` in [src/lxml.c](src/lxml.c)).
39+
40+
## The fix strategy: serialize across the boundary
41+
42+
Instead of passing raw pointers, round-trip through **serialized XML bytes**.
43+
Bytes have no ABI; each library only ever touches nodes its *own* libxml2
44+
allocated:
45+
46+
1. **lxml → bytes**: serialize the input element with lxml's own libxml2
47+
(`etree.tostring`). Input is never mutated by xmlsec.
48+
2. **bytes → xmlsec node**: re-parse with xmlsec's libxml2 (`xmlReadMemory`).
49+
3. Run the xmlsec operation on that fresh, xmlsec-owned node.
50+
4. **xmlsec node → bytes → lxml**: serialize the result with xmlsec's libxml2
51+
(`xmlNodeDump`), re-parse with lxml (`etree.fromstring`), and graft it into
52+
the original lxml tree.
53+
54+
Only bytes cross between the two libxml2 worlds — never a pointer.
55+
56+
> **This decouples lxml from xmlsec, but it assumes the extension and
57+
> `libxmlsec1` share one libxml2.** If *those two* disagree (e.g. the extension
58+
> links system libxml2 while libxmlsec1 links homebrew's), step 3 still mixes
59+
> allocators and crashes. See "Building & validating under a mismatch" below.
60+
61+
### Bridge helpers
62+
63+
Two small helpers in [src/lxml.c](src/lxml.c) (declared in
64+
[src/lxml.h](src/lxml.h)) are the only sanctioned crossing points. They go
65+
through lxml's Python API so the tree is always walked by lxml's libxml2:
66+
67+
- `PyXmlSec_LxmlElementToBytes(element)``etree.tostring(element, with_tail=False)`
68+
- `PyXmlSec_LxmlElementFromBytes(data)``etree.fromstring(data)`
69+
70+
## Reference implementation: `template.add_reference`
71+
72+
`PyXmlSec_TemplateAddReference` in [src/template.c](src/template.c) is the first
73+
function converted and the **template for converting the rest**. Read it
74+
alongside this section. The flow:
75+
76+
1. `PyXmlSec_LxmlElementToBytes(node)` — serialize the `<Signature>` element.
77+
2. `xmlReadMemory(...)` — parse into a throwaway xmlsec-owned `xmlDocPtr`.
78+
3. `xmlSecTmplSignatureAddReference(xmlDocGetRootElement(doc), ...)` — xmlsec
79+
adds the `<Reference>` to the copy and returns it (`res`).
80+
4. Reflect back: dump `res`, `PyXmlSec_LxmlElementFromBytes(...)`, then
81+
`find` the `SignedInfo` of the *original* node and `append` the new element.
82+
5. Return the grafted lxml element (a live node in the user's tree, so the
83+
incremental builder — `add_transform(ref, ...)` — keeps working).
84+
85+
### Two non-obvious gotchas (the reason this took iterations)
86+
87+
These will recur in every function we convert, so they're worth understanding:
88+
89+
**(a) Namespaces are lost on a naive node dump.** `xmlNodeDump` of a subtree does
90+
*not* emit namespace declarations that live on ancestors. The `<Reference>`
91+
uses the dsig namespace declared up on `<Signature>`, so dumping it alone yields
92+
`<Reference>` with no `xmlns` → re-parses into the *wrong* (empty) namespace.
93+
Fix: **`xmlUnlinkNode(res)` first, then `xmlReconciliateNs(doc, res)`.** Order
94+
matters — while the node is still attached, the namespace is reachable via its
95+
ancestors, so reconcile thinks nothing is wrong and does nothing. Only after
96+
unlinking does reconcile redeclare the namespace onto the node itself.
97+
98+
**(b) Whitespace changes the signature.** xmlsec pretty-prints by inserting
99+
newline text nodes between children. The newline it puts *after* the
100+
`<Reference>` element is a *sibling* tail node, not part of the dumped subtree,
101+
so the round-trip drops it.
102+
That single missing `\n` changes the canonicalized `SignedInfo` bytes, which
103+
changes the computed `SignatureValue` and breaks byte-exact signature fixtures.
104+
Fix: capture `res->next`'s text (the tail) *before* unlinking, and reapply it as
105+
the grafted element's `.tail`. Any converted function that relies on xmlsec's
106+
formatting must preserve these tail text nodes to stay byte-compatible.
107+
108+
### Memory / ref-count notes
109+
110+
- `res` is unlinked from `doc`, so it is **not** freed by `xmlFreeDoc(doc)` — it
111+
must be `xmlFreeNode`'d separately.
112+
- The captured `tail` is `xmlStrdup`'d → `xmlFree` it (NULL-safe on the success
113+
path).
114+
- Every `PyObject_CallMethod`/`PyUnicode_*` result is a new reference and is
115+
`Py_DECREF`'d, including the `None` returned by `.append(...)`.
116+
- The pure-C libxml2 work runs inside `Py_BEGIN_ALLOW_THREADS`; all the lxml
117+
Python-API calls run with the GIL held, outside it.
118+
119+
## Version guard / opt-in escape hatch
120+
121+
`PyXmlSec_InitLxmlModule` in [src/lxml.c](src/lxml.c) still blocks import on a
122+
libxml2 mismatch **by default**. Setting `PYXMLSEC_SKIP_VERSION_CHECK` bypasses
123+
the guard. It's needed to exercise the decoupled paths under a mismatch, but it
124+
is **unsafe for any operation still on the raw-node path** — keep it off in
125+
normal use, on only for developing/testing #356.
126+
127+
## Building & validating under a mismatch (macOS / homebrew)
128+
129+
Standard in-place build (dynamic, via pkg-config):
130+
131+
```sh
132+
python -m pip install pkgconfig
133+
python setup.py build_ext --inplace --force # copies the .so into src/
134+
PYTHONPATH=src python -m pytest tests/
135+
```
136+
137+
To actually reproduce #356 you need lxml and xmlsec on **different** libxml2.
138+
The trap on a homebrew Mac is a *three-way* split:
139+
140+
- lxml: bundled libxml2 (static, e.g. 2.14.x)
141+
- the extension: links `/usr/lib/libxml2.2.dylib` (old system libxml2)
142+
- `libxmlsec1.dylib`: links homebrew `libxml2` (e.g. 2.15.x)
143+
144+
The extension and libxmlsec1 disagreeing crashes regardless of the serialize
145+
work. Force them onto the **same** libxml2 (homebrew's, matching libxmlsec1),
146+
leaving only lxml different — the real #356 scenario:
147+
148+
```sh
149+
# build & compile against homebrew libxml2 headers/libs
150+
rm -rf build/ src/xmlsec.cpython-*-darwin.so
151+
PKG_CONFIG_PATH=/opt/homebrew/opt/libxml2/lib/pkgconfig \
152+
python setup.py build_ext --inplace --force
153+
154+
# the linker still prefers the SDK stub, so rewrite the runtime dep
155+
install_name_tool -change /usr/lib/libxml2.2.dylib \
156+
/opt/homebrew/opt/libxml2/lib/libxml2.16.dylib \
157+
src/xmlsec.cpython-*-darwin.so
158+
159+
# verify: lxml and xmlsec now report different libxml2
160+
PYXMLSEC_SKIP_VERSION_CHECK=1 PYTHONPATH=src python -c \
161+
"import xmlsec; from lxml import etree; \
162+
print('lxml', etree.LIBXML_VERSION, 'xmlsec', xmlsec.get_libxml_version())"
163+
164+
# run the suite under the mismatch
165+
PYXMLSEC_SKIP_VERSION_CHECK=1 PYXMLSEC_TEST_ITERATIONS=0 \
166+
PYTHONPATH=src python -m pytest tests/
167+
```
168+
169+
A `PYXMLSEC_STATIC_DEPS=true` build statically links one libxml2 into the
170+
extension+xmlsec and avoids the whole dance (this is what CI/wheels do).
171+
172+
Useful env vars: `PYXMLSEC_ENABLE_DEBUG=1` (debug build + trace),
173+
`PYXMLSEC_TEST_ITERATIONS=N` (per-test leak-detection reruns in `tests/base.py`;
174+
note `ru_maxrss` is **bytes** on macOS, kB on Linux).
175+
176+
## Status & rollout
177+
178+
-`template.add_reference` — converted and validated (full suite green +
179+
10k-iteration loop, no crash/leak, under a real 2.14↔2.15 mismatch).
180+
- ⬜ Everything else still passes raw lxml nodes to xmlsec and is unsafe on a
181+
mismatch: the rest of `src/template.c`, `src/ds.c` (sign/verify),
182+
`src/enc.c` (encrypt/decrypt), `src/tree.c`.
183+
184+
When converting another function, reuse the `add_reference` pattern and watch
185+
for the same two gotchas (namespace reconciliation after unlink; preserving
186+
xmlsec's formatting tail nodes). Each function differs in *where* the result
187+
grafts back (e.g. `add_reference``SignedInfo`; `ensure_key_info`
188+
`Signature`). The default version guard can only be relaxed once **all**
189+
node-passing paths are converted.

developer.md

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# Developer guide: decoupling lxml and xmlsec across libxml2 (#356)
2+
3+
This document explains the approach change introduced for
4+
[issue #356](https://github.com/xmlsec/python-xmlsec/issues/356): how
5+
`python-xmlsec` used to hand XML nodes to `xmlsec1`, why that is unsafe, and the
6+
serialize-based approach that replaces it. The first function converted is
7+
`template.add_reference`; it is the blueprint for the rest.
8+
9+
For day-to-day build/validation commands and the ongoing rollout checklist, see
10+
[CLAUDE.md](CLAUDE.md). This file is the *why* and the *before/after*.
11+
12+
---
13+
14+
## Background
15+
16+
`python-xmlsec` is a C extension that glues together two libraries, **both built
17+
on libxml2**:
18+
19+
| Library | Role | Its libxml2 |
20+
|------------|----------------------------------------------|--------------------------------|
21+
| `lxml` | the XML tree the user edits in Python | usually bundled in the wheel |
22+
| `xmlsec1` | signs / encrypts XML | system / homebrew / static |
23+
24+
An lxml `_Element` is a Python object that wraps a raw libxml2 `xmlNode`:
25+
26+
```c
27+
struct LxmlElement {
28+
PyObject_HEAD
29+
struct LxmlDocument* _doc; // _doc->_c_doc is the xmlDocPtr
30+
xmlNode* _c_node; // the raw libxml2 node
31+
...
32+
};
33+
```
34+
35+
The extension reaches in for `_c_node` / `_c_doc` and passes those pointers
36+
straight to `xmlsec1`.
37+
38+
## The problem
39+
40+
That only works if **lxml and xmlsec1 use the same libxml2 at runtime**. They
41+
frequently don't — lxml wheels bundle their own static libxml2, while xmlsec is
42+
built against whatever libxml2 is on the system. When the versions differ:
43+
44+
- the `xmlNode` / `xmlDoc` struct layouts and allocators don't match,
45+
- a node allocated by one library and freed/mutated by the other corrupts the
46+
heap,
47+
48+
…producing segfaults, double-frees, and silently wrong signatures. Until now the
49+
only protection was to refuse to import on a version mismatch (issue #283).
50+
51+
## Before: pass the raw lxml node to xmlsec
52+
53+
The original `add_reference` handed `node->_c_node` directly to xmlsec and
54+
wrapped the node xmlsec created (inside lxml's document) back as an `_Element`:
55+
56+
```c
57+
// BEFORE — src/template.c
58+
Py_BEGIN_ALLOW_THREADS;
59+
res = xmlSecTmplSignatureAddReference(node->_c_node, digest->id,
60+
XSTR(id), XSTR(uri), XSTR(type));
61+
Py_END_ALLOW_THREADS;
62+
if (res == NULL) {
63+
PyXmlSec_SetLastError("cannot add reference.");
64+
goto ON_FAIL;
65+
}
66+
return (PyObject*)PyXmlSec_elementFactory(node->_doc, res);
67+
```
68+
69+
`xmlSecTmplSignatureAddReference` (xmlsec's libxml2) walks `node->_c_node`
70+
(lxml's libxml2) and allocates new nodes into lxml's document. Two libxml2
71+
builds touching the same tree → the #356 crash.
72+
73+
## After: exchange serialized XML, never pointers
74+
75+
The new approach keeps each library on its own libxml2 and lets only **bytes**
76+
cross between them:
77+
78+
```
79+
lxml's libxml2 xmlsec's libxml2 lxml's libxml2
80+
node ──► etree.tostring ──► bytes ──► xmlReadMemory ──► xmlSecTmpl... ──► xmlNodeDump ──► bytes ──► etree.fromstring ──► graft into node's SignedInfo
81+
(input, never mutated) (throwaway copy xmlsec owns) (new node lxml owns)
82+
```
83+
84+
```c
85+
// AFTER — src/template.c (abridged; full version has the comments + cleanup)
86+
87+
// 1. lxml serializes the template with its OWN libxml2
88+
serialized = PyXmlSec_LxmlElementToBytes((PyObject*)node);
89+
PyBytes_AsStringAndSize(serialized, &xml_data, &xml_size);
90+
91+
Py_BEGIN_ALLOW_THREADS;
92+
// 2. xmlsec re-parses the bytes into a private copy it owns
93+
doc = xmlReadMemory(xml_data, (int)xml_size, NULL, NULL, XML_PARSE_NONET);
94+
if (doc != NULL) {
95+
// 3. run the xmlsec op on the copy
96+
res = xmlSecTmplSignatureAddReference(xmlDocGetRootElement(doc),
97+
digest->id, XSTR(id), XSTR(uri), XSTR(type));
98+
if (res != NULL) {
99+
tail = capture res->next text; // gotcha (b)
100+
xmlUnlinkNode(res); // gotcha (a): unlink BEFORE
101+
xmlReconciliateNs(doc, res); // reconcile
102+
buffer = xmlBufferCreate();
103+
xmlNodeDump(buffer, doc, res, 0, 0); // serialize just the new node
104+
}
105+
}
106+
Py_END_ALLOW_THREADS;
107+
108+
// 4. lxml re-parses the new node and grafts it back into the user's tree
109+
ref_bytes = PyBytes_FromStringAndSize(xmlBufferContent(buffer), xmlBufferLength(buffer));
110+
new_ref = PyXmlSec_LxmlElementFromBytes(ref_bytes);
111+
signed_info = node.find("{dsig}SignedInfo");
112+
signed_info.append(new_ref);
113+
new_ref.tail = tail; // gotcha (b)
114+
return new_ref;
115+
```
116+
117+
The input `node` is never touched by xmlsec. The returned `new_ref` is a live
118+
lxml node in the user's document, so the incremental builder still works:
119+
120+
```python
121+
ref = xmlsec.template.add_reference(sign, consts.TransformSha1, uri="#data")
122+
trans = xmlsec.template.add_transform(ref, consts.TransformEnveloped) # ref is real
123+
```
124+
125+
### The two bridge helpers
126+
127+
Added in [src/lxml.c](src/lxml.c) / [src/lxml.h](src/lxml.h) — the only
128+
sanctioned crossing points, both going through lxml's Python API so lxml's
129+
libxml2 always does the work:
130+
131+
```c
132+
PyObject* PyXmlSec_LxmlElementToBytes(PyObject* element); // etree.tostring(el, with_tail=False)
133+
PyObject* PyXmlSec_LxmlElementFromBytes(PyObject* data); // etree.fromstring(data)
134+
```
135+
136+
## The two gotchas (these recur in every function we convert)
137+
138+
**(a) Namespaces vanish from a naive node dump.** `xmlNodeDump` does not emit
139+
namespace declarations inherited from *ancestors*. The `<Reference>` lives in
140+
the dsig namespace declared up on `<Signature>`, so dumping it alone gives
141+
`<Reference>` with no `xmlns` and it re-parses into the empty namespace.
142+
Fix: **unlink first, then reconcile** — `xmlUnlinkNode(res)` then
143+
`xmlReconciliateNs(doc, res)`. While the node is still attached the namespace is
144+
reachable via ancestors, so reconcile is a no-op; only after unlinking does it
145+
redeclare the namespace onto the node.
146+
147+
**(b) Whitespace is load-bearing.** xmlsec pretty-prints by inserting `"\n"` text
148+
nodes between siblings. The newline *after* `<Reference>` is a sibling tail node,
149+
not part of the dumped subtree, so the round-trip loses it. That one missing
150+
`\n` changes the canonicalized `SignedInfo`, which changes the computed
151+
`SignatureValue` and breaks byte-exact signature fixtures (`tests/test_ds.py`).
152+
Fix: capture `res->next`'s text before unlinking and reapply it as
153+
`new_ref.tail`.
154+
155+
## Files changed
156+
157+
| File | Change |
158+
|-------------------------------------|------------------------------------------------------------------------|
159+
| [src/template.c](src/template.c) | `PyXmlSec_TemplateAddReference` rewritten to the serialize round-trip |
160+
| [src/lxml.c](src/lxml.c) | added `…ToBytes`/`…FromBytes`; version guard gained an opt-in bypass |
161+
| [src/lxml.h](src/lxml.h) | declarations for the two helpers |
162+
163+
## Version guard / opt-in
164+
165+
`PyXmlSec_InitLxmlModule` ([src/lxml.c](src/lxml.c)) still **refuses to import on
166+
a libxml2 mismatch by default**. `PYXMLSEC_SKIP_VERSION_CHECK` opts out — needed
167+
to exercise the decoupled paths under a mismatch, but **unsafe for any operation
168+
still on the raw-node path**. The guard can only be removed once every
169+
node-passing function is converted.
170+
171+
## Validation
172+
173+
Under a real mismatch (lxml 2.14.6 ↔ xmlsec 2.15.3): full suite **285 passed,
174+
6 skipped**, plus a 10,000-iteration `add_reference` + `add_transform` loop with
175+
no crash and no leak. See [CLAUDE.md](CLAUDE.md) for the exact build/relink
176+
recipe (note the homebrew three-way libxml2 trap) and the rollout checklist.

0 commit comments

Comments
 (0)