|
| 1 | +# CLAUDE.md — developer notes |
| 2 | + |
| 3 | +This file is for anyone (human or AI) working on the C internals of |
| 4 | +`python-xmlsec`. User-facing docs live in `doc/source/`. The focus here is the |
| 5 | +libxml2 ABI work tracked in |
| 6 | +[issue #356](https://github.com/xmlsec/python-xmlsec/issues/356). |
| 7 | + |
| 8 | +For the design narrative — the before/after of the approach with code — see |
| 9 | +[developer.md](developer.md). This file is the operational companion: build |
| 10 | +commands, gotchas to remember, and the rollout checklist. |
| 11 | + |
| 12 | +## What this project is |
| 13 | + |
| 14 | +A CPython C extension (`src/*.c`) that bridges two libraries that both build on |
| 15 | +**libxml2**: |
| 16 | + |
| 17 | +- **lxml** — provides the XML tree the user manipulates in Python (`_Element`). |
| 18 | +- **xmlsec1** (`libxmlsec1`) — the C library that signs/encrypts XML. |
| 19 | + |
| 20 | +The extension takes lxml `_Element` objects, reaches into them for the raw |
| 21 | +`xmlNodePtr` (`node->_c_node`) / `xmlDocPtr` (`node->_doc->_c_doc`), and hands |
| 22 | +those pointers to `xmlsec1`. |
| 23 | + |
| 24 | +## The ABI problem (#356) |
| 25 | + |
| 26 | +That pointer-passing is only safe when **lxml and xmlsec1 link the *same* |
| 27 | +libxml2 at runtime**. They often don't: |
| 28 | + |
| 29 | +- lxml wheels bundle their own static libxml2. |
| 30 | +- xmlsec is built against a system / homebrew / static libxml2. |
| 31 | + |
| 32 | +When the two libxml2 versions differ, passing a node allocated by one to the |
| 33 | +other mixes incompatible struct layouts and allocators → memory corruption, |
| 34 | +double-frees, bogus signatures, and segfaults. |
| 35 | + |
| 36 | +Historically the only mitigation was a hard refusal to import on mismatch (the |
| 37 | +`"lxml & xmlsec libxml2 library version mismatch"` guard, issue #283, in |
| 38 | +`PyXmlSec_InitLxmlModule` in [src/lxml.c](src/lxml.c)). |
| 39 | + |
| 40 | +## The fix strategy: serialize across the boundary |
| 41 | + |
| 42 | +Instead of passing raw pointers, round-trip through **serialized XML bytes**. |
| 43 | +Bytes have no ABI; each library only ever touches nodes its *own* libxml2 |
| 44 | +allocated: |
| 45 | + |
| 46 | +1. **lxml → bytes**: serialize the input element with lxml's own libxml2 |
| 47 | + (`etree.tostring`). Input is never mutated by xmlsec. |
| 48 | +2. **bytes → xmlsec node**: re-parse with xmlsec's libxml2 (`xmlReadMemory`). |
| 49 | +3. Run the xmlsec operation on that fresh, xmlsec-owned node. |
| 50 | +4. **xmlsec node → bytes → lxml**: serialize the result with xmlsec's libxml2 |
| 51 | + (`xmlNodeDump`), re-parse with lxml (`etree.fromstring`), and graft it into |
| 52 | + the original lxml tree. |
| 53 | + |
| 54 | +Only bytes cross between the two libxml2 worlds — never a pointer. |
| 55 | + |
| 56 | +> **This decouples lxml from xmlsec, but it assumes the extension and |
| 57 | +> `libxmlsec1` share one libxml2.** If *those two* disagree (e.g. the extension |
| 58 | +> links system libxml2 while libxmlsec1 links homebrew's), step 3 still mixes |
| 59 | +> allocators and crashes. See "Building & validating under a mismatch" below. |
| 60 | +
|
| 61 | +### Bridge helpers |
| 62 | + |
| 63 | +Two small helpers in [src/lxml.c](src/lxml.c) (declared in |
| 64 | +[src/lxml.h](src/lxml.h)) are the only sanctioned crossing points. They go |
| 65 | +through lxml's Python API so the tree is always walked by lxml's libxml2: |
| 66 | + |
| 67 | +- `PyXmlSec_LxmlElementToBytes(element)` → `etree.tostring(element, with_tail=False)` |
| 68 | +- `PyXmlSec_LxmlElementFromBytes(data)` → `etree.fromstring(data)` |
| 69 | + |
| 70 | +## Reference implementation: `template.add_reference` |
| 71 | + |
| 72 | +`PyXmlSec_TemplateAddReference` in [src/template.c](src/template.c) is the first |
| 73 | +function converted and the **template for converting the rest**. Read it |
| 74 | +alongside this section. The flow: |
| 75 | + |
| 76 | +1. `PyXmlSec_LxmlElementToBytes(node)` — serialize the `<Signature>` element. |
| 77 | +2. `xmlReadMemory(...)` — parse into a throwaway xmlsec-owned `xmlDocPtr`. |
| 78 | +3. `xmlSecTmplSignatureAddReference(xmlDocGetRootElement(doc), ...)` — xmlsec |
| 79 | + adds the `<Reference>` to the copy and returns it (`res`). |
| 80 | +4. Reflect back: dump `res`, `PyXmlSec_LxmlElementFromBytes(...)`, then |
| 81 | + `find` the `SignedInfo` of the *original* node and `append` the new element. |
| 82 | +5. Return the grafted lxml element (a live node in the user's tree, so the |
| 83 | + incremental builder — `add_transform(ref, ...)` — keeps working). |
| 84 | + |
| 85 | +### Two non-obvious gotchas (the reason this took iterations) |
| 86 | + |
| 87 | +These will recur in every function we convert, so they're worth understanding: |
| 88 | + |
| 89 | +**(a) Namespaces are lost on a naive node dump.** `xmlNodeDump` of a subtree does |
| 90 | +*not* emit namespace declarations that live on ancestors. The `<Reference>` |
| 91 | +uses the dsig namespace declared up on `<Signature>`, so dumping it alone yields |
| 92 | +`<Reference>` with no `xmlns` → re-parses into the *wrong* (empty) namespace. |
| 93 | +Fix: **`xmlUnlinkNode(res)` first, then `xmlReconciliateNs(doc, res)`.** Order |
| 94 | +matters — while the node is still attached, the namespace is reachable via its |
| 95 | +ancestors, so reconcile thinks nothing is wrong and does nothing. Only after |
| 96 | +unlinking does reconcile redeclare the namespace onto the node itself. |
| 97 | + |
| 98 | +**(b) Whitespace changes the signature.** xmlsec pretty-prints by inserting |
| 99 | +newline text nodes between children. The newline it puts *after* the |
| 100 | +`<Reference>` element is a *sibling* tail node, not part of the dumped subtree, |
| 101 | +so the round-trip drops it. |
| 102 | +That single missing `\n` changes the canonicalized `SignedInfo` bytes, which |
| 103 | +changes the computed `SignatureValue` and breaks byte-exact signature fixtures. |
| 104 | +Fix: capture `res->next`'s text (the tail) *before* unlinking, and reapply it as |
| 105 | +the grafted element's `.tail`. Any converted function that relies on xmlsec's |
| 106 | +formatting must preserve these tail text nodes to stay byte-compatible. |
| 107 | + |
| 108 | +### Memory / ref-count notes |
| 109 | + |
| 110 | +- `res` is unlinked from `doc`, so it is **not** freed by `xmlFreeDoc(doc)` — it |
| 111 | + must be `xmlFreeNode`'d separately. |
| 112 | +- The captured `tail` is `xmlStrdup`'d → `xmlFree` it (NULL-safe on the success |
| 113 | + path). |
| 114 | +- Every `PyObject_CallMethod`/`PyUnicode_*` result is a new reference and is |
| 115 | + `Py_DECREF`'d, including the `None` returned by `.append(...)`. |
| 116 | +- The pure-C libxml2 work runs inside `Py_BEGIN_ALLOW_THREADS`; all the lxml |
| 117 | + Python-API calls run with the GIL held, outside it. |
| 118 | + |
| 119 | +## Version guard / opt-in escape hatch |
| 120 | + |
| 121 | +`PyXmlSec_InitLxmlModule` in [src/lxml.c](src/lxml.c) still blocks import on a |
| 122 | +libxml2 mismatch **by default**. Setting `PYXMLSEC_SKIP_VERSION_CHECK` bypasses |
| 123 | +the guard. It's needed to exercise the decoupled paths under a mismatch, but it |
| 124 | +is **unsafe for any operation still on the raw-node path** — keep it off in |
| 125 | +normal use, on only for developing/testing #356. |
| 126 | + |
| 127 | +## Building & validating under a mismatch (macOS / homebrew) |
| 128 | + |
| 129 | +Standard in-place build (dynamic, via pkg-config): |
| 130 | + |
| 131 | +```sh |
| 132 | +python -m pip install pkgconfig |
| 133 | +python setup.py build_ext --inplace --force # copies the .so into src/ |
| 134 | +PYTHONPATH=src python -m pytest tests/ |
| 135 | +``` |
| 136 | + |
| 137 | +To actually reproduce #356 you need lxml and xmlsec on **different** libxml2. |
| 138 | +The trap on a homebrew Mac is a *three-way* split: |
| 139 | + |
| 140 | +- lxml: bundled libxml2 (static, e.g. 2.14.x) |
| 141 | +- the extension: links `/usr/lib/libxml2.2.dylib` (old system libxml2) |
| 142 | +- `libxmlsec1.dylib`: links homebrew `libxml2` (e.g. 2.15.x) |
| 143 | + |
| 144 | +The extension and libxmlsec1 disagreeing crashes regardless of the serialize |
| 145 | +work. Force them onto the **same** libxml2 (homebrew's, matching libxmlsec1), |
| 146 | +leaving only lxml different — the real #356 scenario: |
| 147 | + |
| 148 | +```sh |
| 149 | +# build & compile against homebrew libxml2 headers/libs |
| 150 | +rm -rf build/ src/xmlsec.cpython-*-darwin.so |
| 151 | +PKG_CONFIG_PATH=/opt/homebrew/opt/libxml2/lib/pkgconfig \ |
| 152 | + python setup.py build_ext --inplace --force |
| 153 | + |
| 154 | +# the linker still prefers the SDK stub, so rewrite the runtime dep |
| 155 | +install_name_tool -change /usr/lib/libxml2.2.dylib \ |
| 156 | + /opt/homebrew/opt/libxml2/lib/libxml2.16.dylib \ |
| 157 | + src/xmlsec.cpython-*-darwin.so |
| 158 | + |
| 159 | +# verify: lxml and xmlsec now report different libxml2 |
| 160 | +PYXMLSEC_SKIP_VERSION_CHECK=1 PYTHONPATH=src python -c \ |
| 161 | + "import xmlsec; from lxml import etree; \ |
| 162 | + print('lxml', etree.LIBXML_VERSION, 'xmlsec', xmlsec.get_libxml_version())" |
| 163 | + |
| 164 | +# run the suite under the mismatch |
| 165 | +PYXMLSEC_SKIP_VERSION_CHECK=1 PYXMLSEC_TEST_ITERATIONS=0 \ |
| 166 | + PYTHONPATH=src python -m pytest tests/ |
| 167 | +``` |
| 168 | + |
| 169 | +A `PYXMLSEC_STATIC_DEPS=true` build statically links one libxml2 into the |
| 170 | +extension+xmlsec and avoids the whole dance (this is what CI/wheels do). |
| 171 | + |
| 172 | +Useful env vars: `PYXMLSEC_ENABLE_DEBUG=1` (debug build + trace), |
| 173 | +`PYXMLSEC_TEST_ITERATIONS=N` (per-test leak-detection reruns in `tests/base.py`; |
| 174 | +note `ru_maxrss` is **bytes** on macOS, kB on Linux). |
| 175 | + |
| 176 | +## Status & rollout |
| 177 | + |
| 178 | +- ✅ `template.add_reference` — converted and validated (full suite green + |
| 179 | + 10k-iteration loop, no crash/leak, under a real 2.14↔2.15 mismatch). |
| 180 | +- ⬜ Everything else still passes raw lxml nodes to xmlsec and is unsafe on a |
| 181 | + mismatch: the rest of `src/template.c`, `src/ds.c` (sign/verify), |
| 182 | + `src/enc.c` (encrypt/decrypt), `src/tree.c`. |
| 183 | + |
| 184 | +When converting another function, reuse the `add_reference` pattern and watch |
| 185 | +for the same two gotchas (namespace reconciliation after unlink; preserving |
| 186 | +xmlsec's formatting tail nodes). Each function differs in *where* the result |
| 187 | +grafts back (e.g. `add_reference` → `SignedInfo`; `ensure_key_info` → |
| 188 | +`Signature`). The default version guard can only be relaxed once **all** |
| 189 | +node-passing paths are converted. |
0 commit comments