Skip to content

Commit 0c8d811

Browse files
mxaminclaude
andcommitted
Extract reusable round-trip helper for the #356 decoupling
The serialize/mutate/graft round-trip that decouples lxml from xmlsec (issue #356) was inlined in template.add_reference and would have been copy-pasted into every remaining node-passing function. Pull it into PyXmlSec_LxmlAddChildViaXmlSec in src/lxml.c: it serializes the element, runs a caller-supplied `op` on a throwaway xmlsec-owned copy, dumps the whole mutated document, re-parses it with lxml, locates the new node by its element-index path, and grafts it into the live tree. Dumping the whole document (not the lone subtree) is what keeps the result byte-identical without manual namespace/tail fix-up. Converting a function is now just writing an `op`; add_reference shrinks to arg-parse + one helper call. The helper covers the xmlSecTmpl*Add* family (one new node appended under an existing parent); find-or-create and intermediate-ancestor shapes are documented for later. Validated under a real 2.14<->2.15 libxml2 mismatch: full suite green, plus a 10k-iteration leak loop with no crash/leak and byte-identical signature fixtures. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent f0ae78d commit 0c8d811

4 files changed

Lines changed: 308 additions & 210 deletions

File tree

CLAUDE.md

Lines changed: 106 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -47,9 +47,10 @@ allocated:
4747
(`etree.tostring`). Input is never mutated by xmlsec.
4848
2. **bytes → xmlsec node**: re-parse with xmlsec's libxml2 (`xmlReadMemory`).
4949
3. Run the xmlsec operation on that fresh, xmlsec-owned node.
50-
4. **xmlsec node → bytes → lxml**: serialize the result with xmlsec's libxml2
51-
(`xmlNodeDump`), re-parse with lxml (`etree.fromstring`), and graft it into
52-
the original lxml tree.
50+
4. **xmlsec doc → bytes → lxml**: serialize the whole mutated document with
51+
xmlsec's libxml2 (`xmlDocDumpMemory`), re-parse with lxml
52+
(`etree.fromstring`), locate the new node, and graft it into the original
53+
lxml tree.
5354

5455
Only bytes cross between the two libxml2 worlds — never a pointer.
5556

@@ -60,61 +61,108 @@ Only bytes cross between the two libxml2 worlds — never a pointer.
6061
6162
### Bridge helpers
6263

63-
Two small helpers in [src/lxml.c](src/lxml.c) (declared in
64-
[src/lxml.h](src/lxml.h)) are the only sanctioned crossing points. They go
65-
through lxml's Python API so the tree is always walked by lxml's libxml2:
64+
Three helpers in [src/lxml.c](src/lxml.c) (declared in [src/lxml.h](src/lxml.h))
65+
are the only sanctioned crossing points. The first two go through lxml's Python
66+
API so the tree is always walked by lxml's libxml2:
6667

6768
- `PyXmlSec_LxmlElementToBytes(element)``etree.tostring(element, with_tail=False)`
6869
- `PyXmlSec_LxmlElementFromBytes(data)``etree.fromstring(data)`
6970

71+
- **`PyXmlSec_LxmlAddChildViaXmlSec(element, op, ctx, error)`** — the reusable
72+
round-trip. It is the ABI-safe drop-in for the old "call an `xmlSecTmpl*Add*`
73+
function on `element->_c_node`, then `elementFactory` the result" one-liner.
74+
It owns the entire serialize → mutate → reflect dance (below); a converted
75+
function supplies only the `op` callback (the `xmlSecTmpl*` call + its args via
76+
`ctx`) and an error string. **Convert new functions by writing an `op`, not by
77+
re-implementing the round-trip.**
78+
7079
## Reference implementation: `template.add_reference`
7180

7281
`PyXmlSec_TemplateAddReference` in [src/template.c](src/template.c) is the first
73-
function converted and the **template for converting the rest**. Read it
74-
alongside this section. The flow:
82+
function converted and the **template for converting the rest**. After the
83+
refactor it is tiny: parse args, build a `ctx`, and call the helper. The
84+
xmlsec-specific part is the `op`:
85+
86+
```c
87+
static xmlNodePtr PyXmlSec_TemplateAddReferenceOp(xmlNodePtr root, void* ctx) {
88+
struct ... * c = ctx;
89+
return xmlSecTmplSignatureAddReference(root, c->digest, XSTR(c->id), XSTR(c->uri), XSTR(c->type));
90+
}
91+
```
92+
93+
`op` receives `root` = the copy's root element, i.e. the equivalent of the
94+
original `element->_c_node`, and returns the node it created (same contract as
95+
the old direct call). Everything else is the helper's job.
96+
97+
### What the helper does (and the key decision)
7598
76-
1. `PyXmlSec_LxmlElementToBytes(node)` — serialize the `<Signature>` element.
99+
1. `PyXmlSec_LxmlElementToBytes(element)` — serialize the subtree.
77100
2. `xmlReadMemory(...)` — parse into a throwaway xmlsec-owned `xmlDocPtr`.
78-
3. `xmlSecTmplSignatureAddReference(xmlDocGetRootElement(doc), ...)` — xmlsec
79-
adds the `<Reference>` to the copy and returns it (`res`).
80-
4. Reflect back: dump `res`, `PyXmlSec_LxmlElementFromBytes(...)`, then
81-
`find` the `SignedInfo` of the *original* node and `append` the new element.
82-
5. Return the grafted lxml element (a live node in the user's tree, so the
101+
3. `res = op(xmlDocGetRootElement(doc), ctx)` — run the caller's xmlsec
102+
mutation on the copy. `res` stays attached to `doc`.
103+
4. Record `res`'s **element-index path** from the root (so it can be re-found
104+
after a round-trip), then dump the **whole** mutated document with
105+
`xmlDocDumpMemory` — *not* the lone `res` subtree.
106+
5. `PyXmlSec_LxmlElementFromBytes(...)` the dump, walk the recorded path to the
107+
new node, walk the same path (minus the last step) in the *original*
108+
`element` to reach the parent, and `append` the new node there.
109+
6. Return the grafted lxml node (a live node in the user's tree, so the
83110
incremental builder — `add_transform(ref, ...)` — keeps working).
84111
85-
### Two non-obvious gotchas (the reason this took iterations)
86-
87-
These will recur in every function we convert, so they're worth understanding:
88-
89-
**(a) Namespaces are lost on a naive node dump.** `xmlNodeDump` of a subtree does
90-
*not* emit namespace declarations that live on ancestors. The `<Reference>`
91-
uses the dsig namespace declared up on `<Signature>`, so dumping it alone yields
92-
`<Reference>` with no `xmlns` → re-parses into the *wrong* (empty) namespace.
93-
Fix: **`xmlUnlinkNode(res)` first, then `xmlReconciliateNs(doc, res)`.** Order
94-
matters — while the node is still attached, the namespace is reachable via its
95-
ancestors, so reconcile thinks nothing is wrong and does nothing. Only after
96-
unlinking does reconcile redeclare the namespace onto the node itself.
97-
98-
**(b) Whitespace changes the signature.** xmlsec pretty-prints by inserting
99-
newline text nodes between children. The newline it puts *after* the
100-
`<Reference>` element is a *sibling* tail node, not part of the dumped subtree,
101-
so the round-trip drops it.
102-
That single missing `\n` changes the canonicalized `SignedInfo` bytes, which
103-
changes the computed `SignatureValue` and breaks byte-exact signature fixtures.
104-
Fix: capture `res->next`'s text (the tail) *before* unlinking, and reapply it as
105-
the grafted element's `.tail`. Any converted function that relies on xmlsec's
106-
formatting must preserve these tail text nodes to stay byte-compatible.
107-
108-
### Memory / ref-count notes
109-
110-
- `res` is unlinked from `doc`, so it is **not** freed by `xmlFreeDoc(doc)` — it
111-
must be `xmlFreeNode`'d separately.
112-
- The captured `tail` is `xmlStrdup`'d → `xmlFree` it (NULL-safe on the success
113-
path).
114-
- Every `PyObject_CallMethod`/`PyUnicode_*` result is a new reference and is
115-
`Py_DECREF`'d, including the `None` returned by `.append(...)`.
116-
- The pure-C libxml2 work runs inside `Py_BEGIN_ALLOW_THREADS`; all the lxml
117-
Python-API calls run with the GIL held, outside it.
112+
Dumping the **entire** document in step 4 (rather than just `res`) is the load-
113+
bearing decision: it keeps the round-trip byte-identical to the old raw-pointer
114+
code *without* any manual fix-up, because two things that would otherwise
115+
diverge only exist in the surrounding document:
116+
117+
**(a) Namespaces.** `xmlNodeDump` of a *subtree* does not emit namespace
118+
declarations that live on ancestors. `<Reference>` uses the dsig namespace
119+
declared up on `<Signature>`; dump it alone and you get `<Reference>` with no
120+
`xmlns` → it re-parses into the *wrong* (empty) namespace, and xmlsec no longer
121+
recognizes it. Dumping the whole document keeps the `<Signature>` declaration in
122+
the bytes, so lxml re-parses the reference into the correct dsig namespace.
123+
124+
**(b) Whitespace.** xmlsec pretty-prints by inserting newline text nodes between
125+
children. The `\n` it puts *after* `<Reference>` is a *sibling* tail node, not
126+
part of the `<Reference>` subtree. Drop it and the canonicalized `<SignedInfo>`
127+
bytes change → the computed `SignatureValue` changes → byte-exact signature
128+
fixtures break. In a whole-document dump that `\n` is just bytes between two
129+
elements; lxml re-parses it as the new element's `.tail`, and `append` carries
130+
the tail along.
131+
132+
> An earlier (pre-helper) version dumped only the `res` subtree and hit both as
133+
> bugs, patching them by hand (`xmlUnlinkNode` + `xmlReconciliateNs` for the
134+
> namespace; `xmlStrdup`-ing `res->next`'s text and reapplying it as `.tail` for
135+
> the whitespace). The whole-document dump deletes all of that. The cost is that
136+
> the new node must be *located* after the round-trip instead of handed back as
137+
> `res`; the helper does this generically via the element-index path.
138+
139+
### Helper assumptions & when *not* to use it
140+
141+
The path-based reflect assumes `op` **appends exactly one new node beneath a
142+
parent that already exists in `element`**, and that the templates contain no
143+
comment/PI siblings (so a libxml2 element index equals an lxml child index).
144+
This holds for the `xmlSecTmpl*Add*` family. Two shapes need more than the
145+
current helper:
146+
147+
- **find-or-create** (`xmlSecTmpl*Ensure*`, e.g. `ensure_key_info`): `op` may
148+
return an *existing* node without adding anything → appending it would
149+
duplicate. Needs a "did the tree actually grow?" check before grafting.
150+
- **intermediate ancestors** (e.g. `add_transform` may create a `<Transforms>`
151+
wrapper *and* the `<Transform>`): the topmost new node, not `res`, is what
152+
must be grafted.
153+
154+
Extend the helper (or branch inside it) when you reach those; don't force-fit.
155+
156+
### Memory / ref-count notes (inside the helper)
157+
158+
- `res` stays attached to `doc`, so `xmlFreeDoc(doc)` reclaims it — it is **not**
159+
freed separately.
160+
- `xmlDocDumpMemory`'s output buffer is `xmlFree`'d (NULL-safe on failure).
161+
- Every `PySequence_GetItem` / `PyObject_CallMethod` result is a new reference
162+
and is `Py_DECREF`'d, including the `None` returned by `.append(...)` and the
163+
reparsed whole-document tree (dropped once the new node is grafted out of it).
164+
- `op` runs inside `Py_BEGIN_ALLOW_THREADS` (pure libxml2/xmlsec, no Python);
165+
all the lxml Python-API calls run with the GIL held, outside it.
118166
119167
## Version guard / opt-in escape hatch
120168
@@ -176,14 +224,18 @@ note `ru_maxrss` is **bytes** on macOS, kB on Linux).
176224
## Status & rollout
177225

178226
-`template.add_reference` — converted and validated (full suite green +
179-
10k-iteration loop, no crash/leak, under a real 2.14↔2.15 mismatch).
227+
10k-iteration loop, no crash/leak, under a real 2.14↔2.15 mismatch). The
228+
reusable round-trip lives in `PyXmlSec_LxmlAddChildViaXmlSec`.
180229
- ⬜ Everything else still passes raw lxml nodes to xmlsec and is unsafe on a
181230
mismatch: the rest of `src/template.c`, `src/ds.c` (sign/verify),
182231
`src/enc.c` (encrypt/decrypt), `src/tree.c`.
183232

184-
When converting another function, reuse the `add_reference` pattern and watch
185-
for the same two gotchas (namespace reconciliation after unlink; preserving
186-
xmlsec's formatting tail nodes). Each function differs in *where* the result
187-
grafts back (e.g. `add_reference``SignedInfo`; `ensure_key_info`
188-
`Signature`). The default version guard can only be relaxed once **all**
189-
node-passing paths are converted.
233+
When converting another function, **write an `op` and call
234+
`PyXmlSec_LxmlAddChildViaXmlSec`** rather than re-implementing the round-trip;
235+
the helper handles serialization, the whole-document dump, and grafting (which
236+
sidesteps the namespace/whitespace divergences for free). For the simple
237+
`xmlSecTmpl*Add*` family that is the whole change. Watch for the two shapes the
238+
current helper does not yet cover — find-or-create (`ensure_key_info`) and
239+
intermediate-ancestor creation (`add_transform`) — and extend the helper when
240+
you reach them (see "Helper assumptions & when *not* to use it"). The default
241+
version guard can only be relaxed once **all** node-passing paths are converted.

src/lxml.c

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,12 @@ int PyXmlSec_LxmlElementConverter(PyObject* o, PyXmlSec_LxmlElementPtr* p) {
137137
}
138138

139139
PyObject* PyXmlSec_LxmlElementToBytes(PyObject* element) {
140+
// Calls lxml.etree.tostring(element, with_tail=False).
141+
// Input: `element` - an lxml _Element (borrowed reference).
142+
// Output: a new reference to a Python bytes object with the element's
143+
// serialized XML, or NULL with an exception set on failure.
144+
// Example: an lxml element <Reference URI="#foo"/> returns the bytes
145+
// b'<Reference URI="#foo"/>'.
140146
// Done entirely through lxml's Python API so that the tree is walked by the
141147
// same libxml2 that allocated it. `with_tail=False` keeps the output to the
142148
// element itself, which xmlsec's libxml2 then re-parses from bytes.
@@ -171,3 +177,150 @@ PyObject* PyXmlSec_LxmlElementFromBytes(PyObject* data) {
171177
Py_DECREF(etree);
172178
return result;
173179
}
180+
181+
// dsig templates never nest anywhere near this deep; the bound just keeps the
182+
// path buffer on the stack and guards against a pathological tree.
183+
#define PYXMLSEC_MAX_TEMPLATE_DEPTH 64
184+
185+
PyObject* PyXmlSec_LxmlAddChildViaXmlSec(
186+
PyXmlSec_LxmlElementPtr element, PyXmlSec_LxmlXmlSecOp op, void* ctx, const char* error) {
187+
188+
PyObject* serialized = NULL;
189+
PyObject* result_bytes = NULL;
190+
PyObject* new_tree = NULL;
191+
PyObject* new_node = NULL;
192+
PyObject* parent = NULL;
193+
PyObject* appended = NULL;
194+
195+
xmlDocPtr doc = NULL;
196+
xmlNodePtr res = NULL;
197+
xmlChar* dump = NULL;
198+
int dump_size = 0;
199+
200+
char* xml_data = NULL;
201+
Py_ssize_t xml_size = 0;
202+
203+
// Element-only child indices from the copy's root down to the node `op`
204+
// produced, stored root-first; `depth` is its length (>= 1 on success).
205+
int path[PYXMLSEC_MAX_TEMPLATE_DEPTH];
206+
int depth = 0;
207+
int overflow = 0;
208+
int i;
209+
210+
serialized = PyXmlSec_LxmlElementToBytes((PyObject*)element);
211+
if (serialized == NULL || PyBytes_AsStringAndSize(serialized, &xml_data, &xml_size) < 0) {
212+
goto ON_FAIL;
213+
}
214+
215+
Py_BEGIN_ALLOW_THREADS;
216+
doc = xmlReadMemory(xml_data, (int)xml_size, NULL, NULL, XML_PARSE_NONET);
217+
if (doc != NULL) {
218+
res = op(xmlDocGetRootElement(doc), ctx);
219+
if (res != NULL) {
220+
// Record res's element-index path up to the root: first measure the
221+
// depth, then walk again filling `path` from the back so it ends up
222+
// ordered root-first.
223+
xmlNodePtr cur = res;
224+
int d = 0;
225+
while (cur->parent != NULL && cur->parent->type == XML_ELEMENT_NODE) {
226+
++d;
227+
cur = cur->parent;
228+
}
229+
if (d < 1 || d > PYXMLSEC_MAX_TEMPLATE_DEPTH) {
230+
overflow = 1;
231+
} else {
232+
depth = d;
233+
for (cur = res; d > 0; cur = cur->parent) {
234+
int idx = 0;
235+
xmlNodePtr s;
236+
for (s = cur->prev; s != NULL; s = s->prev) {
237+
if (s->type == XML_ELEMENT_NODE) { ++idx; }
238+
}
239+
path[--d] = idx;
240+
}
241+
// Dump the whole mutated document, not just res: the surrounding
242+
// context carries the new node's ancestor-declared namespace and
243+
// the formatting whitespace xmlsec appends after it, so lxml
244+
// reconstructs both on re-parse — no manual reconcile/tail fix-up.
245+
xmlDocDumpMemory(doc, &dump, &dump_size);
246+
}
247+
}
248+
}
249+
Py_END_ALLOW_THREADS;
250+
251+
if (doc == NULL) {
252+
PyErr_SetString(PyXmlSec_InternalError, "cannot parse serialized template.");
253+
goto ON_FAIL;
254+
}
255+
if (res == NULL) {
256+
PyXmlSec_SetLastError(error);
257+
goto ON_FAIL;
258+
}
259+
if (overflow) {
260+
PyErr_SetString(PyXmlSec_InternalError, "unexpected template structure.");
261+
goto ON_FAIL;
262+
}
263+
if (dump == NULL || dump_size <= 0) {
264+
PyErr_SetString(PyXmlSec_InternalError, "cannot serialize result.");
265+
goto ON_FAIL;
266+
}
267+
268+
// Re-parse the whole result with lxml so the nodes are lxml-owned.
269+
result_bytes = PyBytes_FromStringAndSize((const char*)dump, (Py_ssize_t)dump_size);
270+
if (result_bytes == NULL) {
271+
goto ON_FAIL;
272+
}
273+
new_tree = PyXmlSec_LxmlElementFromBytes(result_bytes);
274+
if (new_tree == NULL) {
275+
goto ON_FAIL;
276+
}
277+
278+
// Locate the produced node in the re-parsed tree by walking the recorded path.
279+
new_node = new_tree;
280+
Py_INCREF(new_node);
281+
for (i = 0; i < depth; ++i) {
282+
PyObject* child = PySequence_GetItem(new_node, path[i]);
283+
Py_DECREF(new_node);
284+
new_node = child;
285+
if (new_node == NULL) {
286+
goto ON_FAIL;
287+
}
288+
}
289+
290+
// Walk the same path (minus the final step) in the *original* tree to reach
291+
// the parent the new node belongs under, then graft it onto the live tree.
292+
parent = (PyObject*)element;
293+
Py_INCREF(parent);
294+
for (i = 0; i < depth - 1; ++i) {
295+
PyObject* child = PySequence_GetItem(parent, path[i]);
296+
Py_DECREF(parent);
297+
parent = child;
298+
if (parent == NULL) {
299+
goto ON_FAIL;
300+
}
301+
}
302+
appended = PyObject_CallMethod(parent, "append", "O", new_node);
303+
if (appended == NULL) {
304+
goto ON_FAIL;
305+
}
306+
307+
xmlFreeDoc(doc);
308+
xmlFree(dump);
309+
Py_DECREF(serialized);
310+
Py_DECREF(result_bytes);
311+
Py_DECREF(new_tree);
312+
Py_DECREF(parent);
313+
Py_DECREF(appended);
314+
return new_node;
315+
316+
ON_FAIL:
317+
if (doc != NULL) { xmlFreeDoc(doc); }
318+
if (dump != NULL) { xmlFree(dump); }
319+
Py_XDECREF(serialized);
320+
Py_XDECREF(result_bytes);
321+
Py_XDECREF(new_tree);
322+
Py_XDECREF(new_node);
323+
Py_XDECREF(parent);
324+
Py_XDECREF(appended);
325+
return NULL;
326+
}

src/lxml.h

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,28 @@ PyObject* PyXmlSec_LxmlElementToBytes(PyObject* element);
4040
// (with an exception set) on failure.
4141
PyObject* PyXmlSec_LxmlElementFromBytes(PyObject* data);
4242

43+
// Performs the actual xmlsec mutation on a private, xmlsec-owned copy of an
44+
// element. `root` is the copy's root element (the equivalent of the original
45+
// `element->_c_node`); `ctx` carries the call's extra arguments. Must return the
46+
// node it created, or NULL on failure (with the xmlsec error already recorded).
47+
typedef xmlNodePtr (*PyXmlSec_LxmlXmlSecOp)(xmlNodePtr root, void* ctx);
48+
49+
// ABI-safe replacement for the "call an xmlSecTmpl*Add* function on `element`'s
50+
// raw node, then wrap the returned node with elementFactory" pattern (see
51+
// https://github.com/xmlsec/python-xmlsec/issues/356). Serializes `element`,
52+
// runs `op` on a throwaway xmlsec-owned copy, then reflects the node `op`
53+
// produced back into `element`'s live lxml tree and returns it as a new lxml
54+
// _Element. Only bytes cross the lxml/xmlsec boundary, never raw pointers.
55+
// Returns a new reference, or NULL with an exception set; `error` is the message
56+
// raised when `op` returns NULL.
57+
//
58+
// Assumes `op` appends exactly one new node beneath a parent that already exists
59+
// in `element` (true for the xmlSecTmpl*Add* family). Find-or-create ops
60+
// (xmlSecTmpl*Ensure*) and ops that also create intermediate ancestors need
61+
// extra handling — see CLAUDE.md.
62+
PyObject* PyXmlSec_LxmlAddChildViaXmlSec(
63+
PyXmlSec_LxmlElementPtr element, PyXmlSec_LxmlXmlSecOp op, void* ctx, const char* error);
64+
4365
// get version numbers for libxml2 both compiled and loaded
4466
long PyXmlSec_GetLibXmlVersionMajor();
4567
long PyXmlSec_GetLibXmlVersionMinor();

0 commit comments

Comments
 (0)