@@ -47,9 +47,10 @@ allocated:
4747 (` etree.tostring ` ). Input is never mutated by xmlsec.
48482 . ** bytes → xmlsec node** : re-parse with xmlsec's libxml2 (` xmlReadMemory ` ).
49493 . Run the xmlsec operation on that fresh, xmlsec-owned node.
50- 4 . ** xmlsec node → bytes → lxml** : serialize the result with xmlsec's libxml2
51- (` xmlNodeDump ` ), re-parse with lxml (` etree.fromstring ` ), and graft it into
52- the original lxml tree.
50+ 4 . ** xmlsec doc → bytes → lxml** : serialize the whole mutated document with
51+ xmlsec's libxml2 (` xmlDocDumpMemory ` ), re-parse with lxml
52+ (` etree.fromstring ` ), locate the new node, and graft it into the original
53+ lxml tree.
5354
5455Only bytes cross between the two libxml2 worlds — never a pointer.
5556
@@ -60,61 +61,108 @@ Only bytes cross between the two libxml2 worlds — never a pointer.
6061
6162### Bridge helpers
6263
63- Two small helpers in [ src/lxml.c] ( src/lxml.c ) (declared in
64- [ src/lxml.h ] ( src/lxml.h ) ) are the only sanctioned crossing points. They go
65- through lxml's Python API so the tree is always walked by lxml's libxml2:
64+ Three helpers in [ src/lxml.c] ( src/lxml.c ) (declared in [ src/lxml.h ] ( src/lxml.h ) )
65+ are the only sanctioned crossing points. The first two go through lxml's Python
66+ API so the tree is always walked by lxml's libxml2:
6667
6768- ` PyXmlSec_LxmlElementToBytes(element) ` → ` etree.tostring(element, with_tail=False) `
6869- ` PyXmlSec_LxmlElementFromBytes(data) ` → ` etree.fromstring(data) `
6970
71+ - ** ` PyXmlSec_LxmlAddChildViaXmlSec(element, op, ctx, error) ` ** — the reusable
72+ round-trip. It is the ABI-safe drop-in for the old "call an ` xmlSecTmpl*Add* `
73+ function on ` element->_c_node ` , then ` elementFactory ` the result" one-liner.
74+ It owns the entire serialize → mutate → reflect dance (below); a converted
75+ function supplies only the ` op ` callback (the ` xmlSecTmpl* ` call + its args via
76+ ` ctx ` ) and an error string. ** Convert new functions by writing an ` op ` , not by
77+ re-implementing the round-trip.**
78+
7079## Reference implementation: ` template.add_reference `
7180
7281` PyXmlSec_TemplateAddReference ` in [ src/template.c] ( src/template.c ) is the first
73- function converted and the ** template for converting the rest** . Read it
74- alongside this section. The flow:
82+ function converted and the ** template for converting the rest** . After the
83+ refactor it is tiny: parse args, build a ` ctx ` , and call the helper. The
84+ xmlsec-specific part is the ` op ` :
85+
86+ ``` c
87+ static xmlNodePtr PyXmlSec_TemplateAddReferenceOp (xmlNodePtr root, void* ctx) {
88+ struct ... * c = ctx;
89+ return xmlSecTmplSignatureAddReference(root, c->digest, XSTR(c->id), XSTR(c->uri), XSTR(c->type));
90+ }
91+ ```
92+
93+ `op` receives `root` = the copy's root element, i.e. the equivalent of the
94+ original `element->_c_node`, and returns the node it created (same contract as
95+ the old direct call). Everything else is the helper's job.
96+
97+ ### What the helper does (and the key decision)
7598
76- 1 . ` PyXmlSec_LxmlElementToBytes(node ) ` — serialize the ` <Signature> ` element .
99+ 1. `PyXmlSec_LxmlElementToBytes(element )` — serialize the subtree .
771002. `xmlReadMemory(...)` — parse into a throwaway xmlsec-owned `xmlDocPtr`.
78- 3 . ` xmlSecTmplSignatureAddReference(xmlDocGetRootElement(doc), ...) ` — xmlsec
79- adds the ` <Reference> ` to the copy and returns it (` res ` ).
80- 4 . Reflect back: dump ` res ` , ` PyXmlSec_LxmlElementFromBytes(...) ` , then
81- ` find ` the ` SignedInfo ` of the * original* node and ` append ` the new element.
82- 5 . Return the grafted lxml element (a live node in the user's tree, so the
101+ 3. `res = op(xmlDocGetRootElement(doc), ctx)` — run the caller's xmlsec
102+ mutation on the copy. `res` stays attached to `doc`.
103+ 4. Record `res`'s **element-index path** from the root (so it can be re-found
104+ after a round-trip), then dump the **whole** mutated document with
105+ `xmlDocDumpMemory` — *not* the lone `res` subtree.
106+ 5. `PyXmlSec_LxmlElementFromBytes(...)` the dump, walk the recorded path to the
107+ new node, walk the same path (minus the last step) in the *original*
108+ `element` to reach the parent, and `append` the new node there.
109+ 6. Return the grafted lxml node (a live node in the user's tree, so the
83110 incremental builder — `add_transform(ref, ...)` — keeps working).
84111
85- ### Two non-obvious gotchas (the reason this took iterations)
86-
87- These will recur in every function we convert, so they're worth understanding:
88-
89- ** (a) Namespaces are lost on a naive node dump.** ` xmlNodeDump ` of a subtree does
90- * not* emit namespace declarations that live on ancestors. The ` <Reference> `
91- uses the dsig namespace declared up on ` <Signature> ` , so dumping it alone yields
92- ` <Reference> ` with no ` xmlns ` → re-parses into the * wrong* (empty) namespace.
93- Fix: ** ` xmlUnlinkNode(res) ` first, then ` xmlReconciliateNs(doc, res) ` .** Order
94- matters — while the node is still attached, the namespace is reachable via its
95- ancestors, so reconcile thinks nothing is wrong and does nothing. Only after
96- unlinking does reconcile redeclare the namespace onto the node itself.
97-
98- ** (b) Whitespace changes the signature.** xmlsec pretty-prints by inserting
99- newline text nodes between children. The newline it puts * after* the
100- ` <Reference> ` element is a * sibling* tail node, not part of the dumped subtree,
101- so the round-trip drops it.
102- That single missing ` \n ` changes the canonicalized ` SignedInfo ` bytes, which
103- changes the computed ` SignatureValue ` and breaks byte-exact signature fixtures.
104- Fix: capture ` res->next ` 's text (the tail) * before* unlinking, and reapply it as
105- the grafted element's ` .tail ` . Any converted function that relies on xmlsec's
106- formatting must preserve these tail text nodes to stay byte-compatible.
107-
108- ### Memory / ref-count notes
109-
110- - ` res ` is unlinked from ` doc ` , so it is ** not** freed by ` xmlFreeDoc(doc) ` — it
111- must be ` xmlFreeNode ` 'd separately.
112- - The captured ` tail ` is ` xmlStrdup ` 'd → ` xmlFree ` it (NULL-safe on the success
113- path).
114- - Every ` PyObject_CallMethod ` /` PyUnicode_* ` result is a new reference and is
115- ` Py_DECREF ` 'd, including the ` None ` returned by ` .append(...) ` .
116- - The pure-C libxml2 work runs inside ` Py_BEGIN_ALLOW_THREADS ` ; all the lxml
117- Python-API calls run with the GIL held, outside it.
112+ Dumping the **entire** document in step 4 (rather than just `res`) is the load-
113+ bearing decision: it keeps the round-trip byte-identical to the old raw-pointer
114+ code *without* any manual fix-up, because two things that would otherwise
115+ diverge only exist in the surrounding document:
116+
117+ **(a) Namespaces.** `xmlNodeDump` of a *subtree* does not emit namespace
118+ declarations that live on ancestors. `<Reference>` uses the dsig namespace
119+ declared up on `<Signature>`; dump it alone and you get `<Reference>` with no
120+ `xmlns` → it re-parses into the *wrong* (empty) namespace, and xmlsec no longer
121+ recognizes it. Dumping the whole document keeps the `<Signature>` declaration in
122+ the bytes, so lxml re-parses the reference into the correct dsig namespace.
123+
124+ **(b) Whitespace.** xmlsec pretty-prints by inserting newline text nodes between
125+ children. The `\n` it puts *after* `<Reference>` is a *sibling* tail node, not
126+ part of the `<Reference>` subtree. Drop it and the canonicalized `<SignedInfo>`
127+ bytes change → the computed `SignatureValue` changes → byte-exact signature
128+ fixtures break. In a whole-document dump that `\n` is just bytes between two
129+ elements; lxml re-parses it as the new element's `.tail`, and `append` carries
130+ the tail along.
131+
132+ > An earlier (pre-helper) version dumped only the `res` subtree and hit both as
133+ > bugs, patching them by hand (`xmlUnlinkNode` + `xmlReconciliateNs` for the
134+ > namespace; `xmlStrdup`-ing `res->next`'s text and reapplying it as `.tail` for
135+ > the whitespace). The whole-document dump deletes all of that. The cost is that
136+ > the new node must be *located* after the round-trip instead of handed back as
137+ > `res`; the helper does this generically via the element-index path.
138+
139+ ### Helper assumptions & when *not* to use it
140+
141+ The path-based reflect assumes `op` **appends exactly one new node beneath a
142+ parent that already exists in `element`**, and that the templates contain no
143+ comment/PI siblings (so a libxml2 element index equals an lxml child index).
144+ This holds for the `xmlSecTmpl*Add*` family. Two shapes need more than the
145+ current helper:
146+
147+ - **find-or-create** (`xmlSecTmpl*Ensure*`, e.g. `ensure_key_info`): `op` may
148+ return an *existing* node without adding anything → appending it would
149+ duplicate. Needs a "did the tree actually grow?" check before grafting.
150+ - **intermediate ancestors** (e.g. `add_transform` may create a `<Transforms>`
151+ wrapper *and* the `<Transform>`): the topmost new node, not `res`, is what
152+ must be grafted.
153+
154+ Extend the helper (or branch inside it) when you reach those; don't force-fit.
155+
156+ ### Memory / ref-count notes (inside the helper)
157+
158+ - `res` stays attached to `doc`, so `xmlFreeDoc(doc)` reclaims it — it is **not**
159+ freed separately.
160+ - `xmlDocDumpMemory`'s output buffer is `xmlFree`'d (NULL-safe on failure).
161+ - Every `PySequence_GetItem` / `PyObject_CallMethod` result is a new reference
162+ and is `Py_DECREF`'d, including the `None` returned by `.append(...)` and the
163+ reparsed whole-document tree (dropped once the new node is grafted out of it).
164+ - `op` runs inside `Py_BEGIN_ALLOW_THREADS` (pure libxml2/xmlsec, no Python);
165+ all the lxml Python-API calls run with the GIL held, outside it.
118166
119167## Version guard / opt-in escape hatch
120168
@@ -176,14 +224,18 @@ note `ru_maxrss` is **bytes** on macOS, kB on Linux).
176224## Status & rollout
177225
178226- ✅ ` template.add_reference ` — converted and validated (full suite green +
179- 10k-iteration loop, no crash/leak, under a real 2.14↔2.15 mismatch).
227+ 10k-iteration loop, no crash/leak, under a real 2.14↔2.15 mismatch). The
228+ reusable round-trip lives in ` PyXmlSec_LxmlAddChildViaXmlSec ` .
180229- ⬜ Everything else still passes raw lxml nodes to xmlsec and is unsafe on a
181230 mismatch: the rest of ` src/template.c ` , ` src/ds.c ` (sign/verify),
182231 ` src/enc.c ` (encrypt/decrypt), ` src/tree.c ` .
183232
184- When converting another function, reuse the ` add_reference ` pattern and watch
185- for the same two gotchas (namespace reconciliation after unlink; preserving
186- xmlsec's formatting tail nodes). Each function differs in * where* the result
187- grafts back (e.g. ` add_reference ` → ` SignedInfo ` ; ` ensure_key_info ` →
188- ` Signature ` ). The default version guard can only be relaxed once ** all**
189- node-passing paths are converted.
233+ When converting another function, ** write an ` op ` and call
234+ ` PyXmlSec_LxmlAddChildViaXmlSec ` ** rather than re-implementing the round-trip;
235+ the helper handles serialization, the whole-document dump, and grafting (which
236+ sidesteps the namespace/whitespace divergences for free). For the simple
237+ ` xmlSecTmpl*Add* ` family that is the whole change. Watch for the two shapes the
238+ current helper does not yet cover — find-or-create (` ensure_key_info ` ) and
239+ intermediate-ancestor creation (` add_transform ` ) — and extend the helper when
240+ you reach them (see "Helper assumptions & when * not* to use it"). The default
241+ version guard can only be relaxed once ** all** node-passing paths are converted.
0 commit comments