Skip to content

Commit edaf55c

Browse files
authored
Merge pull request #7 from bluedynamics/feature/direct-json-writer
Direct JSON writer + class pickle cache (R3-R4)
2 parents 4dabad5 + d6b0e01 commit edaf55c

12 files changed

Lines changed: 2972 additions & 129 deletions

BENCHMARKS.md

Lines changed: 92 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
Comparison of `zodb-json-codec` (Rust + PyO3) vs CPython's `pickle` module
44
for ZODB record encoding/decoding.
55

6-
Measured on: 2026-02-24
6+
Measured on: 2026-02-25
77
Python: 3.13.9, PyO3: 0.28, 5000 iterations, 100 warmup
8-
Build: `maturin develop --release` (optimized, LTO + codegen-units=1 + PGO)
8+
Build: `maturin develop --release` + PGO (LTO + codegen-units=1)
99

1010
**Important:** Always benchmark with `maturin develop --release`. Debug builds
1111
are 3-8x slower due to missing optimizations and inlining.
@@ -20,7 +20,8 @@ The codec does fundamentally more work than `pickle.loads`/`pickle.dumps`:
2020

2121
The codec's value is not raw speed but **JSONB queryability** — enabling SQL
2222
queries on ZODB object attributes in PostgreSQL. Despite the extra work, the
23-
release build beats CPython pickle on most operations.
23+
release build beats CPython pickle on encode and roundtrip across all
24+
categories, and on decode for all but the largest string-dominated payloads.
2425

2526
---
2627

@@ -30,64 +31,66 @@ release build beats CPython pickle on most operations.
3031

3132
| Category | Python | Codec | Ratio |
3233
|---|---|---|---|
33-
| simple_flat_dict (120 B) | 1.9 us | 1.1 us | **1.8x faster** |
34-
| nested_dict (187 B) | 2.9 us | 1.8 us | **1.6x faster** |
35-
| large_flat_dict (2.5 KB) | 22.8 us | 19.7 us | **1.2x faster** |
36-
| bytes_in_state (1 KB) | 1.8 us | 1.9 us | 1.1x slower |
37-
| special_types (314 B) | 6.8 us | 4.7 us | **1.5x faster** |
38-
| btree_small (112 B) | 1.9 us | 1.8 us | 1.1x faster |
39-
| btree_length (44 B) | 1.0 us | 0.5 us | **2.0x faster** |
40-
| scalar_string (72 B) | 1.1 us | 0.5 us | **2.1x faster** |
41-
| wide_dict (27 KB) | 264 us | 279 us | 1.1x slower |
42-
| deep_nesting (379 B) | 7.2 us | 7.3 us | 1.0x |
34+
| simple_flat_dict (120 B) | 1.9 us | 1.0 us | **1.9x faster** |
35+
| nested_dict (187 B) | 2.7 us | 1.6 us | **1.3x faster** |
36+
| large_flat_dict (2.5 KB) | 22.6 us | 18.0 us | **1.3x faster** |
37+
| bytes_in_state (1 KB) | 1.6 us | 1.4 us | **1.1x faster** |
38+
| special_types (314 B) | 6.8 us | 3.8 us | **1.8x faster** |
39+
| btree_small (112 B) | 1.7 us | 1.5 us | **1.2x faster** |
40+
| btree_length (44 B) | 1.0 us | 0.4 us | **2.3x faster** |
41+
| scalar_string (72 B) | 1.1 us | 0.5 us | **2.2x faster** |
42+
| wide_dict (27 KB) | 250 us | 244.5 us | **1.0x faster** |
43+
| deep_nesting (379 B) | 6.9 us | 6.4 us | 1.0x slower |
4344

4445
### Decode to JSON string (pickle bytes -> JSON, all in Rust)
4546

46-
The direct path for PG storage — serializes to a JSON string entirely in Rust
47-
with the GIL released. Compared against the dict path + `json.dumps()`.
47+
The direct path for PG storage — writes JSON tokens directly to a `String`
48+
buffer from the PickleValue AST, entirely in Rust with the GIL released.
49+
No intermediate `serde_json::Value` allocations. Compared against the dict
50+
path + `json.dumps()`.
4851

4952
| Category | Dict+dumps | JSON str | Speedup |
5053
|---|---|---|---|
51-
| simple_flat_dict | 2.7 us | 1.3 us | **2.2x faster** |
52-
| nested_dict | 4.3 us | 2.5 us | **1.7x faster** |
53-
| large_flat_dict | 35.4 us | 25.6 us | **1.4x faster** |
54-
| bytes_in_state | 5.7 us | 2.7 us | **2.1x faster** |
55-
| special_types | 7.1 us | 4.7 us | **1.5x faster** |
56-
| btree_small | 3.8 us | 2.1 us | **1.8x faster** |
57-
| btree_length | 1.5 us | 0.8 us | **1.9x faster** |
58-
| scalar_string | 0.9 us | 0.7 us | **1.3x faster** |
59-
| wide_dict | 273.7 us | 307.6 us | 1.1x slower |
60-
| deep_nesting | 13.3 us | 8.6 us | **1.5x faster** |
54+
| simple_flat_dict | 2.7 us | 1.1 us | **2.5x faster** |
55+
| nested_dict | 4.3 us | 1.9 us | **2.3x faster** |
56+
| large_flat_dict | 33.7 us | 17.1 us | **2.0x faster** |
57+
| bytes_in_state | 5.2 us | 1.6 us | **3.3x faster** |
58+
| special_types | 7.5 us | 4.0 us | **1.9x faster** |
59+
| btree_small | 3.6 us | 1.6 us | **2.3x faster** |
60+
| btree_length | 1.4 us | 0.5 us | **2.8x faster** |
61+
| scalar_string | 0.8 us | 0.6 us | **1.3x faster** |
62+
| wide_dict | 290.5 us | 161.6 us | **1.8x faster** |
63+
| deep_nesting | 14.2 us | 5.7 us | **2.5x faster** |
6164

6265
### Encode (Python dict -> pickle bytes)
6366

6467
| Category | Python | Codec | Ratio |
6568
|---|---|---|---|
66-
| simple_flat_dict | 1.3 us | 0.2 us | **6.5x faster** |
67-
| nested_dict | 1.5 us | 0.3 us | **4.8x faster** |
68-
| large_flat_dict | 5.3 us | 1.5 us | **3.5x faster** |
69-
| bytes_in_state | 1.2 us | 0.7 us | **1.7x faster** |
70-
| special_types | 4.7 us | 0.5 us | **9.8x faster** |
71-
| btree_small | 1.3 us | 0.2 us | **6.0x faster** |
72-
| btree_length | 1.1 us | 0.1 us | **8.8x faster** |
73-
| scalar_string | 1.2 us | 0.1 us | **8.3x faster** |
74-
| wide_dict | 56.4 us | 13.9 us | **4.0x faster** |
75-
| deep_nesting | 2.8 us | 1.0 us | **2.8x faster** |
69+
| simple_flat_dict | 1.3 us | 0.2 us | **6.7x faster** |
70+
| nested_dict | 1.6 us | 0.3 us | **6.4x faster** |
71+
| large_flat_dict | 5.7 us | 1.6 us | **3.9x faster** |
72+
| bytes_in_state | 1.3 us | 0.8 us | **1.7x faster** |
73+
| special_types | 4.6 us | 0.5 us | **9.2x faster** |
74+
| btree_small | 1.3 us | 0.2 us | **6.6x faster** |
75+
| btree_length | 1.0 us | 0.1 us | **8.0x faster** |
76+
| scalar_string | 1.0 us | 0.1 us | **7.9x faster** |
77+
| wide_dict | 56.9 us | 13.7 us | **4.1x faster** |
78+
| deep_nesting | 2.6 us | 1.0 us | **2.6x faster** |
7679

7780
### Full roundtrip (decode + encode)
7881

7982
| Category | Python | Codec | Ratio |
8083
|---|---|---|---|
81-
| simple_flat_dict | 3.2 us | 1.4 us | **2.4x faster** |
82-
| nested_dict | 4.5 us | 2.1 us | **2.2x faster** |
83-
| large_flat_dict | 29.7 us | 19.1 us | **1.6x faster** |
84-
| bytes_in_state | 3.3 us | 2.4 us | **1.4x faster** |
85-
| special_types | 11.7 us | 4.4 us | **2.7x faster** |
86-
| btree_small | 5.8 us | 1.8 us | **3.3x faster** |
87-
| btree_length | 2.1 us | 0.6 us | **3.6x faster** |
88-
| scalar_string | 2.3 us | 0.6 us | **3.6x faster** |
89-
| wide_dict | 316 us | 260 us | **1.2x faster** |
90-
| deep_nesting | 10.3 us | 7.3 us | **1.4x faster** |
84+
| simple_flat_dict | 3.2 us | 1.3 us | **2.6x faster** |
85+
| nested_dict | 4.4 us | 2.1 us | **2.1x faster** |
86+
| large_flat_dict | 28.7 us | 19.8 us | **1.5x faster** |
87+
| bytes_in_state | 3.1 us | 2.3 us | **1.4x faster** |
88+
| special_types | 11.5 us | 4.9 us | **2.4x faster** |
89+
| btree_small | 3.1 us | 1.8 us | **1.7x faster** |
90+
| btree_length | 2.0 us | 0.6 us | **3.4x faster** |
91+
| scalar_string | 2.1 us | 0.6 us | **3.5x faster** |
92+
| wide_dict | 318 us | 258.8 us | **1.3x faster** |
93+
| deep_nesting | 10.0 us | 7.8 us | **1.3x faster** |
9194

9295
### Output size (pickle bytes vs JSON)
9396

@@ -122,18 +125,18 @@ plus OOBTree containers, group summaries, and edge-case objects.
122125

123126
| Metric | Codec | Python | Speedup |
124127
|---|---|---|---|
125-
| Decode mean | 26.9 us | 22.2 us | 1.2x slower |
126-
| Decode median | 23.2 us | 21.6 us | 1.1x slower |
127-
| Decode P95 | 39.7 us | 31.7 us | 1.3x slower |
128-
| Encode mean | 4.7 us | 18.0 us | **3.8x faster** |
129-
| Encode median | 3.9 us | 19.7 us | **5.1x faster** |
130-
| Encode P95 | 9.6 us | 29.1 us | **3.0x faster** |
128+
| Decode mean | 27.2 us | 22.7 us | 1.2x slower |
129+
| Decode median | 23.6 us | 22.2 us | 1.1x slower |
130+
| Decode P95 | 40.5 us | 33.1 us | 1.2x slower |
131+
| Encode mean | 4.8 us | 18.2 us | **3.8x faster** |
132+
| Encode median | 4.0 us | 19.9 us | **5.0x faster** |
133+
| Encode P95 | 9.9 us | 30.0 us | **3.0x faster** |
131134
| Total pickle | 5.1 MB |||
132135
| Total JSON | 7.2 MB || 1.41x |
133136

134137
Decode is slightly slower (1.1x median) due to the two-pass conversion plus
135138
type-aware transformation. The gap narrows on metadata-heavy records.
136-
Encode is consistently **3.0-5.1x faster** because the Rust encoder writes
139+
Encode is consistently **3.0-5.0x faster** because the Rust encoder writes
137140
pickle opcodes directly from Python objects, bypassing intermediate allocations.
138141

139142
### Record type distribution
@@ -154,26 +157,27 @@ pickle opcodes directly from Python objects, bypassing intermediate allocations.
154157
The zodb-pgjsonb storage path has two decode functions. The dict path
155158
(`decode_zodb_record_for_pg`) returns a Python dict that must then be
156159
serialized via `json.dumps()`. The JSON string path
157-
(`decode_zodb_record_for_pg_json`) does everything in Rust with the GIL
158-
released. See the synthetic comparison above.
160+
(`decode_zodb_record_for_pg_json`) writes JSON tokens directly from the
161+
PickleValue AST to a `String` buffer, entirely in Rust with the GIL released.
159162

160163
```
161164
Dict path: pickle bytes → Rust AST → Python dict (GIL held) → json.dumps() → PG
162-
JSON path: pickle bytes → Rust AST → serde_json → JSON string (all Rust, GIL released) → PG
165+
JSON path: pickle bytes → Rust AST → JSON string (direct write, GIL released) → PG
163166
```
164167

165168
### 1,692 records
166169

167170
| Metric | Dict+dumps | JSON str | Speedup |
168171
|---|---|---|---|
169-
| Mean | 41.3 us | 31.5 us | **1.3x faster** |
170-
| Median | 35.9 us | 26.9 us | **1.3x faster** |
171-
| P95 | 64.2 us | 47.7 us | **1.3x faster** |
172+
| Mean | 40.4 us | 28.3 us | **1.4x faster** |
173+
| Median | 34.7 us | 24.4 us | **1.4x faster** |
174+
| P95 | 62.0 us | 51.9 us | **1.2x faster** |
172175

173-
The JSON string path is **1.3x faster** across real-world data because
174-
it eliminates the Python dict allocation + `json.dumps()` serialization.
175-
The entire pipeline runs in Rust with the GIL released, improving
176-
multi-threaded throughput in Zope/Plone deployments.
176+
The JSON string path is **1.4x faster** across real-world data because
177+
it eliminates both the Python dict allocation + `json.dumps()` serialization
178+
and all intermediate `serde_json::Value` heap allocations. The entire pipeline
179+
runs in Rust with the GIL released, improving multi-threaded throughput in
180+
Zope/Plone deployments.
177181

178182
---
179183

@@ -182,9 +186,9 @@ multi-threaded throughput in Zope/Plone deployments.
182186
The sweet spot is typical ZODB objects (5-50 keys, mixed types, datetime
183187
fields, persistent refs):
184188

185-
- **Decode:** 1.5-2.0x faster on synthetic, near parity on real-world data
186-
- **Encode:** 2-10x faster on synthetic, 3-5x faster on real-world data
187-
- **PG path:** 1.3x faster end-to-end with GIL-free throughput
189+
- **Decode:** 1.1-2.3x faster on synthetic, near parity on real-world data
190+
- **Encode:** 1.7-9.2x faster on synthetic, 3-5x faster on real-world data
191+
- **PG path:** 1.3-3.3x faster end-to-end with GIL-free throughput
188192

189193
Decode overhead comes from the two-pass conversion plus type transformation.
190194
On string-dominated payloads this matters more; on metadata-rich records with
@@ -215,49 +219,33 @@ mixed types (the typical ZODB case) the codec is competitive or faster.
215219
- Thread-local buffer reuse (retains capacity across encode calls)
216220
- `reserve()` calls before multi-part writes (eliminates mid-write reallocations)
217221
- Direct i64 LONG1 encoding (eliminates BigInt heap allocation)
222+
- Thread-local class pickle cache per (module, name) pair (single memcpy
223+
replaces 7 opcode writes for ~99.6% of records)
218224
- `#[inline]` on `write_u8`, `write_bytes`, `encode_int`
219225

220226
**Both paths:**
221227
- Interned marker strings (`pyo3::intern!` for `@t`, `@cls`, `@s`, etc.)
222228
- Pre-collected PyList (`PyList::new` vs append loop)
223229
- Thin LTO + single codegen unit (free 6-9% improvement)
224230
- Profile-guided optimization (PGO) with real FileStorage + synthetic data
225-
- Direct pickle → JSON string path for PG storage (GIL released)
231+
- Direct PickleValue → JSON string writer (`json_writer.rs`) for PG storage,
232+
eliminating all `serde_json::Value` intermediate allocations (GIL released)
233+
- Thread-local JSON writer buffer reuse (retains capacity across decode calls)
226234

227235
---
228236

229237
## Running benchmarks
230238

239+
All numbers in this document are from PGO builds. Always use PGO for
240+
benchmarking — it adds 5-15% and reflects production performance.
241+
231242
```bash
232243
cd sources/zodb-json-codec
233244

234-
# Build release first (important!)
235-
maturin develop --release
236-
237-
# Synthetic micro-benchmarks
238-
python benchmarks/bench.py synthetic --iterations 1000
239-
240-
# Generate a reproducible benchmark FileStorage (requires ZODB + BTrees)
241-
python benchmarks/bench.py generate
242-
243-
# Scan the generated (or any) FileStorage
244-
python benchmarks/bench.py filestorage benchmarks/bench_data/Data.fs
245-
246-
# PG decode path comparison (dict vs JSON string)
247-
python benchmarks/bench.py pg-compare --filestorage benchmarks/bench_data/Data.fs
248-
249-
# Both synthetic + filestorage, with JSON export
250-
python benchmarks/bench.py all --filestorage benchmarks/bench_data/Data.fs --output results.json
251-
```
245+
# 0. Decompress benchmark data (once — Data.fs is gitignored, only .gz is tracked)
246+
gunzip -k benchmarks/bench_data/Data.fs.gz
252247

253-
## PGO build (optional, adds 5-15%)
254-
255-
Profile-guided optimization uses real workload data to optimize branch
256-
prediction and code layout. The release CI builds include PGO for
257-
Linux x86_64 wheels.
258-
259-
```bash
260-
# 1. Install LLVM tools
248+
# 1. Install LLVM tools (once)
261249
rustup component add llvm-tools
262250

263251
# 2. Instrumented build
@@ -266,11 +254,23 @@ RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" maturin develop --release
266254
# 3. Generate profiles — use BOTH real data and synthetic for best coverage
267255
python benchmarks/bench.py filestorage benchmarks/bench_data/Data.fs
268256
python benchmarks/bench.py synthetic --iterations 2000
257+
python benchmarks/bench.py pg-compare --filestorage benchmarks/bench_data/Data.fs --iterations 500
269258

270259
# 4. Merge profiles
271260
LLVM_PROFDATA=$(find ~/.rustup -name llvm-profdata | head -1)
272261
$LLVM_PROFDATA merge -o /tmp/pgo-data/merged.profdata /tmp/pgo-data/*.profraw
273262

274263
# 5. Optimized build
275264
RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" maturin develop --release
265+
266+
# 6. Run benchmarks
267+
python benchmarks/bench.py synthetic --iterations 5000
268+
python benchmarks/bench.py filestorage benchmarks/bench_data/Data.fs
269+
python benchmarks/bench.py pg-compare --filestorage benchmarks/bench_data/Data.fs
270+
271+
# Generate a reproducible benchmark FileStorage (requires ZODB + BTrees)
272+
python benchmarks/bench.py generate
273+
274+
# Both synthetic + filestorage, with JSON export
275+
python benchmarks/bench.py all --filestorage benchmarks/bench_data/Data.fs --output results.json
276276
```

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,4 @@ serde_json = "1"
2323
base64 = "0.22"
2424
hex = "0.4"
2525
num-bigint = "0.4"
26+
ryu = "1"

0 commit comments

Comments
 (0)