Skip to content

Commit aa3308c

Browse files
aepfliclaude
andauthored
docs: add standardized benchmark matrix for cross-language comparison (#67)
## Summary - Adds `BENCHMARKS.md` defining a consistent benchmark specification across Rust, Java, and Python - Covers evaluation scenarios (E1-E11), custom operators (O1-O6), state management (S1-S5), concurrency (C1-C6), and old-vs-new comparison (X1-X3) - Standardizes context shapes and flag definitions for direct cross-language comparison ## Context The benchmark PRs (#64, #65, #66) implement subsets of this matrix. This document serves as the reference spec so all three languages converge on the same scenarios and can be compared meaningfully. ## Test plan - [ ] Review benchmark IDs and scenarios for completeness - [ ] Verify context/flag definitions match what implementations use 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 57e3855 commit aa3308c

3 files changed

Lines changed: 2287 additions & 2 deletions

File tree

.gitignore

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
/dist/
44
/python/tests/__pycache__
55

6-
# Cargo lock file (library)
7-
Cargo.lock
6+
# Cargo.lock is tracked to ensure reproducible WASM builds.
7+
# The WASM binary's import names include hashes that must match Java host functions.
88

99
# IDE and editor files
1010
.idea/

BENCHMARKS.md

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
# Benchmark Standard
2+
3+
This document defines the standardized benchmark matrix for flagd-evaluator across all language implementations (Rust, Java, Python). All benchmarks should follow this matrix to enable direct cross-language performance comparison.
4+
5+
## Evaluation Scenarios
6+
7+
Every language implementation should benchmark the following scenarios. The combination of **targeting complexity** and **context size** isolates where time is spent (serialization vs rule evaluation).
8+
9+
### Core Evaluation Matrix
10+
11+
| ID | Scenario | Targeting | Context Size | What it measures |
12+
|----|----------|-----------|--------------|------------------|
13+
| E1 | Simple flag, empty context | None (STATIC) | 0 attrs | Baseline: flag lookup + result serialization |
14+
| E2 | Simple flag, small context | None (STATIC) | 5 attrs | Serialization overhead for typical call |
15+
| E3 | Simple flag, large context | None (STATIC) | 100+ attrs | Serialization cost dominance |
16+
| E4 | Simple targeting, small context | Single `==` condition | 5 attrs | Minimal rule evaluation cost |
17+
| E5 | Simple targeting, large context | Single `==` condition | 100+ attrs | Serialization + simple rule |
18+
| E6 | Complex targeting, small context | Nested `and`/`or`, 3+ conditions | 5 attrs | Rule evaluation cost dominance |
19+
| E7 | Complex targeting, large context | Nested `and`/`or`, 3+ conditions | 100+ attrs | Worst case: heavy serialization + complex rules |
20+
| E8 | Targeting match | Rule that matches | 5 attrs | Match code path |
21+
| E9 | Targeting no-match | Rule that doesn't match (default) | 5 attrs | Default/fallback code path |
22+
| E10 | Disabled flag | `state: DISABLED` | 0 attrs | Early exit performance |
23+
| E11 | Missing flag | Non-existent key | 0 attrs | Error path performance |
24+
25+
### Custom Operator Benchmarks
26+
27+
| ID | Scenario | What it measures |
28+
|----|----------|------------------|
29+
| O1 | Fractional (2 buckets) | Typical A/B test bucketing |
30+
| O2 | Fractional (8 buckets) | Multi-variant experiment |
31+
| O3 | Semver equality (`=`) | Version string parsing + comparison |
32+
| O4 | Semver range (`^`, `~`) | Range matching logic |
33+
| O5 | `starts_with` | String prefix matching |
34+
| O6 | `ends_with` | String suffix matching |
35+
36+
### State Management Benchmarks
37+
38+
| ID | Scenario | What it measures |
39+
|----|----------|------------------|
40+
| S1 | Update state (5 flags) | Small config parse + validate |
41+
| S2 | Update state (50 flags) | Medium config scaling |
42+
| S3 | Update state (200 flags) | Large config scaling |
43+
| S4 | Update state (no change) | Change detection overhead |
44+
| S5 | Update state (1 flag changed in 100) | Incremental update efficiency |
45+
46+
### Concurrency Benchmarks
47+
48+
| ID | Scenario | Threads | What it measures |
49+
|----|----------|---------|------------------|
50+
| C1 | Simple flag, single thread | 1 | Baseline (no contention) |
51+
| C2 | Simple flag, 4 threads | 4 | Standard concurrent load |
52+
| C3 | Simple flag, 8 threads | 8 | High contention |
53+
| C4 | Targeting flag, 4 threads | 4 | Concurrent rule evaluation |
54+
| C5 | Mixed workload, 4 threads | 4 | Realistic production mix |
55+
| C6 | Read/write contention | 4 | `evaluate` concurrent with `update_state` |
56+
57+
### Comparison Benchmarks (language-specific)
58+
59+
| ID | Scenario | What it measures |
60+
|----|----------|------------------|
61+
| X1 | Old resolver vs new evaluator (simple) | Baseline improvement |
62+
| X2 | Old resolver vs new evaluator (targeting) | Rule evaluation improvement |
63+
| X3 | Old vs new under concurrency (4 threads) | Thread scaling improvement |
64+
65+
**Java**: Old = `json-logic-java` via `MinimalInProcessResolver`; New = WASM via Chicory
66+
**Python**: Old = `json-logic-utils` (pure Python); New = PyO3 native bindings
67+
**Rust**: N/A (Rust *is* the engine; compare `datalogic-rs` direct vs through evaluator)
68+
69+
## Context Definitions
70+
71+
To ensure comparability, use these standard context shapes:
72+
73+
### Empty Context
74+
```json
75+
{}
76+
```
77+
78+
### Small Context (5 attributes)
79+
```json
80+
{
81+
"targetingKey": "user-123",
82+
"tier": "premium",
83+
"role": "admin",
84+
"region": "us-east",
85+
"score": 85
86+
}
87+
```
88+
89+
### Large Context (100+ attributes)
90+
```json
91+
{
92+
"targetingKey": "user-123",
93+
"tier": "premium",
94+
"role": "admin",
95+
"region": "us-east",
96+
"score": 85,
97+
"attr_0": "value-0",
98+
"attr_1": 42,
99+
"attr_2": true,
100+
...
101+
"attr_99": "value-99"
102+
}
103+
```
104+
105+
Use deterministic generation (seeded random) so results are reproducible.
106+
107+
## Flag Definitions
108+
109+
### Simple Boolean Flag (no targeting)
110+
```json
111+
{
112+
"state": "ENABLED",
113+
"defaultVariant": "on",
114+
"variants": { "on": true, "off": false }
115+
}
116+
```
117+
118+
### Simple Targeting Flag
119+
```json
120+
{
121+
"state": "ENABLED",
122+
"defaultVariant": "off",
123+
"variants": { "on": true, "off": false },
124+
"targeting": {
125+
"if": [{ "==": [{ "var": "tier" }, "premium"] }, "on", "off"]
126+
}
127+
}
128+
```
129+
130+
### Complex Targeting Flag
131+
```json
132+
{
133+
"state": "ENABLED",
134+
"defaultVariant": "basic",
135+
"variants": { "premium": "premium-tier", "standard": "standard-tier", "basic": "basic-tier" },
136+
"targeting": {
137+
"if": [
138+
{ "and": [
139+
{ "==": [{ "var": "tier" }, "premium"] },
140+
{ ">": [{ "var": "score" }, 90] }
141+
]},
142+
"premium",
143+
{ "if": [
144+
{ "or": [
145+
{ "==": [{ "var": "tier" }, "standard"] },
146+
{ ">": [{ "var": "score" }, 50] }
147+
]},
148+
"standard",
149+
"basic"
150+
]}
151+
]
152+
}
153+
}
154+
```
155+
156+
## Running Benchmarks
157+
158+
### Rust
159+
```bash
160+
cargo bench # all suites
161+
cargo bench --bench evaluation # evaluation only
162+
cargo bench -- --quick # quick run
163+
# HTML reports: target/criterion/
164+
```
165+
166+
### Java
167+
```bash
168+
cd java
169+
./mvnw clean package
170+
java -jar target/benchmarks.jar # all benchmarks
171+
java -jar target/benchmarks.jar ConcurrentFlagEvaluatorBenchmark # concurrent only
172+
java -jar target/benchmarks.jar -prof gc # with GC profiling
173+
```
174+
175+
### Python
176+
```bash
177+
cd python
178+
uv sync --group dev && maturin develop
179+
pytest benchmarks/ --benchmark-only -v # all benchmarks
180+
pytest benchmarks/ --benchmark-only --benchmark-json=results.json # export
181+
```
182+
183+
## Reporting Results
184+
185+
When reporting benchmark results, always include:
186+
187+
1. **Hardware**: CPU model, core count, RAM
188+
2. **OS**: Distribution and kernel version
189+
3. **Runtime versions**: `rustc --version`, `java --version`, `python --version`
190+
4. **Metrics per scenario**:
191+
- Throughput (ops/sec)
192+
- Latency (mean, p50, p99)
193+
- Allocation rate (if available)
194+
5. **Comparison table** when measuring old vs new
195+
196+
Results should be committed to language-specific README files, not to this document.

0 commit comments

Comments
 (0)