|
| 1 | +# Benchmark Standard |
| 2 | + |
| 3 | +This document defines the standardized benchmark matrix for flagd-evaluator across all language implementations (Rust, Java, Python). All benchmarks should follow this matrix to enable direct cross-language performance comparison. |
| 4 | + |
| 5 | +## Evaluation Scenarios |
| 6 | + |
| 7 | +Every language implementation should benchmark the following scenarios. The combination of **targeting complexity** and **context size** isolates where time is spent (serialization vs rule evaluation). |
| 8 | + |
| 9 | +### Core Evaluation Matrix |
| 10 | + |
| 11 | +| ID | Scenario | Targeting | Context Size | What it measures | |
| 12 | +|----|----------|-----------|--------------|------------------| |
| 13 | +| E1 | Simple flag, empty context | None (STATIC) | 0 attrs | Baseline: flag lookup + result serialization | |
| 14 | +| E2 | Simple flag, small context | None (STATIC) | 5 attrs | Serialization overhead for typical call | |
| 15 | +| E3 | Simple flag, large context | None (STATIC) | 100+ attrs | Serialization cost dominance | |
| 16 | +| E4 | Simple targeting, small context | Single `==` condition | 5 attrs | Minimal rule evaluation cost | |
| 17 | +| E5 | Simple targeting, large context | Single `==` condition | 100+ attrs | Serialization + simple rule | |
| 18 | +| E6 | Complex targeting, small context | Nested `and`/`or`, 3+ conditions | 5 attrs | Rule evaluation cost dominance | |
| 19 | +| E7 | Complex targeting, large context | Nested `and`/`or`, 3+ conditions | 100+ attrs | Worst case: heavy serialization + complex rules | |
| 20 | +| E8 | Targeting match | Rule that matches | 5 attrs | Match code path | |
| 21 | +| E9 | Targeting no-match | Rule that doesn't match (default) | 5 attrs | Default/fallback code path | |
| 22 | +| E10 | Disabled flag | `state: DISABLED` | 0 attrs | Early exit performance | |
| 23 | +| E11 | Missing flag | Non-existent key | 0 attrs | Error path performance | |
| 24 | + |
| 25 | +### Custom Operator Benchmarks |
| 26 | + |
| 27 | +| ID | Scenario | What it measures | |
| 28 | +|----|----------|------------------| |
| 29 | +| O1 | Fractional (2 buckets) | Typical A/B test bucketing | |
| 30 | +| O2 | Fractional (8 buckets) | Multi-variant experiment | |
| 31 | +| O3 | Semver equality (`=`) | Version string parsing + comparison | |
| 32 | +| O4 | Semver range (`^`, `~`) | Range matching logic | |
| 33 | +| O5 | `starts_with` | String prefix matching | |
| 34 | +| O6 | `ends_with` | String suffix matching | |
| 35 | + |
| 36 | +### State Management Benchmarks |
| 37 | + |
| 38 | +| ID | Scenario | What it measures | |
| 39 | +|----|----------|------------------| |
| 40 | +| S1 | Update state (5 flags) | Small config parse + validate | |
| 41 | +| S2 | Update state (50 flags) | Medium config scaling | |
| 42 | +| S3 | Update state (200 flags) | Large config scaling | |
| 43 | +| S4 | Update state (no change) | Change detection overhead | |
| 44 | +| S5 | Update state (1 flag changed in 100) | Incremental update efficiency | |
| 45 | + |
| 46 | +### Concurrency Benchmarks |
| 47 | + |
| 48 | +| ID | Scenario | Threads | What it measures | |
| 49 | +|----|----------|---------|------------------| |
| 50 | +| C1 | Simple flag, single thread | 1 | Baseline (no contention) | |
| 51 | +| C2 | Simple flag, 4 threads | 4 | Standard concurrent load | |
| 52 | +| C3 | Simple flag, 8 threads | 8 | High contention | |
| 53 | +| C4 | Targeting flag, 4 threads | 4 | Concurrent rule evaluation | |
| 54 | +| C5 | Mixed workload, 4 threads | 4 | Realistic production mix | |
| 55 | +| C6 | Read/write contention | 4 | `evaluate` concurrent with `update_state` | |
| 56 | + |
| 57 | +### Comparison Benchmarks (language-specific) |
| 58 | + |
| 59 | +| ID | Scenario | What it measures | |
| 60 | +|----|----------|------------------| |
| 61 | +| X1 | Old resolver vs new evaluator (simple) | Baseline improvement | |
| 62 | +| X2 | Old resolver vs new evaluator (targeting) | Rule evaluation improvement | |
| 63 | +| X3 | Old vs new under concurrency (4 threads) | Thread scaling improvement | |
| 64 | + |
| 65 | +**Java**: Old = `json-logic-java` via `MinimalInProcessResolver`; New = WASM via Chicory |
| 66 | +**Python**: Old = `json-logic-utils` (pure Python); New = PyO3 native bindings |
| 67 | +**Rust**: N/A (Rust *is* the engine; compare `datalogic-rs` direct vs through evaluator) |
| 68 | + |
| 69 | +## Context Definitions |
| 70 | + |
| 71 | +To ensure comparability, use these standard context shapes: |
| 72 | + |
| 73 | +### Empty Context |
| 74 | +```json |
| 75 | +{} |
| 76 | +``` |
| 77 | + |
| 78 | +### Small Context (5 attributes) |
| 79 | +```json |
| 80 | +{ |
| 81 | + "targetingKey": "user-123", |
| 82 | + "tier": "premium", |
| 83 | + "role": "admin", |
| 84 | + "region": "us-east", |
| 85 | + "score": 85 |
| 86 | +} |
| 87 | +``` |
| 88 | + |
| 89 | +### Large Context (100+ attributes) |
| 90 | +```json |
| 91 | +{ |
| 92 | + "targetingKey": "user-123", |
| 93 | + "tier": "premium", |
| 94 | + "role": "admin", |
| 95 | + "region": "us-east", |
| 96 | + "score": 85, |
| 97 | + "attr_0": "value-0", |
| 98 | + "attr_1": 42, |
| 99 | + "attr_2": true, |
| 100 | + ... |
| 101 | + "attr_99": "value-99" |
| 102 | +} |
| 103 | +``` |
| 104 | + |
| 105 | +Use deterministic generation (seeded random) so results are reproducible. |
| 106 | + |
| 107 | +## Flag Definitions |
| 108 | + |
| 109 | +### Simple Boolean Flag (no targeting) |
| 110 | +```json |
| 111 | +{ |
| 112 | + "state": "ENABLED", |
| 113 | + "defaultVariant": "on", |
| 114 | + "variants": { "on": true, "off": false } |
| 115 | +} |
| 116 | +``` |
| 117 | + |
| 118 | +### Simple Targeting Flag |
| 119 | +```json |
| 120 | +{ |
| 121 | + "state": "ENABLED", |
| 122 | + "defaultVariant": "off", |
| 123 | + "variants": { "on": true, "off": false }, |
| 124 | + "targeting": { |
| 125 | + "if": [{ "==": [{ "var": "tier" }, "premium"] }, "on", "off"] |
| 126 | + } |
| 127 | +} |
| 128 | +``` |
| 129 | + |
| 130 | +### Complex Targeting Flag |
| 131 | +```json |
| 132 | +{ |
| 133 | + "state": "ENABLED", |
| 134 | + "defaultVariant": "basic", |
| 135 | + "variants": { "premium": "premium-tier", "standard": "standard-tier", "basic": "basic-tier" }, |
| 136 | + "targeting": { |
| 137 | + "if": [ |
| 138 | + { "and": [ |
| 139 | + { "==": [{ "var": "tier" }, "premium"] }, |
| 140 | + { ">": [{ "var": "score" }, 90] } |
| 141 | + ]}, |
| 142 | + "premium", |
| 143 | + { "if": [ |
| 144 | + { "or": [ |
| 145 | + { "==": [{ "var": "tier" }, "standard"] }, |
| 146 | + { ">": [{ "var": "score" }, 50] } |
| 147 | + ]}, |
| 148 | + "standard", |
| 149 | + "basic" |
| 150 | + ]} |
| 151 | + ] |
| 152 | + } |
| 153 | +} |
| 154 | +``` |
| 155 | + |
| 156 | +## Running Benchmarks |
| 157 | + |
| 158 | +### Rust |
| 159 | +```bash |
| 160 | +cargo bench # all suites |
| 161 | +cargo bench --bench evaluation # evaluation only |
| 162 | +cargo bench -- --quick # quick run |
| 163 | +# HTML reports: target/criterion/ |
| 164 | +``` |
| 165 | + |
| 166 | +### Java |
| 167 | +```bash |
| 168 | +cd java |
| 169 | +./mvnw clean package |
| 170 | +java -jar target/benchmarks.jar # all benchmarks |
| 171 | +java -jar target/benchmarks.jar ConcurrentFlagEvaluatorBenchmark # concurrent only |
| 172 | +java -jar target/benchmarks.jar -prof gc # with GC profiling |
| 173 | +``` |
| 174 | + |
| 175 | +### Python |
| 176 | +```bash |
| 177 | +cd python |
| 178 | +uv sync --group dev && maturin develop |
| 179 | +pytest benchmarks/ --benchmark-only -v # all benchmarks |
| 180 | +pytest benchmarks/ --benchmark-only --benchmark-json=results.json # export |
| 181 | +``` |
| 182 | + |
| 183 | +## Reporting Results |
| 184 | + |
| 185 | +When reporting benchmark results, always include: |
| 186 | + |
| 187 | +1. **Hardware**: CPU model, core count, RAM |
| 188 | +2. **OS**: Distribution and kernel version |
| 189 | +3. **Runtime versions**: `rustc --version`, `java --version`, `python --version` |
| 190 | +4. **Metrics per scenario**: |
| 191 | + - Throughput (ops/sec) |
| 192 | + - Latency (mean, p50, p99) |
| 193 | + - Allocation rate (if available) |
| 194 | +5. **Comparison table** when measuring old vs new |
| 195 | + |
| 196 | +Results should be committed to language-specific README files, not to this document. |
0 commit comments