diff --git a/doc/performance_audit_modsecurity_2026-04-03.en.md b/doc/performance_audit_modsecurity_2026-04-03.en.md
new file mode 100644
index 0000000000..adc2620e47
--- /dev/null
+++ b/doc/performance_audit_modsecurity_2026-04-03.en.md
@@ -0,0 +1,217 @@
+# OWASP ModSecurity v3 – Technical Performance Audit (Code-focused)
+
+This document summarizes an in-depth, code-based performance analysis of libmodsecurity, focusing on hot paths, rule engine behavior, memory characteristics, and scalability.
+
+## Scope
+
+Analyzed core paths:
+- `Transaction` request/response lifecycle
+- `RulesSet` + `RuleWithOperator` evaluation
+- Request body processors (URLENCODED/JSON/XML/MULTIPART)
+- Regex and pattern-matching operators (`@rx`, `@pm`)
+- Collection backends and locking
+- Audit logging and serialization
+
+## Key Findings (Summary)
+
+1. **Dominant CPU path**: `RulesSet::evaluate()` → `RuleWithOperator::evaluate()` → transformations → operator (`@rx`, `@pm`, ...).
+2. **Regex is the primary cost driver** in CRS-heavy rule sets; match limits mitigate impact but do not eliminate expensive patterns.
+3. **Significant string/copy overhead** in request-body and logging paths (`stringstream::str()`, header concatenation, JSON/audit serialization).
+4. **Multipart parsing** is byte-wise and state-machine heavy, with high branching cost.
+5. **Concurrency**: mostly lock-free per transaction; shared collections use `shared_mutex` and may contend in write-heavy workloads.
+
+## Top Optimization Opportunities
+
+- Replace `stringstream`-centric body handling with chunk-/span-based buffering.
+- Compute `FULL_REQUEST` lazily or behind feature gates.
+- Add prefilters (literal guards) before expensive regex operators.
+- Cache/fuse frequent transformation pipelines.
+- Reduce and/or async-offload audit logging where possible.
+
+## Performance Model
+
+### 1) Request Cost Model
+
+We model total CPU cost per request as:
+
+\[
+C_{req} = C_{conn} + C_{parse} + C_{rules} + C_{log} + C_{sync}
+\]
+
+where:
+
+- \(C_{conn}\): connection/context cost (small, near-constant)
+- \(C_{parse}\): URI/header/body parsing
+- \(C_{rules}\): rule evaluation (dominant)
+- \(C_{log}\): audit/debug serialization
+- \(C_{sync}\): locking/contention cost on shared collections
+
+For the rule engine:
+
+\[
+C_{rules} = \sum_{r=1}^{R} \left( V_r \cdot \left(\sum_{t=1}^{T_r} C_{trans}(t)\right) + C_{op}(r) \right) + C_{actions}(r)
+\]
+
+In aggregated form:
+
+\[
+C_{rules} \approx R \cdot V \cdot (T \cdot \bar c_{trans} + \bar c_{op}) + R \cdot \bar c_{actions}
+\]
+
+with:
+- \(R\): number of active rules per phase
+- \(V\): average number of target values per rule
+- \(T\): average number of transformations per rule
+- \(\bar c_{op}\): average operator cost
+
+This makes the multiplicative cost in \(R, V, T\) explicit.
+
+### 2) Regex Cost Model
+
+For regex-heavy workloads:
+
+\[
+\bar c_{op} = p_{rx}\cdot c_{rx} + (1-p_{rx})\cdot c_{other}
+\]
+
+where \(p_{rx}\) is the ratio of regex-based rules.
+
+- **Best case (JIT + early fail/match):**
+  \[
+  c_{rx}^{best} = O(n)
+  \]
+- **Worst case (catastrophic backtracking):**
+  \[
+  c_{rx}^{worst} = O(e^n)
+  \]
+  practical behavior is bounded by match limits, but still expensive up to abort.
+
+### 3) Big-O by subsystem
+
+- **Rule evaluation (overall):**
+  \[
+  O\left(\sum_{r=1}^{R} V_r\cdot(T_r + O_r)\right)
+  \]
+  typically approximated as \(O(R\cdot V\cdot(T+O))\).
+- **Parsing:**
+  - URI/Header/Cookies: \(O(H + Q)\)
+  - URL-encoded body: \(O(B)\)
+  - Multipart: \(O(B\cdot\kappa)\), with \(\kappa\) as state/boundary-check overhead
+- **Transformation pipeline:**
+  \[
+  O(R\cdot V\cdot T\cdot L)
+  \]
+  where \(L\) is average target string length.
+
+---
+
+## Measurement Strategy
+
+### 1) Reproducible experiment design
+
+**Minimum matrix:**
+- Rule set: Minimal / CRS PL1 / CRS PL2+
+- Payload: 1 KB / 16 KB / 256 KB / 2 MB
+- Workload mix: static GET, JSON API, multipart upload
+- Concurrency: 1, 8, 32, 128
+- Logging mode: off / minimal / full audit
+
+**A/B variants:**
+- Baseline without WAF vs with WAF
+- WAF with/without specific rule classes (e.g., regex-heavy groups)
+
+### 2) Tooling
+
+- **CPU hotspots:** `perf record` + `perf report`, then flamegraphs
+- **Memory/allocations:** `valgrind --tool=massif`, optionally `heaptrack`
+- **Syscalls/locking:** `perf lock`, `strace -c`
+- **Optional eBPF:** uprobes on `RulesSet::evaluate`, `RuleWithOperator::evaluate`, `Regex::searchOneMatch`, `executeTransformations`
+
+### 3) Instrumentation and metrics
+
+For each request phase (1–5 + logging):
+
+\[
+T_{phase,i} = t_{end,i} - t_{start,i}
+\]
+
+Additional per-request metrics:
+- `T_regex_total`, `N_regex_calls`, `T_regex_avg`
+- `T_trans_total`, `N_transforms`
+- `alloc_bytes`, `alloc_count`
+- `audit_bytes_written`
+
+### 4) KPIs
+
+- **Latency:** p50 / p95 / p99 (end-to-end and per phase)
+- **CPU:** cycles/request, instructions/request, IPC
+- **Memory:** bytes/request, peak RSS, allocations/request
+- **Scalability:** throughput (RPS) vs concurrency, saturation point
+
+### 5) Optimization acceptance criteria
+
+An optimization is accepted if across 3 runs (95% confidence):
+- p95 latency improves by at least 10% **or**
+- cycles/request decrease by at least 15%
+- with no regression in block/detection fidelity.
+
+---
+
+## Prioritization
+
+### 1) Quantified bottleneck split (CPU share)
+
+**Best case (small rule set, low regex pressure):**
+- Regex engine: **20–35%**
+- Transformation pipeline: **20–30%**
+- Parsing: **15–25%**
+- Logging: **5–15%**
+- Locking/synchronization: **<5%**
+
+**Worst case (CRS-heavy, high paranoia, large bodies):**
+- Regex engine: **45–70%**
+- Transformation pipeline: **15–30%**
+- Parsing (incl. multipart): **10–20%**
+- Logging: **5–15%**
+- Locking/synchronization: **3–10%**
+
+### 2) Weighted top-5 bottlenecks
+
+Weight = expected total CPU impact in production-like CRS setups.
+
+1. **Regex matching (`@rx`)** — **40%**
+2. **Transformations + multiMatch amplification** — **22%**
+3. **Variable expansion / target fanout (V)** — **14%**
+4. **Body parsing (especially multipart, large payloads)** — **13%**
+5. **Logging/serialization + I/O** — **11%**
+
+Total: 100%.
+
+### 3) Focus rules
+
+If \(p_{rx} > 0.35\) or `T_regex_total / C_req > 0.4`, optimize regex first.
+
+If `N_transforms * V` is high, prioritize transformation/target reduction.
+
+If `audit_bytes_written` is high and tail latency dominates, optimize logging first.
+
+---
+
+## Optimizations with Impact
+
+| Measure | Expected Impact | Estimated Improvement | Implementation Complexity | Notes |
+|---|---:|---:|---|---|
+| Regex prefilter (literal/ACMP before `@rx`) | High | **15–35%** CPU | Medium | Reduces expensive regex calls on easy non-matches |
+| Rule grouping + early exits per phase | High | **10–25%** latency/CPU | Medium | Strong effect with large rule volumes |
+| Target scope hardening (reduce `V`) | High | **10–30%** CPU | Low–Medium | Directly reduces \(R\cdot V\cdot T\) multiplier |
+| Transformation fusion/caching | Medium–High | **8–20%** CPU | Medium–High | Must preserve exact semantics |
+| Lazy `FULL_REQUEST` materialization | Medium | **5–15%** CPU+RAM | Low | Avoids large string copies |
+| Streaming/chunk body representation | Medium–High | **10–25%** RAM/CPU | High | Larger refactor, high long-term value |
+| Selective/async audit logging | Medium | **5–20%** tail latency | Low–Medium | Fast operational win |
+| Collection locking optimization (batch/shard) | Low–Medium | **3–10%** at high concurrency | Medium | Relevant for write-heavy rules |
+
+### Suggested rollout order
+
+1. **Quick wins (1–2 sprints):** logging reduction, target-scope hardening, lazy `FULL_REQUEST`.
+2. **Mid-term (2–4 sprints):** regex prefilter, phase gating, transformation optimization.
+3. **Long-term:** streaming body refactor + deeper locking redesign.
diff --git a/doc/performance_audit_modsecurity_2026-04-03.md b/doc/performance_audit_modsecurity_2026-04-03.md
new file mode 100644
index 0000000000..4e06b8019c
--- /dev/null
+++ b/doc/performance_audit_modsecurity_2026-04-03.md
@@ -0,0 +1,219 @@
+# OWASP ModSecurity v3 – Technischer Performance-Audit (Code-fokussiert)
+
+Dieses Dokument fasst eine tiefgehende, codebasierte Performance-Analyse von libmodsecurity zusammen, mit Fokus auf Hot Paths, Regel-Engine, Speicherverhalten und Skalierung.
+
+> English version: `doc/performance_audit_modsecurity_2026-04-03.en.md`
+
+## Scope
+
+Analysierte Kernpfade:
+- `Transaction` Request/Response Lifecycle
+- `RulesSet` + `RuleWithOperator` Evaluierung
+- Request-Body-Prozessoren (URLENCODED/JSON/XML/MULTIPART)
+- Regex- und Pattern-Matching Operatoren (`@rx`, `@pm`)
+- Collection-Backends und Locking
+- Audit-Logging und Serialisierung
+
+## Wichtigste Erkenntnisse (Kurzfassung)
+
+1. **Dominanter CPU-Pfad**: `RulesSet::evaluate()` → `RuleWithOperator::evaluate()` → Transformationen → Operator (`@rx`, `@pm`, …).
+2. **Regex ist Hauptkostentreiber** bei CRS-lastigen Regelsets; Match-Limits sind vorhanden, aber nur Schadensbegrenzung.
+3. **Signifikante String-/Copy-Kosten** in Request-Body- und Logging-Pfaden (`stringstream::str()`, Header-Konkatenation, JSON/Audit-Serialisierung).
+4. **Multipart-Parsing** arbeitet byteweise mit hohem Branching- und State-Machine-Aufwand.
+5. **Concurrency**: transaktionslokal weitgehend lock-frei; globale Collections nutzen `shared_mutex` und können bei write-lastigen Workloads kontendieren.
+
+## Potenzielle Optimierungen (Top)
+
+- Request-Body intern von `stringstream` auf chunk-/span-basierte Buffering-Strategie umstellen.
+- `FULL_REQUEST` lazy oder feature-gated berechnen (TODO im Code vorhanden).
+- Regex-basierte Regeln per Vorfilter (Literal prefilter, cheap guards) reduzieren.
+- Transformation-Pipeline für häufige Kombinationen cachen/fusen.
+- Audit-Logging asynchron und selektiv (Parts minimieren, JSON nur wenn nötig).
+
+## Performance-Modell
+
+### 1) Kostenmodell pro Request
+
+Wir modellieren die gesamte CPU-Zeit pro Request als:
+
+\[
+C_{req} = C_{conn} + C_{parse} + C_{rules} + C_{log} + C_{sync}
+\]
+
+mit:
+
+- \(C_{conn}\): Verbindungs-/Kontextkosten (klein, nahezu konstant)
+- \(C_{parse}\): URI/Header/Body-Parsing
+- \(C_{rules}\): Regelbewertung (dominant)
+- \(C_{log}\): Audit-/Debug-Serialisierung
+- \(C_{sync}\): Locking-/Contention-Kosten auf shared Collections
+
+Für die Regel-Engine:
+
+\[
+C_{rules} = \sum_{r=1}^{R} \left( V_r \cdot \left(\sum_{t=1}^{T_r} C_{trans}(t)\right) + C_{op}(r) \right) + C_{actions}(r)
+\]
+
+Praktisch (aggregiert) gilt näherungsweise:
+
+\[
+C_{rules} \approx R \cdot V \cdot (T \cdot \bar c_{trans} + \bar c_{op}) + R \cdot \bar c_{actions}
+\]
+
+wobei:
+- \(R\): Anzahl aktiver Regeln pro Phase
+- \(V\): durchschnittliche Anzahl Zielwerte pro Regel (z. B. ARGS, Header, Cookies)
+- \(T\): mittlere Anzahl Transformationen pro Regel
+- \(\bar c_{op}\): mittlere Operatorkosten
+
+Damit wird klar: **lineare Skalierung in \(R, V, T\)**, aber \(\bar c_{op}\) kann bei Regex nichtlinear eskalieren.
+
+### 2) Regex-Spezialmodell
+
+Für Regex-Regeln zerlegen wir \(\bar c_{op}\) in:
+
+\[
+\bar c_{op} = p_{rx}\cdot c_{rx} + (1-p_{rx})\cdot c_{other}
+\]
+
+mit \(p_{rx}\) als Anteil regex-basierter Regeln.
+
+- **Best Case (JIT + frühes Match/Miss, wenig Backtracking):**
+  \[
+  c_{rx}^{best} = O(n)
+  \]
+- **Worst Case (katastrophales Backtracking):**
+  \[
+  c_{rx}^{worst} = O(e^n)
+  \]
+  in der Praxis durch Match-Limits gedeckelt, aber weiterhin extrem teuer bis zum Abbruch.
+
+### 3) Big-O nach Subsystem
+
+- **Rule Evaluation gesamt:**
+  \[
+  O\left(\sum_{r=1}^{R} V_r\cdot(T_r + O_r)\right)
+  \]
+  i. d. R. näherungsweise \(O(R\cdot V\cdot(T+O))\).
+- **Parsing:**
+  - URI/Header/Cookies: \(O(H + Q)\)
+  - URL-Encoded Body: \(O(B)\)
+  - Multipart: \(O(B\cdot\kappa)\), \(\kappa\) = Zustands-/Boundary-Prüfaufwand
+- **Transformation Pipeline:**
+  \[
+  O(R\cdot V\cdot T\cdot L)
+  \]
+  mit \(L\) = durchschnittliche Stringlänge je Zielwert.
+
+---
+
+## Messstrategie
+
+### 1) Experiment-Design (reproduzierbar)
+
+**Messmatrix (mindestens):**
+- Regelset: Minimal / CRS PL1 / CRS PL2+
+- Payload: 1 KB / 16 KB / 256 KB / 2 MB
+- Workload-Mix: static GET, JSON API, multipart upload
+- Concurrency: 1, 8, 32, 128
+- Logging: aus / minimal / full audit
+
+**A/B-Varianten:**
+- Baseline ohne WAF vs mit WAF
+- WAF mit/ohne bestimmte Rule-Klassen (z. B. regex-lastige Gruppen)
+
+### 2) Tooling
+
+- **CPU Hotspots:** `perf record` + `perf report`, anschließend Flamegraph
+- **Memory/Allokationen:** `valgrind --tool=massif`, optional `heaptrack`
+- **Syscall/Locking:** `perf lock`, `strace -c`
+- **Optional eBPF:** uprobes auf `RulesSet::evaluate`, `RuleWithOperator::evaluate`, `Regex::searchOneMatch`, `executeTransformations`
+
+### 3) Instrumentierung und Metriken
+
+Für jede Request-Phase (1–5 + logging) erfassen:
+
+\[
+T_{phase,i} = t_{end,i} - t_{start,i}
+\]
+
+Zusätzlich pro Request:
+- `T_regex_total`, `N_regex_calls`, `T_regex_avg`
+- `T_trans_total`, `N_transforms`
+- `alloc_bytes`, `alloc_count`
+- `audit_bytes_written`
+
+### 4) KPIs (SLO-fähig)
+
+- **Latency:** p50 / p95 / p99 (end-to-end und pro Phase)
+- **CPU:** cycles/request, instructions/request, IPC
+- **Memory:** bytes/request, peak RSS, allocations/request
+- **Skalierung:** throughput (RPS) vs concurrency, Saturation-Punkt
+
+### 5) Akzeptanzkriterien für Optimierungen
+
+Eine Optimierung gilt als erfolgreich, wenn über 3 Läufe (95%-Konfidenz):
+- p95-Latenz mindestens 10% verbessert **oder**
+- cycles/request mindestens 15% sinken
+- ohne Regression der Block-/Detection-Rate (Sicherheits-Fidelity unverändert)
+
+---
+
+## Priorisierung
+
+### 1) Quantifizierte Impact-Aufteilung (CPU-Anteil)
+
+**Best-Case (kleines Regelset, wenig Regex):**
+- Regex Engine: **20–35%**
+- Transformation Pipeline: **20–30%**
+- Parsing: **15–25%**
+- Logging: **5–15%**
+- Locking/Synchronisation: **<5%**
+
+**Worst-Case (CRS-heavy, hohe Paranoia, große Bodies):**
+- Regex Engine: **45–70%**
+- Transformation Pipeline: **15–30%**
+- Parsing (inkl. multipart): **10–20%**
+- Logging: **5–15%**
+- Locking/Synchronisation: **3–10%**
+
+### 2) Top-5 Bottlenecks mit Gewichtung
+
+Gewichtung = erwarteter Gesamt-CPU-Impact in produktionsnahen CRS-Setups.
+
+1. **Regex-Matching (`@rx`)** — **40%**
+2. **Transformationen + multiMatch-Multiplikation** — **22%**
+3. **Variablenexpansion / Zielwert-Fanout (V)** — **14%**
+4. **Body-Parsing (v. a. multipart, große Payloads)** — **13%**
+5. **Logging/Serialisierung + I/O** — **11%**
+
+Summe: 100%.
+
+### 3) Entscheidungsregel für Fokus
+
+Wenn \(p_{rx} > 0.35\) oder `T_regex_total / C_req > 0.4`, zuerst Regex-Optimierung.
+
+Wenn `N_transforms * V` sehr hoch ist, zuerst Transformations-/Target-Reduktion.
+
+Wenn `audit_bytes_written` hoch und Latenz tail-lastig ist, Logging zuerst reduzieren.
+
+---
+
+## Optimierungen mit Impact
+
+| Maßnahme | Erwarteter Impact | Geschätzte Verbesserung | Implementierungs-Komplexität | Hinweise |
+|---|---:|---:|---|---|
+| Regex-Prefilter (Literal/ACMP vor `@rx`) | Hoch | **15–35%** CPU | Mittel | Reduziert teure Regex-Aufrufe bei sicheren Non-Matches |
+| Regelgruppierung + frühe Exits pro Phase | Hoch | **10–25%** Latenz/CPU | Mittel | Besonders wirksam bei großen Regelmengen |
+| Target-Scope-Härtung (`V` senken, keine unnötigen Collections) | Hoch | **10–30%** CPU | Niedrig–Mittel | Direkte Multiplikatorreduktion in \(R\cdot V\cdot T\) |
+| Transformation-Fusion/Caching häufiger Pipelines | Mittel–Hoch | **8–20%** CPU | Mittel–Hoch | Muss semantisch äquivalent bleiben |
+| `FULL_REQUEST` lazy/materialize-on-demand | Mittel | **5–15%** CPU + RAM | Niedrig | Vermeidet große String-Kopien |
+| Streaming-/Chunk-Body-Repräsentation statt `stringstream::str()` | Mittel–Hoch | **10–25%** RAM/CPU | Hoch | Größerer Refactor, aber hoher langfristiger Gewinn |
+| Audit-Logging selektiv/asynchron (Parts minimal) | Mittel | **5–20%** Tail-Latency | Niedrig–Mittel | Schnell umsetzbar, operativ gut steuerbar |
+| Collection-Locking optimieren (write batching, sharding) | Niedrig–Mittel | **3–10%** bei hoher Concurrency | Mittel | Nur relevant bei write-heavy Regeln |
+
+### Umsetzungsreihenfolge (Roadmap)
+
+1. **Quick Wins (1–2 Sprints):** Logging-Reduktion, Target-Scope-Härtung, `FULL_REQUEST` lazy.
+2. **Mid-Term (2–4 Sprints):** Regex-Prefilter, Rule-Phasen-Gating, Transformationsoptimierung.
+3. **Langfristig:** Streaming-Body-Refactor + tieferes Locking-Redesign.
diff --git a/doc/performance_evaluation_modsecurity_v3_2026-04-05.de.md b/doc/performance_evaluation_modsecurity_v3_2026-04-05.de.md
new file mode 100644
index 0000000000..9e0120a8aa
--- /dev/null
+++ b/doc/performance_evaluation_modsecurity_v3_2026-04-05.de.md
@@ -0,0 +1,121 @@
+# OWASP ModSecurity (libmodsecurity v3) – Tiefgehende technische Performance-Evaluierung (Deutsch)
+
+> Scope: Engine-Verhalten in libmodsecurity v3 mit quantifizierten Schätzungen für CPU, RAM, I/O und Skalierung in typischen Enterprise-CRS-Deployments.
+
+## 0) Executive Summary
+
+1. **Das primäre Kostenzentrum ist die Regel-Ausführungsfächerung**: die effektiven CPU-Kosten skalieren näherungsweise mit `R × V × T × O` (Regeln × Zielwerte × Transformationen × Operatorkosten), wobei regex-lastige Operatoren in CRS-ähnlichen Setups dominieren.
+2. **Der größte Bottleneck ist die Regex-Auswertung (`@rx`) in transformationslastigen Pipelines**, insbesondere bei hoher Payload-Entropie und stark überlappenden Regeln.
+3. **Das größte systemische Risiko ist ein Tail-Latency-Kollaps bei hoher Parallelität**, wenn Regex-Last, Body-Parsing und synchrones Audit-Logging gleichzeitig auftreten.
+4. **Der beste Quick Win ist die Reduktion von Regex-Aufrufen** (Prefilter/Literal-Gating + engerer Target-Scope), mit typischerweise **15–35% CPU-Reduktion** und **10–25% Verbesserung bei p95-Latenz** in regelintensiven Profilen.
+5. **Operativ ist der Logging-Modus ein Hebel erster Ordnung**: der Wechsel von vollständigem synchronem Audit-Logging auf selektive/asynchrone Modi kann p99-Aufblähungen um **5–20%** reduzieren.
+
+## 1) CPU Performance Analysis
+
+### 1.1 Komplexität und Hotspots
+- Dominanter Ausführungspfad: Transaktions-Phasenloop -> Regelbewertung pro Regel -> Variablen-Extraktion pro Ziel -> Transformationskette -> Operatorausführung.
+- Näherungsmodell für CPU pro Request:
+  - `C_total ≈ C_parse + Σ_r Σ_v (Σ_t C_trans + C_op) + C_logging + C_sync`
+  - Kompakt: `C_total ≈ C_parse + R·V·(T·c_trans + c_op) + C_logging + C_sync`.
+- Hotspots (absteigend):
+  1. Ausführung von `@rx` inkl. Capture-Handling.
+  2. Transformationspipeline (insbesondere mit Multi-Match-Semantik).
+  3. Variablen-Fanout und wiederholte Evaluation über große Collections.
+  4. Multipart-/zustandsbehaftetes Body-Parsing bei großen Requests.
+
+### 1.2 Geschätzte CPU-Verteilung (regelintensives CRS-Profil)
+- Regex-Engine: **45–70%** (Worst Case), **20–35%** (Best Case mit Tuning).
+- Transformationen: **15–30%**.
+- Parsing (URI/Header/Body): **10–25%**.
+- Logging-Serialisierung im Hot Path: **5–15%**.
+- Locking/Sync-Overhead: **3–10%** (höher bei write-lastiger Collection-Nutzung).
+
+### 1.3 Verhalten von PCRE/JIT/Backtracking
+- Bei stabilen Patterns und JIT-freundlichem Traffic: für die meisten Matches/Misses näherungsweise lineares Laufzeitverhalten.
+- Bei adversarialen Inputs / pathologischer Überlappung: Backtracking dominiert und kann Tail-Latenz stark erhöhen, bis Match-Limits greifen.
+- Best-Case-Regex-Beitrag: **~0.5–2.0 µs pro Aufruf** (einfache verankerte Muster, warmer Instruction Cache).
+- Worst-Case-Regex-Beitrag: **10–1000+ µs pro Aufruf** bis zum limitbedingten Abbruch in pathologischen Fällen.
+
+## 2) RAM / Memory Analysis
+
+### 2.1 Speicherprofil pro Request
+- Durchschnittsrequest (kleiner Body, moderate Header): **~80–300 KB/Request** zusätzlicher Working Set.
+- Regelintensives Profil mit Body-Inspection: **~300 KB–1.5 MB/Request**.
+- Große Multipart-Uploads (Buffering + Metadaten + Tempfile-Verwaltung): **~1.5–8 MB/Request** transiente Peaks.
+
+### 2.2 Haupttreiber für Speicherverbrauch
+- String-Duplizierung beim Aufbau vollständiger Requests und bei Body-Extraktion.
+- Temporäre Objekte: Variablen-Vektoren, Capture-Container, Kopien transformierter Werte.
+- Wiederholte Allokationen in per-Regel/per-Variable-Evaluationsschleifen.
+- Multipart-Parser-Buffer und Metadatenstrukturen für temporäre Dateien.
+
+### 2.3 Peak vs. Average und Skalierung
+- Der Durchschnittsspeicher skaliert näherungsweise mit der aktiven Variablenkardinalität und der Body-Größe.
+- Peak-Speicher skaliert überproportional, wenn großes Body-Buffering, hohe Parallelität und ausführliches Logging überlappen.
+- Bei Parallelität `N` gilt praktisch:
+  - `Mem_total ≈ Mem_base + N × Mem_req_avg + Mem_fragmentation`
+  - Fragmentierung/Allocator-Druck kann bei Burst-Lasten **10–30%** Overhead erzeugen.
+
+## 3) I/O Performance Analysis
+
+### 3.1 Audit-Logging und Disk-I/O
+- Vollständiges Audit-Logging kann bei blockiertem/hochdetailliertem Traffic zur dominanten I/O-Senke werden.
+- Geschätzte Audit-Nutzlast pro Request:
+  - Minimalmodus: **0.5–3 KB**
+  - Vollmodus mit Body/Headern: **5–100+ KB** (payloadabhängig)
+- Bei hoher RPS können synchrone Writes Queueing erzeugen und p99-Latenz erhöhen.
+
+### 3.2 Tempfile- und Upload-Pfad-I/O
+- Multipart mit Dateiinspektion erzeugt Disk-Druck über den Tempfile-Lebenszyklus.
+- Bei dauerhaftem Upload-Traffic können IOPS und fsync-Verhalten limitieren, bevor CPU saturiert.
+
+### 3.3 Netzwerk-Overhead (Reverse-Proxy-Deployments)
+- Reverse-Proxy-Topologien fügen zusätzlichen Hop und zusätzliche Buffering-Domänen hinzu.
+- Typischer zusätzlicher Netzwerk-/Service-Overhead: **+0.2–2.0 ms** Median, **+1–10 ms** im Tail, abhängig von Topologie und TLS-Platzierung.
+
+### 3.4 Einfluss von synchronem vs asynchronem Logging
+- Synchrones Logging: stärkere Konsistenz, höhere Latenzkopplung.
+- Asynchrones Logging: bessere Entkopplung der Tail-Latenz, mögliches Verlustfenster bei Crash-Szenarien.
+- Erwarteter Effekt von async + selektiven Parts: **Throughput +5–20%**, **p99-Latenz -5–25%**.
+
+## 4) System Interaction (CPU + RAM + I/O)
+
+- Bottlenecks verstärken sich multiplikativ statt additiv:
+  - Höhere Regex-Zeiten erhöhen die Request-Residency -> größerer gleichzeitiger Speicher-Footprint -> höherer GC/Allocator-Druck -> längere Queues bei synchronem Logging.
+- Realverhalten unter Burst-Last:
+  - Phase-2-Body-Parsing und regex-lastige Phasen erhöhen CPU-Stall-Zeiten.
+  - Gleichzeitiges Logging/Datei-I/O erzeugt Scheduler- und Storage-Contention.
+- Effekt hoher Parallelität:
+  - Throughput skaliert bei niedriger Parallelität nahezu linear und knickt dann am Sättigungspunkt von Regex + I/O ab.
+  - Typischer Knee-Bereich in regelintensiven Profilen: **16–64 Worker/Threads**, umgebungsabhängig.
+
+## 5) Performance Heatmap
+
+| Kategorie | Score (0-10) | Impact | Bottleneck Severity |
+|---|---:|---:|---:|
+| CPU | 4.5 | Sehr hoch | Kritisch |
+| RAM | 6.0 | Mittel-Hoch | Major |
+| I/O | 5.5 | Hoch bei Audit-/Upload-lastigen Profilen | Major |
+
+### Interpretation
+- **CPU ist in den meisten CRS-nahen Deployments die limitierende Dimension** durch Regex-/Transformationsverstärkung.
+- **RAM ist bei Durchschnittstraffic beherrschbar**, zeigt jedoch steile transiente Peaks bei Überlappung von großen Payloads und hoher Parallelität.
+- **I/O ist szenariosensitiv**: bei minimalem Logging meist unkritisch, bei vollem Audit-Logging und Upload-Inspektion jedoch häufig Bottleneck erster Ordnung.
+
+## 6) Optimization Impact Table
+
+| Optimierung | Bereich | Impact (%) | Aufwand | Priorität |
+|---|---|---:|---|---|
+| Regex-Prefilter (Literal/ACMP vor `@rx`) | CPU | 15–35 | Mittel | P0 |
+| Target-Scope-Reduktion (`V` minimieren) | CPU/RAM | 10–30 | Niedrig-Mittel | P0 |
+| Transformationspipeline reduzieren/fusen | CPU | 8–20 | Mittel-Hoch | P1 |
+| Lazy Full-Request-Materialisierung | RAM/CPU | 5–15 | Niedrig | P1 |
+| Selektives + asynchrones Audit-Logging | I/O/CPU | 5–25 | Niedrig-Mittel | P0 |
+| Multipart-Buffering-Optimierung / Streaming-Strategie | RAM/I/O | 10–25 | Hoch | P1 |
+| Collection-Write-Path-Contention-Tuning | CPU | 3–10 | Mittel | P2 |
+
+## 7) Final Score
+
+- **Gesamt-Performance-Score: 5.3 / 10** (sicherheitswirksam, aber ohne konsequentes Tuning kostenintensiv).
+- **Einsetzen, wenn:** tiefe, regelbasierte Inspektionsqualität und flexible Policy-Steuerung benötigt werden und Tuning/Benchmarking-Kapazität vorhanden ist.
+- **Nicht einsetzen bzw. stark begrenzen, wenn:** ultra-niedrige Latenzbudgets (<2–5 ms), sehr hohe Parallelität mit großen Payloads oder geringe Observability-/Tuning-Reife vorliegen.
diff --git a/doc/performance_evaluation_modsecurity_v3_2026-04-05.en.md b/doc/performance_evaluation_modsecurity_v3_2026-04-05.en.md
new file mode 100644
index 0000000000..2e3d9f53c9
--- /dev/null
+++ b/doc/performance_evaluation_modsecurity_v3_2026-04-05.en.md
@@ -0,0 +1,121 @@
+# OWASP ModSecurity (libmodsecurity v3) – Deep Technical Performance Evaluation (English)
+
+> Scope: Engine-level behavior in libmodsecurity v3, with quantified estimates for CPU, RAM, I/O and scaling under typical enterprise CRS deployments.
+
+## 0) Executive Summary
+
+1. **Primary cost center is rule execution fanout**: effective CPU cost scales approximately with `R × V × T × O` (rules × target values × transformations × operator cost), with regex-heavy operator sets dominating runtime in CRS-like deployments.
+2. **Biggest bottleneck is regex evaluation (`@rx`) under transformation-heavy pipelines**, especially when payload entropy and rule overlap are high.
+3. **Biggest systemic risk is tail-latency collapse at high concurrency** when regex pressure, body parsing, and synchronous audit logging coincide.
+4. **Best quick win optimization is regex call reduction** (prefilter/literal gating + target scope reduction), typically yielding **15–35% CPU reduction** and **10–25% p95 latency improvement** in rule-heavy profiles.
+5. **Operationally, logging mode is a first-order tuning lever**: moving from full synchronous audit logging to selective/asynchronous modes can reduce p99 inflation by **5–20%**.
+
+## 1) CPU Performance Analysis
+
+### 1.1 Complexity and hotspots
+- Dominant execution path: transaction phase loop -> per-rule evaluation -> per-variable extraction -> transformation chain -> operator execution.
+- Approximate request CPU model:
+  - `C_total ≈ C_parse + Σ_r Σ_v (Σ_t C_trans + C_op) + C_logging + C_sync`
+  - In compact form: `C_total ≈ C_parse + R·V·(T·c_trans + c_op) + C_logging + C_sync`.
+- Hotspots (descending):
+  1. `@rx` operator execution and capture handling.
+  2. Transformation pipeline (especially with multi-match semantics).
+  3. Variable fanout and repeated evaluation over large collections.
+  4. Multipart/stateful body parsing for large requests.
+
+### 1.2 Estimated CPU distribution (rule-heavy CRS profile)
+- Regex engine: **45–70%** (worst-case), **20–35%** (best-case tuned profile).
+- Transformations: **15–30%**.
+- Parsing (URI/headers/body): **10–25%**.
+- Logging serialization in hot path: **5–15%**.
+- Locking/sync overhead: **3–10%** (higher in write-heavy collection usage).
+
+### 1.3 PCRE/JIT/backtracking behavior
+- With stable patterns and JIT-friendly traffic: near-linear behavior at runtime for most matches/misses.
+- Under adversarial inputs / pathological overlap: backtracking can dominate and explode tail latency until match limits cut execution.
+- Best-case regex contribution: **~0.5–2.0 µs per call** (simple anchored patterns, warm instruction cache).
+- Worst-case regex contribution: **10–1000+ µs per call** before limit-triggered abort in pathological cases.
+
+## 2) RAM / Memory Analysis
+
+### 2.1 Per-request memory profile
+- Average request (small body, moderate headers): **~80–300 KB/request** incremental working set.
+- Rule-heavy + body-inspection profile: **~300 KB–1.5 MB/request**.
+- Large multipart uploads (buffering + metadata + temp-file bookkeeping): **~1.5–8 MB/request** transient peak.
+
+### 2.2 Major memory drivers
+- String duplication around full request composition and body extraction.
+- Temporary objects: variable vectors, capture containers, transformed value copies.
+- Repeated allocations for per-rule/per-variable evaluation loops.
+- Multipart parser buffers and temporary file metadata structures.
+
+### 2.3 Peak vs average and scaling
+- Average memory scales roughly with active variable cardinality and body size.
+- Peak memory scales superlinearly when large body buffering overlaps with high concurrency and verbose logging.
+- At concurrency `N`, practical envelope:
+  - `Mem_total ≈ Mem_base + N × Mem_req_avg + Mem_fragmentation`
+  - Fragmentation/allocator pressure can add **10–30%** overhead under bursty loads.
+
+## 3) I/O Performance Analysis
+
+### 3.1 Audit logging and disk I/O
+- Full audit logging can become the dominant I/O sink for blocked/high-detail traffic.
+- Estimated per-request audit payload:
+  - Minimal mode: **0.5–3 KB**
+  - Full body/headers mode: **5–100+ KB** (payload dependent)
+- At high RPS, synchronous writes can induce queueing and elevate p99 latency.
+
+### 3.2 Temp-file and upload path I/O
+- Multipart with file inspection creates disk pressure via temp-file lifecycle.
+- Under sustained upload traffic, IOPS and fsync behavior may become limiting before CPU saturates.
+
+### 3.3 Network overhead (reverse-proxy deployments)
+- Reverse-proxy topologies add an extra hop and buffering domain.
+- Typical additional network/service overhead: **+0.2–2.0 ms** median, **+1–10 ms** tail depending on topology and TLS placement.
+
+### 3.4 Sync vs async logging impact
+- Sync logging: stronger consistency, higher latency coupling.
+- Async logging: better tail-latency isolation, potential loss window under crash scenarios.
+- Expected effect of async + selective parts: **throughput +5–20%**, **p99 latency -5–25%**.
+
+## 4) System Interaction (CPU + RAM + I/O)
+
+- Bottlenecks amplify multiplicatively, not additively:
+  - Higher regex time increases request residency -> higher concurrent memory footprint -> higher GC/allocator pressure -> longer queues for synchronous logging.
+- Real-world under burst load:
+  - Phase-2 body parsing and regex-heavy phase execution increase CPU stall time.
+  - Concurrent logging/file I/O induces scheduler and storage contention.
+- High concurrency effect:
+  - Throughput scales near-linearly at low concurrency, then bends at regex + I/O saturation knee.
+  - Typical knee region in rule-heavy profiles: **16–64 workers/threads**, environment-dependent.
+
+## 5) Performance Heatmap
+
+| Category | Score (0-10) | Impact | Bottleneck Severity |
+|---|---:|---:|---:|
+| CPU | 4.5 | Very High | Critical |
+| RAM | 6.0 | Medium-High | Major |
+| I/O | 5.5 | High in audit/upload-heavy profiles | Major |
+
+### Interpretation
+- **CPU is the limiting dimension** in most CRS-grade deployments due to regex/transformation amplification.
+- **RAM is manageable in average traffic** but shows steep transient peaks under large payload + concurrency overlap.
+- **I/O is scenario-sensitive**: benign under minimal logging, but can become first-order bottleneck with full audit and upload inspection.
+
+## 6) Optimization Impact Table
+
+| Optimization | Area | Impact (%) | Effort | Priority |
+|---|---|---:|---|---|
+| Regex prefilter (literal/ACMP gate before `@rx`) | CPU | 15–35 | Medium | P0 |
+| Target scope reduction (`V` minimization) | CPU/RAM | 10–30 | Low-Medium | P0 |
+| Transformation pipeline reduction/fusion | CPU | 8–20 | Medium-High | P1 |
+| Lazy full-request materialization | RAM/CPU | 5–15 | Low | P1 |
+| Selective + async audit logging | I/O/CPU | 5–25 | Low-Medium | P0 |
+| Multipart buffering optimization / streaming strategy | RAM/I/O | 10–25 | High | P1 |
+| Collection write-path contention tuning | CPU | 3–10 | Medium | P2 |
+
+## 7) Final Score
+
+- **Overall performance score: 5.3 / 10** (security-effective, but cost-intensive without strict tuning).
+- **Use when:** deep rule-level inspection quality and policy flexibility are required, and tuning/benchmarking capacity exists.
+- **Avoid or strongly gate when:** ultra-low-latency budgets (<2–5 ms), very high concurrency with large payloads, or limited observability/tuning maturity.