diff --git a/doc/performance_audit_modsecurity_2026-04-03.en.md b/doc/performance_audit_modsecurity_2026-04-03.en.md new file mode 100644 index 0000000000..adc2620e47 --- /dev/null +++ b/doc/performance_audit_modsecurity_2026-04-03.en.md @@ -0,0 +1,217 @@ +# OWASP ModSecurity v3 – Technical Performance Audit (Code-focused) + +This document summarizes an in-depth, code-based performance analysis of libmodsecurity, focusing on hot paths, rule engine behavior, memory characteristics, and scalability. + +## Scope + +Analyzed core paths: +- `Transaction` request/response lifecycle +- `RulesSet` + `RuleWithOperator` evaluation +- Request body processors (URLENCODED/JSON/XML/MULTIPART) +- Regex and pattern-matching operators (`@rx`, `@pm`) +- Collection backends and locking +- Audit logging and serialization + +## Key Findings (Summary) + +1. **Dominant CPU path**: `RulesSet::evaluate()` → `RuleWithOperator::evaluate()` → transformations → operator (`@rx`, `@pm`, ...). +2. **Regex is the primary cost driver** in CRS-heavy rule sets; match limits mitigate impact but do not eliminate expensive patterns. +3. **Significant string/copy overhead** in request-body and logging paths (`stringstream::str()`, header concatenation, JSON/audit serialization). +4. **Multipart parsing** is byte-wise and state-machine heavy, with high branching cost. +5. **Concurrency**: mostly lock-free per transaction; shared collections use `shared_mutex` and may contend in write-heavy workloads. + +## Top Optimization Opportunities + +- Replace `stringstream`-centric body handling with chunk-/span-based buffering. +- Compute `FULL_REQUEST` lazily or behind feature gates. +- Add prefilters (literal guards) before expensive regex operators. +- Cache/fuse frequent transformation pipelines. +- Reduce and/or async-offload audit logging where possible. + +## Performance Model + +### 1) Request Cost Model + +We model total CPU cost per request as: + +\[ +C_{req} = C_{conn} + C_{parse} + C_{rules} + C_{log} + C_{sync} +\] + +where: + +- \(C_{conn}\): connection/context cost (small, near-constant) +- \(C_{parse}\): URI/header/body parsing +- \(C_{rules}\): rule evaluation (dominant) +- \(C_{log}\): audit/debug serialization +- \(C_{sync}\): locking/contention cost on shared collections + +For the rule engine: + +\[ +C_{rules} = \sum_{r=1}^{R} \left( V_r \cdot \left(\sum_{t=1}^{T_r} C_{trans}(t)\right) + C_{op}(r) \right) + C_{actions}(r) +\] + +In aggregated form: + +\[ +C_{rules} \approx R \cdot V \cdot (T \cdot \bar c_{trans} + \bar c_{op}) + R \cdot \bar c_{actions} +\] + +with: +- \(R\): number of active rules per phase +- \(V\): average number of target values per rule +- \(T\): average number of transformations per rule +- \(\bar c_{op}\): average operator cost + +This makes the multiplicative cost in \(R, V, T\) explicit. + +### 2) Regex Cost Model + +For regex-heavy workloads: + +\[ +\bar c_{op} = p_{rx}\cdot c_{rx} + (1-p_{rx})\cdot c_{other} +\] + +where \(p_{rx}\) is the ratio of regex-based rules. + +- **Best case (JIT + early fail/match):** + \[ + c_{rx}^{best} = O(n) + \] +- **Worst case (catastrophic backtracking):** + \[ + c_{rx}^{worst} = O(e^n) + \] + practical behavior is bounded by match limits, but still expensive up to abort. + +### 3) Big-O by subsystem + +- **Rule evaluation (overall):** + \[ + O\left(\sum_{r=1}^{R} V_r\cdot(T_r + O_r)\right) + \] + typically approximated as \(O(R\cdot V\cdot(T+O))\). +- **Parsing:** + - URI/Header/Cookies: \(O(H + Q)\) + - URL-encoded body: \(O(B)\) + - Multipart: \(O(B\cdot\kappa)\), with \(\kappa\) as state/boundary-check overhead +- **Transformation pipeline:** + \[ + O(R\cdot V\cdot T\cdot L) + \] + where \(L\) is average target string length. + +--- + +## Measurement Strategy + +### 1) Reproducible experiment design + +**Minimum matrix:** +- Rule set: Minimal / CRS PL1 / CRS PL2+ +- Payload: 1 KB / 16 KB / 256 KB / 2 MB +- Workload mix: static GET, JSON API, multipart upload +- Concurrency: 1, 8, 32, 128 +- Logging mode: off / minimal / full audit + +**A/B variants:** +- Baseline without WAF vs with WAF +- WAF with/without specific rule classes (e.g., regex-heavy groups) + +### 2) Tooling + +- **CPU hotspots:** `perf record` + `perf report`, then flamegraphs +- **Memory/allocations:** `valgrind --tool=massif`, optionally `heaptrack` +- **Syscalls/locking:** `perf lock`, `strace -c` +- **Optional eBPF:** uprobes on `RulesSet::evaluate`, `RuleWithOperator::evaluate`, `Regex::searchOneMatch`, `executeTransformations` + +### 3) Instrumentation and metrics + +For each request phase (1–5 + logging): + +\[ +T_{phase,i} = t_{end,i} - t_{start,i} +\] + +Additional per-request metrics: +- `T_regex_total`, `N_regex_calls`, `T_regex_avg` +- `T_trans_total`, `N_transforms` +- `alloc_bytes`, `alloc_count` +- `audit_bytes_written` + +### 4) KPIs + +- **Latency:** p50 / p95 / p99 (end-to-end and per phase) +- **CPU:** cycles/request, instructions/request, IPC +- **Memory:** bytes/request, peak RSS, allocations/request +- **Scalability:** throughput (RPS) vs concurrency, saturation point + +### 5) Optimization acceptance criteria + +An optimization is accepted if across 3 runs (95% confidence): +- p95 latency improves by at least 10% **or** +- cycles/request decrease by at least 15% +- with no regression in block/detection fidelity. + +--- + +## Prioritization + +### 1) Quantified bottleneck split (CPU share) + +**Best case (small rule set, low regex pressure):** +- Regex engine: **20–35%** +- Transformation pipeline: **20–30%** +- Parsing: **15–25%** +- Logging: **5–15%** +- Locking/synchronization: **<5%** + +**Worst case (CRS-heavy, high paranoia, large bodies):** +- Regex engine: **45–70%** +- Transformation pipeline: **15–30%** +- Parsing (incl. multipart): **10–20%** +- Logging: **5–15%** +- Locking/synchronization: **3–10%** + +### 2) Weighted top-5 bottlenecks + +Weight = expected total CPU impact in production-like CRS setups. + +1. **Regex matching (`@rx`)** — **40%** +2. **Transformations + multiMatch amplification** — **22%** +3. **Variable expansion / target fanout (V)** — **14%** +4. **Body parsing (especially multipart, large payloads)** — **13%** +5. **Logging/serialization + I/O** — **11%** + +Total: 100%. + +### 3) Focus rules + +If \(p_{rx} > 0.35\) or `T_regex_total / C_req > 0.4`, optimize regex first. + +If `N_transforms * V` is high, prioritize transformation/target reduction. + +If `audit_bytes_written` is high and tail latency dominates, optimize logging first. + +--- + +## Optimizations with Impact + +| Measure | Expected Impact | Estimated Improvement | Implementation Complexity | Notes | +|---|---:|---:|---|---| +| Regex prefilter (literal/ACMP before `@rx`) | High | **15–35%** CPU | Medium | Reduces expensive regex calls on easy non-matches | +| Rule grouping + early exits per phase | High | **10–25%** latency/CPU | Medium | Strong effect with large rule volumes | +| Target scope hardening (reduce `V`) | High | **10–30%** CPU | Low–Medium | Directly reduces \(R\cdot V\cdot T\) multiplier | +| Transformation fusion/caching | Medium–High | **8–20%** CPU | Medium–High | Must preserve exact semantics | +| Lazy `FULL_REQUEST` materialization | Medium | **5–15%** CPU+RAM | Low | Avoids large string copies | +| Streaming/chunk body representation | Medium–High | **10–25%** RAM/CPU | High | Larger refactor, high long-term value | +| Selective/async audit logging | Medium | **5–20%** tail latency | Low–Medium | Fast operational win | +| Collection locking optimization (batch/shard) | Low–Medium | **3–10%** at high concurrency | Medium | Relevant for write-heavy rules | + +### Suggested rollout order + +1. **Quick wins (1–2 sprints):** logging reduction, target-scope hardening, lazy `FULL_REQUEST`. +2. **Mid-term (2–4 sprints):** regex prefilter, phase gating, transformation optimization. +3. **Long-term:** streaming body refactor + deeper locking redesign. diff --git a/doc/performance_audit_modsecurity_2026-04-03.md b/doc/performance_audit_modsecurity_2026-04-03.md new file mode 100644 index 0000000000..4e06b8019c --- /dev/null +++ b/doc/performance_audit_modsecurity_2026-04-03.md @@ -0,0 +1,219 @@ +# OWASP ModSecurity v3 – Technischer Performance-Audit (Code-fokussiert) + +Dieses Dokument fasst eine tiefgehende, codebasierte Performance-Analyse von libmodsecurity zusammen, mit Fokus auf Hot Paths, Regel-Engine, Speicherverhalten und Skalierung. + +> English version: `doc/performance_audit_modsecurity_2026-04-03.en.md` + +## Scope + +Analysierte Kernpfade: +- `Transaction` Request/Response Lifecycle +- `RulesSet` + `RuleWithOperator` Evaluierung +- Request-Body-Prozessoren (URLENCODED/JSON/XML/MULTIPART) +- Regex- und Pattern-Matching Operatoren (`@rx`, `@pm`) +- Collection-Backends und Locking +- Audit-Logging und Serialisierung + +## Wichtigste Erkenntnisse (Kurzfassung) + +1. **Dominanter CPU-Pfad**: `RulesSet::evaluate()` → `RuleWithOperator::evaluate()` → Transformationen → Operator (`@rx`, `@pm`, …). +2. **Regex ist Hauptkostentreiber** bei CRS-lastigen Regelsets; Match-Limits sind vorhanden, aber nur Schadensbegrenzung. +3. **Signifikante String-/Copy-Kosten** in Request-Body- und Logging-Pfaden (`stringstream::str()`, Header-Konkatenation, JSON/Audit-Serialisierung). +4. **Multipart-Parsing** arbeitet byteweise mit hohem Branching- und State-Machine-Aufwand. +5. **Concurrency**: transaktionslokal weitgehend lock-frei; globale Collections nutzen `shared_mutex` und können bei write-lastigen Workloads kontendieren. + +## Potenzielle Optimierungen (Top) + +- Request-Body intern von `stringstream` auf chunk-/span-basierte Buffering-Strategie umstellen. +- `FULL_REQUEST` lazy oder feature-gated berechnen (TODO im Code vorhanden). +- Regex-basierte Regeln per Vorfilter (Literal prefilter, cheap guards) reduzieren. +- Transformation-Pipeline für häufige Kombinationen cachen/fusen. +- Audit-Logging asynchron und selektiv (Parts minimieren, JSON nur wenn nötig). + +## Performance-Modell + +### 1) Kostenmodell pro Request + +Wir modellieren die gesamte CPU-Zeit pro Request als: + +\[ +C_{req} = C_{conn} + C_{parse} + C_{rules} + C_{log} + C_{sync} +\] + +mit: + +- \(C_{conn}\): Verbindungs-/Kontextkosten (klein, nahezu konstant) +- \(C_{parse}\): URI/Header/Body-Parsing +- \(C_{rules}\): Regelbewertung (dominant) +- \(C_{log}\): Audit-/Debug-Serialisierung +- \(C_{sync}\): Locking-/Contention-Kosten auf shared Collections + +Für die Regel-Engine: + +\[ +C_{rules} = \sum_{r=1}^{R} \left( V_r \cdot \left(\sum_{t=1}^{T_r} C_{trans}(t)\right) + C_{op}(r) \right) + C_{actions}(r) +\] + +Praktisch (aggregiert) gilt näherungsweise: + +\[ +C_{rules} \approx R \cdot V \cdot (T \cdot \bar c_{trans} + \bar c_{op}) + R \cdot \bar c_{actions} +\] + +wobei: +- \(R\): Anzahl aktiver Regeln pro Phase +- \(V\): durchschnittliche Anzahl Zielwerte pro Regel (z. B. ARGS, Header, Cookies) +- \(T\): mittlere Anzahl Transformationen pro Regel +- \(\bar c_{op}\): mittlere Operatorkosten + +Damit wird klar: **lineare Skalierung in \(R, V, T\)**, aber \(\bar c_{op}\) kann bei Regex nichtlinear eskalieren. + +### 2) Regex-Spezialmodell + +Für Regex-Regeln zerlegen wir \(\bar c_{op}\) in: + +\[ +\bar c_{op} = p_{rx}\cdot c_{rx} + (1-p_{rx})\cdot c_{other} +\] + +mit \(p_{rx}\) als Anteil regex-basierter Regeln. + +- **Best Case (JIT + frühes Match/Miss, wenig Backtracking):** + \[ + c_{rx}^{best} = O(n) + \] +- **Worst Case (katastrophales Backtracking):** + \[ + c_{rx}^{worst} = O(e^n) + \] + in der Praxis durch Match-Limits gedeckelt, aber weiterhin extrem teuer bis zum Abbruch. + +### 3) Big-O nach Subsystem + +- **Rule Evaluation gesamt:** + \[ + O\left(\sum_{r=1}^{R} V_r\cdot(T_r + O_r)\right) + \] + i. d. R. näherungsweise \(O(R\cdot V\cdot(T+O))\). +- **Parsing:** + - URI/Header/Cookies: \(O(H + Q)\) + - URL-Encoded Body: \(O(B)\) + - Multipart: \(O(B\cdot\kappa)\), \(\kappa\) = Zustands-/Boundary-Prüfaufwand +- **Transformation Pipeline:** + \[ + O(R\cdot V\cdot T\cdot L) + \] + mit \(L\) = durchschnittliche Stringlänge je Zielwert. + +--- + +## Messstrategie + +### 1) Experiment-Design (reproduzierbar) + +**Messmatrix (mindestens):** +- Regelset: Minimal / CRS PL1 / CRS PL2+ +- Payload: 1 KB / 16 KB / 256 KB / 2 MB +- Workload-Mix: static GET, JSON API, multipart upload +- Concurrency: 1, 8, 32, 128 +- Logging: aus / minimal / full audit + +**A/B-Varianten:** +- Baseline ohne WAF vs mit WAF +- WAF mit/ohne bestimmte Rule-Klassen (z. B. regex-lastige Gruppen) + +### 2) Tooling + +- **CPU Hotspots:** `perf record` + `perf report`, anschließend Flamegraph +- **Memory/Allokationen:** `valgrind --tool=massif`, optional `heaptrack` +- **Syscall/Locking:** `perf lock`, `strace -c` +- **Optional eBPF:** uprobes auf `RulesSet::evaluate`, `RuleWithOperator::evaluate`, `Regex::searchOneMatch`, `executeTransformations` + +### 3) Instrumentierung und Metriken + +Für jede Request-Phase (1–5 + logging) erfassen: + +\[ +T_{phase,i} = t_{end,i} - t_{start,i} +\] + +Zusätzlich pro Request: +- `T_regex_total`, `N_regex_calls`, `T_regex_avg` +- `T_trans_total`, `N_transforms` +- `alloc_bytes`, `alloc_count` +- `audit_bytes_written` + +### 4) KPIs (SLO-fähig) + +- **Latency:** p50 / p95 / p99 (end-to-end und pro Phase) +- **CPU:** cycles/request, instructions/request, IPC +- **Memory:** bytes/request, peak RSS, allocations/request +- **Skalierung:** throughput (RPS) vs concurrency, Saturation-Punkt + +### 5) Akzeptanzkriterien für Optimierungen + +Eine Optimierung gilt als erfolgreich, wenn über 3 Läufe (95%-Konfidenz): +- p95-Latenz mindestens 10% verbessert **oder** +- cycles/request mindestens 15% sinken +- ohne Regression der Block-/Detection-Rate (Sicherheits-Fidelity unverändert) + +--- + +## Priorisierung + +### 1) Quantifizierte Impact-Aufteilung (CPU-Anteil) + +**Best-Case (kleines Regelset, wenig Regex):** +- Regex Engine: **20–35%** +- Transformation Pipeline: **20–30%** +- Parsing: **15–25%** +- Logging: **5–15%** +- Locking/Synchronisation: **<5%** + +**Worst-Case (CRS-heavy, hohe Paranoia, große Bodies):** +- Regex Engine: **45–70%** +- Transformation Pipeline: **15–30%** +- Parsing (inkl. multipart): **10–20%** +- Logging: **5–15%** +- Locking/Synchronisation: **3–10%** + +### 2) Top-5 Bottlenecks mit Gewichtung + +Gewichtung = erwarteter Gesamt-CPU-Impact in produktionsnahen CRS-Setups. + +1. **Regex-Matching (`@rx`)** — **40%** +2. **Transformationen + multiMatch-Multiplikation** — **22%** +3. **Variablenexpansion / Zielwert-Fanout (V)** — **14%** +4. **Body-Parsing (v. a. multipart, große Payloads)** — **13%** +5. **Logging/Serialisierung + I/O** — **11%** + +Summe: 100%. + +### 3) Entscheidungsregel für Fokus + +Wenn \(p_{rx} > 0.35\) oder `T_regex_total / C_req > 0.4`, zuerst Regex-Optimierung. + +Wenn `N_transforms * V` sehr hoch ist, zuerst Transformations-/Target-Reduktion. + +Wenn `audit_bytes_written` hoch und Latenz tail-lastig ist, Logging zuerst reduzieren. + +--- + +## Optimierungen mit Impact + +| Maßnahme | Erwarteter Impact | Geschätzte Verbesserung | Implementierungs-Komplexität | Hinweise | +|---|---:|---:|---|---| +| Regex-Prefilter (Literal/ACMP vor `@rx`) | Hoch | **15–35%** CPU | Mittel | Reduziert teure Regex-Aufrufe bei sicheren Non-Matches | +| Regelgruppierung + frühe Exits pro Phase | Hoch | **10–25%** Latenz/CPU | Mittel | Besonders wirksam bei großen Regelmengen | +| Target-Scope-Härtung (`V` senken, keine unnötigen Collections) | Hoch | **10–30%** CPU | Niedrig–Mittel | Direkte Multiplikatorreduktion in \(R\cdot V\cdot T\) | +| Transformation-Fusion/Caching häufiger Pipelines | Mittel–Hoch | **8–20%** CPU | Mittel–Hoch | Muss semantisch äquivalent bleiben | +| `FULL_REQUEST` lazy/materialize-on-demand | Mittel | **5–15%** CPU + RAM | Niedrig | Vermeidet große String-Kopien | +| Streaming-/Chunk-Body-Repräsentation statt `stringstream::str()` | Mittel–Hoch | **10–25%** RAM/CPU | Hoch | Größerer Refactor, aber hoher langfristiger Gewinn | +| Audit-Logging selektiv/asynchron (Parts minimal) | Mittel | **5–20%** Tail-Latency | Niedrig–Mittel | Schnell umsetzbar, operativ gut steuerbar | +| Collection-Locking optimieren (write batching, sharding) | Niedrig–Mittel | **3–10%** bei hoher Concurrency | Mittel | Nur relevant bei write-heavy Regeln | + +### Umsetzungsreihenfolge (Roadmap) + +1. **Quick Wins (1–2 Sprints):** Logging-Reduktion, Target-Scope-Härtung, `FULL_REQUEST` lazy. +2. **Mid-Term (2–4 Sprints):** Regex-Prefilter, Rule-Phasen-Gating, Transformationsoptimierung. +3. **Langfristig:** Streaming-Body-Refactor + tieferes Locking-Redesign. diff --git a/doc/performance_evaluation_modsecurity_v3_2026-04-05.de.md b/doc/performance_evaluation_modsecurity_v3_2026-04-05.de.md new file mode 100644 index 0000000000..9e0120a8aa --- /dev/null +++ b/doc/performance_evaluation_modsecurity_v3_2026-04-05.de.md @@ -0,0 +1,121 @@ +# OWASP ModSecurity (libmodsecurity v3) – Tiefgehende technische Performance-Evaluierung (Deutsch) + +> Scope: Engine-Verhalten in libmodsecurity v3 mit quantifizierten Schätzungen für CPU, RAM, I/O und Skalierung in typischen Enterprise-CRS-Deployments. + +## 0) Executive Summary + +1. **Das primäre Kostenzentrum ist die Regel-Ausführungsfächerung**: die effektiven CPU-Kosten skalieren näherungsweise mit `R × V × T × O` (Regeln × Zielwerte × Transformationen × Operatorkosten), wobei regex-lastige Operatoren in CRS-ähnlichen Setups dominieren. +2. **Der größte Bottleneck ist die Regex-Auswertung (`@rx`) in transformationslastigen Pipelines**, insbesondere bei hoher Payload-Entropie und stark überlappenden Regeln. +3. **Das größte systemische Risiko ist ein Tail-Latency-Kollaps bei hoher Parallelität**, wenn Regex-Last, Body-Parsing und synchrones Audit-Logging gleichzeitig auftreten. +4. **Der beste Quick Win ist die Reduktion von Regex-Aufrufen** (Prefilter/Literal-Gating + engerer Target-Scope), mit typischerweise **15–35% CPU-Reduktion** und **10–25% Verbesserung bei p95-Latenz** in regelintensiven Profilen. +5. **Operativ ist der Logging-Modus ein Hebel erster Ordnung**: der Wechsel von vollständigem synchronem Audit-Logging auf selektive/asynchrone Modi kann p99-Aufblähungen um **5–20%** reduzieren. + +## 1) CPU Performance Analysis + +### 1.1 Komplexität und Hotspots +- Dominanter Ausführungspfad: Transaktions-Phasenloop -> Regelbewertung pro Regel -> Variablen-Extraktion pro Ziel -> Transformationskette -> Operatorausführung. +- Näherungsmodell für CPU pro Request: + - `C_total ≈ C_parse + Σ_r Σ_v (Σ_t C_trans + C_op) + C_logging + C_sync` + - Kompakt: `C_total ≈ C_parse + R·V·(T·c_trans + c_op) + C_logging + C_sync`. +- Hotspots (absteigend): + 1. Ausführung von `@rx` inkl. Capture-Handling. + 2. Transformationspipeline (insbesondere mit Multi-Match-Semantik). + 3. Variablen-Fanout und wiederholte Evaluation über große Collections. + 4. Multipart-/zustandsbehaftetes Body-Parsing bei großen Requests. + +### 1.2 Geschätzte CPU-Verteilung (regelintensives CRS-Profil) +- Regex-Engine: **45–70%** (Worst Case), **20–35%** (Best Case mit Tuning). +- Transformationen: **15–30%**. +- Parsing (URI/Header/Body): **10–25%**. +- Logging-Serialisierung im Hot Path: **5–15%**. +- Locking/Sync-Overhead: **3–10%** (höher bei write-lastiger Collection-Nutzung). + +### 1.3 Verhalten von PCRE/JIT/Backtracking +- Bei stabilen Patterns und JIT-freundlichem Traffic: für die meisten Matches/Misses näherungsweise lineares Laufzeitverhalten. +- Bei adversarialen Inputs / pathologischer Überlappung: Backtracking dominiert und kann Tail-Latenz stark erhöhen, bis Match-Limits greifen. +- Best-Case-Regex-Beitrag: **~0.5–2.0 µs pro Aufruf** (einfache verankerte Muster, warmer Instruction Cache). +- Worst-Case-Regex-Beitrag: **10–1000+ µs pro Aufruf** bis zum limitbedingten Abbruch in pathologischen Fällen. + +## 2) RAM / Memory Analysis + +### 2.1 Speicherprofil pro Request +- Durchschnittsrequest (kleiner Body, moderate Header): **~80–300 KB/Request** zusätzlicher Working Set. +- Regelintensives Profil mit Body-Inspection: **~300 KB–1.5 MB/Request**. +- Große Multipart-Uploads (Buffering + Metadaten + Tempfile-Verwaltung): **~1.5–8 MB/Request** transiente Peaks. + +### 2.2 Haupttreiber für Speicherverbrauch +- String-Duplizierung beim Aufbau vollständiger Requests und bei Body-Extraktion. +- Temporäre Objekte: Variablen-Vektoren, Capture-Container, Kopien transformierter Werte. +- Wiederholte Allokationen in per-Regel/per-Variable-Evaluationsschleifen. +- Multipart-Parser-Buffer und Metadatenstrukturen für temporäre Dateien. + +### 2.3 Peak vs. Average und Skalierung +- Der Durchschnittsspeicher skaliert näherungsweise mit der aktiven Variablenkardinalität und der Body-Größe. +- Peak-Speicher skaliert überproportional, wenn großes Body-Buffering, hohe Parallelität und ausführliches Logging überlappen. +- Bei Parallelität `N` gilt praktisch: + - `Mem_total ≈ Mem_base + N × Mem_req_avg + Mem_fragmentation` + - Fragmentierung/Allocator-Druck kann bei Burst-Lasten **10–30%** Overhead erzeugen. + +## 3) I/O Performance Analysis + +### 3.1 Audit-Logging und Disk-I/O +- Vollständiges Audit-Logging kann bei blockiertem/hochdetailliertem Traffic zur dominanten I/O-Senke werden. +- Geschätzte Audit-Nutzlast pro Request: + - Minimalmodus: **0.5–3 KB** + - Vollmodus mit Body/Headern: **5–100+ KB** (payloadabhängig) +- Bei hoher RPS können synchrone Writes Queueing erzeugen und p99-Latenz erhöhen. + +### 3.2 Tempfile- und Upload-Pfad-I/O +- Multipart mit Dateiinspektion erzeugt Disk-Druck über den Tempfile-Lebenszyklus. +- Bei dauerhaftem Upload-Traffic können IOPS und fsync-Verhalten limitieren, bevor CPU saturiert. + +### 3.3 Netzwerk-Overhead (Reverse-Proxy-Deployments) +- Reverse-Proxy-Topologien fügen zusätzlichen Hop und zusätzliche Buffering-Domänen hinzu. +- Typischer zusätzlicher Netzwerk-/Service-Overhead: **+0.2–2.0 ms** Median, **+1–10 ms** im Tail, abhängig von Topologie und TLS-Platzierung. + +### 3.4 Einfluss von synchronem vs asynchronem Logging +- Synchrones Logging: stärkere Konsistenz, höhere Latenzkopplung. +- Asynchrones Logging: bessere Entkopplung der Tail-Latenz, mögliches Verlustfenster bei Crash-Szenarien. +- Erwarteter Effekt von async + selektiven Parts: **Throughput +5–20%**, **p99-Latenz -5–25%**. + +## 4) System Interaction (CPU + RAM + I/O) + +- Bottlenecks verstärken sich multiplikativ statt additiv: + - Höhere Regex-Zeiten erhöhen die Request-Residency -> größerer gleichzeitiger Speicher-Footprint -> höherer GC/Allocator-Druck -> längere Queues bei synchronem Logging. +- Realverhalten unter Burst-Last: + - Phase-2-Body-Parsing und regex-lastige Phasen erhöhen CPU-Stall-Zeiten. + - Gleichzeitiges Logging/Datei-I/O erzeugt Scheduler- und Storage-Contention. +- Effekt hoher Parallelität: + - Throughput skaliert bei niedriger Parallelität nahezu linear und knickt dann am Sättigungspunkt von Regex + I/O ab. + - Typischer Knee-Bereich in regelintensiven Profilen: **16–64 Worker/Threads**, umgebungsabhängig. + +## 5) Performance Heatmap + +| Kategorie | Score (0-10) | Impact | Bottleneck Severity | +|---|---:|---:|---:| +| CPU | 4.5 | Sehr hoch | Kritisch | +| RAM | 6.0 | Mittel-Hoch | Major | +| I/O | 5.5 | Hoch bei Audit-/Upload-lastigen Profilen | Major | + +### Interpretation +- **CPU ist in den meisten CRS-nahen Deployments die limitierende Dimension** durch Regex-/Transformationsverstärkung. +- **RAM ist bei Durchschnittstraffic beherrschbar**, zeigt jedoch steile transiente Peaks bei Überlappung von großen Payloads und hoher Parallelität. +- **I/O ist szenariosensitiv**: bei minimalem Logging meist unkritisch, bei vollem Audit-Logging und Upload-Inspektion jedoch häufig Bottleneck erster Ordnung. + +## 6) Optimization Impact Table + +| Optimierung | Bereich | Impact (%) | Aufwand | Priorität | +|---|---|---:|---|---| +| Regex-Prefilter (Literal/ACMP vor `@rx`) | CPU | 15–35 | Mittel | P0 | +| Target-Scope-Reduktion (`V` minimieren) | CPU/RAM | 10–30 | Niedrig-Mittel | P0 | +| Transformationspipeline reduzieren/fusen | CPU | 8–20 | Mittel-Hoch | P1 | +| Lazy Full-Request-Materialisierung | RAM/CPU | 5–15 | Niedrig | P1 | +| Selektives + asynchrones Audit-Logging | I/O/CPU | 5–25 | Niedrig-Mittel | P0 | +| Multipart-Buffering-Optimierung / Streaming-Strategie | RAM/I/O | 10–25 | Hoch | P1 | +| Collection-Write-Path-Contention-Tuning | CPU | 3–10 | Mittel | P2 | + +## 7) Final Score + +- **Gesamt-Performance-Score: 5.3 / 10** (sicherheitswirksam, aber ohne konsequentes Tuning kostenintensiv). +- **Einsetzen, wenn:** tiefe, regelbasierte Inspektionsqualität und flexible Policy-Steuerung benötigt werden und Tuning/Benchmarking-Kapazität vorhanden ist. +- **Nicht einsetzen bzw. stark begrenzen, wenn:** ultra-niedrige Latenzbudgets (<2–5 ms), sehr hohe Parallelität mit großen Payloads oder geringe Observability-/Tuning-Reife vorliegen. diff --git a/doc/performance_evaluation_modsecurity_v3_2026-04-05.en.md b/doc/performance_evaluation_modsecurity_v3_2026-04-05.en.md new file mode 100644 index 0000000000..2e3d9f53c9 --- /dev/null +++ b/doc/performance_evaluation_modsecurity_v3_2026-04-05.en.md @@ -0,0 +1,121 @@ +# OWASP ModSecurity (libmodsecurity v3) – Deep Technical Performance Evaluation (English) + +> Scope: Engine-level behavior in libmodsecurity v3, with quantified estimates for CPU, RAM, I/O and scaling under typical enterprise CRS deployments. + +## 0) Executive Summary + +1. **Primary cost center is rule execution fanout**: effective CPU cost scales approximately with `R × V × T × O` (rules × target values × transformations × operator cost), with regex-heavy operator sets dominating runtime in CRS-like deployments. +2. **Biggest bottleneck is regex evaluation (`@rx`) under transformation-heavy pipelines**, especially when payload entropy and rule overlap are high. +3. **Biggest systemic risk is tail-latency collapse at high concurrency** when regex pressure, body parsing, and synchronous audit logging coincide. +4. **Best quick win optimization is regex call reduction** (prefilter/literal gating + target scope reduction), typically yielding **15–35% CPU reduction** and **10–25% p95 latency improvement** in rule-heavy profiles. +5. **Operationally, logging mode is a first-order tuning lever**: moving from full synchronous audit logging to selective/asynchronous modes can reduce p99 inflation by **5–20%**. + +## 1) CPU Performance Analysis + +### 1.1 Complexity and hotspots +- Dominant execution path: transaction phase loop -> per-rule evaluation -> per-variable extraction -> transformation chain -> operator execution. +- Approximate request CPU model: + - `C_total ≈ C_parse + Σ_r Σ_v (Σ_t C_trans + C_op) + C_logging + C_sync` + - In compact form: `C_total ≈ C_parse + R·V·(T·c_trans + c_op) + C_logging + C_sync`. +- Hotspots (descending): + 1. `@rx` operator execution and capture handling. + 2. Transformation pipeline (especially with multi-match semantics). + 3. Variable fanout and repeated evaluation over large collections. + 4. Multipart/stateful body parsing for large requests. + +### 1.2 Estimated CPU distribution (rule-heavy CRS profile) +- Regex engine: **45–70%** (worst-case), **20–35%** (best-case tuned profile). +- Transformations: **15–30%**. +- Parsing (URI/headers/body): **10–25%**. +- Logging serialization in hot path: **5–15%**. +- Locking/sync overhead: **3–10%** (higher in write-heavy collection usage). + +### 1.3 PCRE/JIT/backtracking behavior +- With stable patterns and JIT-friendly traffic: near-linear behavior at runtime for most matches/misses. +- Under adversarial inputs / pathological overlap: backtracking can dominate and explode tail latency until match limits cut execution. +- Best-case regex contribution: **~0.5–2.0 µs per call** (simple anchored patterns, warm instruction cache). +- Worst-case regex contribution: **10–1000+ µs per call** before limit-triggered abort in pathological cases. + +## 2) RAM / Memory Analysis + +### 2.1 Per-request memory profile +- Average request (small body, moderate headers): **~80–300 KB/request** incremental working set. +- Rule-heavy + body-inspection profile: **~300 KB–1.5 MB/request**. +- Large multipart uploads (buffering + metadata + temp-file bookkeeping): **~1.5–8 MB/request** transient peak. + +### 2.2 Major memory drivers +- String duplication around full request composition and body extraction. +- Temporary objects: variable vectors, capture containers, transformed value copies. +- Repeated allocations for per-rule/per-variable evaluation loops. +- Multipart parser buffers and temporary file metadata structures. + +### 2.3 Peak vs average and scaling +- Average memory scales roughly with active variable cardinality and body size. +- Peak memory scales superlinearly when large body buffering overlaps with high concurrency and verbose logging. +- At concurrency `N`, practical envelope: + - `Mem_total ≈ Mem_base + N × Mem_req_avg + Mem_fragmentation` + - Fragmentation/allocator pressure can add **10–30%** overhead under bursty loads. + +## 3) I/O Performance Analysis + +### 3.1 Audit logging and disk I/O +- Full audit logging can become the dominant I/O sink for blocked/high-detail traffic. +- Estimated per-request audit payload: + - Minimal mode: **0.5–3 KB** + - Full body/headers mode: **5–100+ KB** (payload dependent) +- At high RPS, synchronous writes can induce queueing and elevate p99 latency. + +### 3.2 Temp-file and upload path I/O +- Multipart with file inspection creates disk pressure via temp-file lifecycle. +- Under sustained upload traffic, IOPS and fsync behavior may become limiting before CPU saturates. + +### 3.3 Network overhead (reverse-proxy deployments) +- Reverse-proxy topologies add an extra hop and buffering domain. +- Typical additional network/service overhead: **+0.2–2.0 ms** median, **+1–10 ms** tail depending on topology and TLS placement. + +### 3.4 Sync vs async logging impact +- Sync logging: stronger consistency, higher latency coupling. +- Async logging: better tail-latency isolation, potential loss window under crash scenarios. +- Expected effect of async + selective parts: **throughput +5–20%**, **p99 latency -5–25%**. + +## 4) System Interaction (CPU + RAM + I/O) + +- Bottlenecks amplify multiplicatively, not additively: + - Higher regex time increases request residency -> higher concurrent memory footprint -> higher GC/allocator pressure -> longer queues for synchronous logging. +- Real-world under burst load: + - Phase-2 body parsing and regex-heavy phase execution increase CPU stall time. + - Concurrent logging/file I/O induces scheduler and storage contention. +- High concurrency effect: + - Throughput scales near-linearly at low concurrency, then bends at regex + I/O saturation knee. + - Typical knee region in rule-heavy profiles: **16–64 workers/threads**, environment-dependent. + +## 5) Performance Heatmap + +| Category | Score (0-10) | Impact | Bottleneck Severity | +|---|---:|---:|---:| +| CPU | 4.5 | Very High | Critical | +| RAM | 6.0 | Medium-High | Major | +| I/O | 5.5 | High in audit/upload-heavy profiles | Major | + +### Interpretation +- **CPU is the limiting dimension** in most CRS-grade deployments due to regex/transformation amplification. +- **RAM is manageable in average traffic** but shows steep transient peaks under large payload + concurrency overlap. +- **I/O is scenario-sensitive**: benign under minimal logging, but can become first-order bottleneck with full audit and upload inspection. + +## 6) Optimization Impact Table + +| Optimization | Area | Impact (%) | Effort | Priority | +|---|---|---:|---|---| +| Regex prefilter (literal/ACMP gate before `@rx`) | CPU | 15–35 | Medium | P0 | +| Target scope reduction (`V` minimization) | CPU/RAM | 10–30 | Low-Medium | P0 | +| Transformation pipeline reduction/fusion | CPU | 8–20 | Medium-High | P1 | +| Lazy full-request materialization | RAM/CPU | 5–15 | Low | P1 | +| Selective + async audit logging | I/O/CPU | 5–25 | Low-Medium | P0 | +| Multipart buffering optimization / streaming strategy | RAM/I/O | 10–25 | High | P1 | +| Collection write-path contention tuning | CPU | 3–10 | Medium | P2 | + +## 7) Final Score + +- **Overall performance score: 5.3 / 10** (security-effective, but cost-intensive without strict tuning). +- **Use when:** deep rule-level inspection quality and policy flexibility are required, and tuning/benchmarking capacity exists. +- **Avoid or strongly gate when:** ultra-low-latency budgets (<2–5 ms), very high concurrency with large payloads, or limited observability/tuning maturity.