|
| 1 | +--- |
| 2 | +title: "Pyrl" |
| 3 | +weight: 1 |
| 4 | +bookCollapseSection: true |
| 5 | +--- |
| 6 | + |
| 7 | +# Pyrl |
| 8 | + |
| 9 | +Pyrl (pronounced "Pearl") is the **first automated detection tool** for Python class pollution vulnerabilities. It uses a novel static analysis technique called *operational taint analysis* implemented on top of CodeQL. |
| 10 | + |
| 11 | +## What Pyrl Does |
| 12 | + |
| 13 | +Pyrl tracks attacker-controlled inputs through "get" and "set" primitives using fine-grained semantic taint labels that capture: |
| 14 | +- **T_INPUT** — Direct attacker input |
| 15 | +- **T_ENUM** — Enumerable value from split operations |
| 16 | +- **T_KEY** — Potential key value from enumeration |
| 17 | +- **T_OBJ** — Object resolved through a tainted key |
| 18 | +- **G_ATTR** / **G_ITEM** — Access type annotations (attribute vs. item) |
| 19 | + |
| 20 | +## Key Features |
| 21 | + |
| 22 | +- Detects all **6 vulnerability types** in the taxonomy |
| 23 | +- Handles both first-order and second-order get operations |
| 24 | +- Performs **exploitability checking** (verifies both assignments in Dual-Set are in mutually exclusive branches) |
| 25 | +- Uses **barrier node analysis** to reduce false positives (key sanitization, type checks) |
| 26 | +- Scales to large codebases (linear with AST nodes) |
| 27 | + |
| 28 | +## Performance |
| 29 | + |
| 30 | +- **868** total alerts across 671K+ Python projects |
| 31 | +- **47** confirmed true positive zero-day vulnerabilities |
| 32 | +- **38%** false positive rate (significantly lower than 78-97% for baseline approaches) |
| 33 | +- Analysis time: typically under 2 minutes per package |
| 34 | + |
| 35 | +## Architecture |
| 36 | + |
| 37 | +``` |
| 38 | +┌──────────────────────────────────────────────────┐ |
| 39 | +│ Pyrl Pipeline │ |
| 40 | +├──────────────────────────────────────────────────┤ |
| 41 | +│ │ |
| 42 | +│ 1. Package Download & Database Setup │ |
| 43 | +│ └─ CodeQL database creation │ |
| 44 | +│ │ |
| 45 | +│ 2. Operational Taint Analysis │ |
| 46 | +│ ├─ Taint Initialization (INPUT rule) │ |
| 47 | +│ ├─ Taint Propagation (SPLIT, ENUMERATE, │ |
| 48 | +│ │ GETITEM, GETATTR, BRANCH rules) │ |
| 49 | +│ └─ Taint Merging (at control-flow joins) │ |
| 50 | +│ │ |
| 51 | +│ 3. Vulnerability Detection │ |
| 52 | +│ ├─ Sink identification (assignment tuples) │ |
| 53 | +│ ├─ Label condition checking (Table 5) │ |
| 54 | +│ └─ Type classification (6 types) │ |
| 55 | +│ │ |
| 56 | +│ 4. Exploitability Checking │ |
| 57 | +│ ├─ Mutual exclusion verification │ |
| 58 | +│ └─ Barrier node / dominator analysis │ |
| 59 | +│ │ |
| 60 | +│ 5. Result Processing │ |
| 61 | +│ └─ Report generation with taint flow paths │ |
| 62 | +│ │ |
| 63 | +└──────────────────────────────────────────────────┘ |
| 64 | +``` |
| 65 | + |
| 66 | +## Implementation |
| 67 | + |
| 68 | +- Written in **CodeQL** (QL language) — 3,509 lines of new code |
| 69 | +- Runs on CodeQL v2.21.3 with Python language support v4.0.5 |
| 70 | +- Extended CodeQL standard library for: |
| 71 | + - Collection data structures (`namedtuple`, `reduce`, etc.) |
| 72 | + - Object attribute definition resolution |
| 73 | + - Data flow through higher-order functions |
0 commit comments