Skip to content

Commit 33b4eff

Browse files
committed
Add a CLAUDE.md
1 parent 99490e8 commit 33b4eff

1 file changed

Lines changed: 195 additions & 0 deletions

File tree

CLAUDE.md

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# PolyFile Development Guide
2+
3+
## Project Overview
4+
5+
PolyFile is a file analysis utility that identifies and maps the semantic and syntactic structure of files—including polyglots, chimeras, and "schizophrenic" files that are validly multiple types simultaneously.
6+
7+
**Key capabilities:**
8+
- Pure-Python libmagic implementation (263+ MIME types)
9+
- Recursive embedded file detection (like binwalk)
10+
- Parsers for PDF, ZIP, JPEG, iNES, and 188 Kaitai Struct formats
11+
- Interactive HTML hex viewer with structure mapping
12+
- Drop-in replacement for Unix `file` command
13+
14+
Part of the [ALAN Parsers Project](https://github.com/trailofbits/polyfile#the-alan-parsers-project) alongside PolyTracker.
15+
16+
## Architecture
17+
18+
### Core Pattern: Matchers + Parsers
19+
20+
**Matchers** classify file types. Two types:
21+
- libmagic DSL matchers (defined in `polyfile/magic_defs/`)
22+
- Python matchers (classes extending `Matcher`)
23+
24+
**Parsers** create AST representations of file structure. Registered via decorator:
25+
```python
26+
from polyfile import register_parser
27+
28+
@register_parser("application/pdf")
29+
def parse_pdf(file_stream, match):
30+
# Return parsed structure
31+
...
32+
```
33+
34+
### Key Modules
35+
36+
| Module | Purpose | Size |
37+
|--------|---------|------|
38+
| `polyfile.py` | Core engine—Match, Parser, Submatch classes | 15 KB |
39+
| `magic.py` | Pure-Python libmagic DSL implementation | 117 KB |
40+
| `pdf.py` | PDF parser with embedded file detection | 48 KB |
41+
| `debugger.py` | Interactive GDB-style debugger | 42 KB |
42+
| `kaitaimatcher.py` | Kaitai Struct format bridge ||
43+
44+
### Directory Structure
45+
46+
```
47+
polyfile/
48+
├── polyfile/ # Main package
49+
│ ├── magic_defs/ # 354 libmagic definition files
50+
│ ├── kaitai/parsers/ # 188 auto-generated Kaitai parsers (excluded from lint)
51+
│ └── templates/ # HTML output templates
52+
├── polymerge/ # Companion merge tool
53+
├── tests/ # Test suite
54+
├── docs/ # Extension guide, JSON format spec
55+
└── kaitai_struct_formats/ # Git submodule with KSY definitions
56+
```
57+
58+
## Development Commands
59+
60+
### Setup
61+
```bash
62+
# Install from source (requires Java for Kaitai compiler)
63+
pip install -e .[dev]
64+
65+
# Install from PyPI
66+
pip install polyfile
67+
```
68+
69+
### Linting
70+
```bash
71+
# Run flake8 (excludes auto-generated kaitai parsers)
72+
flake8 polyfile polymerge --max-complexity=10 --max-line-length=127 \
73+
--exclude=polyfile/kaitai/parsers
74+
```
75+
76+
### Testing
77+
```bash
78+
# Run all tests
79+
pytest tests
80+
81+
# Run specific test file
82+
pytest tests/test_magic.py
83+
pytest tests/test_pdf.py
84+
pytest tests/test_corkami.py # Polyglot corpus
85+
```
86+
87+
### Security Audit
88+
```bash
89+
pip-audit
90+
```
91+
92+
## Code Navigation
93+
94+
### Finding Matchers
95+
```bash
96+
# Find libmagic definitions by MIME type
97+
rg "application/pdf" polyfile/magic_defs/
98+
99+
# Find Python matchers
100+
ast-grep --pattern 'class $NAME(Matcher): $$$' --lang py polyfile/
101+
```
102+
103+
### Finding Parsers
104+
```bash
105+
# Find registered parsers
106+
rg "@register_parser" polyfile/
107+
108+
# Find parser for specific MIME type
109+
rg 'register_parser.*application/zip' polyfile/
110+
```
111+
112+
### Key Entry Points
113+
- CLI: `polyfile/__main__.py`
114+
- Core analysis: `polyfile/polyfile.py:PolyFile.struc()`
115+
- Magic matching: `polyfile/magic.py:MagicMatcher.match()`
116+
117+
## Testing
118+
119+
### Test Structure
120+
```
121+
tests/
122+
├── test_magic.py # libmagic implementation vs corpus
123+
├── test_pdf.py # PDF parsing
124+
├── test_corkami.py # Polyglot/chimera edge cases
125+
├── test_kaitai.py # Kaitai format tests
126+
└── unit/
127+
├── test_ast.py # AST utilities
128+
└── test_http.py # HTTP protocol parsing
129+
```
130+
131+
### Test Conventions
132+
- Uses real file corpus including libmagic's official test suite
133+
- Tests polyglot files to verify multi-type detection
134+
- Parser tests validate structure extraction
135+
136+
## Extending PolyFile
137+
138+
### Adding a Custom Matcher (Python)
139+
```python
140+
from polyfile import Matcher, Match
141+
142+
class MyMatcher(Matcher):
143+
def match(self, data: bytes) -> Match | None:
144+
if data.startswith(b'MAGIC'):
145+
return Match(
146+
mime_type="application/x-myformat",
147+
name="My Format",
148+
offset=0,
149+
length=len(data)
150+
)
151+
return None
152+
```
153+
154+
### Adding a Custom Parser
155+
```python
156+
from polyfile import register_parser, Parser, Submatch
157+
158+
@register_parser("application/x-myformat")
159+
class MyParser(Parser):
160+
def parse(self, file_stream, match) -> Iterator[Submatch]:
161+
# Yield Submatch objects representing structure
162+
yield Submatch(
163+
name="header",
164+
start=0,
165+
length=8,
166+
value=file_stream.read(8)
167+
)
168+
```
169+
170+
### Adding Kaitai Struct Format
171+
1. Add `.ksy` file to `kaitai_struct_formats/`
172+
2. Map MIME type in `polyfile/kaitai/parsers/__init__.py`
173+
3. Rebuild: `python compile_kaitai_parsers.py`
174+
175+
See `docs/extending_polyfile.md` for detailed guide.
176+
177+
## Internal API Patterns
178+
179+
### File I/O
180+
- Use `FileStream` abstraction for seeking/reading
181+
- `PathOrStdin`/`PathOrStdout` for CLI flexibility
182+
183+
### Match Hierarchy
184+
- `Match` → top-level file type match
185+
- `Submatch` → nested structure within a match
186+
- Build trees for embedded files (ZIP contents, PDF streams)
187+
188+
### Error Handling
189+
- Raise `InvalidMatch` when parser cannot process data
190+
- Matchers return `None` for non-matching data
191+
192+
### Gotchas
193+
- `polyfile/kaitai/parsers/` is auto-generated—never edit manually
194+
- Java required at install time for Kaitai compilation
195+
- libmagic DSL has quirks—see [blog post](https://blog.trailofbits.com/2022/07/01/libmagic-the-blathering/)

0 commit comments

Comments
 (0)