Skip to content

Commit 78ef8b7

Browse files
committed
Merge branch 'master' into 3420-support-flip-endianness
2 parents ce54c00 + 1ad8585 commit 78ef8b7

17 files changed

Lines changed: 586 additions & 91 deletions

.github/workflows/add_to_project.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ jobs:
1414
name: Add to project
1515
runs-on: ubuntu-latest
1616
steps:
17-
- uses: actions/add-to-project@v0.5.0
17+
- uses: actions/add-to-project@v1.0.2
1818
with:
1919
project-url: https://github.com/orgs/trailofbits/projects/12
2020
github-token: ${{ secrets.ADD_TO_PROJECT_PAT }}

.github/workflows/pythonpublish.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,11 @@ jobs:
1313
runs-on: ubuntu-latest
1414

1515
steps:
16-
- uses: actions/checkout@v4.1.1
16+
- uses: actions/checkout@v6.0.2
1717
with:
1818
submodules: recursive
1919
- name: Set up Python
20-
uses: actions/setup-python@v4
20+
uses: actions/setup-python@v6
2121
with:
2222
python-version: '3.x'
2323
- name: Install dependencies

.github/workflows/tests.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,16 @@ jobs:
1515
strategy:
1616
matrix:
1717
os: [ubuntu-latest] # windows-latest, macos-latest,
18-
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
18+
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
1919

2020
runs-on: ${{ matrix.os }}
2121

2222
steps:
23-
- uses: actions/checkout@v4.1.1
23+
- uses: actions/checkout@v6.0.2
2424
with:
2525
submodules: recursive
2626
- name: Set up Python ${{ matrix.python-version }}
27-
uses: actions/setup-python@v4
27+
uses: actions/setup-python@v6
2828
with:
2929
python-version: ${{ matrix.python-version }}
3030
- name: Install Python Dependencies
@@ -33,7 +33,7 @@ jobs:
3333
pip install setuptools
3434
pip install .[dev]
3535
- name: Scan with pip-audit
36-
uses: trailofbits/gh-action-pip-audit@v1.0.8
36+
uses: trailofbits/gh-action-pip-audit@v1.1.0
3737
- name: Lint with flake8
3838
run: |
3939
# stop the build if there are Python syntax errors or undefined names

CLAUDE.md

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
# PolyFile Development Guide
2+
3+
## Project Overview
4+
5+
PolyFile is a file analysis utility that identifies and maps the semantic and syntactic structure of files—including polyglots, chimeras, and "schizophrenic" files that are validly multiple types simultaneously.
6+
7+
**Key capabilities:**
8+
- Pure-Python libmagic implementation (263+ MIME types)
9+
- Recursive embedded file detection (like binwalk)
10+
- Parsers for PDF, ZIP, JPEG, iNES, and 188 Kaitai Struct formats
11+
- Interactive HTML hex viewer with structure mapping
12+
- Drop-in replacement for Unix `file` command
13+
14+
Part of the [ALAN Parsers Project](https://github.com/trailofbits/polyfile#the-alan-parsers-project) alongside PolyTracker.
15+
16+
## Architecture
17+
18+
### Core Pattern: Matchers + Parsers
19+
20+
**Matchers** classify file types. Two types:
21+
- libmagic DSL matchers (defined in `polyfile/magic_defs/`)
22+
- Python matchers (classes extending `Matcher`)
23+
24+
**Parsers** create AST representations of file structure. Registered via decorator:
25+
```python
26+
from polyfile import register_parser
27+
28+
@register_parser("application/pdf")
29+
def parse_pdf(file_stream, match):
30+
# Return parsed structure
31+
...
32+
```
33+
34+
### Key Modules
35+
36+
| Module | Purpose | Size |
37+
|--------|---------|------|
38+
| `polyfile.py` | Core engine—Match, Parser, Submatch classes | 15 KB |
39+
| `magic.py` | Pure-Python libmagic DSL implementation | 117 KB |
40+
| `pdf.py` | PDF parser with embedded file detection | 48 KB |
41+
| `debugger.py` | Interactive GDB-style debugger | 42 KB |
42+
| `kaitaimatcher.py` | Kaitai Struct format bridge ||
43+
44+
### Directory Structure
45+
46+
```
47+
polyfile/
48+
├── polyfile/ # Main package
49+
│ ├── magic_defs/ # 354 libmagic definition files
50+
│ ├── kaitai/parsers/ # 188 auto-generated Kaitai parsers (excluded from lint)
51+
│ └── templates/ # HTML output templates
52+
├── polymerge/ # Companion merge tool
53+
├── tests/ # Test suite
54+
├── docs/ # Extension guide, JSON format spec
55+
└── kaitai_struct_formats/ # Git submodule with KSY definitions
56+
```
57+
58+
## Development Commands
59+
60+
### Setup
61+
```bash
62+
# Install from source (requires Java for Kaitai compiler)
63+
pip install -e .[dev]
64+
65+
# Install from PyPI
66+
pip install polyfile
67+
```
68+
69+
### Linting
70+
```bash
71+
# Run flake8 (excludes auto-generated kaitai parsers)
72+
flake8 polyfile polymerge --max-complexity=10 --max-line-length=127 \
73+
--exclude=polyfile/kaitai/parsers
74+
```
75+
76+
### Testing
77+
```bash
78+
# Run all tests
79+
pytest tests
80+
81+
# Run specific test file
82+
pytest tests/test_magic.py
83+
pytest tests/test_pdf.py
84+
pytest tests/test_corkami.py # Polyglot corpus
85+
```
86+
87+
### Security Audit
88+
```bash
89+
pip-audit
90+
```
91+
92+
### Pre-Commit Checklist
93+
Run all checks before committing changes:
94+
```bash
95+
# Lint
96+
flake8 polyfile polymerge --max-complexity=10 --max-line-length=127 \
97+
--exclude=polyfile/kaitai/parsers
98+
99+
# Security audit (checks for vulnerable dependencies)
100+
pip-audit
101+
102+
# Tests
103+
pytest tests
104+
```
105+
106+
## Code Navigation
107+
108+
### Finding Matchers
109+
```bash
110+
# Find libmagic definitions by MIME type
111+
rg "application/pdf" polyfile/magic_defs/
112+
113+
# Find Python matchers
114+
ast-grep --pattern 'class $NAME(Matcher): $$$' --lang py polyfile/
115+
```
116+
117+
### Finding Parsers
118+
```bash
119+
# Find registered parsers
120+
rg "@register_parser" polyfile/
121+
122+
# Find parser for specific MIME type
123+
rg 'register_parser.*application/zip' polyfile/
124+
```
125+
126+
### Key Entry Points
127+
- CLI: `polyfile/__main__.py`
128+
- Core analysis: `polyfile/polyfile.py:PolyFile.struc()`
129+
- Magic matching: `polyfile/magic.py:MagicMatcher.match()`
130+
131+
## Testing
132+
133+
### Test Structure
134+
```
135+
tests/
136+
├── test_magic.py # libmagic implementation vs corpus
137+
├── test_pdf.py # PDF parsing
138+
├── test_corkami.py # Polyglot/chimera edge cases
139+
├── test_kaitai.py # Kaitai format tests
140+
└── unit/
141+
├── test_ast.py # AST utilities
142+
└── test_http.py # HTTP protocol parsing
143+
```
144+
145+
### Test Conventions
146+
- Uses real file corpus including libmagic's official test suite
147+
- Tests polyglot files to verify multi-type detection
148+
- Parser tests validate structure extraction
149+
150+
## Extending PolyFile
151+
152+
### Adding a Custom Matcher (Python)
153+
```python
154+
from polyfile import Matcher, Match
155+
156+
class MyMatcher(Matcher):
157+
def match(self, data: bytes) -> Match | None:
158+
if data.startswith(b'MAGIC'):
159+
return Match(
160+
mime_type="application/x-myformat",
161+
name="My Format",
162+
offset=0,
163+
length=len(data)
164+
)
165+
return None
166+
```
167+
168+
### Adding a Custom Parser
169+
```python
170+
from polyfile import register_parser, Parser, Submatch
171+
172+
@register_parser("application/x-myformat")
173+
class MyParser(Parser):
174+
def parse(self, file_stream, match) -> Iterator[Submatch]:
175+
# Yield Submatch objects representing structure
176+
yield Submatch(
177+
name="header",
178+
start=0,
179+
length=8,
180+
value=file_stream.read(8)
181+
)
182+
```
183+
184+
### Adding Kaitai Struct Format
185+
1. Add `.ksy` file to `kaitai_struct_formats/`
186+
2. Map MIME type in `polyfile/kaitai/parsers/__init__.py`
187+
3. Rebuild: `python compile_kaitai_parsers.py`
188+
189+
See `docs/extending_polyfile.md` for detailed guide.
190+
191+
## Internal API Patterns
192+
193+
### File I/O
194+
- Use `FileStream` abstraction for seeking/reading
195+
- `PathOrStdin`/`PathOrStdout` for CLI flexibility
196+
197+
### Match Hierarchy
198+
- `Match` → top-level file type match
199+
- `Submatch` → nested structure within a match
200+
- Build trees for embedded files (ZIP contents, PDF streams)
201+
202+
### Error Handling
203+
- Raise `InvalidMatch` when parser cannot process data
204+
- Matchers return `None` for non-matching data
205+
206+
### Gotchas
207+
- `polyfile/kaitai/parsers/` is auto-generated—never edit manually
208+
- Java required at install time for Kaitai compilation
209+
- libmagic DSL has quirks—see [blog post](https://blog.trailofbits.com/2022/07/01/libmagic-the-blathering/)

polyfile/__init__.py

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
from . import (
22
nes,
3-
pdf,
43
jpeg,
54
zipmatcher,
65
nitf,
@@ -13,3 +12,20 @@
1312

1413
from .__main__ import main
1514
from .polyfile import __version__, InvalidMatch, Match, Matcher, Parser, PARSERS, register_parser, Submatch
15+
16+
17+
# Lazy PDF parser registration
18+
# This registers immediately but defers importing pdf.py (and pdfminer) until first use
19+
class _LazyPDFParser(Parser):
20+
"""Lazy wrapper that imports the actual PDF parser on first use."""
21+
22+
_actual_parser = None
23+
24+
def parse(self, stream, match):
25+
if _LazyPDFParser._actual_parser is None:
26+
from . import pdf
27+
_LazyPDFParser._actual_parser = pdf.pdf_parser
28+
yield from _LazyPDFParser._actual_parser(stream, match)
29+
30+
31+
PARSERS["application/pdf"].add(_LazyPDFParser())

polyfile/debugger.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ def should_break(
147147
parent_match: Optional[TestResult],
148148
result: Optional[TestResult]
149149
) -> bool:
150-
return self.pattern.is_contained_in(test.mimetypes())
150+
return self.pattern.is_contained_in(test.mimetypes)
151151

152152
@classmethod
153153
def parse(cls: Type[B], command: str) -> Optional[B]:
@@ -183,7 +183,7 @@ def should_break(
183183
parent_match: Optional[TestResult],
184184
result: Optional[TestResult]
185185
) -> bool:
186-
return self.ext in test.all_extensions()
186+
return self.ext in test.all_extensions
187187

188188
@classmethod
189189
def parse(cls: Type[B], command: str) -> Optional[B]:

polyfile/jpeg.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,16 @@
44
from .fileutils import FileStream, Tempfile
55
from .polyfile import Match, register_parser, Submatch
66

7-
from PIL import Image
7+
8+
def _get_pil_image():
9+
"""Lazy import PIL.Image only when needed (for JPEG2000 parsing)."""
10+
from PIL import Image
11+
return Image
812

913

1014
@register_parser("image/jp2")
1115
def parse_jpeg2000(file_stream: FileStream, parent: Match):
16+
Image = _get_pil_image()
1217
with Tempfile(file_stream.read(parent.length)) as input_bytes:
1318
img = Image.open(input_bytes)
1419
with BytesIO() as img_data:

polyfile/kaitaimatcher.py

Lines changed: 30 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,6 @@
55
from kaitaistruct import KaitaiStruct, KaitaiStructError
66

77
from .kaitai.parser import ASTNode, KaitaiParser, RootNode
8-
from .kaitai.parsers.gif import Gif
9-
from .kaitai.parsers.jpeg import Jpeg
10-
from .kaitai.parsers.png import Png
118
from .logger import getStatusLogger
129
from .polyfile import register_parser, InvalidMatch, Match, Parser, Submatch
1310

@@ -83,31 +80,41 @@ def ast_to_matches(ast: RootNode, parent: Match) -> Iterator[Submatch]:
8380
stack.extend(reversed([(new_node, c) for c in node.children]))
8481

8582

86-
for mimetype, kaitai_path in KAITAI_MIME_MAPPING.items():
87-
class parse_:
88-
kaitai_parser = KaitaiParser.load(kaitai_path)
89-
90-
def __call__(self, stream, match):
91-
try:
92-
ast = self.kaitai_parser.parse(stream).ast
93-
except KaitaiStructError as e:
94-
log.warning(f"Error parsing {stream.name} using {self.kaitai_parser}: {e!s}")
95-
raise InvalidMatch()
96-
except Exception as e:
97-
log.error(f"Unexpected exception parsing {stream.name} using {self.kaitai_parser}: {e!s}")
98-
raise InvalidMatch()
99-
yield from ast_to_matches(ast, parent=match)
83+
class LazyKaitaiParser:
84+
"""Parser that lazily loads the Kaitai struct parser on first use."""
10085

101-
func_name = mimetype.replace("/", "_").replace("-", "_")
86+
def __init__(self, kaitai_path: str, mimetype: str):
87+
self.kaitai_path = kaitai_path
88+
self.mimetype = mimetype
89+
self._kaitai_parser = None
90+
91+
@property
92+
def kaitai_parser(self):
93+
if self._kaitai_parser is None:
94+
self._kaitai_parser = KaitaiParser.load(self.kaitai_path)
95+
MIME_BY_PARSER[self._kaitai_parser.struct_type] = self.mimetype
96+
return self._kaitai_parser
10297

103-
parse_.__name__ = f"{parse_.__name__}{func_name}"
104-
parse_.__qualname__ = f"{parse_.__qualname__}{func_name}"
98+
def __call__(self, stream, match):
99+
try:
100+
ast = self.kaitai_parser.parse(stream).ast
101+
except KaitaiStructError as e:
102+
log.warning(f"Error parsing {stream.name} using {self.kaitai_parser}: {e!s}")
103+
raise InvalidMatch()
104+
except Exception as e:
105+
log.error(f"Unexpected exception parsing {stream.name} using {self.kaitai_parser}: {e!s}")
106+
raise InvalidMatch()
107+
yield from ast_to_matches(ast, parent=match)
105108

106-
register_parser(mimetype)(parse_())
107109

108-
MIME_BY_PARSER[parse_.kaitai_parser.struct_type] = mimetype
110+
for mimetype, kaitai_path in KAITAI_MIME_MAPPING.items():
111+
func_name = mimetype.replace("/", "_").replace("-", "_")
112+
parser = LazyKaitaiParser(kaitai_path, mimetype)
113+
parser.__name__ = f"parse_{func_name}"
114+
parser.__qualname__ = f"parse_{func_name}"
115+
register_parser(mimetype)(parser)
109116

110117
del func_name
111118
del kaitai_path
112119
del mimetype
113-
del parse_
120+
del parser

0 commit comments

Comments
 (0)