pdfscalpel/
├── analyze/ # Non-destructive PDF analysis
├── extract/ # Data extraction from PDFs
├── mutate/ # PDF modification operations
├── solve/ # CTF and forensic solving tools
├── generate/ # PDF and challenge generation
├── core/ # Shared utilities and base classes
├── cli/ # Command-line interface
└── plugins/ # Plugin system
Purpose: Non-destructive analysis and intelligence gathering
Components:
structure.py- PDF structure analysis, object tree parsing, anomaly detectionmetadata.py- Metadata extraction (Info dict, XMP), tool fingerprintingencryption.py- Encryption parameter analysis, crackability assessmentmalware.py- Malware and exploit detection (20+ CVEs, JavaScript analysis, YARA integration)signatures.py- Digital signature validation, certificate chain analysis, attack detection (USF, SWA, ISA)form_security.py- Form vulnerability analysis (XXE, SSRF, JavaScript injection)anti_forensics.py- Sanitization tool fingerprinting (ExifTool, MAT2, QPDF, Ghostscript)advanced_stego.py- Advanced steganography detection (stream operators, object ordering, whitespace)watermark.py- Watermark detection and classificationgraph.py- Object graph visualization, DOT generationintelligence.py- Intelligence synthesis, recommendations, rendering analysis
Output: Analysis reports, visualizations, recommendations
Purpose: Extract data from PDFs without modification
Components:
text.py- Text extraction with layout preservationimages.py- Image extraction, format detectionjavascript.py- JavaScript extraction and deobfuscationattachments.py- Embedded file extractionforms.py- AcroForm/XFA form data extractionstreams.py- Object stream extraction and decompressionobjects.py- Raw PDF object dumpinghidden.py- Hidden content detection (invisible text, whitespace)revisions.py- Incremental update extraction, timeline reconstruction
Output: Extracted files, text, data structures
Purpose: Modify PDFs (destructive operations)
Components:
watermark.py- Watermark addition and removal (15+ techniques)encryption.py- Add/remove passwords, set permissionspages.py- Merge, split, extract, reorder pagesbookmarks.py- Add, remove, auto-generate bookmarksredaction.py- Redact sensitive contentoptimize.py- Compress, linearize, remove unused objects
Output: Modified PDF files
Purpose: CTF challenge solving and forensic recovery (ethical use only)
Components:
password.py- Password cracking (dictionary, brute force, mask attacks)flag_hunter.py- Flag detection across all PDF layersstego_solver.py- Steganography detection and extractionauto_solver.py- Automated challenge solving orchestrationrepair.py- PDF damage assessment and repair (header reconstruction, xref rebuilding, stream recovery)ctf_mode.py- CTF mode enforcement, audit trail generation
Output: Cracked passwords, extracted flags, repaired PDFs, audit logs
Note: Password cracking requires --ctf-mode and --challenge-id for ethical use enforcement
Purpose: Create PDFs and CTF challenges
Components:
challenges.py- CTF challenge generation (password, stego, multi-stage)corrupted.py- Intentional PDF corruption for recovery challengespolyglot.py- Polyglot file creation (PDF+ZIP, PDF+HTML)steganography.py- Steganography embedding (LSB, metadata, whitespace)watermark.py- Watermark template generation
Output: Challenge PDFs, solution metadata
Purpose: Shared utilities and infrastructure
Components:
pdf_base.py- PDFDocument wrapper class (pikepdf abstraction)config.py- Configuration management (TOML support)constants.py- Shared constants, patterns, thresholdsdependencies.py- External tool detection, installation guidanceexceptions.py- Custom exception hierarchylogging.py- Logging and audit infrastructurecrypto.py- Cryptographic utilities (hashing, encryption)image_utils.py- Image processing (inpainting, frequency analysis)patterns.py- Regex patterns (flags, hashes, encoding detection)
Output: Shared services for all modules
Purpose: Command-line interface implementation
Components:
main.py- Click-based CLI, command routingui.py- Rich-based UI components (progress bars, tables)validators.py- Input validation, path checking
Output: User-facing command-line interface
Purpose: Extensible plugin system
Components:
base.py- Plugin base classes (AnalyzerPlugin, ExtractorPlugin, GeneratorPlugin)loader.py- Plugin discovery, registration, lifecycle managementexamples/- Example plugins for reference
Output: Plugin framework for third-party extensions
User Command
↓
CLI Parser (main.py)
↓
Validator (validators.py)
↓
PDFDocument (pdf_base.py)
↓
Analyzer Module (analyze/*)
↓
Intelligence Layer (intelligence.py)
↓
Results Formatter (ui.py)
↓
Output (JSON/Text/HTML)
User Command (--ctf-mode --challenge-id)
↓
CTF Mode Enforcement (ctf_mode.py)
↓
Auto Solver Orchestration (auto_solver.py)
↓
┌─────────────┬─────────────┬─────────────┐
│ Password │ Flag Hunter │ Stego │
│ Cracker │ │ Solver │
└─────────────┴─────────────┴─────────────┘
↓
Results Aggregation
↓
Audit Log Generation (ctf_mode.py)
↓
Output (Report + Provenance)
User Command (mutate watermark --remove)
↓
Watermark Analysis (analyze/watermark.py)
↓
Type Classification
↓
Strategy Selection
↓
┌────────────┬─────────────┬──────────────┐
│ Content │ OCG │ XObject │
│ Stream │ Removal │ Removal │
│ Editing │ │ │
└────────────┴─────────────┴──────────────┘
↓
Quality Assessment
↓
Output (Clean PDF)
- pikepdf - PDF parsing and manipulation
- pdfplumber - Text extraction with layout
- click - CLI framework
- rich - Terminal UI components
- tomli - TOML configuration (Python <3.11)
- Pillow - Image processing
- numpy - Numerical operations
- pycryptodome - Cryptographic operations
- ocrmypdf - OCR functionality
- graphviz - Graph visualization
- python-magic - File type detection
- pypdf - Additional PDF operations
- Ghostscript - PDF rendering, watermark operations
- QPDF - PDF structure manipulation, repair
- John the Ripper - Password cracking
- Hashcat - GPU-accelerated password cracking
- Tesseract - OCR engine
- ImageMagick - Advanced image processing
Each module has a single responsibility:
- Analyze - Read-only analysis
- Extract - Data extraction
- Mutate - Modification
- Solve - Problem solving
- Generate - Creation
Missing external tools don't break core functionality:
- Detect tool availability
- Provide installation instructions
- Fall back to Python implementations
- Continue with reduced functionality
Sensitive operations require explicit authorization:
- CTF mode enforcement (
--ctf-mode) - Challenge ID requirement (
--challenge-id) - Audit trail generation
- Signed provenance files
Plugin system allows third-party extensions:
- Well-defined base classes
- Auto-discovery from multiple directories
- Lifecycle hooks for integration
- Isolated execution environments
Optimized for large PDFs:
- Streaming parsers for large files
- Parallel processing where applicable
- Caching for expensive operations
- Memory-mapped file I/O
Production-ready code:
-
90% test coverage
- Type hints (mypy compatible)
- Comprehensive error handling
- Detailed logging
- Clear error messages with guidance
- Command-line options (highest priority)
- Project config file (
./pdfscalpel.toml) - User config file (
~/.pdfscalpel.toml) - System config file (
~/.config/pdfscalpel/config.toml) - Default values (lowest priority)
[ocr] # OCR settings
[watermark] # Watermark defaults
[password] # Password cracking settings
[plugins] # Plugin directories- Test individual functions and classes
- Mock external dependencies
- High coverage (>90%)
- Test complete workflows
- Test CLI commands
- Test plugin system
- Benchmark critical paths
- Test with large PDFs
- Memory profiling
- Auto-generate test PDFs
- Various encryption types
- Different watermark types
- Corrupted PDFs
PDFScalpelError (base)
├── PDFOpenError (cannot open file)
├── PDFEncryptedError (password required)
├── PDFCorruptedError (malformed PDF)
├── PDFNotFoundError (file not found)
├── DependencyMissingError (external tool missing)
└── ConfigurationError (invalid config)
Provide actionable guidance:
- What went wrong
- Why it likely failed
- What to try next
- Installation instructions for missing tools
Example:
Error: Encrypted PDF requires password
Encryption: AES-256
Crackability: High (estimated 6-8 char password)
Try:
1. pdfscalpel solve password INPUT --ctf-mode --challenge-id ID
2. pdfscalpel analyze encryption INPUT --check-exploits
- Lazy Loading - Defer object dereferencing until needed
- Streaming - Process large files incrementally
- Caching - Cache expensive operations (graph traversal, entropy)
- Parallel Processing - Use multiprocessing for page-level operations
- External Tools - Delegate to optimized C tools (QPDF, Ghostscript)
Target performance (Windows 11, i7-12700K):
- Structure analysis: <3s for 1000-page PDF
- Password cracking: >50,000 passwords/sec (RC4-40)
- Object graph: <2s for 5000 objects
- Image extraction: <1s for 100 images
- GUI interface (electron/web-based)
- REST API for integration
- Distributed cracking (cluster support)
- Machine learning for malware detection
- Advanced polyglot detection
- Certificate-based encryption handling
- Machine learning watermark detection
- Automated exploit generation
- PDF parser fuzzing
- Reader behavior profiling