Complete feature reference for Dir2md.
Quick Start: Wiki.md | Main: README | CLI: CLI_REFERENCE.md | Help: TROUBLESHOOTING.md
Dir2md intelligently samples large files to fit within token budgets while preserving context.
Sampling Modes:
off- No sampling, include full contentref- Generate file references only (tree structure)inline- Head/tail sampling with configurable limits
How it works:
- Analyzes file sizes and token counts
- Prioritizes important sections (imports, exports, main functions)
- Maintains code structure visibility
- Preserves context with head/tail sampling
Automatic detection and masking of sensitive information.
Built-in Pattern Categories:
- PEM Blocks: RSA keys, certificates, private keys
- API Keys: AWS, Stripe, SendGrid, Twilio
- Tokens: GitHub, GitLab, Bearer tokens
- Credentials: Database URLs, passwords, connection strings
- JWTs: JSON Web Tokens
- OAuth: Client secrets, refresh tokens
Masking Levels:
off- No maskingbasic- Common patterns (API keys, tokens, PEM blocks)advanced- Extended patterns including database URLs, JWTs
Large File Handling (v1.2.2):
- Inputs over 1,000,000 chars are masked in overlapping chunks (no bypass).
- Chunked masking preserves length so offsets and line counts remain stable.
Custom Patterns:
# Single pattern
dir2md . --masking basic --mask-pattern "(?i)custom_secret_\w+"
# Pattern file
dir2md . --masking basic --mask-pattern-file file:///path/to/patterns.jsonSimHash-based content deduplication prevents redundant information in output.
Benefits:
- Reduces token usage
- Eliminates repeated boilerplate
- Maintains unique code blocks
- Configurable similarity threshold
How it works:
- Generates content fingerprints
- Compares across files
- Marks duplicates with reference links
- Preserves one canonical version
Fine-grained control over output size for LLM context windows.
Budget Controls:
--budget-tokens- Total budget for all files--max-file-tokens- Per-file limit- Automatic distribution across files
- Priority-based allocation
Smart Allocation:
- Important files get larger share
- Query-matched files prioritized
- Small files included whole
- Large files sampled efficiently
When --follow-symlinks is enabled, Dir2md only traverses symlinked directories
that resolve inside the project root. Files resolving outside the root are
skipped to prevent accidental data leakage.
Human-readable blueprint with structure and samples.
Contents:
- Directory tree visualization
- File metadata (size, tokens, line count)
- Code samples with syntax highlighting
- Risk findings (if spicy enabled)
- Masked secrets indicators
Line-delimited JSON for AI/LLM ingestion.
Structure:
{"type": "file", "path": "src/main.py", "content": "...", "meta": {...}}
{"type": "file", "path": "tests/test.py", "content": "...", "meta": {...}}Benefits:
- Streaming-friendly
- Easy to parse
- Preserves metadata
- LLM-optimized
Structured JSON with full metadata.
Structure:
{
"tree": {...},
"files": [...],
"manifest": {...},
"spicy": {...}
}Metadata-only output without file content.
Contents:
- File statistics
- Token counts
- Directory structure
- Risk assessment summary
- Generation metadata
Preset configuration optimized for LLM context windows.
Enabled with: --ai-mode
Includes:
- 6000 token budget (Claude/GPT optimized)
- Automatic JSONL generation
- Query ranking enabled
- Smart sampling
- Security masking
Rank and filter files by relevance to a natural language query.
Usage:
dir2md . --ai-mode --query "authentication flow"
dir2md . --ai-mode --query "database migration logic"How it works:
- Analyzes file content semantically
- Scores relevance to query
- Reorders files by match quality
- Includes top matches within budget
- Shows relevance scores in output
Security-focused code analysis with severity-based findings.
ok- No issues detectedwarn- Minor concerns (TODO, deprecated patterns)risk- Moderate risks (weak crypto, eval usage)high- Serious vulnerabilities (hardcoded secrets visible)critical- Severe security issues (exposed credentials)
- Secrets: Hardcoded API keys, passwords, tokens
- Crypto: Weak algorithms, insecure configurations
- Injection: SQL, command, code injection risks
- Access Control: Insecure permissions, authentication gaps
- Data Exposure: Sensitive data in logs, debug output
Fail build/CI on high/critical findings.
dir2md . --spicy-strict
# Exit code 2 if high/critical found
# Exit code 0 if only warn/risk- Markdown: Dedicated spicy section with findings
- JSON/JSONL:
spicy.score,spicy.counts,spicy.findings[] - Manifest: Summary statistics only
Raw Preset
[tool.dir2md.preset.raw]
budget_tokens = null # Unlimited
sample_mode = "inline"
emit_manifest = falsePro Preset
[tool.dir2md.preset.pro]
budget_tokens = null # User-defined
sample_mode = null # User-defined
emit_manifest = trueAI Preset
[tool.dir2md.preset.ai]
budget_tokens = 6000
sample_mode = "ref"
emit_manifest = true
masking = "basic"Fast Preset
[tool.dir2md.preset.fast]
budget_tokens = 0
sample_mode = "off"
emit_manifest = truepyproject.toml Example:
[tool.dir2md]
preset = "pro"
budget_tokens = 8000
max_file_tokens = 500
include_glob = ["src/**/*.py", "tests/**/*.py"]
exclude_glob = ["**/__pycache__/**"]
emit_manifest = true
ai_mode = false
[tool.dir2md.masking]
level = "basic"
patterns = [
"(?i)custom_secret_\\w+",
"(?i)internal_token\\s*=\\s*['\"]?\\w+"
]
pattern_files = ["file://./.dir2md/patterns.txt"]
[tool.dir2md.spicy]
enabled = true
strict = false- Deterministic output (
--no-timestamp) - Exit codes for automation
- Dry-run mode for validation
- Configurable output paths
Pre-built container for consistent execution across environments.
FROM python:3.9-slim
COPY . /app
RUN pip install -e /app
ENTRYPOINT ["python", "-m", "src.dir2md.cli"]- Respects
.gitignorepatterns - Detects repository metadata
- Branch and commit info in manifest
- Concurrent file reading
- Parallel content analysis
- Async I/O operations
- Streaming file processing
- Lazy content loading
- Incremental token counting
- File hash caching
- Pattern compilation cache
- Metadata persistence
Certification Details:
- ID: SIDRCE-DIR2MD-20251203-ARCHON
- Scores:
- Integrity: 98/100
- Resonance: 95/100
- Stability: 95/100
- Overall: 96/100 (Certified)
Architecture Principles:
- Distributed responsibilities (walker, selector, renderer, orchestrator)
- No god-object anti-patterns
- Clear separation of concerns
- Comprehensive test coverage
Quality Standards:
- Deterministic pytest behavior
- Cross-platform compatibility
- Production-ready error handling
- Security-first design