|
| 1 | +# Heavy File Ingestion |
| 2 | + |
| 3 | +> Convert heavyweight files into agent-friendly markdown, CSV, and a lightweight index before analysis. |
| 4 | +
|
| 5 | +## What It Does |
| 6 | + |
| 7 | +Heavy File Ingestion stops agents from wasting expensive context on raw PDFs, slide decks, spreadsheets, and other bulky files. It routes each file through a deterministic conversion step first, writes a reusable artifact to disk, and creates a small index so the main agent can decide whether it even needs deeper analysis. |
| 8 | + |
| 9 | +## Supported Clients |
| 10 | + |
| 11 | +- Claude Code |
| 12 | +- Codex |
| 13 | +- Claude Desktop |
| 14 | +- Cursor |
| 15 | +- Any AI client that supports reusable skills, rules, or custom instructions and can run local scripts |
| 16 | + |
| 17 | +## Prerequisites |
| 18 | + |
| 19 | +- Python 3.10+ |
| 20 | +- `uv` or `pip` for optional converter dependencies |
| 21 | +- AI client that can load a reusable skill file and run local commands |
| 22 | +- Working Open Brain setup if you want to pair this with Open Brain capture or retrieval flows ([guide](../../docs/01-getting-started.md)) |
| 23 | + |
| 24 | +## Installation |
| 25 | + |
| 26 | +1. Copy the entire [`heavy-file-ingestion`](./) folder into a place your AI client can access, not just `SKILL.md`. The skill expects the bundled `scripts/` and `references/` folders to stay next to it. |
| 27 | +1. For Claude Code, place the folder at `~/.claude/skills/heavy-file-ingestion/`. |
| 28 | +1. For Codex or Cursor, keep the folder in your workspace or copy the contents into that client's skills or rules location. |
| 29 | +1. Restart or reload the client so it picks up [`SKILL.md`](./SKILL.md). |
| 30 | +1. When you want the deterministic converters available, run the skill script with either: |
| 31 | + |
| 32 | +```bash |
| 33 | +uv run \ |
| 34 | + --with pdfplumber \ |
| 35 | + --with python-docx \ |
| 36 | + --with python-pptx \ |
| 37 | + --with openpyxl \ |
| 38 | + python skills/heavy-file-ingestion/scripts/convert_heavy_file.py /absolute/path/to/file.pdf |
| 39 | +``` |
| 40 | + |
| 41 | +1. If you already have `markitdown` installed and want to prefer it for rich document conversion, add `--prefer markitdown`. |
| 42 | + |
| 43 | +## Downloadable Variants |
| 44 | + |
| 45 | +If you want packaged client-specific downloads instead of the raw source folder, use: |
| 46 | + |
| 47 | +- Claude Code: [../../resources/heavy-file-ingestion-claude-code.zip](../../resources/heavy-file-ingestion-claude-code.zip) |
| 48 | +- Codex: [../../resources/heavy-file-ingestion-codex.zip](../../resources/heavy-file-ingestion-codex.zip) |
| 49 | +- Claude Desktop: [../../resources/heavy-file-ingestion-claude-desktop.skill](../../resources/heavy-file-ingestion-claude-desktop.skill) |
| 50 | + |
| 51 | +The Claude Code and Codex downloads include the bundled `scripts/` and `references/` directories. The Claude Desktop `.skill` is intentionally lighter because Claude Desktop is better treated as a policy layer than a local conversion runtime. |
| 52 | + |
| 53 | +## Trigger Conditions |
| 54 | + |
| 55 | +- The user asks the agent to "read," "analyze," "summarize," or "extract from" a PDF, DOCX, PPTX, XLSX, CSV, TSV, or other large file |
| 56 | +- The file is big enough or structured enough that raw ingestion would burn unnecessary tokens |
| 57 | +- The user wants a reusable markdown version, a CSV normalization step, or a quick structural index before analysis |
| 58 | +- The agent needs to know whether the file can be handled deterministically or should escalate to a cheap model fallback |
| 59 | + |
| 60 | +## Expected Outcome |
| 61 | + |
| 62 | +When the skill is working correctly, it should: |
| 63 | + |
| 64 | +- Detect that the file should be converted before the main model reads it |
| 65 | +- Write an extracted artifact to disk instead of pushing raw content into context |
| 66 | +- Create an `index.md` and `index.json` summary with counts, structure hints, preview lines, and quality flags |
| 67 | +- Recommend the cheapest safe next step: use the deterministic artifact, escalate to a small model, or retry with a stronger converter |
| 68 | + |
| 69 | +## Open Source Stack |
| 70 | + |
| 71 | +This skill was shaped around a small set of open-source projects with permissive licensing: |
| 72 | + |
| 73 | +- [Microsoft MarkItDown](https://github.com/microsoft/markitdown) for broad document-to-markdown conversion |
| 74 | +- [Docling](https://github.com/docling-project/docling) as the heavy-duty fallback for ugly or scanned documents |
| 75 | +- [xlsx2csv](https://github.com/dilshod/xlsx2csv) as the reference pattern for spreadsheet normalization |
| 76 | +- [pdfplumber](https://github.com/jsvine/pdfplumber) as the reference pattern for cheap PDF indexing and page-level extraction |
| 77 | + |
| 78 | +More detail lives in [`references/open-source-stack.md`](./references/open-source-stack.md). |
| 79 | + |
| 80 | +## Troubleshooting |
| 81 | + |
| 82 | +**Issue:** The extracted PDF markdown is sparse or missing whole pages. |
| 83 | +Solution: Check the generated `index.md`. If it flags `scanned_pdf_suspected` or `low_text_density`, rerun with a stronger converter or use a cheap model only on the extracted artifact, not on the original PDF. |
| 84 | + |
| 85 | +**Issue:** The script says a dependency is missing. |
| 86 | +Solution: Use the `uv run --with ...` command from this README or install the named package with `pip`. |
| 87 | + |
| 88 | +**Issue:** The client can load the skill text but cannot run scripts. |
| 89 | +Solution: Use the skill as policy guidance only and run the bundled script manually from Terminal. The skill is still useful because it tells the main agent when not to read a raw heavyweight file. |
| 90 | + |
| 91 | +## Notes for Other Clients |
| 92 | + |
| 93 | +If your client only supports a single prompt file, paste the contents of [`SKILL.md`](./SKILL.md) into that client and keep the script path nearby. The reusable behavior is the policy: convert first, inspect the index second, and only spend model tokens on the compressed artifact. |
0 commit comments