Skip to content

Commit c6dc177

Browse files
[skills] Add heavy file ingestion skill packages
1 parent 9d5daa7 commit c6dc177

14 files changed

Lines changed: 1101 additions & 0 deletions

File tree

resources/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,6 @@ Official companion files and packaged exports for Open Brain. Community-contribu
55
| Resource | What It Is |
66
| -------- | ---------- |
77
| [Open Brain Companion](open-brain-companion.skill) | Claude Skill file for AI-assisted Open Brain help |
8+
| [Heavy File Ingestion for Claude Code](heavy-file-ingestion-claude-code.zip) | Downloadable Claude Code skill bundle with conversion script and references |
9+
| [Heavy File Ingestion for Codex](heavy-file-ingestion-codex.zip) | Downloadable Codex skill bundle with conversion script and references |
10+
| [Heavy File Ingestion for Claude Desktop](heavy-file-ingestion-claude-desktop.skill) | Claude Desktop skill package for conversion-first handling of bulky files |
8.28 KB
Binary file not shown.
1.17 KB
Binary file not shown.
8.21 KB
Binary file not shown.

skills/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Reusable AI client skills and prompt packs for Open Brain workflows. These are t
1010
| [Deal Memo Drafting Skill Pack](deal-memo-drafting/) | Turns existing diligence materials into structured deal, IC, or partnership memos | [@NateBJones](https://github.com/NateBJones) |
1111
| [Research Synthesis Skill Pack](research-synthesis/) | Synthesizes source sets into findings, contradictions, confidence markers, and next questions | [@NateBJones](https://github.com/NateBJones) |
1212
| [Meeting Synthesis Skill Pack](meeting-synthesis/) | Converts meeting notes or transcripts into decisions, action items, risks, and follow-up artifacts | [@NateBJones](https://github.com/NateBJones) |
13+
| [Heavy File Ingestion Skill Pack](heavy-file-ingestion/) | Converts PDFs, decks, spreadsheets, and other bulky files into markdown, CSV, and a cheap structural index before analysis | [@NateBJones](https://github.com/NateBJones) |
1314
| [Panning for Gold Skill Pack](panning-for-gold/) | Turns brain dumps and transcripts into evaluated idea inventories | [@jaredirish](https://github.com/jaredirish) |
1415
| [Claudeception Skill Pack](claudeception/) | Extracts reusable lessons from work sessions into new skills | [@jaredirish](https://github.com/jaredirish) |
1516

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Heavy File Ingestion
2+
3+
> Convert heavyweight files into agent-friendly markdown, CSV, and a lightweight index before analysis.
4+
5+
## What It Does
6+
7+
Heavy File Ingestion stops agents from wasting expensive context on raw PDFs, slide decks, spreadsheets, and other bulky files. It routes each file through a deterministic conversion step first, writes a reusable artifact to disk, and creates a small index so the main agent can decide whether it even needs deeper analysis.
8+
9+
## Supported Clients
10+
11+
- Claude Code
12+
- Codex
13+
- Claude Desktop
14+
- Cursor
15+
- Any AI client that supports reusable skills, rules, or custom instructions and can run local scripts
16+
17+
## Prerequisites
18+
19+
- Python 3.10+
20+
- `uv` or `pip` for optional converter dependencies
21+
- AI client that can load a reusable skill file and run local commands
22+
- Working Open Brain setup if you want to pair this with Open Brain capture or retrieval flows ([guide](../../docs/01-getting-started.md))
23+
24+
## Installation
25+
26+
1. Copy the entire [`heavy-file-ingestion`](./) folder into a place your AI client can access, not just `SKILL.md`. The skill expects the bundled `scripts/` and `references/` folders to stay next to it.
27+
2. For Claude Code, place the folder at `~/.claude/skills/heavy-file-ingestion/`.
28+
3. For Codex or Cursor, keep the folder in your workspace or copy the contents into that client's skills or rules location.
29+
4. Restart or reload the client so it picks up [`SKILL.md`](./SKILL.md).
30+
5. When you want the deterministic converters available, run the skill script with either:
31+
32+
```bash
33+
uv run \
34+
--with pdfplumber \
35+
--with python-docx \
36+
--with python-pptx \
37+
--with openpyxl \
38+
python skills/heavy-file-ingestion/scripts/convert_heavy_file.py /absolute/path/to/file.pdf
39+
```
40+
41+
6. If you already have `markitdown` installed and want to prefer it for rich document conversion, add `--prefer markitdown`.
42+
43+
## Downloadable Variants
44+
45+
If you want packaged client-specific downloads instead of the raw source folder, use:
46+
47+
- Claude Code: [../../resources/heavy-file-ingestion-claude-code.zip](../../resources/heavy-file-ingestion-claude-code.zip)
48+
- Codex: [../../resources/heavy-file-ingestion-codex.zip](../../resources/heavy-file-ingestion-codex.zip)
49+
- Claude Desktop: [../../resources/heavy-file-ingestion-claude-desktop.skill](../../resources/heavy-file-ingestion-claude-desktop.skill)
50+
51+
The Claude Code and Codex downloads include the bundled `scripts/` and `references/` directories. The Claude Desktop `.skill` is intentionally lighter because Claude Desktop is better treated as a policy layer than a local conversion runtime.
52+
53+
## Trigger Conditions
54+
55+
- The user asks the agent to "read," "analyze," "summarize," or "extract from" a PDF, DOCX, PPTX, XLSX, CSV, TSV, or other large file
56+
- The file is big enough or structured enough that raw ingestion would burn unnecessary tokens
57+
- The user wants a reusable markdown version, a CSV normalization step, or a quick structural index before analysis
58+
- The agent needs to know whether the file can be handled deterministically or should escalate to a cheap model fallback
59+
60+
## Expected Outcome
61+
62+
When the skill is working correctly, it should:
63+
64+
- Detect that the file should be converted before the main model reads it
65+
- Write an extracted artifact to disk instead of pushing raw content into context
66+
- Create an `index.md` and `index.json` summary with counts, structure hints, preview lines, and quality flags
67+
- Recommend the cheapest safe next step: use the deterministic artifact, escalate to a small model, or retry with a stronger converter
68+
69+
## Open Source Stack
70+
71+
This skill was shaped around a small set of open-source projects with permissive licensing:
72+
73+
- [Microsoft MarkItDown](https://github.com/microsoft/markitdown) for broad document-to-markdown conversion
74+
- [Docling](https://github.com/docling-project/docling) as the heavy-duty fallback for ugly or scanned documents
75+
- [xlsx2csv](https://github.com/dilshod/xlsx2csv) as the reference pattern for spreadsheet normalization
76+
- [pdfplumber](https://github.com/jsvine/pdfplumber) as the reference pattern for cheap PDF indexing and page-level extraction
77+
78+
More detail lives in [`references/open-source-stack.md`](./references/open-source-stack.md).
79+
80+
## Troubleshooting
81+
82+
**Issue:** The extracted PDF markdown is sparse or missing whole pages.
83+
Solution: Check the generated `index.md`. If it flags `scanned_pdf_suspected` or `low_text_density`, rerun with a stronger converter or use a cheap model only on the extracted artifact, not on the original PDF.
84+
85+
**Issue:** The script says a dependency is missing.
86+
Solution: Use the `uv run --with ...` command from this README or install the named package with `pip`.
87+
88+
**Issue:** The client can load the skill text but cannot run scripts.
89+
Solution: Use the skill as policy guidance only and run the bundled script manually from Terminal. The skill is still useful because it tells the main agent when not to read a raw heavyweight file.
90+
91+
## Notes for Other Clients
92+
93+
If your client only supports a single prompt file, paste the contents of [`SKILL.md`](./SKILL.md) into that client and keep the script path nearby. The reusable behavior is the policy: convert first, inspect the index second, and only spend model tokens on the compressed artifact.
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
name: heavy-file-ingestion
3+
description: Use when a user asks to read, analyze, summarize, or extract from a heavyweight file such as PDF, DOCX, PPTX, XLSX, CSV, or TSV. Convert the file into markdown or CSV first, generate a lightweight index, and only spend model tokens on the compressed artifact. Trigger on requests like "read this PDF", "look through this spreadsheet", "summarize this deck", or any time raw file ingestion would waste tokens.
4+
author: Nate B. Jones
5+
version: 1.0.0
6+
---
7+
8+
# Heavy File Ingestion
9+
10+
## Problem
11+
12+
Agents waste money and context when they read heavyweight files raw. This skill turns bulky documents into cheaper working artifacts first, then tells the main agent how much reasoning power the file actually deserves.
13+
14+
## Trigger Conditions
15+
16+
- The user asks to read or analyze a PDF, slide deck, spreadsheet, or word-processing file
17+
- The file is large, structured, or expensive enough that raw ingestion is a bad trade
18+
- The user wants a markdown working copy, CSV extraction, or a quick map of the file before analysis
19+
- The agent needs a deterministic first pass before choosing whether a model fallback is worth the cost
20+
21+
## Core Policy
22+
23+
1. **Convert before reading.** Do not dump raw heavyweight files into model context if a deterministic converter can create a cheaper artifact.
24+
2. **Index before reasoning.** Read the generated `index.md` or `index.json` first. It should tell you what is in the file, how clean the extraction was, and whether escalation is justified.
25+
3. **Match the converter to the file type.**
26+
- PDFs and documents: markdown artifact
27+
- Presentations: markdown slide outline
28+
- Spreadsheets: CSV per sheet plus a markdown manifest
29+
4. **Escalate by cost tier, not instinct.**
30+
- Tier 1: deterministic converter plus index
31+
- Tier 2: cheap model on the extracted artifact only if quality flags say the deterministic pass lost structure
32+
- Tier 3: expensive model only after the file has already been compressed into markdown, CSV, or a sampled subset
33+
34+
## Process
35+
36+
1. Identify the file path, extension, and rough size.
37+
2. Run the converter script instead of reading the original file directly:
38+
39+
```bash
40+
uv run \
41+
--with pdfplumber \
42+
--with python-docx \
43+
--with python-pptx \
44+
--with openpyxl \
45+
python skills/heavy-file-ingestion/scripts/convert_heavy_file.py /absolute/path/to/file.ext
46+
```
47+
48+
3. If you already have `markitdown` installed and want to prefer it for PDF or DOCX conversion, rerun with:
49+
50+
```bash
51+
python skills/heavy-file-ingestion/scripts/convert_heavy_file.py /absolute/path/to/file.ext --prefer markitdown
52+
```
53+
54+
4. Read the generated `index.md` first.
55+
5. Only read the extracted markdown or CSV outputs that the index says are worth reading.
56+
6. If the index flags weak extraction, use a cheap fallback:
57+
- Try an alternate deterministic converter
58+
- Use a small model to rebuild only the structure or outline from the extracted artifact
59+
- Escalate to a stronger model only when the cheaper passes still leave critical ambiguity
60+
61+
## Output
62+
63+
The skill should leave behind:
64+
65+
- A deterministic artifact the agent can work from
66+
- `index.md` with file counts, structure hints, preview lines, and a recommended next step
67+
- `index.json` with the same information in machine-friendly form
68+
- Warnings when the deterministic pass is not trustworthy enough for direct reasoning
69+
70+
## Notes
71+
72+
- Prefer the bundled script over rewriting ad hoc conversion code each time.
73+
- Do not treat "sub-agent" as the default answer to messy files. A cheap deterministic pass beats a cheap model when the task is conversion, counting, routing, or indexing.
74+
- For scanned PDFs, image-heavy decks, or bizarre layouts, the deterministic pass is still useful because it tells you that a fallback is needed before you waste a stronger model on the original file.
75+
- Use [`references/open-source-stack.md`](./references/open-source-stack.md) when you need to choose a better extractor or explain why one was picked.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
{
2+
"name": "Heavy File Ingestion",
3+
"description": "Converts heavyweight files such as PDFs, slide decks, spreadsheets, and documents into markdown, CSV, and lightweight indexes before an agent spends model tokens on them.",
4+
"category": "skills",
5+
"author": {
6+
"name": "Nate B. Jones",
7+
"github": "NateBJones"
8+
},
9+
"version": "1.0.0",
10+
"requires": {
11+
"open_brain": true,
12+
"services": [],
13+
"tools": ["Python 3.10+", "uv or pip", "AI client with reusable skills/prompts"]
14+
},
15+
"tags": ["skill", "file-conversion", "markdown", "pdf", "spreadsheet", "token-efficiency"],
16+
"difficulty": "intermediate",
17+
"estimated_time": "10 minutes",
18+
"created": "2026-03-31",
19+
"updated": "2026-03-31"
20+
}
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Open Source Stack Notes
2+
3+
This skill uses a deterministic-first policy and keeps the tool stack small on purpose. The goal is not perfect document fidelity. The goal is to create an agent-friendly artifact cheaply enough that the main model only sees the compressed version.
4+
5+
## Recommended Roles
6+
7+
### 1. MarkItDown
8+
9+
- Repo: <https://github.com/microsoft/markitdown>
10+
- License: MIT
11+
- Role in this skill: Best general-purpose document-to-markdown converter for PDFs, DOCX, PPTX, and mixed office-style documents when you want broad coverage fast.
12+
- Why it fits: It is explicitly designed to make documents easier for LLM workflows rather than chasing layout-perfect export.
13+
- Why it is not the only tool here: It can pull in bigger dependency trees than we want for every single file, especially when a sheet or deck can be normalized more cheaply with a tiny native extractor.
14+
15+
### 2. Docling
16+
17+
- Repo: <https://github.com/docling-project/docling>
18+
- License: MIT
19+
- Role in this skill: Heavy-duty fallback for ugly PDFs, OCR-heavy files, layout-sensitive extraction, and advanced document recovery.
20+
- Why it fits: Strong PDF understanding, OCR support, and multi-format export, including markdown.
21+
- Why it is not the default: It is overkill for cheap first-pass routing and raises the operational footprint.
22+
23+
### 3. xlsx2csv
24+
25+
- Repo: <https://github.com/dilshod/xlsx2csv>
26+
- License: MIT
27+
- Role in this skill: Reference pattern for spreadsheet normalization.
28+
- Why it fits: The right mental model for spreadsheets is usually "convert each sheet into a plain tabular artifact" rather than forcing the main model to inspect workbook internals.
29+
- How this skill uses the idea: Native spreadsheet handling creates one CSV per sheet plus a markdown manifest with sheet counts, headers, and row estimates.
30+
31+
### 4. pdfplumber
32+
33+
- Repo: <https://github.com/jsvine/pdfplumber>
34+
- License: MIT
35+
- Role in this skill: Cheap PDF indexing and page-level extraction.
36+
- Why it fits: Good for counts, per-page text, page density checks, and detecting when a PDF is likely scanned or image-heavy.
37+
- Why it matters: Even when the markdown extraction is weak, a cheap page-level index still tells the main agent whether escalation is worth the money.
38+
39+
## Architecture Decision
40+
41+
Use the smallest tool that preserves the structure the agent actually needs:
42+
43+
1. Tabular files: native CSV normalization first
44+
2. Slide decks: native slide-outline extraction first
45+
3. PDFs and rich documents: native extraction or MarkItDown
46+
4. Scanned or degraded files: Docling or a cheap model only after the deterministic pass proves it is necessary
47+
48+
## What We Are Avoiding
49+
50+
- Reading raw binary-heavy files directly in the main model
51+
- Defaulting to expensive models for pure conversion work
52+
- Forcing a single converter to own every file type
53+
- Building a pipeline so "smart" that a solo operator cannot debug it six months later
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
#!/usr/bin/env python3
2+
from __future__ import annotations
3+
4+
import shutil
5+
import zipfile
6+
from pathlib import Path
7+
8+
9+
ROOT = Path(__file__).resolve().parents[1]
10+
REPO_ROOT = ROOT.parents[1]
11+
RESOURCES_DIR = REPO_ROOT / "resources"
12+
VARIANTS_DIR = ROOT / "variants"
13+
14+
15+
def reset_dir(path: Path) -> None:
16+
if path.exists():
17+
shutil.rmtree(path)
18+
path.mkdir(parents=True, exist_ok=True)
19+
20+
21+
def copy_tree(src: Path, dst: Path) -> None:
22+
dst.mkdir(parents=True, exist_ok=True)
23+
for item in src.iterdir():
24+
target = dst / item.name
25+
if item.is_dir():
26+
shutil.copytree(item, target, dirs_exist_ok=True)
27+
else:
28+
shutil.copy2(item, target)
29+
30+
31+
def build_zip_from_dir(source_dir: Path, archive_path: Path) -> None:
32+
with zipfile.ZipFile(archive_path, "w", compression=zipfile.ZIP_DEFLATED) as zf:
33+
for path in sorted(source_dir.rglob("*")):
34+
if path.is_file():
35+
zf.write(path, path.relative_to(source_dir.parent))
36+
37+
38+
def build_exports() -> list[Path]:
39+
build_root = ROOT / ".build-exports"
40+
reset_dir(build_root)
41+
RESOURCES_DIR.mkdir(parents=True, exist_ok=True)
42+
43+
created: list[Path] = []
44+
45+
code_bundle = build_root / "heavy-file-ingestion-claude-code"
46+
copy_tree(VARIANTS_DIR / "claude-code", code_bundle)
47+
(code_bundle / "scripts").mkdir(parents=True, exist_ok=True)
48+
shutil.copy2(ROOT / "scripts" / "convert_heavy_file.py", code_bundle / "scripts" / "convert_heavy_file.py")
49+
copy_tree(ROOT / "references", code_bundle / "references")
50+
claude_code_zip = RESOURCES_DIR / "heavy-file-ingestion-claude-code.zip"
51+
build_zip_from_dir(code_bundle, claude_code_zip)
52+
created.append(claude_code_zip)
53+
54+
codex_bundle = build_root / "heavy-file-ingestion-codex"
55+
copy_tree(VARIANTS_DIR / "codex", codex_bundle)
56+
(codex_bundle / "scripts").mkdir(parents=True, exist_ok=True)
57+
shutil.copy2(ROOT / "scripts" / "convert_heavy_file.py", codex_bundle / "scripts" / "convert_heavy_file.py")
58+
copy_tree(ROOT / "references", codex_bundle / "references")
59+
codex_zip = RESOURCES_DIR / "heavy-file-ingestion-codex.zip"
60+
build_zip_from_dir(codex_bundle, codex_zip)
61+
created.append(codex_zip)
62+
63+
desktop_bundle = build_root / "heavy-file-ingestion-claude-desktop"
64+
copy_tree(VARIANTS_DIR / "claude-desktop", desktop_bundle)
65+
desktop_skill = RESOURCES_DIR / "heavy-file-ingestion-claude-desktop.skill"
66+
build_zip_from_dir(desktop_bundle, desktop_skill)
67+
created.append(desktop_skill)
68+
69+
return created
70+
71+
72+
def main() -> int:
73+
created = build_exports()
74+
for path in created:
75+
print(path)
76+
return 0
77+
78+
79+
if __name__ == "__main__":
80+
raise SystemExit(main())

0 commit comments

Comments
 (0)