Skip to content

Latest commit

 

History

History
152 lines (113 loc) · 4.82 KB

File metadata and controls

152 lines (113 loc) · 4.82 KB

Capsules

A capsule is an optional per-paper enrichment layer that extracts and indexes structured sub-content from a paper: figures with captions, a parsed references list, code snippets and script URLs, and supplementary information (SI) files. Capsules make it possible to answer questions that require reasoning about a specific figure, a specific referenced GitHub repository, or a specific supplementary table — not just the main text.


What a capsule contains

data/capsules/<paper_id>/
    metadata.json           # DOI, title, content_type, build timestamp
    figures/
        fig_1.png           # extracted figure image
        fig_1_caption.txt   # figure caption text
        fig_2.png
        fig_2_caption.txt
        ...
    references/
        references.json     # parsed reference list [{title, doi, year, authors}, ...]
    code/
        urls.json           # GitHub / Zenodo / Crossref resource URLs mined from the text
        snippets/           # fetched code files (when fetch_paper_resources was run)
    supplementary/
        files/              # SI files downloaded from PMC OA S3, Springer ESM, ACS
        manifest.json       # {filename, source, size, sha256} per SI file

Not all sections are present for every paper. A paper served as content_type: "abstract" will have very sparse capsule content — typically just the reference list parsed from CrossRef metadata. Structured-content papers (PMC JATS, arXiv HTML) tend to produce the richest capsules.


Building capsules

For a single paper

perspicacite -c config.yml build-capsule --paper-id "10.1038/s41586-023-06924-6" --kb my-kb

The --paper-id argument accepts DOIs, PMIDs, or the internal UUID assigned at ingest time.

For all papers in a KB (idempotent)

perspicacite -c config.yml build-capsules --kb my-kb

Papers that already have a capsule are skipped unless --force is passed. Safe to re-run after adding new papers.

Via MCP

# Build for a single paper
await build_capsule(paper_id="10.1038/s41586-023-06924-6", kb_name="my-kb")

# Build for all papers in a KB
await build_capsules_for_kb(kb_name="my-kb")

Fetching external resources

The fetch-resources command mines external URLs from a paper's text (GitHub repository links, Zenodo records, Crossref-linked datasets) and optionally downloads the referenced files:

perspicacite -c config.yml fetch-resources \
  --paper-id "10.1038/s41586-023-06924-6" \
  --kb my-kb

Resource URLs are written to data/capsules/<paper_id>/code/urls.json. Text files (.py, .R, .md, .yml, etc.) within size limits are fetched and stored under code/snippets/. Binary files and files exceeding external_resources.zenodo_max_bytes_per_file are skipped.

Supplementary information files can be fetched separately:

# Via MCP
await fetch_supplementary(paper_id="10.1038/s41586-023-06924-6", kb_name="my-kb")

SI sources tried in order: PMC OA S3 → Springer ESM → ACS.


Capsule-aware retrieval

When capsules are built, the RAG engine can include figure captions and code snippets in the embedding index alongside the main text chunks. This allows queries like:

  • "What does Figure 3 show in the paper about X?"
  • "Which GitHub repositories are referenced in papers about Y?"
  • "What supplementary tables are available for paper Z?"

Capsule-aware retrieval is enabled by default when capsules exist. The multimodal config section controls whether figure images are attached to answers:

capsule:
  build_on_add: false         # auto-build capsule when adding a paper (off by default)

multimodal:
  mode: "auto"                # auto | force | off
  # "auto": attach figure images when retrieved chunks reference a figure
  # "force": also pull top-N figures by caption relevance
  # "off": never attach figure images

Zotero attachment push with capsule SI

When pushing papers to Zotero, capsule supplementary files can be attached alongside the cached PDF:

await push_to_zotero(
    dois=["10.1038/s41586-023-06924-6"],
    attach_pdf=True,
    attach_supplementary=True,   # uploads data/capsules/<doi>/supplementary/files/*
)

This uses Zotero's 4-step Web API file-upload protocol and requires cloud API access (api.zotero.org). See guides/zotero-integration.md.


Related topics