Edison Scientific deep-research integration via edison-client SDK by realmarcin · Pull Request #33 · CultureBotAI/CultureMech

realmarcin · 2026-05-26T04:52:04Z

Summary

Wires a programmatic path to FutureHouse / Edison Scientific deep research over CultureMech media records.

The existing scripts/research_media.py wraps deep-research-client, but DRC 0.2.4 registers only openai and cyberian as providers — Edison / PaperQA isn't accessible through it. The edison-client SDK is already in pyproject deps; this PR invokes it directly.

New file: `scripts/research_media_edison.py`

Single recipe: python scripts/research_media_edison.py --target <slug|id|path>
Batch: python scripts/research_media_edison.py --batch data/import_tracking/reports/edison_batch.json --limit 5
--job literature (paperqa3, default) | literature-high | precedent | phoenix. Aliases paperqa, paperqa-high.
--dry-run renders the query and prints the plan without an API call (no credits).
--start / --limit cap or skip into a batch list.
Auth: EDISON_PLATFORM_API_KEY (SDK-native) or legacy EDISON_API_KEY; auto-loads repo-root .env via python-dotenv.
Reuses research_media.py's template_vars / load_media / resolve_media_file so the rendered query matches the existing DRC workflow.
Outputs: research/media/{slug}-edison-{job}.md + sibling -meta.yaml (task_id, total_cost, status, template_vars).

Batch resolution fix

The 100-recipe edison_batch.json priority list was failing to resolve because research_media.py's resolve_media_file:

treats string file_paths as CWD-relative (not data/normalized_yaml/-relative), and
raises on multi-match slugs (e.g. dehalospirillum_medium exists in 5 importer-flavor variants).

The Edison batch resolver tries data/normalized_yaml/<file_path> verbatim first — unambiguous — then falls back to slug matching. Resolvability is 100% (5/5) in the smoke dry-run.

Justfile targets

just research-media-edison <target> [*args]
just research-media-edison-batch <batch.json> [*args] (always pass --limit N on first runs)

Also

.env.example added (gitignored — .env already is).
data/import_tracking/reports/edison_batch.json regenerated against current corpus via analyze_media_quality.py (the March-vintage file pre-dated the snake_case + orphan-page cleanups; few entries still resolved).

Out of scope here

Live API smoke test deferred until user confirms .env has the new LBL key (2.6k credits).

Test plan

Dry-run single recipe (--target luria_bertani_lb_medium)
Dry-run batch (--limit 5) — 5/5 resolve cleanly
just --list shows both new targets in the Research group
Live test deferred (key rotation)
just validate-strict clean (no schema/data touched)

🤖 Generated with Claude Code

Adds a programmatic path to FutureHouse / Edison Scientific deep research over CultureMech media records. Companion to (not replacement for) scripts/research_media.py, which wraps deep-research-client — DRC 0.2.4 registers only `openai` and `cyberian` as providers, so Edison / PaperQA isn't accessible through it. The edison-client SDK (already in pyproject deps) is invoked directly here. New: scripts/research_media_edison.py Single recipe: python scripts/research_media_edison.py --target <slug|id|path> Batch (priority list): python scripts/research_media_edison.py \ --batch data/import_tracking/reports/edison_batch.json --limit 5 --job: literature (paperqa3, default) | literature-high | precedent | phoenix. Aliases: paperqa, paperqa-high. --dry-run: render the query and print the plan; no API call, no credits spent. --start / --limit: cap or skip into a batch list. Auth picks up EDISON_PLATFORM_API_KEY (SDK-native) or EDISON_API_KEY (legacy alias used by research_media.py). A repo-root .env is auto-loaded via python-dotenv. .env.example added. Reuses research_media.py's template_vars / load_media / resolve helpers so the rendered query matches the existing DRC workflow. Outputs land under research/media/{slug}-edison-{job}.md plus a sibling -meta.yaml capturing task_id, total_cost, status, the rendered template variables, and the prompt size for audit. Batch resolution gotcha (and fix): edison_batch.json carries `recipe_name` (slug derived from the YAML name field) AND `file_path` (relative to data/normalized_yaml/). research_media.py's resolve_media_file: (a) treats string paths as relative to CWD (not normalized_yaml/), so file_path entries miss; and (b) returns ValueError when a slug matches multiple files (e.g., "dehalospirillum_medium" appears in 5 importer-flavor variants). The Edison batch resolver now first tries `data/normalized_yaml/<file_path>` verbatim — unambiguous — before falling back to slug matching. Resolvability against the freshly- regenerated 100-recipe batch is 100% (5/5 in --limit 5 smoke). Justfile targets: research-media-edison target *args="" # single research-media-edison-batch batch *args="" # batch (pass --limit!) Out of scope here: live API smoke test deferred until user confirms .env has the new LBL key.

Copilot

Pull request overview

Adds a new Edison Scientific / FutureHouse “deep research” entrypoint for CultureMech media records by calling the edison-client SDK directly (bypassing deep-research-client provider limitations), plus convenience just targets and updated reporting/docs artifacts.

Changes:

Introduces scripts/research_media_edison.py to submit Edison jobs for single targets or batches, writing Markdown + YAML meta outputs.
Adds just research-media-edison and just research-media-edison-batch targets to run the Edison script via uv.
Updates the committed quality analysis report summary numbers and adds a .env.example template for Edison key setup.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
scripts/research_media_edison.py	New Edison SDK-backed research script (single + batch modes, output + meta writing).
project.justfile	Adds `just` targets to invoke the Edison research script.
data/import_tracking/reports/quality_analysis.md	Updates report metrics/formatting for the latest corpus analysis.
.env.example	Adds an example env file documenting Edison API key usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Eight findings, all fixed: 1. research_media_edison.py — Path.relative_to(REPO_ROOT) crashed --dry-run for --out-dir outside the repo. New _display_path() helper falls back to absolute string when the path isn't under REPO_ROOT. 2. Module docstring claimed --dry-run "prints the rendered query" but only printed paths + query_chars. --dry-run now writes the full meta yaml (including the rendered prompt) alongside the would-be md path; docstring updated to match. 3. Meta dict didn't actually contain the query referenced in the "prompt that was sent" doc claim. Added `query`, `media_id`, and `template_path` fields. Live runs gain the same fields. 4. slug_for() used the CURIE local part (009674) which diverged from research_media.py's stem-based naming. Switched to media_path.stem (e.g. luria_bertani_lb_medium) so research outputs are sortable / findable by recipe name. CURIE id captured in meta.media_id instead. 5. Filename suffix used job.name.lower() which produced "literature_high" while the CLI alias is --job literature-high. New _short_job() helper normalizes _ -> - for consistency. 6. edison-client + python-dotenv were only transitive deps via deep-research-client. Declared both explicitly under the dev extra in pyproject.toml so fresh `uv run --extra dev ...` won't break if the transitive ever drops out. Lockfile refreshed. 7. analyze_media_quality.py wrote the developer's absolute CWD into the committed report header. Now writes a repo-relative `Source dir: data/normalized_yaml` line; output_file used as the anchor for the relative-path computation. 8. tests/test_research_media_edison.py covers: - load_batch_targets returns recipe_name + file_path candidates in fall-through order - load_batch_targets rejects non-list JSON with SystemExit - _short_job emits hyphens (literature-high) - slug_for uses the YAML stem - _display_path doesn't crash on paths outside REPO_ROOT - resolve_job recognizes literature / paperqa / literature-high / paperqa-high aliases and SystemExits on unknown jobs 7/7 passing. just validate-strict: 0 ERROR rows / 15,827 records (no schema or recipe-data touched).

Copilot AI review requested due to automatic review settings May 26, 2026 04:52

Copilot started reviewing on behalf of realmarcin May 26, 2026 04:52 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

realmarcin merged commit 5e13eff into main May 26, 2026
1 check passed

realmarcin deleted the edison-integration branch May 26, 2026 05:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Edison Scientific deep-research integration via edison-client SDK#33

Edison Scientific deep-research integration via edison-client SDK#33
realmarcin merged 2 commits into
mainfrom
edison-integration

realmarcin commented May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

realmarcin commented May 26, 2026

Summary

New file: scripts/research_media_edison.py

Batch resolution fix

Justfile targets

Also

Out of scope here

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New file: `scripts/research_media_edison.py`