Skip to content

Edison Scientific deep-research integration via edison-client SDK#33

Merged
realmarcin merged 2 commits into
mainfrom
edison-integration
May 26, 2026
Merged

Edison Scientific deep-research integration via edison-client SDK#33
realmarcin merged 2 commits into
mainfrom
edison-integration

Conversation

@realmarcin
Copy link
Copy Markdown
Contributor

Summary

Wires a programmatic path to FutureHouse / Edison Scientific deep research over CultureMech media records.

The existing scripts/research_media.py wraps deep-research-client, but DRC 0.2.4 registers only openai and cyberian as providers — Edison / PaperQA isn't accessible through it. The edison-client SDK is already in pyproject deps; this PR invokes it directly.

New file: scripts/research_media_edison.py

  • Single recipe: python scripts/research_media_edison.py --target <slug|id|path>
  • Batch: python scripts/research_media_edison.py --batch data/import_tracking/reports/edison_batch.json --limit 5
  • --job literature (paperqa3, default) | literature-high | precedent | phoenix. Aliases paperqa, paperqa-high.
  • --dry-run renders the query and prints the plan without an API call (no credits).
  • --start / --limit cap or skip into a batch list.
  • Auth: EDISON_PLATFORM_API_KEY (SDK-native) or legacy EDISON_API_KEY; auto-loads repo-root .env via python-dotenv.
  • Reuses research_media.py's template_vars / load_media / resolve_media_file so the rendered query matches the existing DRC workflow.
  • Outputs: research/media/{slug}-edison-{job}.md + sibling -meta.yaml (task_id, total_cost, status, template_vars).

Batch resolution fix

The 100-recipe edison_batch.json priority list was failing to resolve because research_media.py's resolve_media_file:

  • treats string file_paths as CWD-relative (not data/normalized_yaml/-relative), and
  • raises on multi-match slugs (e.g. dehalospirillum_medium exists in 5 importer-flavor variants).

The Edison batch resolver tries data/normalized_yaml/<file_path> verbatim first — unambiguous — then falls back to slug matching. Resolvability is 100% (5/5) in the smoke dry-run.

Justfile targets

  • just research-media-edison <target> [*args]
  • just research-media-edison-batch <batch.json> [*args] (always pass --limit N on first runs)

Also

  • .env.example added (gitignored — .env already is).
  • data/import_tracking/reports/edison_batch.json regenerated against current corpus via analyze_media_quality.py (the March-vintage file pre-dated the snake_case + orphan-page cleanups; few entries still resolved).

Out of scope here

Live API smoke test deferred until user confirms .env has the new LBL key (2.6k credits).

Test plan

  • Dry-run single recipe (--target luria_bertani_lb_medium)
  • Dry-run batch (--limit 5) — 5/5 resolve cleanly
  • just --list shows both new targets in the Research group
  • Live test deferred (key rotation)
  • just validate-strict clean (no schema/data touched)

🤖 Generated with Claude Code

Adds a programmatic path to FutureHouse / Edison Scientific deep
research over CultureMech media records. Companion to (not replacement
for) scripts/research_media.py, which wraps deep-research-client — DRC
0.2.4 registers only `openai` and `cyberian` as providers, so Edison /
PaperQA isn't accessible through it. The edison-client SDK (already
in pyproject deps) is invoked directly here.

New: scripts/research_media_edison.py
  Single recipe:
    python scripts/research_media_edison.py --target <slug|id|path>
  Batch (priority list):
    python scripts/research_media_edison.py \
        --batch data/import_tracking/reports/edison_batch.json --limit 5

  --job: literature (paperqa3, default) | literature-high | precedent
         | phoenix. Aliases: paperqa, paperqa-high.
  --dry-run: render the query and print the plan; no API call, no
             credits spent.
  --start / --limit: cap or skip into a batch list.

  Auth picks up EDISON_PLATFORM_API_KEY (SDK-native) or EDISON_API_KEY
  (legacy alias used by research_media.py). A repo-root .env is
  auto-loaded via python-dotenv. .env.example added.

  Reuses research_media.py's template_vars / load_media / resolve
  helpers so the rendered query matches the existing DRC workflow.
  Outputs land under research/media/{slug}-edison-{job}.md plus a
  sibling -meta.yaml capturing task_id, total_cost, status, the
  rendered template variables, and the prompt size for audit.

Batch resolution gotcha (and fix):
  edison_batch.json carries `recipe_name` (slug derived from the YAML
  name field) AND `file_path` (relative to data/normalized_yaml/).
  research_media.py's resolve_media_file:
    (a) treats string paths as relative to CWD (not normalized_yaml/),
        so file_path entries miss; and
    (b) returns ValueError when a slug matches multiple files (e.g.,
        "dehalospirillum_medium" appears in 5 importer-flavor variants).
  The Edison batch resolver now first tries
  `data/normalized_yaml/<file_path>` verbatim — unambiguous — before
  falling back to slug matching. Resolvability against the freshly-
  regenerated 100-recipe batch is 100% (5/5 in --limit 5 smoke).

Justfile targets:
  research-media-edison target *args=""           # single
  research-media-edison-batch batch *args=""      # batch (pass --limit!)

Out of scope here: live API smoke test deferred until user confirms
.env has the new LBL key.
Copilot AI review requested due to automatic review settings May 26, 2026 04:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Edison Scientific / FutureHouse “deep research” entrypoint for CultureMech media records by calling the edison-client SDK directly (bypassing deep-research-client provider limitations), plus convenience just targets and updated reporting/docs artifacts.

Changes:

  • Introduces scripts/research_media_edison.py to submit Edison jobs for single targets or batches, writing Markdown + YAML meta outputs.
  • Adds just research-media-edison and just research-media-edison-batch targets to run the Edison script via uv.
  • Updates the committed quality analysis report summary numbers and adds a .env.example template for Edison key setup.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
scripts/research_media_edison.py New Edison SDK-backed research script (single + batch modes, output + meta writing).
project.justfile Adds just targets to invoke the Edison research script.
data/import_tracking/reports/quality_analysis.md Updates report metrics/formatting for the latest corpus analysis.
.env.example Adds an example env file documenting Edison API key usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/research_media_edison.py Outdated
Comment thread scripts/research_media_edison.py Outdated
Comment thread scripts/research_media_edison.py
Comment thread scripts/research_media_edison.py Outdated
Comment thread scripts/research_media_edison.py Outdated
Comment thread scripts/research_media_edison.py
Comment thread data/import_tracking/reports/quality_analysis.md Outdated
Comment thread scripts/research_media_edison.py
Eight findings, all fixed:

1. research_media_edison.py — Path.relative_to(REPO_ROOT) crashed
   --dry-run for --out-dir outside the repo. New _display_path()
   helper falls back to absolute string when the path isn't under
   REPO_ROOT.

2. Module docstring claimed --dry-run "prints the rendered query" but
   only printed paths + query_chars. --dry-run now writes the full
   meta yaml (including the rendered prompt) alongside the would-be
   md path; docstring updated to match.

3. Meta dict didn't actually contain the query referenced in the
   "prompt that was sent" doc claim. Added `query`, `media_id`, and
   `template_path` fields. Live runs gain the same fields.

4. slug_for() used the CURIE local part (009674) which diverged from
   research_media.py's stem-based naming. Switched to media_path.stem
   (e.g. luria_bertani_lb_medium) so research outputs are sortable /
   findable by recipe name. CURIE id captured in meta.media_id
   instead.

5. Filename suffix used job.name.lower() which produced
   "literature_high" while the CLI alias is --job literature-high.
   New _short_job() helper normalizes _ -> - for consistency.

6. edison-client + python-dotenv were only transitive deps via
   deep-research-client. Declared both explicitly under the dev
   extra in pyproject.toml so fresh `uv run --extra dev ...` won't
   break if the transitive ever drops out. Lockfile refreshed.

7. analyze_media_quality.py wrote the developer's absolute CWD into
   the committed report header. Now writes a repo-relative
   `Source dir: data/normalized_yaml` line; output_file used as the
   anchor for the relative-path computation.

8. tests/test_research_media_edison.py covers:
   - load_batch_targets returns recipe_name + file_path candidates
     in fall-through order
   - load_batch_targets rejects non-list JSON with SystemExit
   - _short_job emits hyphens (literature-high)
   - slug_for uses the YAML stem
   - _display_path doesn't crash on paths outside REPO_ROOT
   - resolve_job recognizes literature / paperqa / literature-high /
     paperqa-high aliases and SystemExits on unknown jobs

   7/7 passing.

just validate-strict: 0 ERROR rows / 15,827 records (no schema or
recipe-data touched).
@realmarcin realmarcin merged commit 5e13eff into main May 26, 2026
1 check passed
@realmarcin realmarcin deleted the edison-integration branch May 26, 2026 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants