issue-to-eval: import benchmark evals for issues #147, #148, #149 by elong0527 · Pull Request #150 · RConsortium/pharma-skills

elong0527 · 2026-06-02T13:07:03Z

Summary

Ran the issue-to-eval skill against the live GitHub issues labeled benchmark (35 total). Three new evaluation files were produced; every other issue parses identically to what is already on disk.

Issues imported (new)

Issue	Skill	Title
#147	admiral-bds	ADVS derivation from pharmaversesdtm — BDS Findings (vital signs)
#148	admiral-adae	ADAE derivation from pharmaversesdtm — safety analysis
#149	admiral-bds	ADLB derivation from pharmaversesdtm — BDS Findings (laboratory values)

Skipped / no change

Already up to date (31 issues): [benchmark][group-sequential-design] Reproduce clinical trial numbers from SAP NCT05638204 #2, [benchmark][group-sequential-design] Reproduce clinical trial numbers from SAP NCT01471522 #3, [benchmark][group_sequential_design] eval 1 #21, [benchmark][group-sequential-design] co-primary PFS and OS #22, [benchmark][group-sequential-design] multi-endpoint + multi-populations #23, [benchmark][group-sequential-design] single endpoint with non-constant hazard and non-constant HR #24, [benchmark][group-sequential-design] Group Sequential Design of PFS Endpoint #27, [benchmark][group-sequential-design] Adaptive enrichment, skill must refuse standard GSD and escalate to correct framework #36, [benchmark][group-sequential-design] NPH self-detection, immunotherapy context with no explicit delayed-effect signal #37, [benchmark][group-sequential-design] Front-loaded enrollment, information fraction vs calendar time decoupling #38, [benchmark][group-sequential-design] Competing risks CVOT, CV death endpoint with substantial non-CV death competing event #39, [benchmark][group-sequential-design] Subgroup futility interim, binding vs non-binding accounting and subgroup-specific information fraction #40, [benchmark][group-sequential-design] dry-run #60, [benchmark][skill-name] alpha split between two populations with co-primary endpoints #69, [benchmark][group-sequential-design] co-primary with multiple IAs and complicated timeline requirement #74, [benchmark][skill-name] Fixed sequence design #91, [benchmark][group-sequential-design] Multi-region trial with staggered site activation schedule #103, [benchmark][clinical-trial-simulation] Phase 3 onco simulation with NPH delayed effect and SOC sensitivity #104, [benchmark][group-sequential-design] two biomarker subpopulation nested design #107, [benchmark][group-sequential-design] Short summary of the query #109, [benchmark][group-sequential-design] ctDNA-guided randomized trial where randomized accrual is delayed #113, [benchmark][clinical-trial-simulation] Phase 3 NSCLC with binding ORR gate, PFS GSD, OS final, and adaptive duration #118, [benchmark][admiral-adsl] basic-two-arm #124, [benchmark][admiral-adsl] ADSL derivation from pharmaversesdtm — basic two-arm #126, [benchmark][r2rtf] create baseline characteristic table #128, [benchmark][admiral] simple test for derive_vars_cat #132, [benchmark][admiral] Derive Heart rate using compute_rr #133, [benchmark][admiral-adae] basic-teae: standard TEAE derivation with complete pharmaverse data #137, [benchmark][group-sequential-design] Four arm design to test superiority, dose and individual component contribution #139, [benchmark][admiral-adsl] Create ADVS ADaM and derive ADSL baseline vital sign variables #141, [benchmark][admiral-adsl] Create ADVS ADaM and derive ADSL baseline vital sign variables #142.
[benchmark][skill-name] Short summary of the query #108 skipped: issue body has empty Skills/Query/Expected Output/Assertions sections (template not filled out).
Empty-rubric warnings on [benchmark][group-sequential-design] Reproduce clinical trial numbers from SAP NCT01471522 #3, [benchmark][admiral-adsl] ADSL derivation from pharmaversesdtm — basic two-arm #126, [benchmark][group-sequential-design] Four arm design to test superiority, dose and individual component contribution #139, [benchmark][admiral-bds] ADVS derivation from pharmaversesdtm — BDS Findings (vital signs) #147, [benchmark][admiral-adae] ADAE derivation from pharmaversesdtm — safety analysis #148, [benchmark][admiral-bds] ADLB derivation from pharmaversesdtm — BDS Findings (laboratory values) #149 (issue body has an empty ## Rubric Criteria (Assertions) section) — preserved as-is, since they reflect upstream issue content.

Notes

The gh CLI is unavailable in this environment, so the sync was run by feeding GitHub MCP issue bodies through _automation/issue-to-eval/scripts/import_issue_eval.parse_issue_markdown + save_to_evals directly. The MCP returns HTML-encoded text (e.g. ', ", <), so bodies are normalized via html.unescape before parsing to keep the on-disk evals byte-identical with prior gh-based runs.

Test plan

Re-run python3 _automation/issue-to-eval/scripts/sync_benchmarks.py after this PR merges and confirm every parseable issue reports Skipped (up to date).
_automation/evals/github-issue-147.json carries target_skills: ["admiral-bds"], language: "R", and the SYSBP/DIABP/PULSE/WEIGHT/HEIGHT/BMI parameter mapping prompt.
_automation/evals/github-issue-148.json carries target_skills: ["admiral-adae"], language: "R", and the 30-day TRTEMFL window prompt.
_automation/evals/github-issue-149.json carries target_skills: ["admiral-bds"], language: "R", and the LB-domain ADLB derivation prompt with spec-driven PARAMCD/PARAM lookup.

Generated by Claude Code

Synced 3 new benchmark issues into _automation/evals/: - #147: admiral-bds ADVS derivation from pharmaversesdtm (R) - #148: admiral-adae ADAE derivation from pharmaversesdtm (R) - #149: admiral-bds ADLB derivation from pharmaversesdtm (R) All 32 other benchmark issues already up to date; #108 skipped (unfilled template).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue-to-eval: import benchmark evals for issues #147, #148, #149#150

issue-to-eval: import benchmark evals for issues #147, #148, #149#150
elong0527 wants to merge 1 commit into
mainfrom
claude/funny-planck-oVtNE

elong0527 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elong0527 commented Jun 2, 2026

Summary

Issues imported (new)

Skipped / no change

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants