Skip to content

Commit a265c38

Browse files
authored
Revise LLM-assisted extraction project details
Updated project description for LLM-assisted data extraction, add detail on methodology and expected outcomes.
1 parent 8831de9 commit a265c38

File tree

1 file changed

+16
-9
lines changed

1 file changed

+16
-9
lines changed

src/pages/gsoc_ideas.mdx

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -177,13 +177,19 @@ Medium (175hr) or Large (350 hr) depending on number of deliverables
177177
Medium
178178

179179
---
180-
### 5. LLM-Assisted Extraction of Agronomic Experiments into BETYdb{#llm-betydb}
181180

182-
Manual extraction of agronomic and ecological experiments from scientific literature into BETYdb is slow, error-prone, and labor-intensive. Researchers must interpret complex experimental designs, reconstruct management timelines, identify treatments and controls, handle factorial structures, and link outcomes with correct covariates and uncertainty estimates—tasks that require scientific judgment beyond simple text extraction. Current manual workflows can take hours per paper and introduce inconsistencies that compromise downstream data quality and meta-analyses.
181+
### 5. LLM-Assisted Extraction of Agronomic and Ecological Experiments into Structured Data {#llm-betydb}
183182

184-
This project proposes a human-supervised, LLM-based system to accelerate BETYdb data entry while preserving scientific rigor and traceability. The system will ingest PDFs of scientific papers and produce upload-ready BETYdb entries (sites, treatments, management time series, traits, and yields) with every field labeled as extracted, inferred, or unresolved and linked to provenance evidence in the source document. The system leverages existing labeled training data (scientific papers with ground-truth BETYdb entries).
183+
Manual extraction of agronomic and ecological experiments from scientific literature into a structured format that can be used to calibrate and validate models is slow, error-prone, and labor-intensive. Researchers must interpret complex experimental designs, reconstruct management timelines, identify treatments and controls, handle factorial structures, and link outcomes with correct covariates and uncertainty estimates. Data are often reported as summary statistics (for example mean and standard error) in text, tables, or figures and require additional context from disturbance or management time series. These tasks require scientific judgment beyond simple text extraction.
184+
Current manual workflows can take hours per paper and introduce inconsistencies that compromise downstream data quality and meta-analyses.
185185

186-
The architecture follows a two-layer design: (1) a schema-validated intermediate representation (IR) preserving evidence links, confidence scores, and flagged conflicts, and (2) a BETYdb materialization layer that enforces BETYdb semantics, validation rules, and generates upload-ready CSVs or API payloads with full audit trails. Implementation is flexible—ranging from agentic LLM workflows to fine-tuned specialist models to an adaptive hybrid—and should be informed by empirical evaluation during the project.
186+
This project proposes a human-supervised, LLM-based system to accelerate data extraction while preserving scientific rigor and traceability. It will leverage existing labeled training data (scientific papers with ground‑truth entries), including aligned PDF‑to‑structured‑data records from BETYdb and ForC, which represent expert‑curated, production‑quality datasets. Combined, these resources include over 80,000 plant and ecosystem observations from more than 1,000 sources and provide high-quality supervision for extraction from text, tables, and figures. Evaluation should include held-out, out-of-sample papers. The system will ingest PDFs of scientific papers and produce tables compatible with the [spreadsheet used to upload data to BETYdb](https://docs.google.com/spreadsheets/d/e/2PACX-1vSAa7jBHSaas-bH0ARxQjVLKhz3Iq03t97wrxMZrgVVi98L5bYQi5ZUC0b57xIZBlHEkPH9qYf22xQS/pubhtml) (sites, treatments, management time series, traits+yields bulk upload table) with every field labeled as extracted, inferred, or unresolved and linked to provenance evidence in the source document.
187+
188+
The architecture follows a two-layer design: (1) a schema-validated intermediate representation (IR) preserving evidence links, confidence scores, and flagged conflicts, and (2) a materialization layer that enforces semantics, validation rules, and generates upload-ready CSVs or API payloads with full audit trails. Implementation is flexible—ranging from agentic LLM workflows to fine-tuned specialist models to an adaptive hybrid—and should be informed by empirical evaluation during the project.
189+
190+
The architecture follows a two‑layer design: (1) a schema‑validated intermediate representation (IR) preserving evidence links, confidence scores, and flagged conflicts, and (2) a BETYdb materialization layer that enforces BETYdb semantics and validation rules and generates upload‑ready CSVs or API payloads with full audit trails.
191+
192+
Implementation is flexible—ranging from agentic LLM workflows to fine‑tuned specialist models to an adaptive hybrid—and should be informed by empirical evaluation during the project.
187193

188194
**Expected outcomes:**
189195

@@ -194,15 +200,15 @@ A successful project would complete the following tasks:
194200
* Independent validators for BETYdb semantics, unit consistency, temporal logic, and required fields
195201
* BETYdb export module producing upload-ready management CSVs and bulk trait upload formats with full provenance preservation
196202
* Scientist-in-the-loop review interface for approving, correcting, or rejecting extracted entries with inline evidence and confidence scores
197-
* Evaluation harness with automated metrics for extraction accuracy, inference quality, coverage, and time savings on held-out test papers
203+
* Evaluation harness with automated metrics for extraction accuracy, inference quality, coverage, and time savings relative to manual curation on heldout test papers
198204
* Documentation covering IR schema specification, developer guidance for adding new extraction components, and user guidance for the review interface
199205

200206
**Prerequisites:**
201207

202-
- Required: R Shiny, Python (familiarity with scientific literature and experimental design concepts)
203-
- Helpful: experience with LLM APIs (Anthropic, OpenAI) or fine-tuning frameworks, knowledge of BETYdb schema and workflows, familiarity with agronomic or ecological experimental designs
208+
- Required: Python; familiarity with natural language processing, information extraction, and machine learning
209+
- Helpful: experience with LLM APIs and fine-tuning frameworks, knowledge of BETYdb schema and workflows, familiarity with scientific writing and agronomic or ecological experimental design/analysis
204210

205-
**Contact person:**
211+
**Contact persons:**
206212

207213
Nihar Sanda (@koolgax99), David LeBauer (@dlebauer)
208214

@@ -212,7 +218,8 @@ Large (350 hr)
212218

213219
**Difficulty:**
214220

215-
Medium to High
221+
High
222+
216223

217224
<!--
218225

0 commit comments

Comments
 (0)