Skip to content

Commit cc91139

Browse files
authored
minor changes
Removed redundant text and clarify.
1 parent a265c38 commit cc91139

File tree

1 file changed

+1
-3
lines changed

1 file changed

+1
-3
lines changed

src/pages/gsoc_ideas.mdx

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -183,12 +183,10 @@ Medium
183183
Manual extraction of agronomic and ecological experiments from scientific literature into a structured format that can be used to calibrate and validate models is slow, error-prone, and labor-intensive. Researchers must interpret complex experimental designs, reconstruct management timelines, identify treatments and controls, handle factorial structures, and link outcomes with correct covariates and uncertainty estimates. Data are often reported as summary statistics (for example mean and standard error) in text, tables, or figures and require additional context from disturbance or management time series. These tasks require scientific judgment beyond simple text extraction.
184184
Current manual workflows can take hours per paper and introduce inconsistencies that compromise downstream data quality and meta-analyses.
185185

186-
This project proposes a human-supervised, LLM-based system to accelerate data extraction while preserving scientific rigor and traceability. It will leverage existing labeled training data (scientific papers with ground‑truth entries), including aligned PDF‑to‑structured‑data records from BETYdb and ForC, which represent expert‑curated, production‑quality datasets. Combined, these resources include over 80,000 plant and ecosystem observations from more than 1,000 sources and provide high-quality supervision for extraction from text, tables, and figures. Evaluation should include held-out, out-of-sample papers. The system will ingest PDFs of scientific papers and produce tables compatible with the [spreadsheet used to upload data to BETYdb](https://docs.google.com/spreadsheets/d/e/2PACX-1vSAa7jBHSaas-bH0ARxQjVLKhz3Iq03t97wrxMZrgVVi98L5bYQi5ZUC0b57xIZBlHEkPH9qYf22xQS/pubhtml) (sites, treatments, management time series, traits+yields bulk upload table) with every field labeled as extracted, inferred, or unresolved and linked to provenance evidence in the source document.
186+
This project proposes a human-supervised, LLM-based system to accelerate data extraction while preserving scientific rigor and traceability. It will leverage existing labeled training data (scientific papers with ground‑truth entries), including aligned PDF‑to‑structured‑data records from BETYdb and ForC, which represent expert‑curated, production‑quality datasets. Combined, these resources include over 80,000 plant and ecosystem observations from more than 1,000 sources andCombined, these resources include over 80,000 observations from more than 1,000 sources, providing high-quality supervision for extraction from text, tables, and figures. Evaluation should include held-out, out-of-sample papers. The system will ingest PDFs of scientific papers and produce tables compatible with the [spreadsheet used to upload data to BETYdb](https://docs.google.com/spreadsheets/d/e/2PACX-1vSAa7jBHSaas-bH0ARxQjVLKhz3Iq03t97wrxMZrgVVi98L5bYQi5ZUC0b57xIZBlHEkPH9qYf22xQS/pubhtml) (sites, treatments, management time series, traits+yields bulk upload table) with every field labeled as extracted, inferred, or unresolved and linked to provenance evidence in the source document.
187187

188188
The architecture follows a two-layer design: (1) a schema-validated intermediate representation (IR) preserving evidence links, confidence scores, and flagged conflicts, and (2) a materialization layer that enforces semantics, validation rules, and generates upload-ready CSVs or API payloads with full audit trails. Implementation is flexible—ranging from agentic LLM workflows to fine-tuned specialist models to an adaptive hybrid—and should be informed by empirical evaluation during the project.
189189

190-
The architecture follows a two‑layer design: (1) a schema‑validated intermediate representation (IR) preserving evidence links, confidence scores, and flagged conflicts, and (2) a BETYdb materialization layer that enforces BETYdb semantics and validation rules and generates upload‑ready CSVs or API payloads with full audit trails.
191-
192190
Implementation is flexible—ranging from agentic LLM workflows to fine‑tuned specialist models to an adaptive hybrid—and should be informed by empirical evaluation during the project.
193191

194192
**Expected outcomes:**

0 commit comments

Comments
 (0)