Skip to content

Commit 8831de9

Browse files
authored
Merge pull request #150 from koolgax99/betydb-llm-gsoc-idea
update gsoc_ideas.mdx with betydb llm idea
2 parents b267779 + 108c131 commit 8831de9

File tree

1 file changed

+158
-0
lines changed

1 file changed

+158
-0
lines changed

src/pages/gsoc_ideas.mdx

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Below is a list of project ideas. Feel free to contact the listed mentors on Sla
1616
2. [Benchmarking and Validation Framework](#validation)
1717
3. [Increase PEcAn modularity](#module)
1818
4. [Standardizing Model Couplers Across Models](#couplertools)
19+
5. [LLM-Assisted Extraction of Agronomic Experiments into BETYdb](#llm-betydb)
1920

2021
---
2122

@@ -174,3 +175,160 @@ Medium (175hr) or Large (350 hr) depending on number of deliverables
174175

175176
**Difficulty:**
176177
Medium
178+
179+
---
180+
### 5. LLM-Assisted Extraction of Agronomic Experiments into BETYdb{#llm-betydb}
181+
182+
Manual extraction of agronomic and ecological experiments from scientific literature into BETYdb is slow, error-prone, and labor-intensive. Researchers must interpret complex experimental designs, reconstruct management timelines, identify treatments and controls, handle factorial structures, and link outcomes with correct covariates and uncertainty estimates—tasks that require scientific judgment beyond simple text extraction. Current manual workflows can take hours per paper and introduce inconsistencies that compromise downstream data quality and meta-analyses.
183+
184+
This project proposes a human-supervised, LLM-based system to accelerate BETYdb data entry while preserving scientific rigor and traceability. The system will ingest PDFs of scientific papers and produce upload-ready BETYdb entries (sites, treatments, management time series, traits, and yields) with every field labeled as extracted, inferred, or unresolved and linked to provenance evidence in the source document. The system leverages existing labeled training data (scientific papers with ground-truth BETYdb entries).
185+
186+
The architecture follows a two-layer design: (1) a schema-validated intermediate representation (IR) preserving evidence links, confidence scores, and flagged conflicts, and (2) a BETYdb materialization layer that enforces BETYdb semantics, validation rules, and generates upload-ready CSVs or API payloads with full audit trails. Implementation is flexible—ranging from agentic LLM workflows to fine-tuned specialist models to an adaptive hybrid—and should be informed by empirical evaluation during the project.
187+
188+
**Expected outcomes:**
189+
190+
A successful project would complete the following tasks:
191+
192+
* IR schema definition with validation rules and documented field semantics covering sites, treatments, managements, and traits/yields
193+
* Modular extraction pipeline for document parsing, information extraction, and IR generation with clear separation between extraction and validation logic
194+
* Independent validators for BETYdb semantics, unit consistency, temporal logic, and required fields
195+
* BETYdb export module producing upload-ready management CSVs and bulk trait upload formats with full provenance preservation
196+
* Scientist-in-the-loop review interface for approving, correcting, or rejecting extracted entries with inline evidence and confidence scores
197+
* Evaluation harness with automated metrics for extraction accuracy, inference quality, coverage, and time savings on held-out test papers
198+
* Documentation covering IR schema specification, developer guidance for adding new extraction components, and user guidance for the review interface
199+
200+
**Prerequisites:**
201+
202+
- Required: R Shiny, Python (familiarity with scientific literature and experimental design concepts)
203+
- Helpful: experience with LLM APIs (Anthropic, OpenAI) or fine-tuning frameworks, knowledge of BETYdb schema and workflows, familiarity with agronomic or ecological experimental designs
204+
205+
**Contact person:**
206+
207+
Nihar Sanda (@koolgax99), David LeBauer (@dlebauer)
208+
209+
**Duration:**
210+
211+
Large (350 hr)
212+
213+
**Difficulty:**
214+
215+
Medium to High
216+
217+
<!--
218+
219+
220+
# This comment section for ideas that may be potentially viable in future (with revision)
221+
222+
223+
---
224+
225+
### 4. Development of Notebook-based PEcAn Workflows{#notebook}
226+
227+
The PEcAn workflow is currently run using either a web based user interface, an API, or custom R scripts. The web based user interface is easiest to use, but has limited functionality whereas the custom R scripts and API are more flexible, but require more experience.
228+
229+
This project will focus on building Quarto notebooks that provide an interface to PEcAn that is both welcoming to new users and flexible enough to be a starting point for more advanced users. It will build on existing [Pull Request 1733](https://github.com/PecanProject/pecan/pull/1733).
230+
231+
**Expected outcomes:**
232+
233+
- Two or more template workflows for running the PEcAn workflow.
234+
- Written vignette and video tutorial introducing their use.
235+
236+
**Prerequisites:**
237+
238+
- Familiarity with R.
239+
- Familiarity with R studio and Quarto or Rmarkdown is a plus.
240+
241+
**Contact person:**
242+
David LeBauer @dlebauer, Nihar Sanda @koolgax99
243+
244+
**Duration:**
245+
Medium (175hr)
246+
247+
**Difficulty:**
248+
Medium
249+
250+
251+
#### BETYdb R data package
252+
253+
BETYdb's web front end is built on a version of Ruby on Rails that is functional byt no longer supported. A key feature of BETYdb is that the data is open and accessible.
254+
255+
Building an R data package would make the Trait and Yield data currently in BETYdb more accessible to users beyond the PEcAn community.
256+
257+
**Expected outcomes:**
258+
259+
A successful project would complete a subset of the following tasks:
260+
261+
- An R package containing the data currently hosted in BETYdb.
262+
- Documentation and examples of use.
263+
- Updates to BETYdb documentation.
264+
265+
**Prerequisites:**
266+
267+
- Required: R
268+
- Helpful: R package development; familiarity with relational databases and SQL.
269+
270+
**Contact person:**
271+
272+
David LeBauer (@dlebauer)
273+
274+
**Duration:**
275+
276+
Medium (175hr) to Large (350hr) depending on scope of proposal.
277+
278+
**Difficulty:**
279+
280+
Medium
281+
282+
---
283+
284+
#### [Optimize PEcAn for freestanding use of single packages [R package development]](#freestanding)
285+
286+
PEcAn was designed as a system of independent modules, each implemented as its own R package that was intended to be usable either standalone or as part of the full PEcAn system. Subsequent development focused on the most common cross-module workflows has lead to tighter coupling between modules than was originally intended, so that in practice many of the modules are now challenging to use, test, or develop without a full understanding of their interdependencies. Further, some packages expect inputs and outputs in data structures that are only generated by other PEcAn packages but might be more easily provided in standard well-known formats. We seek proposals to re-loosen these couplings by revisiting the design and interface of PEcAn packages through one or more of:
287+
288+
1. Refactoring code to remove unneeded dependencies, simplify package interfaces, and exchange data in standard formats
289+
2. Identifying exported functions that are not core to the functionality of the package, and removing them or making them internal
290+
3. Writing tests and examples that demonstrate freestanding use of the package
291+
4. Developing methods for tracking the dependencies between packages that cannot be eliminated, including how these change between package versions
292+
Proposals for this project should choose a subset of these approaches and apply them to a specified subset of the PEcAn packages. Strong proposals will clearly show why each chosen package should be a priority, how it will become more independent at the completion of the project, and what interface changes will be needed to achieve this.
293+
294+
**Expected outcome:**
295+
296+
- One or more PEcAn packages can be installed, used, and/or tested without the user needing to know [something previously important] about [another package].
297+
298+
**Prerequisites:**
299+
300+
- Familiarity with R, especially how it manages dependencies between packages, and with concepts of software package development. Helpful resources: [rOpenSci packages](https://devguide.ropensci.org/index.html) and [R packages](https://r-pkgs.org). Experience with multi-package code bases will be very helpful.
301+
302+
**Contact person:**
303+
Chris Black @infotroph, Shashank Singh @moki1202
304+
305+
**Duration:**
306+
Flexible to work as either a Small (175hr) or Large (350 hr)
307+
308+
**Difficulty:**
309+
Medium, Large
310+
---
311+
312+
#### [PEcAn model coupling and development [Data Science]](#coupling)
313+
314+
PEcAn has the capability to interface multiple ecological models. The goal of this project is to improve the coupling of existing models to PEcAn (specifically FATES) and add new models (specifically a simple vegetation model that is under development). It is also possible to contribute to the development of the simple vegetation model which is written in Fortran.
315+
316+
**Expected outcome:**
317+
318+
- New or improved PEcAn model packages.
319+
320+
**Prerequisites:**
321+
322+
- R, Fortran is an advantage.
323+
324+
**Contact person:**
325+
Hui Tang @Hui Tang, Istem Fer @istfer
326+
327+
**Duration:**
328+
Flexible to work as either a Small (175hr) or Large (350 hr)
329+
330+
**Difficulty:**
331+
Medium
332+
333+
---
334+
-->

0 commit comments

Comments
 (0)