Week 1: Document Understanding Layer by Abhishek-Kumar-Rai5 · Pull Request #2 · PecanProject/sage

Abhishek-Kumar-Rai5 · 2026-06-17T16:01:29Z

Week 1: Document Understanding Layer

Overview

This pull request contains the work completed for the first milestone of the BETYdb Document Understanding Layer. The focus of this phase was to establish the structural foundation of the extraction pipeline rather than implementing scientific information extraction. The goal was to build a reliable, deterministic, and maintainable document representation that all later stages of the project can depend on.

Marker Evaluation

The work began with an empirical evaluation of Marker using multiple scientific papers with different layouts and formatting styles. Instead of designing the architecture around assumptions, the implementation was driven by observations from real Marker outputs.

This analysis documented how Marker represents sections, tables, figures, captions, equations, references, footnotes, reading order, and page structure, while also identifying several structural inconsistencies that the downstream pipeline must handle. These findings formed the basis for all subsequent architectural decisions.

Raw Marker Model

Using the observations from the evaluation, a complete Raw Marker Model was implemented using Pydantic v2.

The purpose of this layer is to provide a lossless and immutable representation of Marker output. Every field produced by Marker is preserved exactly as received, without normalization or interpretation. Keeping this layer separate from later processing stages provides a stable interface between Marker and the remainder of the extraction pipeline while simplifying future compatibility and regression testing.

Document Schema Specification

Before implementation, a detailed Document Schema Specification was developed to serve as the engineering contract for the Document layer.

The specification defines every structural object in the document hierarchy—including documents, pages, sections, paragraphs, tables, figures, equations, references, captions, provenance, metadata, and statistics—along with their relationships, validation rules, serialization behaviour, deterministic identifier strategy, and structural invariants.

The schema intentionally models only document structure and avoids introducing scientific semantics, leaving biological concepts and BETYdb-specific entities to later stages of the pipeline.

Document Object Implementation

Following the finalized specification, the complete Document Object hierarchy was implemented using immutable Pydantic models.

The implementation provides strongly typed models for all structural document elements while preserving reading order, provenance information, recursive document hierarchy, deterministic identifiers, and reproducible serialization. Particular care was taken to ensure that every structural object can be traced back to the original Marker output without introducing any scientific interpretation at this stage.

Testing

A comprehensive unit test suite accompanies the implementation.

The tests validate the Raw Marker Model as well as the Document Object implementation, covering model construction, validation, serialization, deterministic identifier generation, provenance handling, recursive structures, document invariants, and round-trip consistency against real Marker-generated outputs.

Current Scope

This pull request completes the structural foundation of the Document Understanding Layer and includes:

Empirical evaluation of Marker outputs
Raw Marker Model implementation
Document Schema Specification
Complete Document Object implementation
Unit tests for both layers

The following components are intentionally outside the scope of this milestone and will be implemented in subsequent phases:

Normalizer
Retrieval API
Scientific Extraction
Intermediate Representation (IR)
Validation
Review
Export

Next Steps

With the Raw Marker Model and the Document Object now established, the next milestone will focus on implementing the Normalizer. This component will be responsible for transforming the raw Marker representation into the canonical Document Object while preserving provenance, reading order, and structural fidelity. The completed Document layer will then serve as the foundation for retrieval, extraction, validation, and export in the later stages of the project.

Week 1: Document Understanding Layer

0c810f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Week 1: Document Understanding Layer#2

Week 1: Document Understanding Layer#2
Abhishek-Kumar-Rai5 wants to merge 1 commit into
PecanProject:mainfrom
Abhishek-Kumar-Rai5:gsoc/week1-document-layer

Abhishek-Kumar-Rai5 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Abhishek-Kumar-Rai5 commented Jun 17, 2026

Week 1: Document Understanding Layer

Overview

Marker Evaluation

Raw Marker Model

Document Schema Specification

Document Object Implementation

Testing

Current Scope

Next Steps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant