Week 1: Document Understanding Layer#2
Open
Abhishek-Kumar-Rai5 wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Week 1: Document Understanding Layer
Overview
This pull request contains the work completed for the first milestone of the BETYdb Document Understanding Layer. The focus of this phase was to establish the structural foundation of the extraction pipeline rather than implementing scientific information extraction. The goal was to build a reliable, deterministic, and maintainable document representation that all later stages of the project can depend on.
Marker Evaluation
The work began with an empirical evaluation of Marker using multiple scientific papers with different layouts and formatting styles. Instead of designing the architecture around assumptions, the implementation was driven by observations from real Marker outputs.
This analysis documented how Marker represents sections, tables, figures, captions, equations, references, footnotes, reading order, and page structure, while also identifying several structural inconsistencies that the downstream pipeline must handle. These findings formed the basis for all subsequent architectural decisions.
Raw Marker Model
Using the observations from the evaluation, a complete Raw Marker Model was implemented using Pydantic v2.
The purpose of this layer is to provide a lossless and immutable representation of Marker output. Every field produced by Marker is preserved exactly as received, without normalization or interpretation. Keeping this layer separate from later processing stages provides a stable interface between Marker and the remainder of the extraction pipeline while simplifying future compatibility and regression testing.
Document Schema Specification
Before implementation, a detailed Document Schema Specification was developed to serve as the engineering contract for the Document layer.
The specification defines every structural object in the document hierarchy—including documents, pages, sections, paragraphs, tables, figures, equations, references, captions, provenance, metadata, and statistics—along with their relationships, validation rules, serialization behaviour, deterministic identifier strategy, and structural invariants.
The schema intentionally models only document structure and avoids introducing scientific semantics, leaving biological concepts and BETYdb-specific entities to later stages of the pipeline.
Document Object Implementation
Following the finalized specification, the complete Document Object hierarchy was implemented using immutable Pydantic models.
The implementation provides strongly typed models for all structural document elements while preserving reading order, provenance information, recursive document hierarchy, deterministic identifiers, and reproducible serialization. Particular care was taken to ensure that every structural object can be traced back to the original Marker output without introducing any scientific interpretation at this stage.
Testing
A comprehensive unit test suite accompanies the implementation.
The tests validate the Raw Marker Model as well as the Document Object implementation, covering model construction, validation, serialization, deterministic identifier generation, provenance handling, recursive structures, document invariants, and round-trip consistency against real Marker-generated outputs.
Current Scope
This pull request completes the structural foundation of the Document Understanding Layer and includes:
The following components are intentionally outside the scope of this milestone and will be implemented in subsequent phases:
Next Steps
With the Raw Marker Model and the Document Object now established, the next milestone will focus on implementing the Normalizer. This component will be responsible for transforming the raw Marker representation into the canonical Document Object while preserving provenance, reading order, and structural fidelity. The completed Document layer will then serve as the foundation for retrieval, extraction, validation, and export in the later stages of the project.