Skip to content

Week 1: Document Understanding Layer#2

Open
Abhishek-Kumar-Rai5 wants to merge 1 commit into
PecanProject:mainfrom
Abhishek-Kumar-Rai5:gsoc/week1-document-layer
Open

Week 1: Document Understanding Layer#2
Abhishek-Kumar-Rai5 wants to merge 1 commit into
PecanProject:mainfrom
Abhishek-Kumar-Rai5:gsoc/week1-document-layer

Conversation

@Abhishek-Kumar-Rai5

Copy link
Copy Markdown
Collaborator

Week 1: Document Understanding Layer

Overview

This pull request contains the work completed for the first milestone of the BETYdb Document Understanding Layer. The focus of this phase was to establish the structural foundation of the extraction pipeline rather than implementing scientific information extraction. The goal was to build a reliable, deterministic, and maintainable document representation that all later stages of the project can depend on.


Marker Evaluation

The work began with an empirical evaluation of Marker using multiple scientific papers with different layouts and formatting styles. Instead of designing the architecture around assumptions, the implementation was driven by observations from real Marker outputs.

This analysis documented how Marker represents sections, tables, figures, captions, equations, references, footnotes, reading order, and page structure, while also identifying several structural inconsistencies that the downstream pipeline must handle. These findings formed the basis for all subsequent architectural decisions.


Raw Marker Model

Using the observations from the evaluation, a complete Raw Marker Model was implemented using Pydantic v2.

The purpose of this layer is to provide a lossless and immutable representation of Marker output. Every field produced by Marker is preserved exactly as received, without normalization or interpretation. Keeping this layer separate from later processing stages provides a stable interface between Marker and the remainder of the extraction pipeline while simplifying future compatibility and regression testing.


Document Schema Specification

Before implementation, a detailed Document Schema Specification was developed to serve as the engineering contract for the Document layer.

The specification defines every structural object in the document hierarchy—including documents, pages, sections, paragraphs, tables, figures, equations, references, captions, provenance, metadata, and statistics—along with their relationships, validation rules, serialization behaviour, deterministic identifier strategy, and structural invariants.

The schema intentionally models only document structure and avoids introducing scientific semantics, leaving biological concepts and BETYdb-specific entities to later stages of the pipeline.


Document Object Implementation

Following the finalized specification, the complete Document Object hierarchy was implemented using immutable Pydantic models.

The implementation provides strongly typed models for all structural document elements while preserving reading order, provenance information, recursive document hierarchy, deterministic identifiers, and reproducible serialization. Particular care was taken to ensure that every structural object can be traced back to the original Marker output without introducing any scientific interpretation at this stage.


Testing

A comprehensive unit test suite accompanies the implementation.

The tests validate the Raw Marker Model as well as the Document Object implementation, covering model construction, validation, serialization, deterministic identifier generation, provenance handling, recursive structures, document invariants, and round-trip consistency against real Marker-generated outputs.


Current Scope

This pull request completes the structural foundation of the Document Understanding Layer and includes:

  • Empirical evaluation of Marker outputs
  • Raw Marker Model implementation
  • Document Schema Specification
  • Complete Document Object implementation
  • Unit tests for both layers

The following components are intentionally outside the scope of this milestone and will be implemented in subsequent phases:

  • Normalizer
  • Retrieval API
  • Scientific Extraction
  • Intermediate Representation (IR)
  • Validation
  • Review
  • Export

Next Steps

With the Raw Marker Model and the Document Object now established, the next milestone will focus on implementing the Normalizer. This component will be responsible for transforming the raw Marker representation into the canonical Document Object while preserving provenance, reading order, and structural fidelity. The completed Document layer will then serve as the foundation for retrieval, extraction, validation, and export in the later stages of the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant