poster2json

Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models.

Documentation · Changelog · Report Bug · Request Feature

Description

poster2json extracts structured metadata from scientific conference posters (PDF or image format) into machine-actionable JSON conforming to the poster-json-schema.

The pipeline uses:

Llama-3.1-8B-Instruct (a verbatim mirror of Meta's release; swap with any HuggingFace instruct model via --model) for JSON structuring
Qwen2-VL-7B for vision-based OCR of image posters
pdfplumber for layout-aware PDF text extraction
lingua-language-detector for ISO 639-1 language detection on body text (overrides any value the model emits — body text beats metadata-fragment guessing)
ROR (https://api.ror.org) for affiliation canonicalisation; matched names get a ROR identifier attached

Quick Start

Installation

pip install poster2json

CLI Usage

# Extract metadata from a poster (default: Llama-3.1-8B-Instruct @ 4bit)
poster2json extract poster.pdf -o result.json

# Use a different instruct model (any HuggingFace repo id works)
poster2json extract poster.pdf --model google/gemma-2-9b-it --quantization 4bit

# Trade VRAM for quality
poster2json extract poster.pdf --quantization 8bit
poster2json extract poster.pdf --quantization fp16

# Validate extracted JSON
poster2json validate result.json

# Process multiple posters
poster2json batch ./posters/ -o ./output/

Python API

from poster2json import extract_poster, validate_poster

# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])

# Validate the result
is_valid = validate_poster(result)

Output Format

Output conforms to the poster-json-schema (DataCite 4.7):

{
  "$schema": "https://posters.science/schema/v0.2/poster_schema.json",
  "creators": [
    {
      "name": "Garcia, Sofia",
      "givenName": "Sofia",
      "familyName": "Garcia",
      "affiliation": [
        {
          "name": "Stanford University",
          "affiliationIdentifier": "https://ror.org/00f54p054",
          "affiliationIdentifierScheme": "ROR",
          "schemeURI": "https://ror.org/"
        }
      ]
    }
  ],
  "titles": [
    { "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
  ],
  "publicationYear": null,
  "language": "en",
  "researchField": "Health Sciences",
  "subjects": [
    { "subject": "Machine Learning" },
    { "subject": "Diabetic Retinopathy" }
  ],
  "descriptions": [
    { "description": "We present a deep learning model...", "descriptionType": "Other" }
  ],
  "publisher": null,
  "content": {
    "sections": [
      { "sectionTitle": "Abstract", "sectionContent": "..." },
      { "sectionTitle": "Methods", "sectionContent": "..." },
      { "sectionTitle": "Results", "sectionContent": "..." }
    ]
  },
  "imageCaptions": [{ "id": "fig1", "caption": "Figure 1. ROC curves showing..." }],
  "tableCaptions": [{ "id": "table1", "caption": "Table 1. Performance metrics" }],
  "formats": ["application/pdf"]
}

Notes on the auto-populated fields:

language is detected from the raw body text (lingua heuristic). Returns null when text is too short (<200 chars / <50 non-ASCII codepoints) or the detector is unsure.
researchField must be one of the four OpenAlex top-level domains: Health Sciences, Life Sciences, Physical Sciences, Social Sciences. Null when the model can't pick one confidently.
affiliation gets ROR enrichment when the matcher returns a high-confidence chosen result. Strings without a confident match pass through unchanged. Set POSTER2JSON_ROR=0 to disable.
publisher and publicationYear are always emitted as null. They are platform-owned and set when the poster is published, not by extraction.
formats is derived from the input file's extension, not the model.

System Requirements

Requirement	Specification
GPU	NVIDIA CUDA-capable, ≥8GB VRAM (default 4bit); ≥16GB for `--quantization fp16` or image/OCR posters
RAM	≥32GB recommended
Python	3.10+
OS	Linux, macOS, Windows (via WSL2)

Performance

Validated on 20 manually annotated scientific posters (19 PDF via pdfplumber, 1 image via vision OCR):

Metric	Score	Threshold
Word Capture	0.92	≥0.75
ROUGE-L	0.85	≥0.75
Number Capture	0.97	≥0.75
Field Proportion	0.88	0.50–1.50

Pass Rate: 19/20 (95%). The single failure is a dense table/flowchart poster whose reference annotation splits one visual region into many fine-grained sections.

Documentation

Document	Description
Architecture	Technical details & methodology
Evaluation	Validation metrics & results

Development Setup

# Clone the repository
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json

# Create a virtual environment
python -m venv .venv

# Activate the virtual environment
source venv/bin/activate
.venv\Scripts\activate # On Windows

# Install poetry
pip install poetry

# Install dependencies
poetry install

# Run tests
poe test

# Format code
poe format

If you are on windows and have multiple python versions, you can use the following commands:

py -0p # list all python versions

py -3.12 -m venv .venv

License

MIT License - see LICENSE for details.

Citation

@software{poster2json2026,
  title = {poster2json: Scientific Poster to JSON Metadata Extraction},
  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
  year = {2026},
  version = {0.8.0},
  url = {https://github.com/fairdataihub/poster2json},
  doi = {10.5281/zenodo.18320010}
}

Funding

This project is funded by The Navigation Fund (10.71707/rk36-9x79).

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
.github/workflows		.github/workflows
.vscode		.vscode
bin		bin
docs		docs
notebooks		notebooks
poster2json		poster2json
tests		tests
.appveyor.yml		.appveyor.yml
.coveragerc		.coveragerc
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pydocstyle.ini		.pydocstyle.ini
.pylint.ini		.pylint.ini
.scrutinizer.yml		.scrutinizer.yml
.tool-versions		.tool-versions
.verchew.ini		.verchew.ini
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
crosswalk.md		crosswalk.md
llama_generation_settings.md		llama_generation_settings.md
logo.svg		logo.svg
mkdocs.yml		mkdocs.yml
pdfalto_migration_plan.md		pdfalto_migration_plan.md
pdfalto_poster_params.md		pdfalto_poster_params.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
scent.py		scent.py
sync_schema.py		sync_schema.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

poster2json

Documentation · Changelog · Report Bug · Request Feature

Description

Quick Start

Installation

CLI Usage

Python API

Output Format

System Requirements

Performance

Documentation

Development Setup

License

Citation

Funding

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

poster2json

Documentation · Changelog · Report Bug · Request Feature

Description

Quick Start

Installation

CLI Usage

Python API

Output Format

System Requirements

Performance

Documentation

Development Setup

License

Citation

Funding

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages