poster2json extracts structured metadata from scientific conference posters (PDF or image format) into machine-actionable JSON conforming to the poster-json-schema.
The pipeline uses:
- Llama-3.1-8B-Instruct (a verbatim mirror of Meta's release; swap with any HuggingFace instruct model via
--model) for JSON structuring - Qwen2-VL-7B for vision-based OCR of image posters
- pdfalto for layout-aware PDF text extraction
- lingua-language-detector for ISO 639-1 language detection on body text (overrides any value the model emits — body text beats metadata-fragment guessing)
- ROR (
https://api.ror.org) for affiliation and publisher canonicalisation; matched names get a ROR identifier attached - SPDX matching (with integer-exact version handling) for license normalisation in
rightsList
pip install poster2json# Extract metadata from a poster (default: Llama-3.1-8B-Instruct @ 4bit)
poster2json extract poster.pdf -o result.json
# Use a different instruct model (any HuggingFace repo id works)
poster2json extract poster.pdf --model google/gemma-2-9b-it --quantization 4bit
# Trade VRAM for quality
poster2json extract poster.pdf --quantization 8bit
poster2json extract poster.pdf --quantization fp16
# Validate extracted JSON
poster2json validate result.json
# Process multiple posters
poster2json batch ./posters/ -o ./output/from poster2json import extract_poster, validate_poster
# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])
# Validate the result
is_valid = validate_poster(result)Output conforms to the poster-json-schema (DataCite 4.7):
{
"$schema": "https://posters.science/schema/v0.2/poster_schema.json",
"creators": [
{
"name": "Garcia, Sofia",
"givenName": "Sofia",
"familyName": "Garcia",
"affiliation": [
{
"name": "Stanford University",
"affiliationIdentifier": "https://ror.org/00f54p054",
"affiliationIdentifierScheme": "ROR",
"schemeUri": "https://ror.org/"
}
]
}
],
"titles": [
{ "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
],
"publicationYear": 2025,
"language": "en",
"researchField": "Health Sciences",
"subjects": [
{ "subject": "Machine Learning" },
{ "subject": "Diabetic Retinopathy" }
],
"descriptions": [
{ "description": "We present a deep learning model...", "descriptionType": "Abstract" }
],
"publisher": { "name": "Zenodo" },
"rightsList": [
{
"rights": "Creative Commons Attribution 4.0 International",
"rightsIdentifier": "CC-BY-4.0",
"rightsIdentifierScheme": "SPDX",
"schemeUri": "https://spdx.org/licenses/",
"rightsUri": "https://creativecommons.org/licenses/by/4.0/"
}
],
"content": {
"sections": [
{ "sectionTitle": "Abstract", "sectionContent": "..." },
{ "sectionTitle": "Methods", "sectionContent": "..." },
{ "sectionTitle": "Results", "sectionContent": "..." }
]
},
"imageCaptions": [{ "id": "fig1", "caption": "Figure 1. ROC curves showing..." }],
"tableCaptions": [{ "id": "table1", "caption": "Table 1. Performance metrics" }]
}Notes on the auto-populated fields:
languageis detected from the raw body text (lingua heuristic). Returns null when text is too short (<200 chars / <50 non-ASCII codepoints) or the detector is unsure.researchFieldmust be one of the four OpenAlex top-level domains:Health Sciences,Life Sciences,Physical Sciences,Social Sciences. Null when the model can't pick one confidently.affiliationandpublisherget ROR enrichment when the matcher returns a high-confidence chosen result. Strings without a confident match pass through unchanged. SetPOSTER2JSON_ROR=0to disable.rightsListentries are matched against an SPDX table; the matcher is conservative on version numbers (e.g.CC-BY-4.0andCC-BY-4.1are never confused).
| Requirement | Specification |
|---|---|
| GPU | NVIDIA CUDA-capable, ≥8GB VRAM (default 4bit); ≥16GB for --quantization fp16 or image/OCR posters |
| RAM | ≥32GB recommended |
| Python | 3.10+ |
| OS | Linux, macOS, Windows (via WSL2) |
Validated on 10 manually annotated scientific posters:
| Metric | Score | Threshold |
|---|---|---|
| Word Capture | 0.96 | ≥0.75 |
| ROUGE-L | 0.89 | ≥0.75 |
| Number Capture | 0.93 | ≥0.75 |
| Field Proportion | 0.99 | 0.50–2.00 |
Pass Rate: 10/10 (100%)
| Document | Description |
|---|---|
| Architecture | Technical details & methodology |
| Evaluation | Validation metrics & results |
# Clone the repository
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json
# Create a virtual environment
python -m venv .venv
# Activate the virtual environment
source venv/bin/activate
.venv\Scripts\activate # On Windows
# Install poetry
pip install poetry
# Install dependencies
poetry install
# Run tests
poe test
# Format code
poe formatIf you are on windows and have multiple python versions, you can use the following commands:
py -0p # list all python versions
py -3.12 -m venv .venvMIT License - see LICENSE for details.
@software{poster2json2026,
title = {poster2json: Scientific Poster to JSON Metadata Extraction},
author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
year = {2026},
version = {0.4.3},
url = {https://github.com/fairdataihub/poster2json},
doi = {10.5281/zenodo.18320010}
}This project is funded by The Navigation Fund (10.71707/rk36-9x79).
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
