Skip to content

fairdataihub/poster-sentry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PosterSentry

Lightweight multimodal classifier for scientific poster quality control in open repositories.

License: MIT Python 3.10+ HuggingFace

PosterSentry

Part of the quality control pipeline for posters.science, a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).

Developed by the FAIR Data Innovations Hub at the California Medical Innovations Institute (CalMI2).

The Problem

Open repositories like Zenodo and Figshare host tens of thousands of records labeled as scientific posters. However, approximately 20% of these records are mislabeled — containing multi-page papers, conference proceedings, abstract booklets, slide decks, or other non-poster documents. This label noise is a significant barrier to automated poster processing at scale.

Architecture

PosterSentry classifies PDFs using three complementary feature channels concatenated into a 542-dimensional vector:

Channel Features Dimensions Signal
Text model2vec (potion-base-32M) embedding 512 Semantic content
Visual Color stats, edge density, FFT spatial complexity, whitespace 15 Visual layout
Structural Page count, area, font diversity, text blocks, density 15 PDF geometry

A StandardScaler normalizes all features (preventing the 512-d text embedding from drowning out structural/visual signal), then a LogisticRegression classifier produces the final prediction.

The classifier head is a single linear layer stored as a numpy .npz file (10 KB). Inference is pure numpy — no GPU or deep learning framework required.

Performance

Validated on 3,606 real scientific documents (zero synthetic data):

Metric Value
Accuracy 87.3%
F1 (poster) 87.1%
F1 (non-poster) 87.4%
Precision (poster) 88.2%
Recall (poster) 85.9%
Inference speed < 1 sec/PDF (CPU)

Applied to 30,205 PDFs from Zenodo and Figshare, PosterSentry classified 80.2% as true posters and 19.8% as non-posters, with mean confidence of 0.799.

Top Discriminative Features

Feature Coefficient Signal
size_per_page_kb +7.65 Posters are dense, high-res single pages
page_count -5.49 More pages = not a poster
file_size_kb -5.44 Multi-page docs are bigger overall
is_landscape +0.98 Some posters are landscape
color_diversity +0.95 Posters are visually rich
edge_density +0.79 More visual edges in posters

Quick Start

Installation

pip install poster-sentry

CLI Usage

# Classify a single PDF
poster-sentry classify document.pdf

# Classify multiple PDFs
poster-sentry classify *.pdf --output results.tsv

# Print model info
poster-sentry info

Python API

from poster_sentry import PosterSentry

sentry = PosterSentry()
sentry.initialize()

# Classify a PDF (uses text + visual + structural features)
result = sentry.classify("document.pdf")
print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}")

# Batch classification
results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"])

# Text-only classification (no PDF needed)
result = sentry.classify_text("Title: My Poster\nAuthors: ...")

Pipeline Position

PosterSentry sits at the front of the posters.science pipeline — it screens incoming PDFs before expensive LLM-based extraction:

PDF Input
   |
   v
PosterSentry          -->  poster2json                     -->  FAIR output
(classify: poster?)        (Llama 3.1 8B structured extraction)  (poster-json-schema)

System Requirements

Requirement Value
CPU Any modern CPU (no GPU needed)
RAM 4 GB+
Python 3.10+
Model size 10 KB head + ~60 MB embeddings (downloaded once)

Related Resources

Resource Description
poster-sentry (HuggingFace) Model weights and config
poster-sentry-training-data (HuggingFace) Training dataset (3,606 samples)
poster-sentry-training (GitHub) Training code and replication
poster2json Poster to structured JSON extraction
posters.science Platform

Development

git clone https://github.com/fairdataihub/poster-sentry.git
cd poster-sentry
pip install -e ".[dev]"
pytest

Citation

@software{poster_sentry_2026,
  title = {PosterSentry: Multimodal Scientific Poster Classifier},
  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
  year = {2026},
  url = {https://github.com/fairdataihub/poster-sentry},
  note = {Part of the posters.science initiative at FAIR Data Innovations Hub}
}

License

MIT License. See LICENSE for details.

Acknowledgments

About

Lightweight multimodal scientific poster classifier — text + visual + structural features. Part of posters.science.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages