Metadata Processing

The Posters.science platform uses advanced AI-powered processing to automatically extract, validate, and enrich metadata from scientific posters. This ensures FAIR compliance and optimal discoverability.

AI-Powered Extraction

Large Language Model

The extraction tool uses Llama 3.1 8B for structured information extraction from poster PDFs. The model is deployed on local GPU hardware using 4-bit quantization for optimized inference, targeting under 60 seconds per poster.

Model Performance

Through systematic evaluation on 400 posters from Zenodo and Figshare, Llama 3.1 8B significantly outperformed traditional NLP methods (Grobid) across all fields.

Extraction Strategy

The system uses adaptive prompting with specialized templates for different extraction tasks. Few-shot learning includes 3-5 example extractions selected based on similarity. Key strengths include:

Handling irregular layouts
Understanding scientific terminology
Robustness to typos
Extracting information from context

Content Analysis Pipeline

graph TD
    A[PDF Upload] --> B[OCR Processing]
    B --> C[Text Extraction]
    C --> D[Structure Recognition]
    D --> E[Entity Extraction]
    E --> F[Metadata Generation]
    F --> G[Validation]
    G --> H[User Review]

    I[Abstract Text] --> J[Natural Language Processing]
    J --> K[Keyword Extraction]
    K --> L[Topic Classification]
    L --> F

Features

High-Accuracy Text Recognition: Advanced OCR for poster text extraction
Multi-language Support: Recognition of various languages and scripts
Layout Preservation: Maintains document structure and formatting
Image Processing: Handles charts, graphs, and visual elements

Natural Language Processing

Entity Recognition: Automatic identification of authors, institutions, and dates
Topic Classification: AI-powered subject categorization
Keyword Extraction: Automatic keyword and phrase identification
Abstract Summarization: Content summarization and enhancement

Processing Workflow

1. Initial Extraction

File Analysis: PDF structure and content analysis
Text Recognition: OCR processing and text extraction
Layout Detection: Section and element identification
Quality Assessment: Content completeness evaluation

2. Entity Recognition

Author Extraction: Name and affiliation identification
Institution Detection: Organization and department recognition
Date Parsing: Publication and event date extraction
Location Identification: Geographic and venue information

3. Content Analysis

Topic Classification: Subject area determination
Keyword Generation: Automatic keyword extraction
Abstract Enhancement: Content summarization and improvement
Reference Parsing: Citation and bibliography analysis

4. Validation and Enrichment

Data Validation: Format and completeness checking
External Lookup: ORCID, DOI, and database integration
Standardization: Controlled vocabulary mapping
Quality Scoring: Metadata completeness assessment

Confidence Scoring

Each extracted field receives a confidence score (0-100%) based on:

Format Compliance: Adherence to expected data formats
Database Validation: Verification against external databases
Context Analysis: Consistency with surrounding information

Fields below 70% confidence trigger user review during the submission process.

External Database Integration

The system validates and enriches metadata through integration with authoritative databases:

ORCID: Author identification and profile linking
ROR (Research Organization Registry): Institution validation and standardization
Crossref Funder Registry: Funding agency validation

For U.S. federal funding, the system cross-references:

NIH Reporter: National Institutes of Health grant information
NSF Award Search: National Science Foundation award data

Content Structure

Poster content is stored in a posterContent object with the following structure:

posterTitle: The main title text extracted from the poster itself
sections: Array of content sections, each with:
- sectionTitle: Heading as it appears on the poster (e.g., "Introduction", "Methods", "Results & Discussion", "Conclusions")
- sectionContent: Text content of that section

This flexible structure accommodates the diverse, unstructured nature of scientific posters without forcing content into predefined rigid sections.

Image and Table Captions

Captions are stored in separate arrays:

imageCaption: Array of objects containing multi-line image captions (caption1, caption2, etc.)
tableCaption: Array of objects containing multi-line table captions (caption1, caption2, etc.)

Technical Implementation

The metadata extraction tool is containerized in Docker with Python 3.10+, exposing RESTful API endpoints. Key features include:

Redis-based job queuing: Asynchronous processing of extraction requests
Rate limiting: Prevention of abuse and resource exhaustion
Horizontal scaling: Support for multiple GPU instances
API endpoints: RESTful interface for submission integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata Processing

AI-Powered Extraction

Large Language Model

Model Performance

Extraction Strategy

Content Analysis Pipeline

Features

Natural Language Processing

Processing Workflow

1. Initial Extraction

2. Entity Recognition

3. Content Analysis

4. Validation and Enrichment

Confidence Scoring

External Database Integration

Content Structure

Image and Table Captions

Technical Implementation

FilesExpand file tree

metadata.md

Latest commit

History

metadata.md

File metadata and controls

Metadata Processing

AI-Powered Extraction

Large Language Model

Model Performance

Extraction Strategy

Content Analysis Pipeline

Features

Natural Language Processing

Processing Workflow

1. Initial Extraction

2. Entity Recognition

3. Content Analysis

4. Validation and Enrichment

Confidence Scoring

External Database Integration

Content Structure

Image and Table Captions

Technical Implementation