AIMS Hackathon - Synapse DataForge

Nji Ruth Mbikang
Franck Justin Etape Etape
Tecla Kyalo
Uriel Nguefack Yefou

1. Problem Statement

The global fight against modern slavery is a critical and complex challenge. In response, countries like Australia, the UK, and Canada have enacted legislation that requires companies to publish Modern Slavery Statements, detailing their practices and supply chain dynamics. However, these vital documents pose a significant barrier to analysis. They are often unstructured, in formats such as scanned PDFs, and contain a mix of content including plain text, infographics, and tables. Moreover, these reports can be in different languages, such as French for Canadian companies, making them difficult for automated systems to process.

Our challenge is to overcome these data-related obstacles. Our goal is to develop innovative solutions for Data Mining, Processing & Enrichment to transform these disparate, complex documents into a usable, structured dataset. By tackling issues like enhancing OCR accuracy for scanned PDFs, enabling the precise extraction of information from figures and tables, and developing smarter methods for multilingual document understanding and data enrichment, we will be able to unlock the full potential of the data. Our solutions will be the foundation for a powerful Knowledge Repository (Pillar 1), allowing researchers and civil society organizations to analyze company compliance, identify red flags, and ultimately strengthen the fight against modern slavery.

2. Objective

Overarching Objective

To search, download, build company profiles, and, transform complex, unstructured corporate Modern Slavery Statements into a clean, structured, and usable dataset to accelerate research, analysis, and accountability in the fight against modern slavery.

Pillar 1: Dashboards & Knowledge Repositories

Objective 1:

Mine and Consolidate Data: Design and execute a strategy to systematically collect Modern Slavery Statements from official government registries (Australia, UK, and Canada) and company websites.

Objective 2:

Enrich Company Profiles: Create a process to enrich the collected data by integrating information from external datasets (e.g., GDELT, Open Corporates, Datanyze) to build comprehensive company profiles with details like revenue, directors, and other relevant metrics.

Objective 3:

Build an Accessible Knowledge Repository: Develop a dashboard or knowledge repository that allows users to ask and answer complex analytical questions, such as identifying companies reporting to multiple laws or assessing the visibility of their statements online.

Pillar 2: Processing & Enrichment

Objective 4:

Enhance OCR and Document Understanding: Develop and implement methods to significantly improve the accuracy of Optical Character Recognition (OCR) for scanned PDF documents, including those in languages like French.

Objective 5:

Extract Structured Data from Unstructured Formats: Create solutions to accurately identify and extract meaningful content from non-textual elements, such as tables, infographics, and figures embedded within documents.

Objective 6:

Structure Mixed Content: Build a data pipeline or model that can intelligently structure mixed content (text, visuals, and tables) from corporate reports into a cohesive, machine-readable format.

Social Impact Objective

Objective 7:

Enable Social Impact: Provide a working prototype or a robust data processing framework that empowers civil society organizations and researchers with the tools needed to monitor, analyze, and report on corporate compliance, thereby contributing directly to efforts to combat modern slavery.

3. Solution/Data use case description

A comprehensive description of the data-based solution or/and data use case.

Our proposed solution addresses the challenge of extracting and structuring information from Modern Slavery Statements using a comprehensive data platform designed to transform unstructured modern slavery statements into a structured, searchable, and insightful knowledge base. The core of our system is a universal document extraction pipeline that leverages multiple methods to accurately pull text, tables, and infographics from PDF documents, including those with multilingual content or complex formatting. This raw, extracted data is then used to populate a dynamic knowledge repository and analytics dashboard. This platform enables users to conduct sophisticated comparative analyses by checking for multi-jurisdictional reporting, assessing statement visibility on company websites, and providing enriched company profiles with crucial external data like revenues, directors, and sanctions. Ultimately, our solution empowers researchers and stakeholders by providing a centralized hub that turns disparate documents into actionable intelligence and a clear view of corporate transparency. Key functionalities include:

Dashboards & Knowledge Repositories (Data Collection, Enrichment & Analytics)

Automated Statement Collection & Monitoring

Systematic Data Ingestion: Our solution includes a sophisticated module for systematically collecting Modern Slavery Statements. This involves:

Registry Harvesting: Automated search and download of statements from official government registries in Australia, the UK, and Canada.
Agentic Website Browsing: Implementing advanced techniques, such as tree-structured HTML extraction and agentic browsing, to search and locate statements directly on company websites. This allows us to track not only if a statement is present but also its visibility and exact location (e.g., on the first page, within a specific CSR section) to assess corporate transparency.
Language Identification: Identifying and downloading documents in specific languages, such as French documents from Canadian companies, to ensure comprehensive coverage.

Enriched Company Profiles

Data Fusion Engine: The collected and extracted data is significantly enriched by integrating information from a variety of external, authoritative datasets. This process creates holistic company profiles, which include:

Financials & Corporate Structure: Leveraging data from sources like Open Corporates and Datanyze to pull in revenues, director information, and organizational hierarchies.
Adverse Media & Risk Indicators: Integrating GDELT for adverse media mentions and Open Sanctions for screening against sanction lists, providing crucial red flags.
Compliance & Labor Standards: Incorporating data from the International Labour Organisation (ILO) for labor standards context and ABN Lookup (for Australian entities) for official business registration details.
Sustainability & ESG Ratings: Utilizing platforms like Wikirate to gather crowd-sourced and expert-assessed data points related to human rights and modern slavery performance.

Interactive Analytics & Knowledge Repository Dashboard

Advanced Querying: Allows government and civil society organizations to drill down to the submitted statement for the companies and organizations.

Multi-Jurisdictional Reporting: Identifying and visualizing how many companies are reporting under multiple modern slavery laws (e.g., UK, Australian, Canadian).
Geographical and Sectoral Analysis: Breaking down the number of statements by country, year, and sector to highlight trends and areas of focus.
Statement Visibility & Accessibility: Providing insights into where statements are found on company websites, contributing to an assessment of corporate diligence.
Red Flag Identification: Correlating internal statement data with external adverse media or sanction data to proactively identify potential risks.

Intuitive Visualizations: The dashboard presents complex data in clear, accessible visualizations, enabling users to quickly grasp key trends, identify outliers, and drill down into specific data points. The provided dashboard image serves as a prototype, illustrating the kind of aggregated insights and answerable questions available, such as the 'Number of statements by country,' 'Number of Statement by Year,' and 'Companies reporting to multiple laws' visualizations.

Web Application Interface

Our user-friendly web interface allows users to:

Upload PDF documents for processing
Choose between different extraction methods (PyPDF2, PDFPlumber, or OCR)
View the original document alongside extracted content
Access structured data in real-time

Extraction Pipeline

The system implements a robust extraction pipeline that:

Processes both digital and scanned PDFs
Extracts text, tables, and images with high accuracy
Supports multiple languages including English and French
Organizes extracted content into structured formats

Key Features

Intelligent Text Extraction
- Advanced OCR capabilities for scanned documents
- Multi-language support
- Preprocessing for improved accuracy
Table Detection & Structuring
- Automated table detection and extraction
- Conversion to structured formats (CSV, Excel)
- Preservation of table relationships and context
Image Processing
- Extraction of embedded images
- OCR processing of text within images
- Image quality enhancement for better results
Output Organization
- Structured folder hierarchy for extracted content
- Multiple export formats
- Detailed extraction reports and metrics

Technical Implementation

The solution leverages several advanced technologies:

Python-based extraction pipeline
Streamlit for web interface
PyMuPDF and PDFPlumber for PDF processing
Tesseract OCR for image text extraction
Tabula-py for table extraction
PIL for image processing

Performance Metrics

Processing time: Average 2-3 seconds per page
Text extraction accuracy: >95% for digital PDFs
Table detection rate: >90% for structured tables
OCR accuracy: >85% for high-quality scans

4. Pitch

Click on the link below to watch our 5 minutes presentation.

View Video Pitch

5. Datasets

Location: /Datasets

This section documents the datasets explored, accessed, and generated as part of Pillar 1 of the project. It highlights the sources implemented, those only explored, the encountered challenges, and the transformations applied.

📡 API Access to Required Data

Wikirate API: The main external source actively used. It enriched registry records with location, website, registration numbers, and ABNs. Its open API access was reliable and well-suited to large-scale integration.

🌐 Web Data Collected Directly

Company Websites (News & Profiles): News articles and basic profile information were scraped directly from company websites. This added context about company activities and provided material for further enrichment beyond registry statements.

📂 Other Datasets (Explored but Not Fully Implemented)

ABN Lookup (Australia): Explored for retrieving Australian Business Numbers but not integrated due to time constraints.
OpenSanctions: Several compliance datasets (BIC, FIRDS, GLEIF) were downloaded but not integrated due to meaningless informations found.
GDELT: Considered for adverse media monitoring but not implemented due to scope/time limitations.
Bloomberg Search: Tested for corporate news monitoring, but postponed for later development.
ILO Data: Global labour datasets were explored but had limited relevance to company-specific modern slavery reporting.
Datanyze / OpenCorporates: Both identified as rich sources but not usable at scale due to commercial access restrictions.

⚠️ Data Quality & Transformation Issues

Identifier harmonization: Company numbers, ABNs, and registry IDs required normalization to avoid duplication.
Format standardization: Data arrived in mixed formats (JSON, CSV, scraped HTML) and had to be transformed into consistent tabular structures.
Access restrictions: Certain datasets could not be leveraged because of licensing or paywall barriers.

🛠️ Transformations Applied

Integration of Wikirate attributes into registry datasets.
Normalization of company identifiers across sources.
Direct extraction of company news into structured fields for enrichment.
Addition of multilingual markers for Canadian statements.
Consolidation into harmonized CSV datasets ready for analysis and dashboards.

📦 New Datasets Generated

combined_statements_with_profiles.csv → Registry statements enriched with Wikirate attributes and company website news.
final_statements_with_profiles.csv → Final harmonized dataset including website verification results and accessibility flags.

6. Project Code

Location: /Project code

It contains all the code for the 2 pillars of this Hackathon. Everything is documented in the Readme file of each pillar.

7. Additional Documentation

Location: /Docs

Video Demonstration

We've created a comprehensive video demonstration showcasing our solution's capabilities. The demo covers:

The full extraction pipeline in action
Web application interface and features
Dashboard functionality and analytics

🎥 Watch the Demo

Our complete video demonstration is available here:
View Demo Video

Additional Resources

PowerPoint Presentation
Technical Documentation ans user guide

8. Intelectual Property

This project builds on the open research of Project AIMS (AI against Modern Slavery) by Mila and QUT.

GitHub repository: ai4h_aims-au.

Disclaimers

Computational Resources & Comparative Results

Describe here the resources used in developing your solution (e.g. GPUs, etc).

No Claims About Companies

This repository and its accompanying models, datasets, metrics, dashboards, and comparative analyses are provided strictly for research and demonstration purposes.

Any comparisons, rankings, or assessments of companies or organizations are exploratory in nature. They may be affected by incomplete data, modeling limitations, or methodological choices. These results must not be used to make factual, legal, or reputational claims about any entity without independent expert review and validation.

Do not use this repository’s contents to make public statements or claims about specific companies, organizations, or individuals.

Terms and Conditions

By submitting this solution to the AIMS Hackathon, our team acknowledges and agrees to abide by the Event’s Terms and Conditions.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Datasets		Datasets
Docs		Docs
Project code		Project code
images		images
.DS_Store		.DS_Store
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AIMS Hackathon - Synapse DataForge

1. Problem Statement

2. Objective

Overarching Objective

Pillar 1: Dashboards & Knowledge Repositories

Objective 1:

Objective 2:

Objective 3:

Pillar 2: Processing & Enrichment

Objective 4:

Objective 5:

Objective 6:

Social Impact Objective

Objective 7:

3. Solution/Data use case description

Dashboards & Knowledge Repositories (Data Collection, Enrichment & Analytics)

Automated Statement Collection & Monitoring

Enriched Company Profiles

Interactive Analytics & Knowledge Repository Dashboard

Web Application Interface

Extraction Pipeline

Key Features

Technical Implementation

Performance Metrics

4. Pitch

5. Datasets

📡 API Access to Required Data

🌐 Web Data Collected Directly

📂 Other Datasets (Explored but Not Fully Implemented)

⚠️ Data Quality & Transformation Issues

🛠️ Transformations Applied

📦 New Datasets Generated

6. Project Code

7. Additional Documentation

Video Demonstration

🎥 Watch the Demo

Additional Resources

8. Intelectual Property

This project builds on the open research of Project AIMS (AI against Modern Slavery) by Mila and QUT.

Disclaimers

Computational Resources & Comparative Results

No Claims About Companies

Terms and Conditions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages