- Nji Ruth Mbikang
- Franck Justin Etape Etape
- Tecla Kyalo
- Uriel Nguefack Yefou
The global fight against modern slavery is a critical and complex challenge. In response, countries like Australia, the UK, and Canada have enacted legislation that requires companies to publish Modern Slavery Statements, detailing their practices and supply chain dynamics. However, these vital documents pose a significant barrier to analysis. They are often unstructured, in formats such as scanned PDFs, and contain a mix of content including plain text, infographics, and tables. Moreover, these reports can be in different languages, such as French for Canadian companies, making them difficult for automated systems to process.
Our challenge is to overcome these data-related obstacles. Our goal is to develop innovative solutions for Data Mining, Processing & Enrichment to transform these disparate, complex documents into a usable, structured dataset. By tackling issues like enhancing OCR accuracy for scanned PDFs, enabling the precise extraction of information from figures and tables, and developing smarter methods for multilingual document understanding and data enrichment, we will be able to unlock the full potential of the data. Our solutions will be the foundation for a powerful Knowledge Repository (Pillar 1), allowing researchers and civil society organizations to analyze company compliance, identify red flags, and ultimately strengthen the fight against modern slavery.
To search, download, build company profiles, and, transform complex, unstructured corporate Modern Slavery Statements into a clean, structured, and usable dataset to accelerate research, analysis, and accountability in the fight against modern slavery.
Mine and Consolidate Data: Design and execute a strategy to systematically collect Modern Slavery Statements from official government registries (Australia, UK, and Canada) and company websites.
Enrich Company Profiles: Create a process to enrich the collected data by integrating information from external datasets (e.g., GDELT, Open Corporates, Datanyze) to build comprehensive company profiles with details like revenue, directors, and other relevant metrics.
Build an Accessible Knowledge Repository: Develop a dashboard or knowledge repository that allows users to ask and answer complex analytical questions, such as identifying companies reporting to multiple laws or assessing the visibility of their statements online.
Enhance OCR and Document Understanding: Develop and implement methods to significantly improve the accuracy of Optical Character Recognition (OCR) for scanned PDF documents, including those in languages like French.
Extract Structured Data from Unstructured Formats: Create solutions to accurately identify and extract meaningful content from non-textual elements, such as tables, infographics, and figures embedded within documents.
Structure Mixed Content: Build a data pipeline or model that can intelligently structure mixed content (text, visuals, and tables) from corporate reports into a cohesive, machine-readable format.
Enable Social Impact: Provide a working prototype or a robust data processing framework that empowers civil society organizations and researchers with the tools needed to monitor, analyze, and report on corporate compliance, thereby contributing directly to efforts to combat modern slavery.
A comprehensive description of the data-based solution or/and data use case.
Our proposed solution addresses the challenge of extracting and structuring information from Modern Slavery Statements using a comprehensive data platform designed to transform unstructured modern slavery statements into a structured, searchable, and insightful knowledge base. The core of our system is a universal document extraction pipeline that leverages multiple methods to accurately pull text, tables, and infographics from PDF documents, including those with multilingual content or complex formatting. This raw, extracted data is then used to populate a dynamic knowledge repository and analytics dashboard. This platform enables users to conduct sophisticated comparative analyses by checking for multi-jurisdictional reporting, assessing statement visibility on company websites, and providing enriched company profiles with crucial external data like revenues, directors, and sanctions. Ultimately, our solution empowers researchers and stakeholders by providing a centralized hub that turns disparate documents into actionable intelligence and a clear view of corporate transparency. Key functionalities include:
Systematic Data Ingestion: Our solution includes a sophisticated module for systematically collecting Modern Slavery Statements. This involves:
- Registry Harvesting: Automated search and download of statements from official government registries in Australia, the UK, and Canada.
- Agentic Website Browsing: Implementing advanced techniques, such as tree-structured HTML extraction and agentic browsing, to search and locate statements directly on company websites. This allows us to track not only if a statement is present but also its visibility and exact location (e.g., on the first page, within a specific CSR section) to assess corporate transparency.
- Language Identification: Identifying and downloading documents in specific languages, such as French documents from Canadian companies, to ensure comprehensive coverage.
Data Fusion Engine: The collected and extracted data is significantly enriched by integrating information from a variety of external, authoritative datasets. This process creates holistic company profiles, which include:
- Financials & Corporate Structure: Leveraging data from sources like Open Corporates and Datanyze to pull in revenues, director information, and organizational hierarchies.
- Adverse Media & Risk Indicators: Integrating GDELT for adverse media mentions and Open Sanctions for screening against sanction lists, providing crucial red flags.
- Compliance & Labor Standards: Incorporating data from the International Labour Organisation (ILO) for labor standards context and ABN Lookup (for Australian entities) for official business registration details.
- Sustainability & ESG Ratings: Utilizing platforms like Wikirate to gather crowd-sourced and expert-assessed data points related to human rights and modern slavery performance.
Advanced Querying: Allows government and civil society organizations to drill down to the submitted statement for the companies and organizations.
- Multi-Jurisdictional Reporting: Identifying and visualizing how many companies are reporting under multiple modern slavery laws (e.g., UK, Australian, Canadian).
- Geographical and Sectoral Analysis: Breaking down the number of statements by country, year, and sector to highlight trends and areas of focus.
- Statement Visibility & Accessibility: Providing insights into where statements are found on company websites, contributing to an assessment of corporate diligence.
- Red Flag Identification: Correlating internal statement data with external adverse media or sanction data to proactively identify potential risks.
Intuitive Visualizations: The dashboard presents complex data in clear, accessible visualizations, enabling users to quickly grasp key trends, identify outliers, and drill down into specific data points. The provided dashboard image serves as a prototype, illustrating the kind of aggregated insights and answerable questions available, such as the 'Number of statements by country,' 'Number of Statement by Year,' and 'Companies reporting to multiple laws' visualizations.

Our user-friendly web interface allows users to:
- Upload PDF documents for processing
- Choose between different extraction methods (PyPDF2, PDFPlumber, or OCR)
- View the original document alongside extracted content
- Access structured data in real-time
The system implements a robust extraction pipeline that:
- Processes both digital and scanned PDFs
- Extracts text, tables, and images with high accuracy
- Supports multiple languages including English and French
- Organizes extracted content into structured formats
-
Intelligent Text Extraction
- Advanced OCR capabilities for scanned documents
- Multi-language support
- Preprocessing for improved accuracy
-
Table Detection & Structuring
- Automated table detection and extraction
- Conversion to structured formats (CSV, Excel)
- Preservation of table relationships and context
-
Image Processing
- Extraction of embedded images
- OCR processing of text within images
- Image quality enhancement for better results
-
Output Organization
- Structured folder hierarchy for extracted content
- Multiple export formats
- Detailed extraction reports and metrics
The solution leverages several advanced technologies:
- Python-based extraction pipeline
- Streamlit for web interface
- PyMuPDF and PDFPlumber for PDF processing
- Tesseract OCR for image text extraction
- Tabula-py for table extraction
- PIL for image processing
- Processing time: Average 2-3 seconds per page
- Text extraction accuracy: >95% for digital PDFs
- Table detection rate: >90% for structured tables
- OCR accuracy: >85% for high-quality scans
Click on the link below to watch our 5 minutes presentation.
Location: /Datasets
This section documents the datasets explored, accessed, and generated as part of Pillar 1 of the project. It highlights the sources implemented, those only explored, the encountered challenges, and the transformations applied.
- Wikirate API: The main external source actively used. It enriched registry records with location, website, registration numbers, and ABNs. Its open API access was reliable and well-suited to large-scale integration.
- Company Websites (News & Profiles): News articles and basic profile information were scraped directly from company websites. This added context about company activities and provided material for further enrichment beyond registry statements.
- ABN Lookup (Australia): Explored for retrieving Australian Business Numbers but not integrated due to time constraints.
- OpenSanctions: Several compliance datasets (BIC, FIRDS, GLEIF) were downloaded but not integrated due to meaningless informations found.
- GDELT: Considered for adverse media monitoring but not implemented due to scope/time limitations.
- Bloomberg Search: Tested for corporate news monitoring, but postponed for later development.
- ILO Data: Global labour datasets were explored but had limited relevance to company-specific modern slavery reporting.
- Datanyze / OpenCorporates: Both identified as rich sources but not usable at scale due to commercial access restrictions.
- Identifier harmonization: Company numbers, ABNs, and registry IDs required normalization to avoid duplication.
- Format standardization: Data arrived in mixed formats (JSON, CSV, scraped HTML) and had to be transformed into consistent tabular structures.
- Access restrictions: Certain datasets could not be leveraged because of licensing or paywall barriers.
- Integration of Wikirate attributes into registry datasets.
- Normalization of company identifiers across sources.
- Direct extraction of company news into structured fields for enrichment.
- Addition of multilingual markers for Canadian statements.
- Consolidation into harmonized CSV datasets ready for analysis and dashboards.
- combined_statements_with_profiles.csv → Registry statements enriched with Wikirate attributes and company website news.
- final_statements_with_profiles.csv → Final harmonized dataset including website verification results and accessibility flags.
Location: /Project code
It contains all the code for the 2 pillars of this Hackathon. Everything is documented in the Readme file of each pillar.
Location: /Docs
We've created a comprehensive video demonstration showcasing our solution's capabilities. The demo covers:
- The full extraction pipeline in action
- Web application interface and features
- Dashboard functionality and analytics
Our complete video demonstration is available here:
View Demo Video
- PowerPoint Presentation
- Technical Documentation ans user guide
This project builds on the open research of Project AIMS (AI against Modern Slavery) by Mila and QUT.
GitHub repository: ai4h_aims-au.
- Describe here the resources used in developing your solution (e.g. GPUs, etc).
This repository and its accompanying models, datasets, metrics, dashboards, and comparative analyses are provided strictly for research and demonstration purposes.
Any comparisons, rankings, or assessments of companies or organizations are exploratory in nature. They may be affected by incomplete data, modeling limitations, or methodological choices. These results must not be used to make factual, legal, or reputational claims about any entity without independent expert review and validation.
Do not use this repository’s contents to make public statements or claims about specific companies, organizations, or individuals.
By submitting this solution to the AIMS Hackathon, our team acknowledges and agrees to abide by the Event’s Terms and Conditions.
