A Composite Index for Pre-Training Dataset Risk Assessment
What happens when Artificial Intelligence starts feeding on its own scraps?
Inspired by the 2024 Nature paper "AI models collapse when trained on recursively generated data", GenProof is an open technical proposal and a quantitative framework designed to measure the risk of Model Collapse before a model is trained.
The full theoretical background, mathematical formulation, and architectural design are available in the Technical Report / Preprint:
👉 Read the GenProof Paper (PDF) ## 🎯 What is GenProof?
GenProof introduces the Inbreeding Coefficient Score (ICS), a calibrated probability score in [0, 1] derived from four independent measurements of dataset health:
- Semantic Entropy: Embedding-space kernel density estimation.
- Generational Similarity: Fréchet Distance and centroid drift against a pre-AI reference corpus (Wikipedia 2020).
- Tail-Density Depletion: Measured against an absolute reference threshold to avoid self-percentile tautologies.
- AI Detection Ensemble: A combined signal from stylometry, SimHash, perplexity, and n-gram repetition.
The system is designed to handle large-scale datasets (e.g., 500k+ rows) safely using an asynchronous, stream-based architecture:
- FastAPI for the HTTP layer.
- Celery + Redis for background job orchestration.
- Sentence-Transformers & FAISS for efficient CPU-based embedding and retrieval.
(Include here the repository tree structure you mapped out in the paper)
I am a manager in the insurance sector, not an academic researcher. I built GenProof during late nights and weekends, driven by an insatiable curiosity about AI and the future of data quality.
I am releasing this framework open-source because I believe we live in an era where anyone, armed with curiosity and an LLM, can try to contribute to frontier problems. The ideas are out in the world now—whether this serves as a starting point, a component of something larger, or simply inspiration for a different direction.
This is an open technical proposal. The code provided outlines the complete architecture, but the benchmark dataset for final calibration does not yet exist. Contributions, critiques, and forks are highly welcome!
This project is licensed under the MIT License - see the LICENSE file for details.