GenProof 🧬

A Composite Index for Pre-Training Dataset Risk Assessment

What happens when Artificial Intelligence starts feeding on its own scraps?

Inspired by the 2024 Nature paper "AI models collapse when trained on recursively generated data", GenProof is an open technical proposal and a quantitative framework designed to measure the risk of Model Collapse before a model is trained.

📖 The Whitepaper

The full theoretical background, mathematical formulation, and architectural design are available in the Technical Report / Preprint: 👉 Read the GenProof Paper (PDF) ## 🎯 What is GenProof? GenProof introduces the Inbreeding Coefficient Score (ICS), a calibrated probability score in [0, 1] derived from four independent measurements of dataset health:

Semantic Entropy: Embedding-space kernel density estimation.
Generational Similarity: Fréchet Distance and centroid drift against a pre-AI reference corpus (Wikipedia 2020).
Tail-Density Depletion: Measured against an absolute reference threshold to avoid self-percentile tautologies.
AI Detection Ensemble: A combined signal from stylometry, SimHash, perplexity, and n-gram repetition.

🏗️ System Architecture

The system is designed to handle large-scale datasets (e.g., 500k+ rows) safely using an asynchronous, stream-based architecture:

FastAPI for the HTTP layer.
Celery + Redis for background job orchestration.
Sentence-Transformers & FAISS for efficient CPU-based embedding and retrieval.

(Include here the repository tree structure you mapped out in the paper)

💡 Why I built this

I am a manager in the insurance sector, not an academic researcher. I built GenProof during late nights and weekends, driven by an insatiable curiosity about AI and the future of data quality.

I am releasing this framework open-source because I believe we live in an era where anyone, armed with curiosity and an LLM, can try to contribute to frontier problems. The ideas are out in the world now—whether this serves as a starting point, a component of something larger, or simply inspiration for a different direction.

🤝 Contributing

This is an open technical proposal. The code provided outlines the complete architecture, but the benchmark dataset for final calibration does not yet exist. Contributions, critiques, and forks are highly welcome!

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
CITATION.cff		CITATION.cff
GenProof_Frigerio_v2.pdf		GenProof_Frigerio_v2.pdf
GenProof_Frigerio_v2.tex		GenProof_Frigerio_v2.tex
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenProof 🧬

📖 The Whitepaper

🏗️ System Architecture

💡 Why I built this

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenProof 🧬

📖 The Whitepaper

🏗️ System Architecture

💡 Why I built this

🤝 Contributing

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages