Data deduplication engine, supporting optional compression and public key encryption.
-
Updated
Aug 25, 2022 - Rust
Data deduplication engine, supporting optional compression and public key encryption.
Official Repository of "LLM × DATA" Survey Paper
🚢 Data Toolkit for Sailor Language Models
Self-contained C# library for data deduplication using Sqlite
Fast and efficient content-defined chunking for data deduplication. Java implementation of FastCDC as library.
Optimal distributed data deduplication and supervised learning pipeline using Apache Spark
A JAVA project that splits data using hashing techniques and removes duplicate blocks to save cloud storage. This project also uses the CloudSim framework for cloud storage simulation.
Sievio turns GitHub, local repos, and web PDFs into clean JSONL for LLM pretraining, fine-tuning, and RAG. It offers structure-aware chunking, reliable Unicode decoding, pluggable QC and safety checks, plus optional dataset cards and deduplication.
PolyDeDupe: Multi-Lingual Data Deduplication
A pure-JS, content-addressed, copy-on-write virtual filesystem for the browser, featuring: deduplication, filesystem universes (snapshots), events, and optional asynchronous sync.
Probabilistic record linkage across sites like Shopify & Stripe — 100% precision, zero false positives.
Enterprise-grade SaaS platform for importing, cleaning, and managing large-scale mailing lists with advanced deduplication and enrichment.
A calculator for storage and transmission of deduplicated data. Output: charts and tables
Этот проект представляет собой мощный инструмент для поиска и анализа дублирующихся файлов в указанной директории. Программа позволяет эффективно выявлять одинаковые файлы на основе их содержимого, используя алгоритм хеширования SHA-256. Она поддерживает настройку параметров, таких как минимальный размер файла для проверки и игнорирование определен
Fellow is a package for creating people that can be unified by their shared values via a singleton list on the class
A web tool that compares two URL lists, identifies unique and matching domains, and creates a detailed report. It streamlines URL data analysis efficiently.
Practical backups. The Unix toolkit way.
This repository contains SQL scripts and documentation for cleaning and standardizing data in the NashvilleHousing table within the sqlproject2 database. The project aims to prepare the dataset for analysis by addressing inconsistencies, filling missing values, standardizing formats, and removing duplicates.
Add a description, image, and links to the data-deduplication topic page so that developers can more easily learn about it.
To associate your repository with the data-deduplication topic, visit your repo's landing page and select "manage topics."