data-deduplication

Star

Here are 29 public repositories matching this topic...

dpc / rdedup

Star

Data deduplication engine, supporting optional compression and public key encryption.

backup encryption data-deduplication deduplication

Updated Aug 25, 2022
Rust

OpenDataBox / awesome-data-llm

Star

Official Repository of "LLM × DATA" Survey Paper

data-transformation data-acquisition data-deduplication data-filtering vlm data-selection data-synthesis data-provenance llm data-mixing

Updated Jun 15, 2026

sail-sg / sailcraft

Star

🚢 Data Toolkit for Sailor Language Models

data-deduplication data-cleaning

Updated Feb 24, 2025
Python

jchristn / WatsonDedupe

Sponsor

Star

Self-contained C# library for data deduplication using Sqlite

compression storage nuget dedupe sqlite-database data-deduplication chunk compress deduplication chunk-data duplicate-data chunk-key

Updated Apr 7, 2023
C#

Zabuzard / FastCDC4J

Star

Fast and efficient content-defined chunking for data deduplication. Java implementation of FastCDC as library.

java library data-deduplication chunking cdc fastcdc content-defined-chunking

Updated Sep 21, 2023
Java

david-siqi-liu / sparklyclean

Star

Optimal distributed data deduplication and supervised learning pipeline using Apache Spark

distributed-systems data-science spark hadoop data-deduplication data-engineering data-cleaning deduplication

Updated Aug 19, 2020
Scala

shubham-thakare / data-deduplication

Star

A JAVA project that splits data using hashing techniques and removes duplicate blocks to save cloud storage. This project also uses the CloudSim framework for cloud storage simulation.

java cloud-storage data-deduplication cloudsim cloudsim-framework

Updated Jan 6, 2021
Java

melkarama / makfuzz

Star

java search-engine cross-platform record-linkage fuzzy-matching data-deduplication desktop-application jaro-winkler data-cleaning similarity-score data-quality string-similarity csv-processing excel-export phonetic-search names-matching internationalization-i18n

Updated Jan 1, 2026
Java

Sievio turns GitHub, local repos, and web PDFs into clean JSONL for LLM pretraining, fine-tuning, and RAG. It offers structure-aware chunking, reliable Unicode decoding, pluggable QC and safety checks, plus optional dataset cards and deduplication.

python data-deduplication dataset-creation data-pipelines repository-mining jsonl github-repos rag text-preprocessing quality-filtering code-mining llm llm-training llm-datasets

Updated Dec 27, 2025
Python

gagan3012 / PolyDeDupe

Sponsor

Star

PolyDeDupe: Multi-Lingual Data Deduplication

multilingual nlp data-deduplication

Updated Jun 29, 2026
Python

MS702 / jFS3

Star

A pure-JS, content-addressed, copy-on-write virtual filesystem for the browser, featuring: deduplication, filesystem universes (snapshots), events, and optional asynchronous sync.

browser localstorage pure-javascript data-deduplication indexeddb leightweight content-addressable-storage copy-on-write delta-sync webxdc virtual-filesystem

Updated Jan 8, 2026
JavaScript

surpradhan / customer-unification-agent

Star

Probabilistic record linkage across sites like Shopify & Stripe — 100% precision, zero false positives.

python stripe record-linkage shopify data-deduplication customer-data

Updated Apr 14, 2026
Python

rspeciale0519 / MailingListManager

Star

Enterprise-grade SaaS platform for importing, cleaning, and managing large-scale mailing lists with advanced deduplication and enrichment.

react typescript postgresql saas data-deduplication mailing-list fastify contact-management-system

Updated Jan 1, 2026
TypeScript

Anveshika06 / VIT-VTAS-TY-2022

Star

data-deduplication hashing-algorithm

Updated Jan 7, 2023
Python

Jim-JMCD / Data_storage_network_deduplication_calculator

Star

A calculator for storage and transmission of deduplicated data. Output: charts and tables

data-deduplication deduplication deduplication-calculator storage-deduplication-calculator network-deduplication-calculator

Updated Jun 7, 2026

dffdgdg / FindDuplicates

Star

Этот проект представляет собой мощный инструмент для поиска и анализа дублирующихся файлов в указанной директории. Программа позволяет эффективно выявлять одинаковые файлы на основе их содержимого, используя алгоритм хеширования SHA-256. Она поддерживает настройку параметров, таких как минимальный размер файла для проверки и игнорирование определен

python hashing productivity multithreading data-deduplication file-system sha256 file-management system-utility cli-tool dev-tools file-deduplication file-comparison disk-cleanup command-line-utility duplicate-file-finder

Updated Feb 14, 2025
Python

bevry / fellow

Star

Fellow is a package for creating people that can be unified by their shared values via a singleton list on the class

nodejs model data-deduplication client-side

Updated Jun 21, 2026
TypeScript

shakeeb-sa / domain-checker

Star

A web tool that compares two URL lists, identifies unique and matching domains, and creates a detailed report. It streamlines URL data analysis efficiently.

data-deduplication react-js domain-analysis xlsx-parsing url-comparison

Updated Jan 24, 2026
JavaScript

baraverkstad / mixtape

Star

Practical backups. The Unix toolkit way.

linux shell bash unix backup command-line data-deduplication

Updated Jun 28, 2025
Shell

KeerthanaPalanikumar / Data-Cleaning-on-SQL

Star

This repository contains SQL scripts and documentation for cleaning and standardizing data in the NashvilleHousing table within the sqlproject2 database. The project aims to prepare the dataset for analysis by addressing inconsistencies, filling missing values, standardizing formats, and removing duplicates.

data-deduplication database-management mssql data-manipulation data-cleaning ssms data-standardization

Updated Jun 17, 2024

Improve this page

Add a description, image, and links to the data-deduplication topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the data-deduplication topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-deduplication

Here are 29 public repositories matching this topic...

dpc / rdedup

OpenDataBox / awesome-data-llm

sail-sg / sailcraft

jchristn / WatsonDedupe

Zabuzard / FastCDC4J

david-siqi-liu / sparklyclean

shubham-thakare / data-deduplication

melkarama / makfuzz

JochiRaider / sievio

gagan3012 / PolyDeDupe

MS702 / jFS3

surpradhan / customer-unification-agent

rspeciale0519 / MailingListManager

Anveshika06 / VIT-VTAS-TY-2022

Jim-JMCD / Data_storage_network_deduplication_calculator

dffdgdg / FindDuplicates

bevry / fellow

shakeeb-sa / domain-checker

baraverkstad / mixtape

KeerthanaPalanikumar / Data-Cleaning-on-SQL

Improve this page

Add this topic to your repo