Skip to content

Deduplication

Joana Maia edited this page Jun 26, 2025 · 4 revisions

🔀 Merge Profiles - Deduplicating profiles

Principles behind deduplication

Profile (people and organizations) deduplication is the process of identifying and merging multiple representations of the same individual or entity within a dataset. In a system like Community Data Platform, profile duplication can arise due to multiple data sources and inconsistent identities from the collected entities.

The deduplication process is guided by the following core principles:

  • Accuracy over Aggressiveness: Merging profiles should be conservative to avoid incorrect merges. False positives are more damaging than duplicates.
  • Traceability: Every merge should be recorded so it can be audited or reversed.
  • Source Priority: Data from more reliable sources (e.g. GitHub, LinkedIn) should take precedence when merging conflicting profile fields.
  • Minimal Data Loss: When merging, no important data should be discarded unless it is known to be redundant or superseded. Instead, data should be merged into one profile.
  • Automation with Oversight: Deduplication can be automated with confidence thresholds, but always supports manual review for edge cases.
  • Automation First: Wherever feasible, we favor automated processes over manual intervention to ensure scalability, consistency, and efficiency.

Deduplication goal

The goal of deduplication is to ensure that each person or entity is represented by a single, unified profile that aggregates all related activity and metadata. This enhances:

  • Data Quality: Fewer duplicates, redundancies and inconsistencies.
  • Analytics Accuracy: More reliable metrics and insights.

A well-merged profile serves as the single source of truth for a person's identity and engagement history across platforms.

Deduplication concepts

Primary profile

The primary profile is the one the system keeps when merging duplicates. The profile priveliges all the data that initialy had and inherits all the data points linked to the secondary profile and that didn't exist on the primary one.

When the system detects duplicated profiles, it suggests one primary and one secondary, where the secondary is merged into the primary one. This is the creteria to choose the primary profile:

  • First priority: Member with more identities comes first.
  • Second priority: If both members have the same number of identities, the member with higher activity count comes is marked as the primary one.
  • Equal case: If both criteria are equal, the order remains unchanged

Secondary Profile

Secondary profiles are the ones that get merged into the primary profile. They might have overlapping or partial info — like a GitHub handle here, a Twitter profile there — and we pull in anything useful that’s not already in the primary profile.

Once merged, these profiles are hidden or removed, but their data sticks around if it's relevant.

Similarity and Confidence Thresholds

TBD

Manual Merges

If a duplicated profile hasn't been idenfitied by the system, the user still has the possibility to manually merge the profiles via the UI.

Merge Suggestions

Since we are collecting millions of profiles, simply relying on manual processes for deduplication wouldn't be scalable. Therefore, the system has an automatic mechanism in place to detect duplicates and mark them as merge suggestions. Merge suggestions can then be manually reviewerd by a user, or handled by an LLM agent automatically. For each suggestion, the merge can be accepted or ignored depending if it's correct or wrong.

Unmerging process

Manual deduplication process

TBD

Automatic deduplication / Merge suggestions process

TBD

Algorithm to generate people merge suggestions

People Merge suggestions

Algorithm to generate organization merge suggestions

Organization Merge suggestions

Technical implementation

TBD

Components

Deduplication components

People merges

People merge

Organization merges

Organization merge

Merge Suggestions and Automatic Merging

Merge Suggestions and Automatic Merging

Data Schema

Merges data schema

Clone this wiki locally