-
Notifications
You must be signed in to change notification settings - Fork 731
Deduplication
Profile (people and organizations) deduplication is the process of identifying and merging multiple representations of the same individual or entity within a dataset. In a system like Community Data Platform, profile duplication can arise due to multiple data sources and inconsistent identities from the collected entities.
The deduplication process is guided by the following core principles:
- Accuracy over Aggressiveness: Merging profiles should be conservative to avoid incorrect merges. False positives are more damaging than duplicates.
- Traceability: Every merge should be recorded so it can be audited or reversed.
- Source Priority: Data from more reliable sources (e.g. GitHub, LinkedIn) should take precedence when merging conflicting profile fields.
- Minimal Data Loss: When merging, no important data should be discarded unless it is known to be redundant or superseded. Instead, data should be merged into one profile.
- Automation with Oversight: Deduplication can be automated with confidence thresholds, but always supports manual review for edge cases.
- Automation First: Wherever feasible, we favor automated processes over manual intervention to ensure scalability, consistency, and efficiency.
The goal of deduplication is to ensure that each person or entity is represented by a single, unified profile that aggregates all related activity and metadata. This enhances:
- Data Quality: Fewer duplicates, redundancies and inconsistencies.
- Analytics Accuracy: More reliable metrics and insights.
A well-merged profile serves as the single source of truth for a person's identity and engagement history across platforms.
The primary profile is the one the system keeps when merging duplicates. The profile priveliges all the data that initialy had and inherits all the data points linked to the secondary profile and that didn't exist on the primary one.
When the system detects duplicated profiles, it suggests one primary and one secondary, where the secondary is merged into the primary one. This is the creteria to choose the primary profile:
- First priority: Member with more identities comes first.
- Second priority: If both members have the same number of identities, the member with higher activity count comes is marked as the primary one.
- Equal case: If both criteria are equal, the order remains unchanged
Secondary profiles are the ones that get merged into the primary profile. They might have overlapping or partial info — like a GitHub handle here, a Twitter profile there — and we pull in anything useful that’s not already in the primary profile.
Once merged, these profiles are hidden or removed, but their data sticks around if it's relevant.
TBD
If a duplicated profile hasn't been idenfitied by the system, the user still has the possibility to manually merge the profiles via the UI.
Since we are collecting millions of profiles, simply relying on manual processes for deduplication wouldn't be scalable. Therefore, the system has an automatic mechanism in place to detect duplicates and mark them as merge suggestions. Merge suggestions can then be manually reviewerd by a user, or handled by an LLM agent automatically. For each suggestion, the merge can be accepted or ignored depending if it's correct or wrong.
TBD
TBD


TBD

People merges

Organization merges

Merge Suggestions and Automatic Merging


- Home
- Features
- Areas
- Backend
- Frontend
- Core Platform
- Integrations Pipeline
- Integrations
- Data Correctness
- Resources
- Deployment
- Kubernetes
- Local Development
- Monitoring
- Oracle Cloud
- Scripts
- Archives