Problem
The memory extraction pipeline currently does a poor job of detecting entity identification information (phone numbers, Signal IDs, Discord IDs/usernames, email addresses, GitHub handles, etc.) and associating it with the correct entity record.
Example
During Neva's onboarding, the following information was available across Signal and Discord conversations but was never extracted or associated with her entity record (entity_id 24):
- Signal phone number (
+18086498444)
- Signal UUID (
bf688e71-9267-4b72-ba9e-21a506fbf190)
- Email (
heyninarei@gmail.com)
- GitHub username (
Pr1ncessN1na)
- Discord ID (
603715052435931147)
- Discord username (
ninarei)
- Discord display name (
Nina Rei)
- Timezone (
America/Chicago)
- Nickname (
Princess)
All of this had to be manually inserted after the fact.
Desired Behavior
The memory extractor should:
- Detect identity-class information in conversations — phone numbers, email addresses, usernames, platform IDs, UUIDs, timezones, pronouns, real names, nicknames, etc.
- Associate detected facts with the correct entity — using conversational context, sender metadata, and existing entity records to determine who the information belongs to.
- Store as entity_facts with appropriate
data_type = 'identity' and high confidence when self-reported.
- Cross-reference platform identifiers — when a Discord user mentions their Signal number or email, link those to the same entity.
- Leverage inbound message metadata — sender IDs, phone numbers, and usernames from message envelopes should be automatically captured and associated.
Scope
- Focus on the memory extraction hooks/pipeline (not the embedding layer).
- Consider both explicit self-reporting ("my email is...") and implicit metadata (Signal sender phone number in message envelope).
- Should handle the common platforms: Signal, Discord, Telegram, Slack, email, GitHub, X/Twitter.
Related
- Entity facts table:
entity_facts (key/value with entity_id FK)
- Memory extraction hook:
memory-extract in hooks config
- Onboarding workflow also touches this (User Onboarding workflow)
Problem
The memory extraction pipeline currently does a poor job of detecting entity identification information (phone numbers, Signal IDs, Discord IDs/usernames, email addresses, GitHub handles, etc.) and associating it with the correct entity record.
Example
During Neva's onboarding, the following information was available across Signal and Discord conversations but was never extracted or associated with her entity record (entity_id 24):
+18086498444)bf688e71-9267-4b72-ba9e-21a506fbf190)heyninarei@gmail.com)Pr1ncessN1na)603715052435931147)ninarei)Nina Rei)America/Chicago)Princess)All of this had to be manually inserted after the fact.
Desired Behavior
The memory extractor should:
data_type = 'identity'and high confidence when self-reported.Scope
Related
entity_facts(key/value with entity_id FK)memory-extractin hooks config