Skip to content

[Bug]: phone regex over-matches 10-digit identifiers (GitHub comment IDs, order numbers, etc.) #195

@guolvlin-cn

Description

@guolvlin-cn

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Environment (optional)

memoria 0.3.3, MCP mode, macOS

Actual behavior

The phone pattern in memoria/crates/memoria-core/src/sensitivity.rs redacts any unbroken 10-digit number to [phone], because its separators are optional:

regex: r"\b\d{3}[-.]?\d{3,4}[-.]?\d{4}\b",
replacement: "[phone]",

With [-.]? being optional, a plain 10-digit sequence like 4345955198 matches and gets redacted. Real-world collateral damage I hit today:

  • GitHub comment URLs: https://github.com/org/repo/issues/123#issuecomment-4345955198 → the numeric suffix is rewritten to [phone], the link becomes unusable.
  • memory_correct on a memory containing such a URL silently loses the reference; repeated corrections can't restore it.

Likely also affected: order IDs, transaction IDs, any 10–11 digit opaque identifier pasted in context.

Expected behavior

The filter should only redact digit sequences that actually look like phone numbers — i.e. require at least one separator, or require surrounding context (+, tel:, parens, international prefix) before redacting.

Alternatives worth considering:

  1. Tighten the regex to require separators: \b\d{3}[-.]\d{3,4}[-.]\d{4}\b (no ?), plus a separate pattern for +<country>\d{7,} style.
  2. URL-aware skip: don't run MEDIUM-tier redaction inside obvious URLs (e.g. tokens that contain :// or match [a-z]+://\S+).
  3. Allow callers (especially memory_correct / memory_store) to opt out per-field via a flag like disable_pii_redaction=true when the caller knows the content is trusted.

Steps to reproduce

// in any place that calls check_sensitivity
let s = "see https://github.com/foo/bar/issues/1#issuecomment-4345955198";
let r = memoria_core::sensitivity::check_sensitivity(s);
assert_eq!(r.redacted_content, None); // currently fails — gets redacted to [phone]

Or via MCP:

memory_store(content="ref https://github.com/farion1231/cc-switch/issues/2423#issuecomment-4345955198")
# retrieved content shows: ref https://github.com/farion1231/cc-switch/issues/2423#issuecomment-[phone]

Additional information

Same failure mode likely applies to credit_card (\b(?:\d[ -]*?){13,19}\b) — that pattern will match any 13–19 digit sequence without a Luhn check, so e.g. a 15-digit order number or bank reference would also get redacted to [card]. Worth auditing all MEDIUM patterns for the same over-reach.

Suggested priority: low-med — bug degrades retrieval quality but doesn't cause data loss on the primary store (original stays intact in snapshots).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions