Skip to content

Commit 74a0fc4

Browse files
rofeclaude
andcommitted
fix: split bullet-list emails Teams concatenates without separators
Teams delivers bullet lists in activity.text as flat HTML-entity-encoded text with the <li> boundaries removed, so "a@adobe.com" and "b@adobe.com" arrive glued as "a@adobe.comb@adobe.com". The email regex then backtracks into nonsense (TLD=.comb, etc). Decode common HTML entities, then insert a space after a known common TLD when it's immediately followed by a letter (only possible when two emails were concatenated). Trade-off: addresses in less common TLDs that share a prefix with a common one (e.g. .coffee) lose correctness, but this covers the Adobe/customer corporate domains we actually see. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 22f49c6 commit 74a0fc4

1 file changed

Lines changed: 24 additions & 7 deletions

File tree

src/parser.js

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,30 @@ const EMAIL_RE = /[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/g;
22

33
const INTENT_WORDS = /\b(add|invite|include|onboard|grant|give\s+access|join)\b/i;
44

5-
// Strip <at>...</at> mentions entirely, then replace every remaining HTML
6-
// tag with a space so adjacent <li> / <a> / <p> boundaries become real
7-
// separators for the email regex.
5+
// Common TLDs used to separate emails that Teams flattens together when
6+
// users send bullet lists. We insert a space after these when they are
7+
// immediately followed by another letter (which can only happen when two
8+
// emails were concatenated).
9+
const COMMON_TLDS = 'com|org|net|edu|gov|mil|io|co|us|uk|de|fr|jp|cn|au|in|br|ca|me|tv|info|biz|app|dev|ai|cloud';
10+
const TLD_GLUE_RE = new RegExp(`\\.(${COMMON_TLDS})(?=[a-zA-Z])`, 'gi');
11+
12+
const HTML_ENTITIES = { lt: '<', gt: '>', amp: '&', quot: '"', apos: "'", nbsp: ' ' };
13+
14+
function decodeEntities(text) {
15+
return text.replace(/&(lt|gt|amp|quot|apos|nbsp|#\d+);/gi, (m, name) => {
16+
if (name.startsWith('#')) return String.fromCharCode(Number(name.slice(1)));
17+
return HTML_ENTITIES[name.toLowerCase()] ?? m;
18+
});
19+
}
20+
21+
// Decode HTML entities, strip <at>...</at> mentions entirely, replace
22+
// every remaining HTML tag with a space, then split emails that Teams
23+
// concatenated together (no separator between bullet list items).
824
function stripMarkup(text) {
9-
return text
25+
return decodeEntities(text)
1026
.replace(/<at[^>]*>.*?<\/at>/gi, ' ')
11-
.replace(/<[^>]+>/g, ' ');
27+
.replace(/<[^>]+>/g, ' ')
28+
.replace(TLD_GLUE_RE, '.$1 ');
1229
}
1330

1431
export function hasAddIntent(text) {
@@ -18,8 +35,8 @@ export function hasAddIntent(text) {
1835

1936
/**
2037
* Extract de-duplicated email addresses from a Teams message.
21-
* Pulls emails from mailto: links first (so bullet lists that get
22-
* flattened don't lose their separators), then from the stripped text.
38+
* Pulls emails from mailto: links first (when chips are present), then
39+
* from the stripped text (with concatenated bullet-list emails split).
2340
*/
2441
export function extractEmails(text) {
2542
if (!text) return [];

0 commit comments

Comments
 (0)