Skip to content

Commit 2fe74de

Browse files
Fix ISO 8601 date pattern accepting impossible month/day values (#2113)
* Fix ISO 8601 date pattern accepting impossible month/day values The "ISO 8601 datetime" pattern in DateRecognizer used `[01]\d` for the month and `[0-3]\d` for the day. These ranges admit impossible values: month `00` and `13`-`19`, and day `00` and `32`-`39`. As a result strings such as `2024-13-15T14:30:00Z` and `2024-12-32T14:30Z` were detected as DATE_TIME. Every other date pattern in this same file already constrains the month to `01`-`12` and the day to `01`-`31`; only the ISO 8601 pattern was loose. Tighten the ISO month/day fields to match (using non-capturing groups so existing capture-group positions are unaffected). No valid ISO 8601 datetime is lost, since those values are not valid dates to begin with. Adds parametrized cases for invalid month (00, 13) and day (00, 32). * Address review: apply word boundary across all alternatives, rename pattern Two review points from #2113: 1. `|` has lower precedence than concatenation, so the pattern `\b A | B | C \b` was parsed as `(\b A) | B | (C \b)`. The leading `\b` only guarded the first alternative (the full-fractional form) and the trailing `\b` only guarded the last (the minutes-only form). The seconds-only alternative in the middle had no word-boundary anchor at all, so a valid seconds/minutes datetime could match mid-word (e.g. `Today is2024-03-15T14:30:00+02:00`). Wrap the alternation in a non-capturing group so both `\b` anchors apply to every alternative. 2. Rename the pattern from "ISO 8601 datetime" to "Datetime (yyyy-mm-ddThh:mm[:ss[.f]] with timezone)" — the pattern doesn't fully validate ISO 8601 (e.g. the hour field admits 24–29). The new name honestly describes the shape it accepts. --------- Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>
1 parent ac56751 commit 2fe74de

2 files changed

Lines changed: 12 additions & 2 deletions

File tree

presidio-analyzer/presidio_analyzer/predefined_recognizers/generic/date_recognizer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ class DateRecognizer(PatternRecognizer):
1515

1616
PATTERNS = [
1717
Pattern(
18-
"ISO 8601 datetime",
19-
r"\b(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d\.\d+([+-][0-2]\d:[0-5]\d|Z))|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d([+-][0-2]\d:[0-5]\d|Z))|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d([+-][0-2]\d:[0-5]\d|Z))\b",
18+
"Datetime (yyyy-mm-ddThh:mm[:ss[.f]] with timezone)",
19+
r"\b(?:(\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])T[0-2]\d:[0-5]\d:[0-5]\d\.\d+([+-][0-2]\d:[0-5]\d|Z))|(\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])T[0-2]\d:[0-5]\d:[0-5]\d([+-][0-2]\d:[0-5]\d|Z))|(\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])T[0-2]\d:[0-5]\d([+-][0-2]\d:[0-5]\d|Z)))\b",
2020
0.8,
2121
),
2222
Pattern(

presidio-analyzer/tests/test_date_recognizer.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,17 @@ def entities():
4949
("Today is 2024-03-15T14:30:00Z\r or not?", 1, ((9, 29),), ((0.6, 0.81),),),
5050
("Today is 2024-03-15T14:30Z\n or not?", 1, ((9, 26),), ((0.6, 0.81),),),
5151
("2024-03-15T14:30Z", 1, ((0, 17),), ((0.6, 1),),),
52+
# Invalid ISO 8601 month/day values must not be detected as a date
53+
("2024-13-15T14:30:00Z", 0, (), (),),
54+
("2024-00-15T14:30:00Z", 0, (), (),),
55+
("2024-12-32T14:30Z", 0, (), (),),
56+
("2024-12-00T14:30Z", 0, (), (),),
5257
("Today is2024-06-05T09:15:30.500-07:00", 0, (), (),),
58+
# The leading `\b` must apply to every alternative in the pattern,
59+
# not just the first one. Without a non-capturing wrapper, the
60+
# seconds and minutes-only alternatives could match mid-word.
61+
("Today is2024-03-15T14:30:00+02:00", 0, (), (),),
62+
("Today is2024-03-15T14:30Z", 0, (), (),),
5363
# Word boundary tests
5464
("Today is5/21", 0, (), (),),
5565
("Today is5/21and it's sunny", 0, (), (),),

0 commit comments

Comments
 (0)