Fix ISO 8601 date pattern accepting impossible month/day values#2113
Conversation
The "ISO 8601 datetime" pattern in DateRecognizer used `[01]\d` for the month and `[0-3]\d` for the day. These ranges admit impossible values: month `00` and `13`-`19`, and day `00` and `32`-`39`. As a result strings such as `2024-13-15T14:30:00Z` and `2024-12-32T14:30Z` were detected as DATE_TIME. Every other date pattern in this same file already constrains the month to `01`-`12` and the day to `01`-`31`; only the ISO 8601 pattern was loose. Tighten the ISO month/day fields to match (using non-capturing groups so existing capture-group positions are unaffected). No valid ISO 8601 datetime is lost, since those values are not valid dates to begin with. Adds parametrized cases for invalid month (00, 13) and day (00, 32).
…attern Two review points from data-privacy-stack#2113: 1. `|` has lower precedence than concatenation, so the pattern `\b A | B | C \b` was parsed as `(\b A) | B | (C \b)`. The leading `\b` only guarded the first alternative (the full-fractional form) and the trailing `\b` only guarded the last (the minutes-only form). The seconds-only alternative in the middle had no word-boundary anchor at all, so a valid seconds/minutes datetime could match mid-word (e.g. `Today is2024-03-15T14:30:00+02:00`). Wrap the alternation in a non-capturing group so both `\b` anchors apply to every alternative. 2. Rename the pattern from "ISO 8601 datetime" to "Datetime (yyyy-mm-ddThh:mm[:ss[.f]] with timezone)" — the pattern doesn't fully validate ISO 8601 (e.g. the hour field admits 24–29). The new name honestly describes the shape it accepts.
|
Thanks @SharonHart — both good points, addressed in the follow-up commit:
All 41 |
|
Quick heads-up on the red CI: the failures on this run are a transient GitHub API outage in The API is back — |
Summary
The
"ISO 8601 datetime"pattern inDateRecognizeruses[01]\dfor the month and[0-3]\dfor the day. These ranges admit values that cannot occur in a real date:00and13–1900and32–39So strings like
2024-13-15T14:30:00Zor2024-12-32T14:30Zare detected asDATE_TIME.Every other date pattern in this same file already constrains the month to
01–12([1-9]|0[1-9]|1[0-2]) and the day to01–31([1-9]|0[1-9]|[1-2][0-9]|3[0-1]). Only the ISO 8601 pattern was left loose — this looks like an oversight rather than intent.Fix
Tighten the month/day fields of the ISO pattern to valid ISO 8601 ranges (zero-padded, since ISO 8601 always uses 2-digit fields):
Non-capturing groups are used so the existing capture-group positions in the pattern are unchanged. No valid ISO 8601 datetime is lost — the excluded values are not valid dates.
Tests
Added parametrized cases for invalid month (
00,13) and day (00,32). All existingtest_date_recognizer.pycases still pass (39 passed total).ruff check .is clean.