Skip to content

Fix ISO 8601 date pattern accepting impossible month/day values#2113

Merged
SharonHart merged 3 commits into
data-privacy-stack:mainfrom
jichaowang02-lang:fix/date-iso8601-month-day-range
Jul 5, 2026
Merged

Fix ISO 8601 date pattern accepting impossible month/day values#2113
SharonHart merged 3 commits into
data-privacy-stack:mainfrom
jichaowang02-lang:fix/date-iso8601-month-day-range

Conversation

@jichaowang02-lang

Copy link
Copy Markdown
Contributor

Summary

The "ISO 8601 datetime" pattern in DateRecognizer uses [01]\d for the month and [0-3]\d for the day. These ranges admit values that cannot occur in a real date:

  • month: 00 and 1319
  • day: 00 and 3239

So strings like 2024-13-15T14:30:00Z or 2024-12-32T14:30Z are detected as DATE_TIME.

Every other date pattern in this same file already constrains the month to 0112 ([1-9]|0[1-9]|1[0-2]) and the day to 0131 ([1-9]|0[1-9]|[1-2][0-9]|3[0-1]). Only the ISO 8601 pattern was left loose — this looks like an oversight rather than intent.

Fix

Tighten the month/day fields of the ISO pattern to valid ISO 8601 ranges (zero-padded, since ISO 8601 always uses 2-digit fields):

-\d{4}-[01]\d-[0-3]\dT...
+\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])T...

Non-capturing groups are used so the existing capture-group positions in the pattern are unchanged. No valid ISO 8601 datetime is lost — the excluded values are not valid dates.

Tests

Added parametrized cases for invalid month (00, 13) and day (00, 32). All existing test_date_recognizer.py cases still pass (39 passed total). ruff check . is clean.

The "ISO 8601 datetime" pattern in DateRecognizer used `[01]\d` for the
month and `[0-3]\d` for the day. These ranges admit impossible values:
month `00` and `13`-`19`, and day `00` and `32`-`39`. As a result
strings such as `2024-13-15T14:30:00Z` and `2024-12-32T14:30Z` were
detected as DATE_TIME.

Every other date pattern in this same file already constrains the
month to `01`-`12` and the day to `01`-`31`; only the ISO 8601 pattern
was loose. Tighten the ISO month/day fields to match (using
non-capturing groups so existing capture-group positions are
unaffected). No valid ISO 8601 datetime is lost, since those values are
not valid dates to begin with.

Adds parametrized cases for invalid month (00, 13) and day (00, 32).
Copilot AI review requested due to automatic review settings June 27, 2026 16:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

…attern

Two review points from data-privacy-stack#2113:

1. `|` has lower precedence than concatenation, so the pattern
   `\b A | B | C \b` was parsed as `(\b A) | B | (C \b)`. The leading
   `\b` only guarded the first alternative (the full-fractional form)
   and the trailing `\b` only guarded the last (the minutes-only form).
   The seconds-only alternative in the middle had no word-boundary
   anchor at all, so a valid seconds/minutes datetime could match
   mid-word (e.g. `Today is2024-03-15T14:30:00+02:00`). Wrap the
   alternation in a non-capturing group so both `\b` anchors apply to
   every alternative.

2. Rename the pattern from "ISO 8601 datetime" to
   "Datetime (yyyy-mm-ddThh:mm[:ss[.f]] with timezone)" — the pattern
   doesn't fully validate ISO 8601 (e.g. the hour field admits 24–29).
   The new name honestly describes the shape it accepts.
@jichaowang02-lang

Copy link
Copy Markdown
Contributor Author

Thanks @SharonHart — both good points, addressed in the follow-up commit:

  1. Word-boundary across all alternatives. Confirmed the leak — Today is2024-03-15T14:30:00+02:00 matched via the middle alternative before this change. Wrapped the alternation in a non-capturing group \b(?: alt1 | alt2 | alt3 )\b so both \b anchors apply to every branch. New regression tests cover the seconds and minutes-only mid-word cases.

  2. Not-really-ISO-8601 name. You're right — the hour field is [0-2]\d (admits 24–29) and other fields aren't fully validated either. Renamed the pattern to "Datetime (yyyy-mm-ddThh:mm[:ss[.f]] with timezone)" which honestly describes the shape it accepts, without claiming ISO 8601 compliance. I kept the branch/PR title as-is; happy to rename those too if you'd prefer.

All 41 test_date_recognizer.py cases pass; ruff check . clean.

@jichaowang02-lang

Copy link
Copy Markdown
Contributor Author

Quick heads-up on the red CI: the failures on this run are a transient GitHub API outage in python-coverage-comment-action, not real test failures. The failed jobs all died at ~45–60s at the coverage-upload step:

httpx.HTTPStatusError: Server error '503 Service Unavailable' for url
'https://api.github.com/repos/data-privacy-stack/presidio/contents/data.json?ref=coverage-data-presidio-anonymizer'
##[error]Critical error. This error possibly occurred because the permissions of the workflow are set incorrectly.

The API is back — CI on main ran green a bit later today (15:26 UTC). I don't have rerun permission on the workflow; @SharonHart could you rerun the failed jobs when convenient? The follow-up commit that addresses your review points is b63950a1 and locally pytest tests/test_date_recognizer.py is 41 passed / ruff clean.

Copilot AI review requested due to automatic review settings July 5, 2026 07:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@SharonHart SharonHart merged commit 2fe74de into data-privacy-stack:main Jul 5, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants