Skip to content

fix(cdk): support start_date format without microseconds in file-based connectors#945

Merged
Daryna Ishchenko (darynaishchenko) merged 10 commits intomainfrom
devin/1773159423-fix-start-date-format-parsing
Mar 11, 2026
Merged

fix(cdk): support start_date format without microseconds in file-based connectors#945
Daryna Ishchenko (darynaishchenko) merged 10 commits intomainfrom
devin/1773159423-fix-start-date-format-parsing

Conversation

@darynaishchenko
Copy link
Copy Markdown
Contributor

@darynaishchenko Daryna Ishchenko (darynaishchenko) commented Mar 10, 2026

fix(cdk): support start_date format without microseconds in file-based connectors

Summary

AbstractFileBasedStreamReader.filter_files_by_globs_and_start_date crashes at runtime when start_date is provided without microseconds (e.g. 2025-01-01T00:00:00Z), because datetime.strptime with %f requires the fractional-seconds component. This format is commonly produced by Terraform and other API clients.

The spec validator (updated in CDK v7.7.1) already accepts this format, so the failure only surfaces at runtime during file filtering — not at config validation time.

Fix: Adds a _parse_start_date instance method with a three-tier fallback:

  1. Try self.DATE_TIME_FORMAT (%Y-%m-%dT%H:%M:%S.%fZ) — the strict format with microseconds
  2. Try %Y-%m-%dT%H:%M:%SZ — the shorter ISO8601 variant without microseconds
  3. Fall back to ab_datetime_parse() from airbyte_cdk.utils.datetime_helpers — handles date-only (YYYY-MM-DD), timezone offsets, and other formats supported by dateutil

For the ab_datetime_parse fallback, the result is first converted to UTC via .astimezone(timezone.utc) and then made naive via .replace(tzinfo=None) to remain comparable with RemoteFile.last_modified (a naive datetime). This ensures non-UTC offsets like +05:30 are correctly converted before comparison.

Relates to: https://github.com/airbytehq/oncall/issues/9390

Updates since last revision

  • Fixed timezone handling in ab_datetime_parse fallback: now uses .astimezone(timezone.utc).replace(tzinfo=None) instead of .replace(tzinfo=None). Previously, a non-UTC offset like 2025-01-01T00:00:00+05:30 would silently discard the offset and produce 2025-01-01 00:00:00 instead of the correct UTC-equivalent 2024-12-31 18:30:00.
  • Added test case with_timezone_offset_converted_to_utc verifying correct UTC conversion for timezone-offset inputs.

Previous updates

  • Added ab_datetime_parse as a third-level fallback in _parse_start_date. If both strptime formats fail, the method now delegates to ab_datetime_parse() rather than raising ValueError.
  • Added test case for date-only format ("2025-01-01") exercising the ab_datetime_parse fallback path.
  • Removed @staticmethod decorator from _parse_start_date; the method now uses self.DATE_TIME_FORMAT instead of referencing AbstractFileBasedStreamReader.DATE_TIME_FORMAT directly. This ensures subclass overrides of DATE_TIME_FORMAT are respected.
  • Extended the docstring on _parse_start_date to document that the fallback format originates from AbstractFileBasedSpec's pattern_descriptor: "YYYY-MM-DD, YYYY-MM-DDTHH:mm:ssZ, or YYYY-MM-DDTHH:mm:ss.SSSSSSZ".

Review & Testing Checklist for Human

  • Confirm the ab_datetime_parse fallback scope is acceptable. ab_datetime_parse is very flexible — it handles Unix timestamps, various ISO8601 variants, and anything dateutil.parser.parse() can handle. This is broader than what the spec's pattern_descriptor advertises. Verify that silently accepting unexpected formats won't mask config errors.
  • Verify naive datetime consistency across the two strptime branches. The first two branches parse Z as a literal character (not as a UTC designator), producing naive datetimes. The third branch converts to UTC then strips tzinfo. This is consistent as long as RemoteFile.last_modified is always naive and effectively in UTC — confirm this assumption holds across connectors.
  • Check for other rigid start_date parsing sites. The cursors (DefaultFileBasedCursor, FileBasedConcurrentCursor) also use DATE_TIME_FORMAT with strptime — but those parse state values the CDK itself serialized, so they should always include microseconds. Confirm this assumption holds.

Suggested test plan:

  1. Configure a file-based connector (e.g. source-s3 or source-gcs) with various start_date formats:
    • "2025-01-01T00:00:00Z" (without microseconds)
    • "2025-01-01T00:00:00.000000Z" (with microseconds)
    • "2025-01-01" (date-only)
    • "2025-01-01T00:00:00+05:30" (with non-UTC offset)
  2. Verify syncs complete without error and files are filtered correctly based on last_modified >= start_date.

Notes

Requested by: Daryna Ishchenko (@darynaishchenko)
Devin session

Summary by CodeRabbit

  • Bug Fixes

    • Improved start_date parsing to accept multiple ISO‑8601 variants (with/without microseconds, timezone offsets) and to normalize to UTC when needed for consistent filtering.
  • Tests

    • Added comprehensive tests covering microsecond/no-microsecond formats, end-of-day and date-only inputs, timezone-offset conversion, and invalid inputs to ensure robust parsing and error handling.

Open with Devin

…d connectors

Co-Authored-By: Daryna Ishchenko <darina.ishchenko17@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1773159423-fix-start-date-format-parsing#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1773159423-fix-start-date-format-parsing

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

devin-ai-integration bot and others added 3 commits March 10, 2026 16:25
…TIME_FORMAT

Co-Authored-By: Daryna Ishchenko <darina.ishchenko17@gmail.com>
…Spec pattern_descriptor

Co-Authored-By: Daryna Ishchenko <darina.ishchenko17@gmail.com>
…ate, not cursor values

Co-Authored-By: Daryna Ishchenko <darina.ishchenko17@gmail.com>
@darynaishchenko Daryna Ishchenko (darynaishchenko) marked this pull request as ready for review March 10, 2026 16:39
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 10, 2026

Warning

Rate limit exceeded

@devin-ai-integration[bot] has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 15 minutes and 7 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fac1800f-68ee-440a-8ad5-9c4c79c50727

📥 Commits

Reviewing files that changed from the base of the PR and between 09bdff2 and 9bcab01.

📒 Files selected for processing (1)
  • unit_tests/sources/file_based/test_file_based_stream_reader.py
📝 Walkthrough

Walkthrough

Adds a private _parse_start_date helper to parse start_date strings (microsecond and non-microsecond ISO variants plus a final parse fallback) and replaces direct datetime.strptime usage; corresponding unit tests for multiple date formats and invalid input were added.

Changes

Cohort / File(s) Summary
Date Parsing Enhancement
airbyte_cdk/sources/file_based/file_based_stream_reader.py
Adds private _parse_start_date(self, start_date_str: str) -> datetime that tries primary DATE_TIME_FORMAT (with microseconds), %Y-%m-%dT%H:%M:%SZ fallback, then ab_datetime_parse with UTC normalization; replaces direct datetime.strptime calls and uses parsed value for config.start_date.
Date Parsing Tests
unit_tests/sources/file_based/test_file_based_stream_reader.py
Adds parametrized test_parse_start_date (variants: with/without microseconds, end-of-day, date-only fallback, timezone offset), test_parse_start_date_invalid_raises, and a globs test all_csvs_start_date_without_microseconds; tests duplicate coverage in a second location within the file.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Would you like me to suggest a brief docstring and a couple of edge-case tests for leap seconds or uncommon timezone formats, wdyt?

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding support for start_date formats without microseconds in file-based connectors, which is the core fix implemented in the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1773159423-fix-start-date-format-parsing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 10, 2026

PyTest Results (Fast)

3 914 tests  +9   3 902 ✅ +9   6m 42s ⏱️ +25s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 9bcab01. ± Comparison against base commit 3e65ad5.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 10, 2026

PyTest Results (Full)

3 917 tests  +9   3 905 ✅ +9   11m 20s ⏱️ +10s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 9bcab01. ± Comparison against base commit 3e65ad5.

♻️ This comment has been updated with latest results.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration bot and others added 3 commits March 11, 2026 13:11
…llback

Co-Authored-By: Daryna Ishchenko <darina.ishchenko17@gmail.com>
…ipping tzinfo

Co-Authored-By: Daryna Ishchenko <darina.ishchenko17@gmail.com>
coderabbitai[bot]

This comment was marked as resolved.

…tart_date

Co-Authored-By: Daryna Ishchenko <darina.ishchenko17@gmail.com>
@darynaishchenko Daryna Ishchenko (darynaishchenko) merged commit 6876663 into main Mar 11, 2026
28 of 29 checks passed
@darynaishchenko Daryna Ishchenko (darynaishchenko) deleted the devin/1773159423-fix-start-date-format-parsing branch March 11, 2026 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants