Skip to content

fix: use flexible datetime parsing in filter_files_by_globs_and_start_date#921

Open
Aaron ("AJ") Steers (aaronsteers) wants to merge 3 commits intomainfrom
devin/1772050585-fix-datetime-parsing
Open

fix: use flexible datetime parsing in filter_files_by_globs_and_start_date#921
Aaron ("AJ") Steers (aaronsteers) wants to merge 3 commits intomainfrom
devin/1772050585-fix-datetime-parsing

Conversation

@aaronsteers
Copy link
Contributor

@aaronsteers Aaron ("AJ") Steers (aaronsteers) commented Feb 25, 2026

Summary

Fixes #920. Replaces strict datetime.strptime (which required microseconds via %Y-%m-%dT%H:%M:%S.%fZ) with the flexible ab_datetime_parse helper in filter_files_by_globs_and_start_date. This allows valid ISO8601 dates like 2025-01-01T00:00:00Z (without microseconds) to be accepted at runtime, consistent with the spec validation fix in CDK v7.7.1 (PR #887).

Because ab_datetime_parse returns timezone-aware datetimes while file.last_modified may be naive, the comparison now normalizes naive last_modified values to UTC before comparing. This matches the old implicit behavior (where strptime treated Z as a literal, producing naive datetimes that were effectively UTC).

Review & Testing Checklist for Human

  • Timezone normalization correctness: Verify that assuming naive file.last_modified is UTC is safe across all file-based connectors (S3, GCS, Azure Blob, SharePoint, OneDrive, SFTP). If any source produces non-UTC naive datetimes, the comparison could be off.
  • Permissiveness of ab_datetime_parse: The new parsing accepts date-only strings (e.g. "2023-06-01") and various offset formats. Confirm this is acceptable for start_date, or whether upstream config validation already constrains this sufficiently.
  • Other DATE_TIME_FORMAT usages unchanged: Cursor classes (DefaultFileBasedCursor, FileBasedConcurrentCursor) also use datetime.strptime with the same strict format for reading cursor state. Those were not changed — verify they don't face the same issue (they likely don't since cursor state is always serialized by the CDK itself with microseconds).
  • Recommended test plan: Configure a file-based connector (e.g. source-s3) with start_date = "2025-01-01T00:00:00Z" (no microseconds) and run a sync to confirm it no longer raises ValueError.

Notes


Open with Devin

Summary by CodeRabbit

  • Bug Fixes

    • Improved start_date parsing and timezone handling for file filtering, normalizing timestamps to ensure consistent comparisons.
  • Tests

    • Added comprehensive tests covering multiple date formats and timezones (ISO timestamps, offsets, and date-only) to validate start_date-based filtering.

…_date

Replace strict datetime.strptime with ab_datetime_parse in
filter_files_by_globs_and_start_date to accept valid ISO8601 dates
without microseconds (e.g. 2025-01-01T00:00:00Z).

Also handle timezone-naive file.last_modified by assuming UTC for
comparison, preventing TypeError on mixed naive/aware comparisons.

Closes #920

Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1772050585-fix-datetime-parsing#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1772050585-fix-datetime-parsing

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@aaronsteers Aaron ("AJ") Steers (aaronsteers) marked this pull request as ready for review February 25, 2026 20:20
Copilot AI review requested due to automatic review settings February 25, 2026 20:20
github-code-quality[bot]

This comment was marked as resolved.

Co-Authored-By: AJ Steers <aj@airbyte.io>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f52c10 and 432976e.

📒 Files selected for processing (1)
  • unit_tests/sources/file_based/test_file_based_stream_reader.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • unit_tests/sources/file_based/test_file_based_stream_reader.py

📝 Walkthrough

Walkthrough

Replaces strict datetime parsing with ab_datetime_parse for start_date and normalizes file.last_modified to UTC when timezone is missing, ensuring timezone-consistent comparisons in the file-based stream reader.

Changes

Cohort / File(s) Summary
File-based stream reader
airbyte_cdk/sources/file_based/file_based_stream_reader.py
Use ab_datetime_parse instead of datetime.strptime for start_date; add timezone-aware normalization of file.last_modified (set to UTC when tzinfo is missing); add imports for ab_datetime_parse and timezone utilities.
Unit tests
unit_tests/sources/file_based/test_file_based_stream_reader.py
Add parameterized test cases to cover start_date formats: without microseconds, with +00:00 offset, with non-zero offset, and date-only format; validate matching and prefix expectations for glob filtering.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Would you like me to suggest a follow-up test that explicitly asserts naive vs. aware datetime comparisons to prevent regressions, wdyt?

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: replacing strict datetime parsing with flexible datetime parsing in the filter_files_by_globs_and_start_date function.
Linked Issues check ✅ Passed The PR fully addresses issue #920 by replacing datetime.strptime with ab_datetime_parse to accept ISO8601 dates without microseconds and normalizing naive file.last_modified to UTC.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the datetime parsing issue in filter_files_by_globs_and_start_date and adding comprehensive test coverage for the fix.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1772050585-fix-datetime-parsing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the file-based stream reader’s runtime start_date filtering to accept flexible ISO8601/RFC3339 inputs (fixing runtime failures when microseconds are omitted), aligning behavior with the existing spec validation change.

Changes:

  • Replace strict datetime.strptime(...%fZ) parsing with ab_datetime_parse in filter_files_by_globs_and_start_date.
  • Normalize naive file.last_modified values to UTC before comparing against the (timezone-aware) parsed start_date.
  • Add unit test cases covering start_date values without microseconds, with explicit UTC offsets, and date-only strings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
airbyte_cdk/sources/file_based/file_based_stream_reader.py Switches to flexible datetime parsing and fixes naive/aware datetime comparison by normalizing naive last_modified to UTC.
unit_tests/sources/file_based/test_file_based_stream_reader.py Adds regression tests for flexible start_date formats to ensure runtime filtering no longer rejects valid inputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

import time
from abc import ABC, abstractmethod
from datetime import datetime
from datetime import timezone
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime is no longer used in this module after switching to ab_datetime_parse; consider removing it from the from datetime import ... import to avoid dead code / lint noise.

Suggested change
from datetime import timezone

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in the second commit (1f52c10) — datetime was removed, only timezone is imported now.


Devin session

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
unit_tests/sources/file_based/test_file_based_stream_reader.py (1)

425-431: Could we add one non-zero timezone-offset case too, wdyt?

Line 427 uses +00:00, which is equivalent to Z. Would you consider adding one case with a non-zero offset (e.g. +05:00 or -04:00) to prove offset normalization behavior end-to-end?

➕ Optional test case
         pytest.param(
             ["**/*.csv"],
             {"start_date": "2023-06-01T00:00:00+00:00", "streams": []},
             {"a.csv", "a/b.csv", "a/c.csv", "a/b/c.csv", "a/c/c.csv", "a/b/c/d.csv"},
             set(),
             id="start_date_with_utc_offset",
         ),
+        pytest.param(
+            ["**/*.csv"],
+            {"start_date": "2023-06-05T08:54:07+05:00", "streams": []},
+            {"a.csv", "a/b.csv", "a/c.csv", "a/b/c.csv", "a/c/c.csv", "a/b/c/d.csv"},
+            set(),
+            id="start_date_with_non_zero_offset",
+        ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@unit_tests/sources/file_based/test_file_based_stream_reader.py` around lines
425 - 431, Add a new pytest.param alongside the existing one that uses
{"start_date": "2023-06-01T00:00:00+00:00", "streams": []} (id
"start_date_with_utc_offset") but with a non-zero timezone offset (e.g.
"2023-06-01T00:00:00+05:00" or "-04:00") to validate offset normalization; keep
the same pattern argument (["**/*.csv"]), same expected matched set ({"a.csv",
"a/b.csv", "a/c.csv", "a/b/c.csv", "a/c/c.csv", "a/b/c/d.csv"}) and same empty
excluded set, and give it a distinct id like "start_date_with_nonzero_offset" so
the test exercises end-to-end timezone-offset handling in the reader.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@airbyte_cdk/sources/file_based/file_based_stream_reader.py`:
- Around line 117-121: RemoteFile.last_modified is sometimes produced as a
local-naive datetime (e.g., via datetime.now()) but the reader assumes naive
means UTC; fix by making producers emit timezone-aware UTC datetimes (replace
datetime.now() with datetime.now(timezone.utc) in producers such as
unstructured_parser.py), and add a defensive validator/normalizer on the
RemoteFile model (e.g., RemoteFile.__post_init__ or factory) that asserts or
converts last_modified to an aware UTC datetime (use .astimezone(timezone.utc)
if tzinfo present or attach timezone.utc if you choose to treat naive as UTC);
also add a unit test that constructs RemoteFile with naive and aware datetimes
to ensure the validator enforces/normalizes to UTC so
file_based_stream_reader.py no longer needs to assume naive==UTC.

---

Nitpick comments:
In `@unit_tests/sources/file_based/test_file_based_stream_reader.py`:
- Around line 425-431: Add a new pytest.param alongside the existing one that
uses {"start_date": "2023-06-01T00:00:00+00:00", "streams": []} (id
"start_date_with_utc_offset") but with a non-zero timezone offset (e.g.
"2023-06-01T00:00:00+05:00" or "-04:00") to validate offset normalization; keep
the same pattern argument (["**/*.csv"]), same expected matched set ({"a.csv",
"a/b.csv", "a/c.csv", "a/b/c.csv", "a/c/c.csv", "a/b/c/d.csv"}) and same empty
excluded set, and give it a distinct id like "start_date_with_nonzero_offset" so
the test exercises end-to-end timezone-offset handling in the reader.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f41401 and 1f52c10.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py
  • unit_tests/sources/file_based/test_file_based_stream_reader.py

Comment on lines +117 to +121
last_modified = (
file.last_modified
if file.last_modified.tzinfo is not None
else file.last_modified.replace(tzinfo=timezone.utc)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "1) Inspect RemoteFile definition (last_modified type/contract):"
fd 'remote_file.py$' --exec sed -n '1,260p' {}

echo
echo "2) Find call sites constructing RemoteFile / UploadableRemoteFile with last_modified:"
rg -nP --type=py -C3 'RemoteFile\(|UploadableRemoteFile\('

echo
echo "3) Find last_modified assignments and datetime parsing sources:"
rg -nP --type=py -C3 'last_modified\s*=|strptime\(|fromtimestamp\(|parser\.parse\(|ab_datetime_parse\('

Repository: airbytehq/airbyte-python-cdk

Length of output: 50385


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Production code (non-test) that constructs RemoteFile with last_modified ==="
rg -t py 'RemoteFile\(' --glob '!unit_tests' --glob '!test_' -B2 -A2 'last_modified' | head -100

echo
echo "=== Datetime parsing in production (non-test) sources ==="
rg -t py '(strptime|fromtimestamp|parser\.parse|datetime\.now)' --glob '!unit_tests' --glob '!test_' -B2 -A2 | head -150

echo
echo "=== Check for validators on RemoteFile or last_modified ==="
rg -t py 'validator|@root_validator' airbyte_cdk/sources/file_based/

echo
echo "=== Search for actual S3/GCS/HTTPS remote file implementations ==="
fd -t f -e py --glob '*remote*file*' --glob '!test' | xargs rg -l 'RemoteFile\|UploadableRemoteFile' 2>/dev/null | head -20

Repository: airbytehq/airbyte-python-cdk

Length of output: 223


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Production code RemoteFile/UploadableRemoteFile implementations ==="
find airbyte_cdk/sources/file_based -name '*.py' -type f ! -path '*/test*' -exec grep -l 'class.*RemoteFile' {} \;

echo
echo "=== Search for datetime creation in these implementation files ==="
find airbyte_cdk/sources/file_based -name '*.py' -type f ! -path '*/test*' -exec grep -H 'last_modified.*=' {} \; | head -50

echo
echo "=== Check for datetime/timezone imports in file_based_stream_reader.py ==="
sed -n '1,50p' airbyte_cdk/sources/file_based/file_based_stream_reader.py

echo
echo "=== Look at code around line 117-121 to see imports and context ==="
sed -n '1,15p' airbyte_cdk/sources/file_based/file_based_stream_reader.py
rg -t py 'from datetime|import.*timezone' airbyte_cdk/sources/file_based/file_based_stream_reader.py

echo
echo "=== Search for test that explicitly tests UTC-naive assumption ==="
rg -t py 'UTC|utc.*naive|naive.*utc|tzinfo' --glob '*test*' -B2 -A2 | head -80

Repository: airbytehq/airbyte-python-cdk

Length of output: 10704


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Tests in file_based_stream_reader test file ==="
rg -t py 'def test.*last_modified|def test.*utc|def test.*naive' unit_tests/sources/file_based/test_file_based_stream_reader.py -A 5

echo
echo "=== Look at the actual code lines 117-121 context ==="
sed -n '110,130p' airbyte_cdk/sources/file_based/file_based_stream_reader.py

echo
echo "=== Check if RemoteFile has any validators ==="
cat airbyte_cdk/sources/file_based/remote_file.py

Repository: airbytehq/airbyte-python-cdk

Length of output: 123


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Code context around lines 117-121 ==="
sed -n '110,130p' airbyte_cdk/sources/file_based/file_based_stream_reader.py

echo
echo "=== Full RemoteFile definition ==="
cat airbyte_cdk/sources/file_based/remote_file.py

Repository: airbytehq/airbyte-python-cdk

Length of output: 2412


Verify that all RemoteFile.last_modified producers generate UTC-naive datetimes.

The concern is valid: the code assumes naive datetimes are UTC (line 120: .replace(tzinfo=timezone.utc)), but evidence shows datetime.now() is used in production code (e.g., unstructured_parser.py), which generates local-naive datetimes, not UTC-naive. If the system runs in a non-UTC timezone, this would cause incorrect time-based filtering.

Could we ensure all RemoteFile producers either (a) use timezone-aware UTC datetimes, or (b) consistently use a UTC-aware datetime parser? Adding a guard test or validator on RemoteFile.last_modified to document/enforce this contract would also help prevent regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@airbyte_cdk/sources/file_based/file_based_stream_reader.py` around lines 117
- 121, RemoteFile.last_modified is sometimes produced as a local-naive datetime
(e.g., via datetime.now()) but the reader assumes naive means UTC; fix by making
producers emit timezone-aware UTC datetimes (replace datetime.now() with
datetime.now(timezone.utc) in producers such as unstructured_parser.py), and add
a defensive validator/normalizer on the RemoteFile model (e.g.,
RemoteFile.__post_init__ or factory) that asserts or converts last_modified to
an aware UTC datetime (use .astimezone(timezone.utc) if tzinfo present or attach
timezone.utc if you choose to treat naive as UTC); also add a unit test that
constructs RemoteFile with naive and aware datetimes to ensure the validator
enforces/normalizes to UTC so file_based_stream_reader.py no longer needs to
assume naive==UTC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the non-zero timezone offset test case in 432976e. Good suggestion — this exercises the full offset normalization path.


Devin session

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation about datetime.now() in producers like unstructured_parser.py — that's a pre-existing issue where naive-local datetimes could be produced in non-UTC environments. However, fixing all RemoteFile producers and adding a model-level validator is out of scope for this PR, which is focused on the start_date parsing bug (#920).

The defensive normalization here (assuming naive = UTC) preserves the existing behavior from before this change — the old strptime also produced naive datetimes that were implicitly treated as UTC. This PR doesn't make that situation worse; it just makes the assumption explicit.

A follow-up to enforce UTC-aware datetimes at the RemoteFile model level would be a good improvement but should be a separate effort.


Devin session

@github-actions
Copy link

PyTest Results (Fast)

422 tests   - 3 447   412 ✅  - 3 445   1m 39s ⏱️ - 5m 12s
  1 suites ±    0     9 💤  -     3 
  1 files   ±    0     1 ❌ +    1 

For more details on these failures, see this check.

Results for commit 432976e. ± Comparison against base commit 7f41401.

This pull request removes 3447 tests.
unit_tests.resources.source_pokeapi_w_components_py.integration_tests.test_airbyte_standards.TestSuiteSourcePokeAPI ‑ test_basic_read['valid_config' Test Scenario]
unit_tests.resources.source_pokeapi_w_components_py.integration_tests.test_airbyte_standards.TestSuiteSourcePokeAPI ‑ test_check['valid_config' Test Scenario]
unit_tests.resources.source_pokeapi_w_components_py.integration_tests.test_airbyte_standards.TestSuiteSourcePokeAPI ‑ test_discover['valid_config' Test Scenario]
unit_tests.resources.source_pokeapi_w_components_py.integration_tests.test_airbyte_standards.TestSuiteSourcePokeAPI ‑ test_docker_image_build_and_check['valid_config' Test Scenario]
unit_tests.resources.source_pokeapi_w_components_py.integration_tests.test_airbyte_standards.TestSuiteSourcePokeAPI ‑ test_docker_image_build_and_read['valid_config' Test Scenario]
unit_tests.resources.source_pokeapi_w_components_py.integration_tests.test_airbyte_standards.TestSuiteSourcePokeAPI ‑ test_fail_read_with_bad_catalog['valid_config' Test Scenario]
unit_tests.resources.source_pokeapi_w_components_py.integration_tests.test_airbyte_standards.TestSuiteSourcePokeAPI ‑ test_spec
unit_tests.source_declarative_manifest.test_source_declarative_local_manifest ‑ test_given_invalid_config_then_unsuccessful_check
unit_tests.source_declarative_manifest.test_source_declarative_local_manifest ‑ test_given_valid_config_with_successful_check
unit_tests.source_declarative_manifest.test_source_declarative_local_manifest ‑ test_invalid_yaml_throws
…

@github-actions
Copy link

PyTest Results (Full)

3 877 tests   3 865 ✅  11m 16s ⏱️
    1 suites     12 💤
    1 files        0 ❌

Results for commit 432976e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: file_based_stream_reader.filter_files_by_globs_and_start_date uses strict datetime parsing, rejecting valid ISO8601 dates

2 participants