fix: use flexible datetime parsing in filter_files_by_globs_and_start_date#921
fix: use flexible datetime parsing in filter_files_by_globs_and_start_date#921Aaron ("AJ") Steers (aaronsteers) wants to merge 3 commits intomainfrom
Conversation
…_date Replace strict datetime.strptime with ab_datetime_parse in filter_files_by_globs_and_start_date to accept valid ISO8601 dates without microseconds (e.g. 2025-01-01T00:00:00Z). Also handle timezone-naive file.last_modified by assuming UTC for comparison, preventing TypeError on mixed naive/aware comparisons. Closes #920 Co-Authored-By: AJ Steers <aj@airbyte.io>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1772050585-fix-datetime-parsing#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1772050585-fix-datetime-parsingPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
Co-Authored-By: AJ Steers <aj@airbyte.io>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Repository UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughReplaces strict datetime parsing with Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Would you like me to suggest a follow-up test that explicitly asserts naive vs. aware datetime comparisons to prevent regressions, wdyt? 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Updates the file-based stream reader’s runtime start_date filtering to accept flexible ISO8601/RFC3339 inputs (fixing runtime failures when microseconds are omitted), aligning behavior with the existing spec validation change.
Changes:
- Replace strict
datetime.strptime(...%fZ)parsing withab_datetime_parseinfilter_files_by_globs_and_start_date. - Normalize naive
file.last_modifiedvalues to UTC before comparing against the (timezone-aware) parsedstart_date. - Add unit test cases covering
start_datevalues without microseconds, with explicit UTC offsets, and date-only strings.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
airbyte_cdk/sources/file_based/file_based_stream_reader.py |
Switches to flexible datetime parsing and fixes naive/aware datetime comparison by normalizing naive last_modified to UTC. |
unit_tests/sources/file_based/test_file_based_stream_reader.py |
Adds regression tests for flexible start_date formats to ensure runtime filtering no longer rejects valid inputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import time | ||
| from abc import ABC, abstractmethod | ||
| from datetime import datetime | ||
| from datetime import timezone |
There was a problem hiding this comment.
datetime is no longer used in this module after switching to ab_datetime_parse; consider removing it from the from datetime import ... import to avoid dead code / lint noise.
| from datetime import timezone |
There was a problem hiding this comment.
Already addressed in the second commit (1f52c10) — datetime was removed, only timezone is imported now.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
unit_tests/sources/file_based/test_file_based_stream_reader.py (1)
425-431: Could we add one non-zero timezone-offset case too, wdyt?Line 427 uses
+00:00, which is equivalent toZ. Would you consider adding one case with a non-zero offset (e.g.+05:00or-04:00) to prove offset normalization behavior end-to-end?➕ Optional test case
pytest.param( ["**/*.csv"], {"start_date": "2023-06-01T00:00:00+00:00", "streams": []}, {"a.csv", "a/b.csv", "a/c.csv", "a/b/c.csv", "a/c/c.csv", "a/b/c/d.csv"}, set(), id="start_date_with_utc_offset", ), + pytest.param( + ["**/*.csv"], + {"start_date": "2023-06-05T08:54:07+05:00", "streams": []}, + {"a.csv", "a/b.csv", "a/c.csv", "a/b/c.csv", "a/c/c.csv", "a/b/c/d.csv"}, + set(), + id="start_date_with_non_zero_offset", + ),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@unit_tests/sources/file_based/test_file_based_stream_reader.py` around lines 425 - 431, Add a new pytest.param alongside the existing one that uses {"start_date": "2023-06-01T00:00:00+00:00", "streams": []} (id "start_date_with_utc_offset") but with a non-zero timezone offset (e.g. "2023-06-01T00:00:00+05:00" or "-04:00") to validate offset normalization; keep the same pattern argument (["**/*.csv"]), same expected matched set ({"a.csv", "a/b.csv", "a/c.csv", "a/b/c.csv", "a/c/c.csv", "a/b/c/d.csv"}) and same empty excluded set, and give it a distinct id like "start_date_with_nonzero_offset" so the test exercises end-to-end timezone-offset handling in the reader.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@airbyte_cdk/sources/file_based/file_based_stream_reader.py`:
- Around line 117-121: RemoteFile.last_modified is sometimes produced as a
local-naive datetime (e.g., via datetime.now()) but the reader assumes naive
means UTC; fix by making producers emit timezone-aware UTC datetimes (replace
datetime.now() with datetime.now(timezone.utc) in producers such as
unstructured_parser.py), and add a defensive validator/normalizer on the
RemoteFile model (e.g., RemoteFile.__post_init__ or factory) that asserts or
converts last_modified to an aware UTC datetime (use .astimezone(timezone.utc)
if tzinfo present or attach timezone.utc if you choose to treat naive as UTC);
also add a unit test that constructs RemoteFile with naive and aware datetimes
to ensure the validator enforces/normalizes to UTC so
file_based_stream_reader.py no longer needs to assume naive==UTC.
---
Nitpick comments:
In `@unit_tests/sources/file_based/test_file_based_stream_reader.py`:
- Around line 425-431: Add a new pytest.param alongside the existing one that
uses {"start_date": "2023-06-01T00:00:00+00:00", "streams": []} (id
"start_date_with_utc_offset") but with a non-zero timezone offset (e.g.
"2023-06-01T00:00:00+05:00" or "-04:00") to validate offset normalization; keep
the same pattern argument (["**/*.csv"]), same expected matched set ({"a.csv",
"a/b.csv", "a/c.csv", "a/b/c.csv", "a/c/c.csv", "a/b/c/d.csv"}) and same empty
excluded set, and give it a distinct id like "start_date_with_nonzero_offset" so
the test exercises end-to-end timezone-offset handling in the reader.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/file_based/file_based_stream_reader.pyunit_tests/sources/file_based/test_file_based_stream_reader.py
| last_modified = ( | ||
| file.last_modified | ||
| if file.last_modified.tzinfo is not None | ||
| else file.last_modified.replace(tzinfo=timezone.utc) | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "1) Inspect RemoteFile definition (last_modified type/contract):"
fd 'remote_file.py$' --exec sed -n '1,260p' {}
echo
echo "2) Find call sites constructing RemoteFile / UploadableRemoteFile with last_modified:"
rg -nP --type=py -C3 'RemoteFile\(|UploadableRemoteFile\('
echo
echo "3) Find last_modified assignments and datetime parsing sources:"
rg -nP --type=py -C3 'last_modified\s*=|strptime\(|fromtimestamp\(|parser\.parse\(|ab_datetime_parse\('Repository: airbytehq/airbyte-python-cdk
Length of output: 50385
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Production code (non-test) that constructs RemoteFile with last_modified ==="
rg -t py 'RemoteFile\(' --glob '!unit_tests' --glob '!test_' -B2 -A2 'last_modified' | head -100
echo
echo "=== Datetime parsing in production (non-test) sources ==="
rg -t py '(strptime|fromtimestamp|parser\.parse|datetime\.now)' --glob '!unit_tests' --glob '!test_' -B2 -A2 | head -150
echo
echo "=== Check for validators on RemoteFile or last_modified ==="
rg -t py 'validator|@root_validator' airbyte_cdk/sources/file_based/
echo
echo "=== Search for actual S3/GCS/HTTPS remote file implementations ==="
fd -t f -e py --glob '*remote*file*' --glob '!test' | xargs rg -l 'RemoteFile\|UploadableRemoteFile' 2>/dev/null | head -20Repository: airbytehq/airbyte-python-cdk
Length of output: 223
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Production code RemoteFile/UploadableRemoteFile implementations ==="
find airbyte_cdk/sources/file_based -name '*.py' -type f ! -path '*/test*' -exec grep -l 'class.*RemoteFile' {} \;
echo
echo "=== Search for datetime creation in these implementation files ==="
find airbyte_cdk/sources/file_based -name '*.py' -type f ! -path '*/test*' -exec grep -H 'last_modified.*=' {} \; | head -50
echo
echo "=== Check for datetime/timezone imports in file_based_stream_reader.py ==="
sed -n '1,50p' airbyte_cdk/sources/file_based/file_based_stream_reader.py
echo
echo "=== Look at code around line 117-121 to see imports and context ==="
sed -n '1,15p' airbyte_cdk/sources/file_based/file_based_stream_reader.py
rg -t py 'from datetime|import.*timezone' airbyte_cdk/sources/file_based/file_based_stream_reader.py
echo
echo "=== Search for test that explicitly tests UTC-naive assumption ==="
rg -t py 'UTC|utc.*naive|naive.*utc|tzinfo' --glob '*test*' -B2 -A2 | head -80Repository: airbytehq/airbyte-python-cdk
Length of output: 10704
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Tests in file_based_stream_reader test file ==="
rg -t py 'def test.*last_modified|def test.*utc|def test.*naive' unit_tests/sources/file_based/test_file_based_stream_reader.py -A 5
echo
echo "=== Look at the actual code lines 117-121 context ==="
sed -n '110,130p' airbyte_cdk/sources/file_based/file_based_stream_reader.py
echo
echo "=== Check if RemoteFile has any validators ==="
cat airbyte_cdk/sources/file_based/remote_file.pyRepository: airbytehq/airbyte-python-cdk
Length of output: 123
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Code context around lines 117-121 ==="
sed -n '110,130p' airbyte_cdk/sources/file_based/file_based_stream_reader.py
echo
echo "=== Full RemoteFile definition ==="
cat airbyte_cdk/sources/file_based/remote_file.pyRepository: airbytehq/airbyte-python-cdk
Length of output: 2412
Verify that all RemoteFile.last_modified producers generate UTC-naive datetimes.
The concern is valid: the code assumes naive datetimes are UTC (line 120: .replace(tzinfo=timezone.utc)), but evidence shows datetime.now() is used in production code (e.g., unstructured_parser.py), which generates local-naive datetimes, not UTC-naive. If the system runs in a non-UTC timezone, this would cause incorrect time-based filtering.
Could we ensure all RemoteFile producers either (a) use timezone-aware UTC datetimes, or (b) consistently use a UTC-aware datetime parser? Adding a guard test or validator on RemoteFile.last_modified to document/enforce this contract would also help prevent regressions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@airbyte_cdk/sources/file_based/file_based_stream_reader.py` around lines 117
- 121, RemoteFile.last_modified is sometimes produced as a local-naive datetime
(e.g., via datetime.now()) but the reader assumes naive means UTC; fix by making
producers emit timezone-aware UTC datetimes (replace datetime.now() with
datetime.now(timezone.utc) in producers such as unstructured_parser.py), and add
a defensive validator/normalizer on the RemoteFile model (e.g.,
RemoteFile.__post_init__ or factory) that asserts or converts last_modified to
an aware UTC datetime (use .astimezone(timezone.utc) if tzinfo present or attach
timezone.utc if you choose to treat naive as UTC); also add a unit test that
constructs RemoteFile with naive and aware datetimes to ensure the validator
enforces/normalizes to UTC so file_based_stream_reader.py no longer needs to
assume naive==UTC.
There was a problem hiding this comment.
Added the non-zero timezone offset test case in 432976e. Good suggestion — this exercises the full offset normalization path.
There was a problem hiding this comment.
Valid observation about datetime.now() in producers like unstructured_parser.py — that's a pre-existing issue where naive-local datetimes could be produced in non-UTC environments. However, fixing all RemoteFile producers and adding a model-level validator is out of scope for this PR, which is focused on the start_date parsing bug (#920).
The defensive normalization here (assuming naive = UTC) preserves the existing behavior from before this change — the old strptime also produced naive datetimes that were implicitly treated as UTC. This PR doesn't make that situation worse; it just makes the assumption explicit.
A follow-up to enforce UTC-aware datetimes at the RemoteFile model level would be a good improvement but should be a separate effort.
Co-Authored-By: AJ Steers <aj@airbyte.io>
PyTest Results (Fast)422 tests - 3 447 412 ✅ - 3 445 1m 39s ⏱️ - 5m 12s For more details on these failures, see this check. Results for commit 432976e. ± Comparison against base commit 7f41401. This pull request removes 3447 tests. |
PyTest Results (Full)3 877 tests 3 865 ✅ 11m 16s ⏱️ Results for commit 432976e. |
Summary
Fixes #920. Replaces strict
datetime.strptime(which required microseconds via%Y-%m-%dT%H:%M:%S.%fZ) with the flexibleab_datetime_parsehelper infilter_files_by_globs_and_start_date. This allows valid ISO8601 dates like2025-01-01T00:00:00Z(without microseconds) to be accepted at runtime, consistent with the spec validation fix in CDK v7.7.1 (PR #887).Because
ab_datetime_parsereturns timezone-aware datetimes whilefile.last_modifiedmay be naive, the comparison now normalizes naivelast_modifiedvalues to UTC before comparing. This matches the old implicit behavior (wherestrptimetreatedZas a literal, producing naive datetimes that were effectively UTC).Review & Testing Checklist for Human
file.last_modifiedis UTC is safe across all file-based connectors (S3, GCS, Azure Blob, SharePoint, OneDrive, SFTP). If any source produces non-UTC naive datetimes, the comparison could be off.ab_datetime_parse: The new parsing accepts date-only strings (e.g."2023-06-01") and various offset formats. Confirm this is acceptable forstart_date, or whether upstream config validation already constrains this sufficiently.DATE_TIME_FORMATusages unchanged: Cursor classes (DefaultFileBasedCursor,FileBasedConcurrentCursor) also usedatetime.strptimewith the same strict format for reading cursor state. Those were not changed — verify they don't face the same issue (they likely don't since cursor state is always serialized by the CDK itself with microseconds).start_date = "2025-01-01T00:00:00Z"(no microseconds) and run a sync to confirm it no longer raisesValueError.Notes
Summary by CodeRabbit
Bug Fixes
Tests