-
Notifications
You must be signed in to change notification settings - Fork 40
fix: use flexible datetime parsing in filter_files_by_globs_and_start_date #921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
c7dee62
1f52c10
432976e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,7 +5,7 @@ | |
| import logging | ||
| import time | ||
| from abc import ABC, abstractmethod | ||
| from datetime import datetime | ||
| from datetime import timezone | ||
| from enum import Enum | ||
| from io import IOBase | ||
| from os import makedirs, path | ||
|
|
@@ -24,6 +24,7 @@ | |
| from airbyte_cdk.sources.file_based.exceptions import FileSizeLimitError | ||
| from airbyte_cdk.sources.file_based.file_record_data import FileRecordData | ||
| from airbyte_cdk.sources.file_based.remote_file import RemoteFile, UploadableRemoteFile | ||
| from airbyte_cdk.utils.datetime_helpers import ab_datetime_parse | ||
|
|
||
|
|
||
| class FileReadMode(Enum): | ||
|
|
@@ -105,15 +106,20 @@ def filter_files_by_globs_and_start_date( | |
| Utility method for filtering files based on globs. | ||
| """ | ||
| start_date = ( | ||
| datetime.strptime(self.config.start_date, self.DATE_TIME_FORMAT) | ||
| ab_datetime_parse(self.config.start_date) | ||
| if self.config and self.config.start_date | ||
| else None | ||
| ) | ||
| seen = set() | ||
|
|
||
| for file in files: | ||
| if self.file_matches_globs(file, globs): | ||
| if file.uri not in seen and (not start_date or file.last_modified >= start_date): | ||
| last_modified = ( | ||
| file.last_modified | ||
| if file.last_modified.tzinfo is not None | ||
| else file.last_modified.replace(tzinfo=timezone.utc) | ||
| ) | ||
|
Comment on lines
+117
to
+121
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
echo "1) Inspect RemoteFile definition (last_modified type/contract):"
fd 'remote_file.py$' --exec sed -n '1,260p' {}
echo
echo "2) Find call sites constructing RemoteFile / UploadableRemoteFile with last_modified:"
rg -nP --type=py -C3 'RemoteFile\(|UploadableRemoteFile\('
echo
echo "3) Find last_modified assignments and datetime parsing sources:"
rg -nP --type=py -C3 'last_modified\s*=|strptime\(|fromtimestamp\(|parser\.parse\(|ab_datetime_parse\('Repository: airbytehq/airbyte-python-cdk Length of output: 50385 🏁 Script executed: #!/bin/bash
set -euo pipefail
echo "=== Production code (non-test) that constructs RemoteFile with last_modified ==="
rg -t py 'RemoteFile\(' --glob '!unit_tests' --glob '!test_' -B2 -A2 'last_modified' | head -100
echo
echo "=== Datetime parsing in production (non-test) sources ==="
rg -t py '(strptime|fromtimestamp|parser\.parse|datetime\.now)' --glob '!unit_tests' --glob '!test_' -B2 -A2 | head -150
echo
echo "=== Check for validators on RemoteFile or last_modified ==="
rg -t py 'validator|@root_validator' airbyte_cdk/sources/file_based/
echo
echo "=== Search for actual S3/GCS/HTTPS remote file implementations ==="
fd -t f -e py --glob '*remote*file*' --glob '!test' | xargs rg -l 'RemoteFile\|UploadableRemoteFile' 2>/dev/null | head -20Repository: airbytehq/airbyte-python-cdk Length of output: 223 🏁 Script executed: #!/bin/bash
set -euo pipefail
echo "=== Production code RemoteFile/UploadableRemoteFile implementations ==="
find airbyte_cdk/sources/file_based -name '*.py' -type f ! -path '*/test*' -exec grep -l 'class.*RemoteFile' {} \;
echo
echo "=== Search for datetime creation in these implementation files ==="
find airbyte_cdk/sources/file_based -name '*.py' -type f ! -path '*/test*' -exec grep -H 'last_modified.*=' {} \; | head -50
echo
echo "=== Check for datetime/timezone imports in file_based_stream_reader.py ==="
sed -n '1,50p' airbyte_cdk/sources/file_based/file_based_stream_reader.py
echo
echo "=== Look at code around line 117-121 to see imports and context ==="
sed -n '1,15p' airbyte_cdk/sources/file_based/file_based_stream_reader.py
rg -t py 'from datetime|import.*timezone' airbyte_cdk/sources/file_based/file_based_stream_reader.py
echo
echo "=== Search for test that explicitly tests UTC-naive assumption ==="
rg -t py 'UTC|utc.*naive|naive.*utc|tzinfo' --glob '*test*' -B2 -A2 | head -80Repository: airbytehq/airbyte-python-cdk Length of output: 10704 🏁 Script executed: #!/bin/bash
set -euo pipefail
echo "=== Tests in file_based_stream_reader test file ==="
rg -t py 'def test.*last_modified|def test.*utc|def test.*naive' unit_tests/sources/file_based/test_file_based_stream_reader.py -A 5
echo
echo "=== Look at the actual code lines 117-121 context ==="
sed -n '110,130p' airbyte_cdk/sources/file_based/file_based_stream_reader.py
echo
echo "=== Check if RemoteFile has any validators ==="
cat airbyte_cdk/sources/file_based/remote_file.pyRepository: airbytehq/airbyte-python-cdk Length of output: 123 🏁 Script executed: #!/bin/bash
set -euo pipefail
echo "=== Code context around lines 117-121 ==="
sed -n '110,130p' airbyte_cdk/sources/file_based/file_based_stream_reader.py
echo
echo "=== Full RemoteFile definition ==="
cat airbyte_cdk/sources/file_based/remote_file.pyRepository: airbytehq/airbyte-python-cdk Length of output: 2412 Verify that all The concern is valid: the code assumes naive datetimes are UTC (line 120: Could we ensure all 🤖 Prompt for AI Agents
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added the non-zero timezone offset test case in 432976e. Good suggestion — this exercises the full offset normalization path.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Valid observation about The defensive normalization here (assuming naive = UTC) preserves the existing behavior from before this change — the old A follow-up to enforce UTC-aware datetimes at the |
||
| if file.uri not in seen and (not start_date or last_modified >= start_date): | ||
| seen.add(file.uri) | ||
| yield file | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datetimeis no longer used in this module after switching toab_datetime_parse; consider removing it from thefrom datetime import ...import to avoid dead code / lint noise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already addressed in the second commit (1f52c10) —
datetimewas removed, onlytimezoneis imported now.Devin session