-
Notifications
You must be signed in to change notification settings - Fork 1.3k
"Add S3 data loader support to DBTableManager and data formulator" #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…abase support) data formulator v0.2, with database support to handle large sized data
fix the openai endpoint issue with custom api_base and api_keys
External data loader class
|
@microsoft-github-policy-service agree |
|
This looks fantastic! I don't have a S3 environment to test it. I briefly reviewed the code and it looks good. Would you be able to test it a bit? I will see if I can create a mock instance to do some testing. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds support for loading data from S3 in both the React UI and the Python data formulator.
- UI: Include an “s3” tab for external data sources in DBTableManager.
- Python: Implement
S3DataLoaderwith listing, sampling, row-count estimation, and ingestion logic. - Registration: Import and register
S3DataLoaderin the loader registry.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/views/DBTableManager.tsx | Added "s3" to external data loader tabs |
| py-src/data_formulator/data_loader/s3_data_loader.py | New S3DataLoader implementation for listing and ingesting S3 data |
| py-src/data_formulator/data_loader/init.py | Imported and registered S3DataLoader in DATA_LOADERS |
Comments suppressed due to low confidence (2)
py-src/data_formulator/data_loader/s3_data_loader.py:9
- Add a class-level docstring for
S3DataLoaderto explain its role, expected parameters, and any side effects (e.g., DuckDB extension installation).
class S3DataLoader(ExternalDataLoader):
src/views/DBTableManager.tsx:656
- Ensure there’s a corresponding TabPanel for the new "s3" tab; otherwise the panels will be misaligned after introducing this extra tab.
{["file upload", "mysql", "kusto","s3"].map((dataLoaderType, i) => (
| }) | ||
| except Exception as e: | ||
| # Skip files that can't be read | ||
| print(f"Error reading {s3_url}: {e}") |
Copilot
AI
May 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Replace print statements with a proper logging framework to avoid uncontrolled console output in production libraries.
| print(f"Error reading {s3_url}: {e}") | |
| logging.error(f"Error reading {s3_url}: {e}") |
| """Estimate the number of rows in a file.""" | ||
| try: | ||
| # For parquet files, we can get the exact count | ||
| if s3_url.lower().endswith('.parquet'): |
Copilot
AI
May 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a full COUNT(*) scan on large Parquet files may be slow; consider reading row-count metadata from the file footer instead of scanning all rows.
| # Use boto3 to list objects in the bucket | ||
| import boto3 | ||
|
|
||
| s3_client = boto3.client( |
Copilot
AI
May 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The boto3 client is created in multiple methods; consider extracting common initialization into a private helper to reduce duplication and simplify updates.
|
This generally looks good, I'll merge to the dev branch and test it (along with build a Azure blob data reader based on this) before merging to main. |
|
I did do a test on this with S3. Adding a video of the test. Uploading good_one.mov… |
Added S3 data loaders