Skip to content

Conversation

@sumanth-dhanya
Copy link
Contributor

Added S3 data loaders

@sumanth-dhanya
Copy link
Contributor Author

@microsoft-github-policy-service agree

@Chenglong-MS
Copy link
Collaborator

This looks fantastic! I don't have a S3 environment to test it.

I briefly reviewed the code and it looks good. Would you be able to test it a bit? I will see if I can create a mock instance to do some testing. Thanks!

@Chenglong-MS Chenglong-MS requested a review from Copilot May 29, 2025 22:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for loading data from S3 in both the React UI and the Python data formulator.

  • UI: Include an “s3” tab for external data sources in DBTableManager.
  • Python: Implement S3DataLoader with listing, sampling, row-count estimation, and ingestion logic.
  • Registration: Import and register S3DataLoader in the loader registry.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/views/DBTableManager.tsx Added "s3" to external data loader tabs
py-src/data_formulator/data_loader/s3_data_loader.py New S3DataLoader implementation for listing and ingesting S3 data
py-src/data_formulator/data_loader/init.py Imported and registered S3DataLoader in DATA_LOADERS
Comments suppressed due to low confidence (2)

py-src/data_formulator/data_loader/s3_data_loader.py:9

  • Add a class-level docstring for S3DataLoader to explain its role, expected parameters, and any side effects (e.g., DuckDB extension installation).
class S3DataLoader(ExternalDataLoader):

src/views/DBTableManager.tsx:656

  • Ensure there’s a corresponding TabPanel for the new "s3" tab; otherwise the panels will be misaligned after introducing this extra tab.
{["file upload", "mysql", "kusto","s3"].map((dataLoaderType, i) => (

})
except Exception as e:
# Skip files that can't be read
print(f"Error reading {s3_url}: {e}")
Copy link

Copilot AI May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Replace print statements with a proper logging framework to avoid uncontrolled console output in production libraries.

Suggested change
print(f"Error reading {s3_url}: {e}")
logging.error(f"Error reading {s3_url}: {e}")

Copilot uses AI. Check for mistakes.
"""Estimate the number of rows in a file."""
try:
# For parquet files, we can get the exact count
if s3_url.lower().endswith('.parquet'):
Copy link

Copilot AI May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a full COUNT(*) scan on large Parquet files may be slow; consider reading row-count metadata from the file footer instead of scanning all rows.

Copilot uses AI. Check for mistakes.
# Use boto3 to list objects in the bucket
import boto3

s3_client = boto3.client(
Copy link

Copilot AI May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The boto3 client is created in multiple methods; consider extracting common initialization into a private helper to reduce duplication and simplify updates.

Copilot uses AI. Check for mistakes.
@Chenglong-MS Chenglong-MS changed the base branch from main to dev May 29, 2025 22:51
@Chenglong-MS
Copy link
Collaborator

This generally looks good, I'll merge to the dev branch and test it (along with build a Azure blob data reader based on this) before merging to main.

@Chenglong-MS Chenglong-MS merged commit e8799a6 into microsoft:dev May 29, 2025
3 checks passed
@sumanth-dhanya
Copy link
Contributor Author

sumanth-dhanya commented May 30, 2025

I did do a test on this with S3. Adding a video of the test.

Uploading good_one.mov…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants