-
Notifications
You must be signed in to change notification settings - Fork 0
feat(etl): Script to export github data into bigquery #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a comprehensive GitHub ETL pipeline that extracts pull request data from GitHub repositories and loads it into Google BigQuery. The implementation uses a streaming/chunked architecture to process data in batches of 100 PRs, includes rate limit handling, and provides local testing capabilities through mock services.
- Implements chunked extraction, transformation, and loading of GitHub PR data to BigQuery
- Adds mock GitHub API server and BigQuery emulator for local testing without rate limits
- Includes comprehensive documentation with setup instructions and schema definitions
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements.txt | Adds BigQuery client library and testing dependencies (pytest, pytest-mock, pytest-cov) |
| mock_github_api.py | Creates Flask-based mock GitHub API server that generates 250 sample PRs with commits, comments, and reviewers for testing |
| main.py | Implements the core ETL pipeline with chunked processing: extraction from GitHub API with pagination/rate limiting, data transformation to BigQuery format, and insertion using BigQuery client |
| docker-compose.yml | Orchestrates three services: mock GitHub API, BigQuery emulator, and the ETL service for local development |
| data.yml | Defines BigQuery table schemas for pull_requests, commits, reviewers, and comments tables used by the emulator |
| README.md | Provides comprehensive documentation including setup, architecture, authentication methods, and usage examples |
| Dockerfile.mock | Creates container image for the mock GitHub API service using Python 3.11 and Flask |
Comments suppressed due to low confidence (1)
main.py:18
- Import of 'pprint' is not used.
import pprint
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 15 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: dklawren <826315+dklawren@users.noreply.github.com>
Co-authored-by: dklawren <826315+dklawren@users.noreply.github.com>
Co-authored-by: dklawren <826315+dklawren@users.noreply.github.com>
Co-authored-by: dklawren <826315+dklawren@users.noreply.github.com>
Refactor pagination to preserve request parameters
No description provided.