Machine Learning Operations Playbook Adoption Workshop – Phase 2: Data Services Integration Architecture - Hands-On Workshop
- Review knowledge of Amazon S3 integration with SageMaker workflows
- Review knowledge of Amazon Redshift integration with ML pipelines
- Review how EC2/SageMaker models use S3 for artifact persistence and CSV/flat file input/output
- Review how Redshift pipelines use COPY/UNLOAD and SQL transformations for ML data preparation
- Prepare for migration by mapping AWS data service patterns to Google Cloud equivalents (Vertex AI, GCS, BigQuery)
- Practice searching for and interpreting
# TODO: Lab 6.X.X.X.Xmarkers in code files (model.py,ingest_model.py)
- AWS Management Console access with SageMaker and Redshift permissions
- IAM role with
AmazonSageMakerFullAccess,AmazonS3FullAccess, and Redshift query permissions - Python 3.9+ environment with
boto3,joblib,pandas, andpsycopg2orsqlalchemyinstalled - Access to the GitHub Training Repo containing
model.pyandingest_model.pyfiles with embedded TODO markers - Pre-created S3 bucket for input/output artifacts and Redshift cluster with sample schema
Difficulty: Introductory Review
Tools Required: GitHub Training Repo (model.py)
- Review how ML models running on EC2 or SageMaker use S3 for artifacts
- Identify boto3 usage patterns for uploading/downloading models and data
- Understand local staging → S3 upload → SageMaker registry integration
- Explore logging and metrics evolution from print/MLflow → CloudWatch/SageMaker metrics
- AWS account with SageMaker Studio or Notebook access
- IAM role with
AmazonSageMakerFullAccessandAmazonS3FullAccess - Python 3.9+ environment with
boto3andjoblibinstalled - Access to
model.pyin the GitHub Training Repo
Amazon S3 is the primary durable artifact store for ML workloads in AWS.
- Trainers on EC2 or SageMaker stage files locally, then upload to S3 for persistence.
- Models are serialized with
jobliborpickleand stored in S3 buckets. - S3 also serves as input/output for CSVs and flat files.
- Logging evolves from local print statements to CloudWatch metrics and SageMaker structured metrics.
Each task maps directly to commented sections in model.py. Use VSCode/PyCharm search for # TODO: Lab 6.1.X.X.
-
Search:
# TODO: Lab 6.1.1 - Line-by-Line Import Exploration- Where: Top of
model.pyimports - What: Inspect
joblib,boto3references - Why: Understand which libraries enable local vs S3 persistence
- Where: Top of
-
Search:
# TODO: Lab 6.1.2 - Execution Model Translation- Where:
train()function signature and persistence branches - What: Compare
dat.get_handle()vs manual S3 upload - Why: Distinguish local dev vs orchestrated SageMaker workflows
- Where:
-
Search:
# TODO: Lab 6.1.3 - Parameter and Artifact Evolution- Where:
_select_features()and artifact naming logic - What: Observe feature selection rules and artifact naming conventions
- Why: Understand reproducibility and artifact compatibility
- Where:
-
Search:
# TODO: Lab 6.1.4 - S3 Data Loading Conversion- Where:
_s3_persist()function - What: Inspect
boto3.upload_fileusage - Why: Learn durable storage and IAM/cost implications
- Where:
-
Search:
# TODO: Lab 6.1.5 - Artifact Persistence and Model Registry Integration- Where: Branch where
model_pathstarts withs3:// - What: Inspect local temp file → S3 upload pattern
- Why: Understand registry integration and artifact governance
- Where: Branch where
-
Search:
# TODO: Lab 6.1.6 - Logging and Metrics Evolution- Where: End of
train()function - What: Inspect
LOG.infostatements for elapsed time, feature count, artifact path - Why: Understand structured logging for reproducibility and monitoring
- Where: End of
Difficulty: Introductory Review
Tools Required: GitHub Training Repo (ingest_model.py)
- Review Redshift ingestion/extraction patterns (COPY, UNLOAD, Data API)
- Identify ETL task mapping (CTAS, materialized views, pre-aggregation)
- Understand orchestration and dependency mapping (Step Functions, Airflow, SageMaker Pipelines)
- Explore data movement and performance considerations (distribution/sort keys, compression, cluster sizing)
- Review model training integration with Redshift outputs staged to S3 or streamed via Data API
- Discuss monitoring and cost trade-offs
- AWS account with Redshift cluster access
- IAM role with
AmazonRedshiftFullAccessandAmazonS3FullAccess - Python 3.9+ environment with
psycopg2orsqlalchemyinstalled - Access to
ingest_model.pyin the GitHub Training Repo
Amazon Redshift serves as a data warehouse for ML pipelines.
- Data access patterns include COPY from S3, UNLOAD to S3, and direct queries via JDBC/Data API.
- ETL tasks transform data using SQL (CTAS, materialized views).
- Trainers consume Redshift outputs either staged in S3 or streamed directly.
- Performance depends on distribution/sort keys, compression, and cluster sizing.
- Monitoring includes query cost, cluster utilization, and egress charges.
Each task maps directly to commented sections in ingest_model.py. Use VSCode/PyCharm search for # TODO: Lab 6.2.X.X.
-
Search:
# TODO: Lab 6.2.1 - Data Access Pattern Conversion- Where:
_read_from_redshift()function - What: Inspect
select_sql_from_dictorpd.read_sqlusage - Why: Understand Redshift → DataFrame conversion patterns
- Where:
-
Search:
# TODO: Lab 6.2.2 - ETL Task Mapping- Where:
prepare_redshift_training()function - What: Inspect ingest_obj methods and SQL transformations
- Why: Map SQL queries to ETL tasks
- Where:
-
Search:
# TODO: Lab 6.2.3 - Orchestration and Dependency Mapping- Where:
orchestration_hint_tasks()function - What: Inspect documented task list (extract, transform, stage, train, register)
- Why: Understand DAG dependencies and orchestration
- Where:
-
Search:
# TODO: Lab 6.2.4 - Data Movement and Performance Considerations- Where:
stage_table_to_s3()andquick_sample_pipeline()functions - What: Inspect UNLOAD vs client-side upload patterns
- Why: Learn efficiency vs cost trade-offs
- Where:
-
Search:
# TODO: Lab 6.2.5 - Model Training Integration- Where: Return of
prepare_redshift_training() - What: Inspect
(df, s3_uri)outputs - Why: Understand training consumption patterns (in-memory vs staged artifacts)
- Where: Return of
-
Search:
# TODO: Lab 6.2.6 - Monitoring and Cost Trade-offs- Where:
teaching_checklist()function - What: Inspect listed monitoring signals and cost drivers
- Why: Understand observability and economics of Redshift pipelines
- Where:
- Understand Cloud Storage integration with Vertex AI workflows
- Learn how BigQuery ML can train models directly using SQL
- GCP project with Vertex AI, BigQuery, and Cloud Storage enabled
- Service account with
Storage Admin,BigQuery Admin, andVertex AI Adminroles - Python 3.9+ environment with
google-cloud-aiplatform,google-cloud-bigquery,google-cloud-storageinstalled - Access to BigQuery public datasets (e.g., Penguins dataset)
- GitHub Training Repo containing starter notebook code
Difficulty: Intermediate
Tools Required: Google Cloud Console, Cloud Storage, Vertex AI SDK for Python
- Install and configure Vertex AI and BigQuery SDKs
- Authenticate and set up project information
- Create and configure a Cloud Storage bucket for ML artifacts
- Assign service accounts for pipeline execution
- Initialize Vertex AI SDK with Cloud Storage staging bucket
- Compare Cloud Storage integration to Amazon S3 workflows from Lab 6.1
- GCP project with Vertex AI and Cloud Storage enabled
- Service account with
Storage AdminandVertex AI Adminroles - Python environment with required SDKs installed
Cloud Storage is the artifact backbone for Vertex AI workflows.
- Buckets store datasets, intermediate files, and exported models
- Vertex AI pipelines reference GCS URIs (
gs://bucket-name/path) for inputs and outputs - IAM roles control access to buckets, ensuring reproducibility and governance
- Cloud Storage replaces S3 in AWS workflows, but the integration pattern is nearly identical
-
Task 6.4.1 — Install SDKs
- Command:
! pip3 install --upgrade --quiet pyarrow google-cloud-aiplatform google-cloud-bigquery google-cloud-bigquery-storage db-dtypes - WHY: Enable Python environment to interact with Vertex AI and BigQuery
- Command:
-
Task 6.4.2 — Authenticate Environment
- Code:
auth.authenticate_user()(Colab only) - WHY: Ensure secure access to GCP resources
- Code:
-
Task 6.4.3 — Set Project Information
- Code:
PROJECT_ID = "...",LOCATION = "us-east1" - WHY: Required for all GCP API calls
- Code:
-
Task 6.4.4 — Create Cloud Storage Bucket
- Code:
BUCKET_URI = f"gs://churn-user19-{PROJECT_ID}-unique" - Command:
! gsutil mb -l {LOCATION} {BUCKET_URI} - WHY: Buckets store datasets and exported models
- Code:
-
Task 6.4.5 — Configure Service Account
- Code:
SERVICE_ACCOUNT = "vertex-pipeline-executor@..." - WHY: Service accounts provide secure, auditable access
- Code:
-
Task 6.4.6 — Initialize SDKs
- Code:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI) bqclient = bigquery.Client(project=PROJECT_ID)
- WHY: Required for pipeline jobs and queries
- Code:
Difficulty: Intermediate to Advanced
Tools Required: Google Cloud Console, BigQuery, Vertex AI SDK for Python
- Train ML models directly in BigQuery using SQL
- Evaluate BigQuery ML models with SQL functions
- GCP project with BigQuery and Vertex AI enabled
- Service account with
BigQuery AdminandVertex AI Adminroles - Python environment with required SDKs installed
- Access to BigQuery public datasets (e.g., Penguins dataset)
BigQuery ML allows you to train ML models directly in SQL.
- Models are stored as BigQuery resources
- Evaluation uses SQL functions like
ML.EVALUATE - Exported models can be uploaded to Vertex AI for deployment
- Integration pattern: BigQuery ML → Cloud Storage → Vertex AI Model Registry → Endpoint deployment
-
Task 6.5.1 — Create BigQuery Dataset
- SQL:
CREATE SCHEMA penguins - WHY: Organize tables and models in BigQuery
- SQL:
-
Task 6.5.2 — Train BigQuery ML Model
- SQL:
CREATE OR REPLACE MODEL penguins OPTIONS(model_type='DNN_CLASSIFIER', labels=['species']) AS SELECT * FROM `bigquery-public-data.ml_datasets.penguins`
- WHY: Learn SQL‑based ML training
- SQL:
-
Task 6.5.3 — Evaluate Model
- SQL:
SELECT * FROM ML.EVALUATE(MODEL penguins) ORDER BY roc_auc DESC LIMIT 1
- WHY: Validate model performance
- SQL:
- Lab 6.5 teaches BigQuery ML integration with Vertex AI, mirroring Redshift + AWS ML workflows
- Both labs use the scavenger hunt style with
# TODOanchors, guiding learners to record WHERE, WHAT, and WHY for each integration step.