From 988bb34e6c57cb94c6699855fd35d3ae61ba2c33 Mon Sep 17 00:00:00 2001 From: zhangqiang Date: Fri, 29 Aug 2025 06:58:12 +0200 Subject: [PATCH 01/14] hanzhizh-xzuo demo week4 --- .../demo/week4/hanzhizh-xzuo/README.md | 26 +++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 contributions/demo/week4/hanzhizh-xzuo/README.md diff --git a/contributions/demo/week4/hanzhizh-xzuo/README.md b/contributions/demo/week4/hanzhizh-xzuo/README.md new file mode 100644 index 0000000000..2535f0fd3b --- /dev/null +++ b/contributions/demo/week4/hanzhizh-xzuo/README.md @@ -0,0 +1,26 @@ +# Assignment Proposal + +## Title + +Blue-Green Deployment of AI Models Based on MLflow + +## Names and KTH ID + + - Hanzhi Zhang (hanzhizh@kth.se) + - Xu Zuo (xzuo@kth.se) + +## Deadline + +Week 4 + +## Category + +Demo + +## Description + +This project showcases Blue-Green deployment of a Iris classification AI model using MLflow, running two model versions (Blue and Green) in parallel. A FastAPI load balancer enables seamless switching and rollback, while a Streamlit dashboard provides a simple interface for monitoring, testing, and controlling traffic between the models. + +**Relevance** + +In this project, we demonstrate how the Blue-Green deployment strategy can be applied to AI model deployment using MLflow. By running two versions of the Iris classification model in parallel (Blue and Green), we can safely switch traffic between them, test new models in production, and roll back instantly if needed. This directly reflects DevOps principles such as continuous delivery, automation, reliability, and rapid feedback, showing how modern MLOps practices align with core DevOps methodologies. From 3484504227109add2f26f144b6f6b1c2b05fd192 Mon Sep 17 00:00:00 2001 From: zhangqiang Date: Mon, 1 Sep 2025 16:35:23 +0200 Subject: [PATCH 02/14] hanzhizh-xzuo demo week4 modified --- contributions/demo/week4/hanzhizh-xzuo/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/contributions/demo/week4/hanzhizh-xzuo/README.md b/contributions/demo/week4/hanzhizh-xzuo/README.md index 2535f0fd3b..4e6a7b29cd 100644 --- a/contributions/demo/week4/hanzhizh-xzuo/README.md +++ b/contributions/demo/week4/hanzhizh-xzuo/README.md @@ -19,7 +19,9 @@ Demo ## Description -This project showcases Blue-Green deployment of a Iris classification AI model using MLflow, running two model versions (Blue and Green) in parallel. A FastAPI load balancer enables seamless switching and rollback, while a Streamlit dashboard provides a simple interface for monitoring, testing, and controlling traffic between the models. +This project highlights the use of MLflow in a Blue-Green deployment of an Iris classification AI model. Instead of focusing only on traffic switching, we demonstrate how MLflow supports the end-to-end lifecycle of the model: training and packaging with MLflow Models, versioning and promotion with the Model Registry, and experiment tracking with the Tracking Server. Rollback and health checks go beyond service availability by validating model functionality and performance through MLflow. In this way, the project showcases how MLflow enhances Blue-Green deployment by integrating model management, monitoring, and lifecycle control. + +There are several differences from the similar topic in previous years. Our demo mainly focuses on traffic switching in production and visual control via a load balancer, by running two different model versions on two separate deployments and switching traffic between them. The previous demo, however, switches model versions in the MLflow Registry, which means it runs different versions of the model within a single virtual machine. **Relevance** From 1613f0cf603d4541815871be0f08b960cf12debd Mon Sep 17 00:00:00 2001 From: zhangqiang Date: Tue, 23 Sep 2025 11:50:14 +0200 Subject: [PATCH 03/14] hanzhizh-bingjiez tutorial proposal --- .../hanzhizh-bingjiez/README.md | 38 +++++++++++++++++++ 1 file changed, 38 insertions(+) create mode 100644 contributions/executable-tutorial/hanzhizh-bingjiez/README.md diff --git a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md new file mode 100644 index 0000000000..2b5db3267d --- /dev/null +++ b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md @@ -0,0 +1,38 @@ +# Assignment Proposal + +## Title + +Container Vulnerability Scanning and Remediation with Trivy and GitHub Actions + +## Names and KTH ID + +- Hanzhi Zhang (hanzhizh@kth.se) +- Bingjie Zhao (bingjiez@kth.se) + +## Deadline + +- Task 3 + +## Category + +- Executable tutorial + +## Description + +Ensuring container security is a critical part of DevSecOps practices. In this tutorial, we want to use [Trivy](https://aquasecurity.github.io/trivy) and GitHub Actions to demonstrate the following scenario: + +1. **Build a vulnerable Docker container** with outdated dependencies (e.g., `lodash@4.17.15`), scan it with Trivy, and observe vulnerabilities being reported. + +2. **Remediate the vulnerabilities** by updating dependencies (e.g., upgrading `lodash` and `cross-spawn`) and rebuilding the container. Then, scan again with Trivy to confirm that vulnerabilities are fixed or reduced. + +3. **Handle unfixable vulnerabilities** (e.g., `zlib1g` with `CVE-2023-45853`) by using Trivy flags such as `--ignore-unfixed` or `.trivyignore`. This highlights the reality that not all vulnerabilities have patches and shows how to configure scanning policies. + +4. **Integrate vulnerability scanning into CI/CD** using GitHub Actions. We configure a workflow that automatically builds images and runs Trivy scans, failing the pipeline if `HIGH` or `CRITICAL` vulnerabilities are found. + +With this tutorial, we want to highlight the **before (vulnerable)** vs **after (remediated)** condition, and demonstrate how vulnerability scanning becomes part of an automated DevSecOps workflow. We plan to deliver our tutorial on [KillerCoda](https://killercoda.com). + +Trivy is chosen in the tutorial as an open-source, lightweight, and widely used vulnerability scanner for containers, file systems, and Infrastructure-as-Code (IaC). + +**Relevance** + +Vulnerability management is central to DevSecOps because it ensures that insecure dependencies are detected early, developers can remediate issues quickly, and CI/CD pipelines enforce security gates. This aligns security scanning with continuous delivery workflows and raises awareness about both remediable and unfixable vulnerabilities in containerized applications. From 7422fa374f3c5e28a3c2b2f963d85547a67e07c7 Mon Sep 17 00:00:00 2001 From: dazhijiong <69571910+dazhijiong@users.noreply.github.com> Date: Tue, 23 Sep 2025 11:53:35 +0200 Subject: [PATCH 04/14] Delete contributions/demo/week4/hanzhizh-xzuo/README.md --- .../demo/week4/hanzhizh-xzuo/README.md | 28 ------------------- 1 file changed, 28 deletions(-) delete mode 100644 contributions/demo/week4/hanzhizh-xzuo/README.md diff --git a/contributions/demo/week4/hanzhizh-xzuo/README.md b/contributions/demo/week4/hanzhizh-xzuo/README.md deleted file mode 100644 index 4e6a7b29cd..0000000000 --- a/contributions/demo/week4/hanzhizh-xzuo/README.md +++ /dev/null @@ -1,28 +0,0 @@ -# Assignment Proposal - -## Title - -Blue-Green Deployment of AI Models Based on MLflow - -## Names and KTH ID - - - Hanzhi Zhang (hanzhizh@kth.se) - - Xu Zuo (xzuo@kth.se) - -## Deadline - -Week 4 - -## Category - -Demo - -## Description - -This project highlights the use of MLflow in a Blue-Green deployment of an Iris classification AI model. Instead of focusing only on traffic switching, we demonstrate how MLflow supports the end-to-end lifecycle of the model: training and packaging with MLflow Models, versioning and promotion with the Model Registry, and experiment tracking with the Tracking Server. Rollback and health checks go beyond service availability by validating model functionality and performance through MLflow. In this way, the project showcases how MLflow enhances Blue-Green deployment by integrating model management, monitoring, and lifecycle control. - -There are several differences from the similar topic in previous years. Our demo mainly focuses on traffic switching in production and visual control via a load balancer, by running two different model versions on two separate deployments and switching traffic between them. The previous demo, however, switches model versions in the MLflow Registry, which means it runs different versions of the model within a single virtual machine. - -**Relevance** - -In this project, we demonstrate how the Blue-Green deployment strategy can be applied to AI model deployment using MLflow. By running two versions of the Iris classification model in parallel (Blue and Green), we can safely switch traffic between them, test new models in production, and roll back instantly if needed. This directly reflects DevOps principles such as continuous delivery, automation, reliability, and rapid feedback, showing how modern MLOps practices align with core DevOps methodologies. From 7059b2c87021c634fc53b70aac71a0ee6efaca69 Mon Sep 17 00:00:00 2001 From: zhangqiang Date: Tue, 23 Sep 2025 16:07:55 +0200 Subject: [PATCH 05/14] hanzhizh-bingjiez tutorial proposal --- .../hanzhizh-bingjiez/README.md | 28 ++++++++++--------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md index 2b5db3267d..d661fbfc77 100644 --- a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md +++ b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md @@ -2,7 +2,7 @@ ## Title -Container Vulnerability Scanning and Remediation with Trivy and GitHub Actions +Chaos Engineering with Pumba: Building Fault-Tolerant Containerized Applications ## Names and KTH ID @@ -11,28 +11,30 @@ Container Vulnerability Scanning and Remediation with Trivy and GitHub Actions ## Deadline -- Task 3 +- Task 3 ## Category -- Executable tutorial +- Executable tutorial ## Description -Ensuring container security is a critical part of DevSecOps practices. In this tutorial, we want to use [Trivy](https://aquasecurity.github.io/trivy) and GitHub Actions to demonstrate the following scenario: +This executable tutorial demonstrates how to apply **Chaos Engineering** in containerized environments to improve system **fault tolerance** and **resilience**. Using [Pumba](https://github.com/alexei-led/pumba), a chaos testing tool for Docker, learners will inject controlled failures (random restarts, network delays, resource stress) into a containerized web application, and then gradually enhance the system to withstand these disruptions. -1. **Build a vulnerable Docker container** with outdated dependencies (e.g., `lodash@4.17.15`), scan it with Trivy, and observe vulnerabilities being reported. - -2. **Remediate the vulnerabilities** by updating dependencies (e.g., upgrading `lodash` and `cross-spawn`) and rebuilding the container. Then, scan again with Trivy to confirm that vulnerabilities are fixed or reduced. +The tutorial will guide users through the following scenario: -3. **Handle unfixable vulnerabilities** (e.g., `zlib1g` with `CVE-2023-45853`) by using Trivy flags such as `--ignore-unfixed` or `.trivyignore`. This highlights the reality that not all vulnerabilities have patches and shows how to configure scanning policies. +1. **Baseline setup**: Run a single-container Flask web application. Inject chaos (kill/restart, latency) with Pumba and observe how the service becomes unavailable. -4. **Integrate vulnerability scanning into CI/CD** using GitHub Actions. We configure a workflow that automatically builds images and runs Trivy scans, failing the pipeline if `HIGH` or `CRITICAL` vulnerabilities are found. +2. **Redundancy with multiple replicas**: Deploy multiple web containers behind an NGINX load balancer. Repeat the chaos experiments and see how the system remains available despite single-container failures. -With this tutorial, we want to highlight the **before (vulnerable)** vs **after (remediated)** condition, and demonstrate how vulnerability scanning becomes part of an automated DevSecOps workflow. We plan to deliver our tutorial on [KillerCoda](https://killercoda.com). +3. **Self-healing with restart policies**: Enable Docker restart policies so containers automatically recover after being killed by Pumba. Demonstrate improved resilience. -Trivy is chosen in the tutorial as an open-source, lightweight, and widely used vulnerability scanner for containers, file systems, and Infrastructure-as-Code (IaC). +4. **Application-level resilience**: Modify the Flask app to handle timeouts and provide fallback responses when backend services are delayed by chaos injections. -**Relevance** +5. **Iterative improvement**: Compare “before vs after” conditions, highlighting how redundancy, self-healing, and graceful degradation improve system reliability. -Vulnerability management is central to DevSecOps because it ensures that insecure dependencies are detected early, developers can remediate issues quickly, and CI/CD pipelines enforce security gates. This aligns security scanning with continuous delivery workflows and raises awareness about both remediable and unfixable vulnerabilities in containerized applications. +This tutorial will be delivered through [KillerKoda](https://killercoda.com), using Docker Playground to safely run chaos experiments. + +## Relevance + +Chaos Engineering is a key practice in **Site Reliability Engineering (SRE)**. By deliberately introducing controlled failures, teams can identify weaknesses early and build systems that remain reliable under stress. This tutorial highlights the transition from a fragile single-container system to a resilient multi-container setup, aligning with real-world practices for designing reliable, fault-tolerant cloud-native applications. From e75a33e1831940e0c84c0b0a7ca074d2e3bba919 Mon Sep 17 00:00:00 2001 From: zhangqiang Date: Tue, 23 Sep 2025 16:27:03 +0200 Subject: [PATCH 06/14] hanzhizh-bingjiez tutorial proposal --- .../hanzhizh-bingjiez/README.md | 29 ++++++++++++------- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md index d661fbfc77..db6d7c308e 100644 --- a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md +++ b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md @@ -2,7 +2,7 @@ ## Title -Chaos Engineering with Pumba: Building Fault-Tolerant Containerized Applications +Observability with eBPF and FlameGraphs: Profiling Containerized Applications ## Names and KTH ID @@ -19,22 +19,31 @@ Chaos Engineering with Pumba: Building Fault-Tolerant Containerized Applications ## Description -This executable tutorial demonstrates how to apply **Chaos Engineering** in containerized environments to improve system **fault tolerance** and **resilience**. Using [Pumba](https://github.com/alexei-led/pumba), a chaos testing tool for Docker, learners will inject controlled failures (random restarts, network delays, resource stress) into a containerized web application, and then gradually enhance the system to withstand these disruptions. +This executable tutorial demonstrates how to use **eBPF** for observability and performance profiling of containerized applications, combined with **FlameGraph visualization** to identify bottlenecks. Unlike traditional monitoring tools, eBPF allows dynamic tracing at the kernel level with minimal overhead, giving developers deep insights into system and application behavior. The tutorial will guide users through the following scenario: -1. **Baseline setup**: Run a single-container Flask web application. Inject chaos (kill/restart, latency) with Pumba and observe how the service becomes unavailable. +1. **Baseline setup**: Run a simple containerized Python/Flask application and perform a stress test to generate system load. -2. **Redundancy with multiple replicas**: Deploy multiple web containers behind an NGINX load balancer. Repeat the chaos experiments and see how the system remains available despite single-container failures. +2. **Traditional monitoring**: Use `top` and `strace` to show the limitations of conventional tools for observability. -3. **Self-healing with restart policies**: Enable Docker restart policies so containers automatically recover after being killed by Pumba. Demonstrate improved resilience. +3. **eBPF tracing**: + - Use `bpftrace` and `bcc-tools` to trace syscalls (`execve`, `open`), file I/O, and TCP traffic of the containerized app. + - Use `profile` sampling to capture CPU stack traces during load testing. -4. **Application-level resilience**: Modify the Flask app to handle timeouts and provide fallback responses when backend services are delayed by chaos injections. +4. **FlameGraph visualization**: + - Collect profiling data with `perf` or `bpftrace`. + - Convert stack traces into folded format and generate an interactive `flamegraph.svg` using Brendan Gregg’s FlameGraph toolkit. + - Observe which functions and code paths consume the most CPU time. -5. **Iterative improvement**: Compare “before vs after” conditions, highlighting how redundancy, self-healing, and graceful degradation improve system reliability. +5. **Before vs After comparison**: + - Show how traditional monitoring only reveals high-level metrics. + - Demonstrate how eBPF + FlameGraph provides deep, actionable insights into performance bottlenecks. -This tutorial will be delivered through [KillerKoda](https://killercoda.com), using Docker Playground to safely run chaos experiments. +The tutorial will be delivered on [KillerKoda](https://killercoda.com), using Linux playgrounds that support Docker, bpftrace, and perf tools. -## Relevance +**Relevance** -Chaos Engineering is a key practice in **Site Reliability Engineering (SRE)**. By deliberately introducing controlled failures, teams can identify weaknesses early and build systems that remain reliable under stress. This tutorial highlights the transition from a fragile single-container system to a resilient multi-container setup, aligning with real-world practices for designing reliable, fault-tolerant cloud-native applications. +Observability and performance profiling are critical aspects of DevOps and Site Reliability Engineering (SRE). This tutorial introduces students to modern, cutting-edge techniques using **eBPF**, which is increasingly adopted in production environments for performance debugging, security, and monitoring. + +By combining eBPF with **FlameGraph visualization**, learners can bridge the gap between raw system-level tracing and intuitive performance insights, aligning with DevOps principles of continuous improvement and operational excellence. From 320a1a910ab6037ea9e43c60890a007c16d6794f Mon Sep 17 00:00:00 2001 From: zhangqiang Date: Fri, 3 Oct 2025 23:09:35 +0200 Subject: [PATCH 07/14] submit tutorial --- contributions/executable-tutorial/hanzhizh-bingjiez/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md index db6d7c308e..521fbb91f0 100644 --- a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md +++ b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md @@ -47,3 +47,5 @@ The tutorial will be delivered on [KillerKoda](https://killercoda.com), using Li Observability and performance profiling are critical aspects of DevOps and Site Reliability Engineering (SRE). This tutorial introduces students to modern, cutting-edge techniques using **eBPF**, which is increasingly adopted in production environments for performance debugging, security, and monitoring. By combining eBPF with **FlameGraph visualization**, learners can bridge the gap between raw system-level tracing and intuitive performance insights, aligning with DevOps principles of continuous improvement and operational excellence. + +Tutorial Link: https://killercoda.com/dazhi/scenario/my-ebpf-tutorial \ No newline at end of file From d313d85da9da64ba9ca7bbf14ef91301706a01de Mon Sep 17 00:00:00 2001 From: zhangqiang Date: Thu, 9 Oct 2025 08:27:51 +0200 Subject: [PATCH 08/14] submit feedback --- .../feedback/hanzhizh-bingjiez/README.md | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 contributions/feedback/hanzhizh-bingjiez/README.md diff --git a/contributions/feedback/hanzhizh-bingjiez/README.md b/contributions/feedback/hanzhizh-bingjiez/README.md new file mode 100644 index 0000000000..ccc62c8e75 --- /dev/null +++ b/contributions/feedback/hanzhizh-bingjiez/README.md @@ -0,0 +1,22 @@ +# Assignment Proposal + +## Title + +Feedback on executable tutorial:"Zero-Trust Data Pipelines: A Practical DevOps Security Tutorial" + +## Names and KTH ID + +- Hanzhi Zhang (hanzhizh@kth.se) +- Bingjie Zhao (bingjiez@kth.se) + +## Deadline + +- Task 3 + +## Category + +- Feedback + +## Description + +Link to [Zero-Trust Data Pipelines: A Practical DevOps Security Tutorial comment](https://github.com/KTH/devops-course/pull/2882#issuecomment-3380911246) \ No newline at end of file From 7085926801e4a1f96d947e6099b587ec6100a1e8 Mon Sep 17 00:00:00 2001 From: dazhijiong <69571910+dazhijiong@users.noreply.github.com> Date: Thu, 9 Oct 2025 08:31:44 +0200 Subject: [PATCH 09/14] Delete contributions/executable-tutorial/hanzhizh-bingjiez/README.md --- .../hanzhizh-bingjiez/README.md | 51 ------------------- 1 file changed, 51 deletions(-) delete mode 100644 contributions/executable-tutorial/hanzhizh-bingjiez/README.md diff --git a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md deleted file mode 100644 index dc2ea444e8..0000000000 --- a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md +++ /dev/null @@ -1,51 +0,0 @@ -# Assignment Proposal - -## Title - -Observability with eBPF and FlameGraphs: Profiling Containerized Applications - -## Names and KTH ID - -- Hanzhi Zhang (hanzhizh@kth.se) -- Bingjie Zhao (bingjiez@kth.se) - -## Deadline - -- Task 3 - -## Category - -- Executable tutorial - -## Description - -This executable tutorial demonstrates how to use **eBPF** for observability and performance profiling of containerized applications, combined with **FlameGraph visualization** to identify bottlenecks. Unlike traditional monitoring tools, eBPF allows dynamic tracing at the kernel level with minimal overhead, giving developers deep insights into system and application behavior. - -The tutorial will guide users through the following scenario: - -1. **Baseline setup**: Run a simple containerized Python/Flask application and perform a stress test to generate system load. - -2. **Traditional monitoring**: Use `top` and `strace` to show the limitations of conventional tools for observability. - -3. **eBPF tracing**: - - Use `bpftrace` and `bcc-tools` to trace syscalls (`execve`, `open`), file I/O, and TCP traffic of the containerized app. - - Use `profile` sampling to capture CPU stack traces during load testing. - -4. **FlameGraph visualization**: - - Collect profiling data with `perf` or `bpftrace`. - - Convert stack traces into folded format and generate an interactive `flamegraph.svg` using Brendan Gregg’s FlameGraph toolkit. - - Observe which functions and code paths consume the most CPU time. - -5. **Before vs After comparison**: - - Show how traditional monitoring only reveals high-level metrics. - - Demonstrate how eBPF + FlameGraph provides deep, actionable insights into performance bottlenecks. - -The tutorial will be delivered on [KillerKoda](https://killercoda.com), using Linux playgrounds that support Docker, bpftrace, and perf tools. - -**Relevance** - -Observability and performance profiling are critical aspects of DevOps and Site Reliability Engineering (SRE). This tutorial introduces students to modern, cutting-edge techniques using **eBPF**, which is increasingly adopted in production environments for performance debugging, security, and monitoring. - -By combining eBPF with **FlameGraph visualization**, learners can bridge the gap between raw system-level tracing and intuitive performance insights, aligning with DevOps principles of continuous improvement and operational excellence. - -Tutorial Link: https://killercoda.com/dazhi/scenario/my-ebpf-tutorial From 67f4af7fad2b2b82b61da43f64b188618c954baf Mon Sep 17 00:00:00 2001 From: zhangqiang Date: Thu, 9 Oct 2025 08:34:01 +0200 Subject: [PATCH 10/14] submit feedback --- .../hanzhizh-bingjiez/README.md | 51 +++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 contributions/executable-tutorial/hanzhizh-bingjiez/README.md diff --git a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md new file mode 100644 index 0000000000..dc2ea444e8 --- /dev/null +++ b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md @@ -0,0 +1,51 @@ +# Assignment Proposal + +## Title + +Observability with eBPF and FlameGraphs: Profiling Containerized Applications + +## Names and KTH ID + +- Hanzhi Zhang (hanzhizh@kth.se) +- Bingjie Zhao (bingjiez@kth.se) + +## Deadline + +- Task 3 + +## Category + +- Executable tutorial + +## Description + +This executable tutorial demonstrates how to use **eBPF** for observability and performance profiling of containerized applications, combined with **FlameGraph visualization** to identify bottlenecks. Unlike traditional monitoring tools, eBPF allows dynamic tracing at the kernel level with minimal overhead, giving developers deep insights into system and application behavior. + +The tutorial will guide users through the following scenario: + +1. **Baseline setup**: Run a simple containerized Python/Flask application and perform a stress test to generate system load. + +2. **Traditional monitoring**: Use `top` and `strace` to show the limitations of conventional tools for observability. + +3. **eBPF tracing**: + - Use `bpftrace` and `bcc-tools` to trace syscalls (`execve`, `open`), file I/O, and TCP traffic of the containerized app. + - Use `profile` sampling to capture CPU stack traces during load testing. + +4. **FlameGraph visualization**: + - Collect profiling data with `perf` or `bpftrace`. + - Convert stack traces into folded format and generate an interactive `flamegraph.svg` using Brendan Gregg’s FlameGraph toolkit. + - Observe which functions and code paths consume the most CPU time. + +5. **Before vs After comparison**: + - Show how traditional monitoring only reveals high-level metrics. + - Demonstrate how eBPF + FlameGraph provides deep, actionable insights into performance bottlenecks. + +The tutorial will be delivered on [KillerKoda](https://killercoda.com), using Linux playgrounds that support Docker, bpftrace, and perf tools. + +**Relevance** + +Observability and performance profiling are critical aspects of DevOps and Site Reliability Engineering (SRE). This tutorial introduces students to modern, cutting-edge techniques using **eBPF**, which is increasingly adopted in production environments for performance debugging, security, and monitoring. + +By combining eBPF with **FlameGraph visualization**, learners can bridge the gap between raw system-level tracing and intuitive performance insights, aligning with DevOps principles of continuous improvement and operational excellence. + +Tutorial Link: https://killercoda.com/dazhi/scenario/my-ebpf-tutorial From bd4af8fb3f0ef98a337d2fe96a8d50a07a462d76 Mon Sep 17 00:00:00 2001 From: dazhijiong <69571910+dazhijiong@users.noreply.github.com> Date: Thu, 9 Oct 2025 08:37:04 +0200 Subject: [PATCH 11/14] Delete contributions/executable-tutorial/hanzhizh-bingjiez/README.md --- .../hanzhizh-bingjiez/README.md | 51 ------------------- 1 file changed, 51 deletions(-) delete mode 100644 contributions/executable-tutorial/hanzhizh-bingjiez/README.md diff --git a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md deleted file mode 100644 index dc2ea444e8..0000000000 --- a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md +++ /dev/null @@ -1,51 +0,0 @@ -# Assignment Proposal - -## Title - -Observability with eBPF and FlameGraphs: Profiling Containerized Applications - -## Names and KTH ID - -- Hanzhi Zhang (hanzhizh@kth.se) -- Bingjie Zhao (bingjiez@kth.se) - -## Deadline - -- Task 3 - -## Category - -- Executable tutorial - -## Description - -This executable tutorial demonstrates how to use **eBPF** for observability and performance profiling of containerized applications, combined with **FlameGraph visualization** to identify bottlenecks. Unlike traditional monitoring tools, eBPF allows dynamic tracing at the kernel level with minimal overhead, giving developers deep insights into system and application behavior. - -The tutorial will guide users through the following scenario: - -1. **Baseline setup**: Run a simple containerized Python/Flask application and perform a stress test to generate system load. - -2. **Traditional monitoring**: Use `top` and `strace` to show the limitations of conventional tools for observability. - -3. **eBPF tracing**: - - Use `bpftrace` and `bcc-tools` to trace syscalls (`execve`, `open`), file I/O, and TCP traffic of the containerized app. - - Use `profile` sampling to capture CPU stack traces during load testing. - -4. **FlameGraph visualization**: - - Collect profiling data with `perf` or `bpftrace`. - - Convert stack traces into folded format and generate an interactive `flamegraph.svg` using Brendan Gregg’s FlameGraph toolkit. - - Observe which functions and code paths consume the most CPU time. - -5. **Before vs After comparison**: - - Show how traditional monitoring only reveals high-level metrics. - - Demonstrate how eBPF + FlameGraph provides deep, actionable insights into performance bottlenecks. - -The tutorial will be delivered on [KillerKoda](https://killercoda.com), using Linux playgrounds that support Docker, bpftrace, and perf tools. - -**Relevance** - -Observability and performance profiling are critical aspects of DevOps and Site Reliability Engineering (SRE). This tutorial introduces students to modern, cutting-edge techniques using **eBPF**, which is increasingly adopted in production environments for performance debugging, security, and monitoring. - -By combining eBPF with **FlameGraph visualization**, learners can bridge the gap between raw system-level tracing and intuitive performance insights, aligning with DevOps principles of continuous improvement and operational excellence. - -Tutorial Link: https://killercoda.com/dazhi/scenario/my-ebpf-tutorial From a02a6e4f123ec20d45e937a7a52b780121af2230 Mon Sep 17 00:00:00 2001 From: zhangqiang Date: Thu, 9 Oct 2025 08:39:42 +0200 Subject: [PATCH 12/14] Fix: update feedback README only --- contributions/feedback/hanzhizh-bingjiez/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/contributions/feedback/hanzhizh-bingjiez/README.md b/contributions/feedback/hanzhizh-bingjiez/README.md index ccc62c8e75..12ecdf0a17 100644 --- a/contributions/feedback/hanzhizh-bingjiez/README.md +++ b/contributions/feedback/hanzhizh-bingjiez/README.md @@ -19,4 +19,4 @@ Feedback on executable tutorial:"Zero-Trust Data Pipelines: A Practical DevOps S ## Description -Link to [Zero-Trust Data Pipelines: A Practical DevOps Security Tutorial comment](https://github.com/KTH/devops-course/pull/2882#issuecomment-3380911246) \ No newline at end of file +Link to the feedback [Zero-Trust Data Pipelines: A Practical DevOps Security Tutorial comment](https://github.com/KTH/devops-course/pull/2882#issuecomment-3380911246) \ No newline at end of file From cac6ee8ddaedd9575720c8024042fe5b7fa38ea8 Mon Sep 17 00:00:00 2001 From: zhangqiang Date: Thu, 9 Oct 2025 08:43:37 +0200 Subject: [PATCH 13/14] Fix: update feedback --- .../hanzhizh-bingjiez/README.md | 51 +++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 contributions/executable-tutorial/hanzhizh-bingjiez/README.md diff --git a/contributions/executable-tutorial/hanzhizh-bingjiez/README.md b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md new file mode 100644 index 0000000000..dc2ea444e8 --- /dev/null +++ b/contributions/executable-tutorial/hanzhizh-bingjiez/README.md @@ -0,0 +1,51 @@ +# Assignment Proposal + +## Title + +Observability with eBPF and FlameGraphs: Profiling Containerized Applications + +## Names and KTH ID + +- Hanzhi Zhang (hanzhizh@kth.se) +- Bingjie Zhao (bingjiez@kth.se) + +## Deadline + +- Task 3 + +## Category + +- Executable tutorial + +## Description + +This executable tutorial demonstrates how to use **eBPF** for observability and performance profiling of containerized applications, combined with **FlameGraph visualization** to identify bottlenecks. Unlike traditional monitoring tools, eBPF allows dynamic tracing at the kernel level with minimal overhead, giving developers deep insights into system and application behavior. + +The tutorial will guide users through the following scenario: + +1. **Baseline setup**: Run a simple containerized Python/Flask application and perform a stress test to generate system load. + +2. **Traditional monitoring**: Use `top` and `strace` to show the limitations of conventional tools for observability. + +3. **eBPF tracing**: + - Use `bpftrace` and `bcc-tools` to trace syscalls (`execve`, `open`), file I/O, and TCP traffic of the containerized app. + - Use `profile` sampling to capture CPU stack traces during load testing. + +4. **FlameGraph visualization**: + - Collect profiling data with `perf` or `bpftrace`. + - Convert stack traces into folded format and generate an interactive `flamegraph.svg` using Brendan Gregg’s FlameGraph toolkit. + - Observe which functions and code paths consume the most CPU time. + +5. **Before vs After comparison**: + - Show how traditional monitoring only reveals high-level metrics. + - Demonstrate how eBPF + FlameGraph provides deep, actionable insights into performance bottlenecks. + +The tutorial will be delivered on [KillerKoda](https://killercoda.com), using Linux playgrounds that support Docker, bpftrace, and perf tools. + +**Relevance** + +Observability and performance profiling are critical aspects of DevOps and Site Reliability Engineering (SRE). This tutorial introduces students to modern, cutting-edge techniques using **eBPF**, which is increasingly adopted in production environments for performance debugging, security, and monitoring. + +By combining eBPF with **FlameGraph visualization**, learners can bridge the gap between raw system-level tracing and intuitive performance insights, aligning with DevOps principles of continuous improvement and operational excellence. + +Tutorial Link: https://killercoda.com/dazhi/scenario/my-ebpf-tutorial From c4fabe585ecf1216fe23664e3d1b96f21a4ec00c Mon Sep 17 00:00:00 2001 From: zhangqiang Date: Thu, 9 Oct 2025 08:48:21 +0200 Subject: [PATCH 14/14] Fix: update feedback --- contributions/feedback/hanzhizh-bingjiez/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/contributions/feedback/hanzhizh-bingjiez/README.md b/contributions/feedback/hanzhizh-bingjiez/README.md index 12ecdf0a17..398b34b769 100644 --- a/contributions/feedback/hanzhizh-bingjiez/README.md +++ b/contributions/feedback/hanzhizh-bingjiez/README.md @@ -19,4 +19,4 @@ Feedback on executable tutorial:"Zero-Trust Data Pipelines: A Practical DevOps S ## Description -Link to the feedback [Zero-Trust Data Pipelines: A Practical DevOps Security Tutorial comment](https://github.com/KTH/devops-course/pull/2882#issuecomment-3380911246) \ No newline at end of file +Link to feedback [Zero-Trust Data Pipelines: A Practical DevOps Security Tutorial comment](https://github.com/KTH/devops-course/pull/2882#issuecomment-3380911246) \ No newline at end of file