diff --git a/docs/machine-learning/projects-and-case-studies/advanced-projects.mdx b/docs/machine-learning/projects-and-case-studies/advanced-projects.mdx index e69de29..759abd4 100644 --- a/docs/machine-learning/projects-and-case-studies/advanced-projects.mdx +++ b/docs/machine-learning/projects-and-case-studies/advanced-projects.mdx @@ -0,0 +1,87 @@ +--- +title: Advanced & Generative AI Projects +sidebar_label: Advanced Projects +description: "Master the cutting edge with projects in LLM Agents, Generative Adversarial Networks (GANs), and Reinforcement Learning." +tags: [gen-ai, llm-agents, gan, reinforcement-learning, pytorch, transformers] +--- + +Advanced projects involve systems that don't just analyze data but **create** new data or **interact** autonomously with environments. At this level, you will work with transformer architectures, diffusion models, and feedback-based learning. + +## Project 1: Multi-Agent Research Assistant (LLM Ops) +**Goal:** Build a system where multiple AI agents collaborate to research a topic, verify facts, and write a formatted report. + +### Project Overview +This project moves from simple "Chat" to **Agentic Workflows**. You will learn how to orchestrate different LLM "personas" and give them tools to browse the web and write files. + +* **Tech Stack:** `LangChain` or `CrewAI`, `OpenAI API` or `Llama 3 (Ollama)`. +* **Key Concept:** **Tool Use (Function Calling)** and **Multi-Agent Orchestration**. +* **Success Metric:** Accuracy of citations and coherence of the final multi-step report. + +### Advanced Skills +1. **Orchestration:** Managing the "handoff" of data from one agent to the next. +2. **State Management:** Ensuring the agents remember what has already been researched. +3. **Prompt Engineering:** Writing system prompts that prevent agents from getting stuck in infinite loops. + +## Project 2: Synthetic Image Generation (GANs or Diffusion) +**Goal:** Train a model to generate realistic images (e.g., human faces or artistic styles) that do not exist in the real world. + +### Project Overview +You will explore the "Generative" side of AI. You can choose between **Generative Adversarial Networks (GANs)** or the more modern **Latent Diffusion Models**. + +* **Key Algorithm:** $G$ (Generator) vs $D$ (Discriminator) or **Denoising Diffusion Probabilistic Models (DDPM)**. +* **Framework:** `PyTorch`. +* **Dataset:** [CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) (Faces) or [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html). + +[Image showing the Denoising process: starting with pure noise and slowly revealing a clear image] + +### Key Mathematical Concepts +* **Adversarial Loss:** The generator learns to fool the discriminator: + + $$ \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))] $$ + +* **Latent Space:** Understanding how low-dimensional "noise" maps to high-dimensional images. + +## Project 3: Autonomous RL Agent (Reinforcement Learning) +**Goal:** Train an agent to master a game (like Lunar Lander or Atari) or optimize a trading strategy through trial and error. + +### Project Overview +Reinforcement Learning (RL) is about maximizing rewards in an environment. There are no labels; only "points" for good actions and "penalties" for bad ones. + +* **Environment:** `OpenAI Gym` (Gymnasium). +* **Key Algorithm:** **Deep Q-Learning (DQN)** or **Proximal Policy Optimization (PPO)**. +* **Primary Metric:** Cumulative Reward over Time. + +## Advanced Architecture: The Transformer + +Most advanced projects today rely on the **Transformer** architecture, which uses **Self-Attention** to process data in parallel. + +```mermaid +graph TD + Input[Input Sequence] --> Embed[Input Embedding + Positional Encoding] + Embed --> MultiHead[Multi-Head Self-Attention] + MultiHead --> Norm[Layer Norm & Residual Connection] + Norm --> FFN[Feed Forward Network] + FFN --> Output[Output Probabilities] + + style MultiHead fill:#fce4ec,stroke:#d81b60,stroke-width:2px,color:#334 + style Input fill:#e1f5fe,stroke:#01579b,color:#334 + style Output fill:#c8e6c9,stroke:#2e7d32,color:#334 + +``` + +## The Advanced AI Stack + +* **Deployment:** `BentoML`, `Triton Inference Server`, or `vLLM` for fast LLM serving. +* **Optimization:** **Quantization** (making models smaller) and **LoRA** (Low-Rank Adaptation for fine-tuning). +* **Tracking:** `Weights & Biases` for monitoring complex training runs. +* **Compute:** Heavy reliance on **CUDA** and high-performance GPUs (A100/H100). + +## References + +* **Attention is All You Need:** [The original Transformer Paper](https://arxiv.org/abs/1706.03762) +* **OpenAI:** [Spinning Up in Deep RL](https://spinningup.openai.com/) +* **Hugging Face:** [Diffusion Models Course](https://huggingface.co/learn/diffusion-course/) + +--- + +**Advanced projects are the gateway to a career as an AI Engineer or Researcher. How do these technologies apply to real businesses?** \ No newline at end of file diff --git a/docs/machine-learning/projects-and-case-studies/beginner-projects.mdx b/docs/machine-learning/projects-and-case-studies/beginner-projects.mdx index e69de29..5867f87 100644 --- a/docs/machine-learning/projects-and-case-studies/beginner-projects.mdx +++ b/docs/machine-learning/projects-and-case-studies/beginner-projects.mdx @@ -0,0 +1,89 @@ +--- +title: Beginner ML Projects +sidebar_label: Beginner Projects +description: "Hands-on machine learning projects for beginners, including house price prediction, iris classification, and customer segmentation." +tags: [projects, regression, classification, clustering, python, scikit-learn] +--- + +The best way to learn Machine Learning is by building. These three projects are the "Hello World" of ML, covering the fundamental types of supervised and unsupervised learning. + +## Project 1: House Price Predictor (Regression) +**Goal:** Predict the continuous price of a house based on features like square footage, number of bedrooms, and location. + +### Project Overview +This project introduces **Linear Regression**. You will learn how to handle numerical data and minimize the error between your prediction and the actual price. + +* **Dataset:** [Ames Housing Dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) or California Housing. +* **Key Algorithm:** `LinearRegression` or `RandomForestRegressor`. +* **Primary Metric:** Mean Squared Error (MSE) or $R^2$ Score. + +### Implementation Steps +1. **Exploratory Data Analysis (EDA):** Visualize correlations using a heatmap. +2. **Preprocessing:** Handle missing values and scale features using `StandardScaler`. +3. **Training:** Split data into 80% training and 20% testing. +4. **Evaluation:** Calculate the $R^2$ score to see how much variance your model explains. + +## Project 2: Iris Flower Classifier (Classification) +**Goal:** Predict the species of an iris flower (Setosa, Versicolour, or Virginica) based on its petal and sepal measurements. + +### Project Overview +This is the classic "classification" problem. You will learn how to handle categorical targets and evaluate accuracy across multiple classes. + +* **Dataset:** [Iris Dataset](https://archive.ics.uci.edu/ml/datasets/iris) (built into Scikit-Learn). +* **Key Algorithm:** `LogisticRegression` or `K-Nearest Neighbors (KNN)`. +* **Primary Metric:** Accuracy and the **Confusion Matrix**. + +### Implementation Steps +1. **Pairplots:** Use Seaborn to see how the species cluster based on petal width vs length. +2. **Training:** Use a Simple Decision Tree to see how the model "splits" the data. +3. **Evaluation:** Generate a classification report to check **Precision** and **Recall** for each flower type. + +## Project 3: Customer Segmentation (Clustering) +**Goal:** Group customers into "segments" based on their spending habits and income without using any pre-defined labels. + +### Project Overview +This project introduces **Unsupervised Learning**. Unlike the first two, there is no "correct answer." You are asking the model to find hidden patterns. + +* **Dataset:** [Mall Customer Segmentation](https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python). +* **Key Algorithm:** `K-Means Clustering`. +* **Primary Metric:** Silhouette Score or the "Elbow Method." + +### Implementation Steps +1. **Feature Selection:** Focus on "Annual Income" and "Spending Score." +2. **The Elbow Method:** Run K-Means for $k=1$ to $10$ to find the optimal number of clusters. +3. **Visualization:** Plot the clusters in different colors and identify the "Big Spenders" vs "Frugal" groups. + +## Project Workflow Summary + +The following diagram illustrates the standard workflow you should follow for every beginner project. + +```mermaid +graph LR + Data[Load Data] --> Clean[Clean & Preprocess] + Clean --> Split[Train/Test Split] + Split --> Model[Train Model] + Model --> Eval[Evaluate Metrics] + Eval --> Tune[Hyperparameter Tuning] + + style Data fill:#e1f5fe,stroke:#01579b,color:#333 + style Model fill:#fff3e0,stroke:#ef6c00,color:#333 + style Eval fill:#c8e6c9,stroke:#2e7d32,color:#333 + +``` + +## Recommended Tools for Beginners + +* **Google Colab:** No setup required; run Python in your browser. +* **Scikit-Learn:** The industry-standard library for classical ML. +* **Pandas & NumPy:** For data manipulation. +* **Matplotlib & Seaborn:** For data visualization. + +## References + +* **Kaggle:** [House Prices Competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) +* **Scikit-Learn Docs:** [Supervised Learning Guide](https://scikit-learn.org/stable/supervised_learning.html) +* **UCI Machine Learning Repository:** [Classic Datasets](https://archive.ics.uci.edu/ml/index.php) + +--- + +**Building these projects provides the foundation for more complex systems. Once you have mastered these, are you ready to tackle real-world case studies?** \ No newline at end of file diff --git a/docs/machine-learning/projects-and-case-studies/industry-case-studies.mdx b/docs/machine-learning/projects-and-case-studies/industry-case-studies.mdx index e69de29..4cf951d 100644 --- a/docs/machine-learning/projects-and-case-studies/industry-case-studies.mdx +++ b/docs/machine-learning/projects-and-case-studies/industry-case-studies.mdx @@ -0,0 +1,78 @@ +--- +title: "Industry Case Studies: ML at Scale" +sidebar_label: Case Studies +description: "Examining how top-tier tech companies implement machine learning to solve real-world business challenges." +tags: [case-studies, netflix, uber, amazon, recommendation-engines, mlops] +--- + +Moving from a local Jupyter notebook to a system serving millions of users requires a shift in thinking. These case studies highlight how industry giants solve problems regarding **scale**, **latency**, and **data drift**. + +## 1. Netflix: The Artwork Personalization Engine +**The Problem:** How do you convince a user to click on a movie they’ve never heard of? + +**The Solution:** Netflix doesn't just recommend movies; they recommend **artwork**. If you watch many romantic movies, you might see a thumbnail of the lead couple. If you watch comedies, you might see the same movie represented by a funny side-character. + +* **Technology:** Multi-Armed Bandits (MAB). +* **Logic:** The system continuously tests different images (arms) for the same title and exploits the one with the highest Click-Through Rate (CTR) for your specific profile. +* **Outcome:** Significant increase in "Take-rate" (the percentage of recommendations that result in a play). + +## 2. Uber: Michelangelo & Marketplace Forecasting +**The Problem:** Predicting "Estimated Time of Arrival" (ETA) and "Surge Pricing" in real-time across thousands of cities. + +**The Solution:** Uber built **Michelangelo**, an internal ML-as-a-Service platform. It allows data scientists to train and deploy models that process trillions of data points, including weather, historical traffic, and current driver supply. + +* **Technology:** Deep Learning and Gradient Boosted Decision Trees (GBDT). +* **Key Challenge:** Feature Store management. Ensuring that "training data" and "serving data" are identical to avoid **Training-Serving Skew**. + +## 3. Amazon: Predictive Supply Chain +**The Problem:** How can Amazon offer "Same-Day Delivery" without knowing exactly what people will buy? + +**The Solution:** **Anticipatory Shipping**. Amazon uses deep learning to predict what customers in a specific zip code are likely to purchase *before* they actually click "Buy." They move those items to a local fulfillment center in advance. + +* **Technology:** Time-Series Forecasting (DeepAR). +* **Impact:** Massive reduction in shipping costs and delivery times. + +## 4. Comparing Architectures + +The transition from a simple model to an industry-grade system involves adding layers for monitoring and data validation. + +```mermaid +graph TD + Data[Raw Data Lake] --> Valid[Data Validation & Cleaning] + Valid --> Feat[Feature Store: Reusable Features] + Feat --> Train[Distributed Training Cluster] + Train --> Eval[Automated Model Evaluation] + Eval --> Deploy[Blue/Green Deployment] + Deploy --> Monitor[Monitoring: Drift & Bias Detection] + Monitor --> Data + + style Feat fill:#fff3e0,stroke:#ef6c00,color:#333 + style Monitor fill:#fce4ec,stroke:#d81b60,color:#333 + style Deploy fill:#e8f5e9,stroke:#2e7d32,color:#333 + +``` + +## 5. Key Lessons from the Industry + +| Challenge | Industry Solution | Why it Matters | +| --- | --- | --- | +| **Data Drift** | Continuous Monitoring | Models degrade as the world changes (e.g., shopping habits during a pandemic). | +| **Latency** | Model Quantization | A recommendation is useless if it takes 5 seconds to load a webpage. | +| **Scalability** | Distributed Computing | Training on petabytes of data requires clusters (Spark/Ray), not single GPUs. | + +## 6. Emerging Case Study: AI Agents in FinTech + +In 2026, companies like **Klarna** and **Stripe** are replacing traditional support flows with **Autonomous Agents**. + +* **Case:** An agent handles a "disputed transaction." +* **Workflow:** The agent queries the merchant API Checks user's location history Compares with fraud patterns Decides to approve/deny the refund Updates the ledger. + +## References + +* **Netflix Tech Blog:** [Artwork Personalization at Netflix](https://netflixtechblog.com/artwork-personalization-c589f074ad76) +* **Uber Engineering:** [Meet Michelangelo: Uber’s ML Platform](https://www.uber.com/en-IN/blog/michelangelo-machine-learning-platform/) +* **Amazon Science:** [The Science of Anticipatory Shipping](https://www.amazon.science/) + +--- + +**Case studies prove that ML is about more than just accuracy—it's about reliability and system design. Now that you've seen the "what," are you ready to learn the "how" of deployment?** \ No newline at end of file diff --git a/docs/machine-learning/projects-and-case-studies/intermediate-projects.mdx b/docs/machine-learning/projects-and-case-studies/intermediate-projects.mdx index e69de29..0c10e37 100644 --- a/docs/machine-learning/projects-and-case-studies/intermediate-projects.mdx +++ b/docs/machine-learning/projects-and-case-studies/intermediate-projects.mdx @@ -0,0 +1,89 @@ +--- +title: Intermediate ML Projects +sidebar_label: Intermediate Projects +description: "Intermediate-level ML projects focusing on NLP, Computer Vision, and Time-Series forecasting." +tags: [nlp, computer-vision, time-series, deep-learning, xgboost, lstm] +--- + +Intermediate projects move beyond basic Scikit-Learn pipelines. At this level, you will deal with **unstructured data** (text and images) and **temporal data**, requiring more sophisticated feature engineering and deep learning frameworks. + +## Project 1: Sentiment Analysis on Movie Reviews (NLP) +**Goal:** Classify a text review as positive or negative using natural language processing. + +### Project Overview +This project introduces the challenges of turning text into numbers. You will explore word importance and sequence. + +* **Dataset:** [IMDb Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). +* **Key Techniques:** TF-IDF Vectorization, Word Embeddings, or BERT. +* **Algorithm:** `XGBoost` or a simple `RNN/LSTM`. + +### Challenges +1. **Text Cleaning:** Removing HTML tags, emojis, and stopwords. +2. **Sparsity:** Managing high-dimensional data created by large vocabularies. +3. **Context:** Moving from "Bag of Words" (ignoring order) to "Word Sequences" (preserving context). + +## Project 2: Digit Recognition (Computer Vision) +**Goal:** Correctly identify handwritten digits (0-9) from grayscale images. + +### Project Overview +This is the entry point into **Deep Learning**. You will move from flat feature vectors to spatial data processing. + +* **Dataset:** [MNIST Database](http://yann.lecun.com/exdb/mnist/). +* **Key Algorithm:** Convolutional Neural Networks (CNN). +* **Framework:** `TensorFlow/Keras` or `PyTorch`. + +### Implementation Steps +1. **Reshaping:** Convert image arrays into a format compatible with CNNs (Height, Width, Channels). +2. **Normalization:** Scale pixel values from [0, 255] to [0, 1]. +3. **Architecture:** Build a model with `Conv2D`, `MaxPooling`, and `Dropout` layers to prevent overfitting. + +## Project 3: Stock Price or Weather Forecasting (Time-Series) +**Goal:** Predict future values based on historical sequential data. + +### Project Overview +Time-series data is unique because the order of data points matters. You will learn to handle "autocorrelation." + +* **Dataset:** Yahoo Finance (Stock) or NOAA (Weather). +* **Key Algorithm:** `Prophet` (by Meta), `ARIMA`, or `LSTMs`. +* **Primary Metric:** Root Mean Squared Error (RMSE). + +### Key Concepts +1. **Stationarity:** Checking if the mean and variance change over time. +2. **Windowing:** Creating "Sliding Windows" where the previous $N$ days are used to predict the next day. +3. **Seasonality:** Identifying repeating patterns (e.g., higher sales during holidays). + +## Intermediate Project Workflow + +At this stage, your workflow includes an "Feature Engineering" and "Architecture Design" phase. + +```mermaid +graph TD + Data[Unstructured Data: Text/Images] --> Prep[Advanced Preprocessing: NLP/Vision] + Prep --> Design[Model Architecture Design: CNN/LSTM/XGB] + Design --> Train[GPU Accelerated Training] + Train --> Eval[Evaluation: F1-Score/RMSE] + Eval --> Error[Error Analysis: Why did it fail?] + Error --> Design + + style Data fill:#e1f5fe,stroke:#01579b,color:#333 + style Design fill:#fff3e0,stroke:#ef6c00,color:#333 + style Error fill:#fce4ec,stroke:#d81b60,color:#333 + +``` + +## Recommended Tools for Intermediate Level + +* **Frameworks:** `PyTorch` or `TensorFlow`. +* **Boosting:** `XGBoost`, `LightGBM`, or `CatBoost`. +* **NLP Tools:** `Hugging Face Transformers`, `Spacy`. +* **Hardware:** Access to GPUs (Google Colab or Kaggle Kernels). + +## References + +* **Hugging Face:** [NLP Course](https://huggingface.co/learn/nlp-course/) +* **DeepLearning.ai:** [Convolutional Neural Networks Course](https://www.coursera.org/learn/convolutional-neural-networks) +* **Prophet:** [Forecasting at Scale](https://facebook.github.io/prophet/) + +--- + +**Intermediate projects transition you from a "user" of libraries to a "builder" of architectures. Are you ready to dive into the cutting edge of AI?** \ No newline at end of file