Create solution3.ipynb

alesanchezr · web-flow · commit 520ef0997bb4 · 2022-05-25T12:02:23.000-04:00
diff --git a/notebook/solution3.ipynb b/notebook/solution3.ipynb
@@ -0,0 +1,334 @@
+{
+"cells": [
+{
+"cell_type": "markdown",
+"id": "27b09e13",
+"metadata": {},
+"source": [
+"# Algorithm Optimization problems"
+]
+},
+{
+"cell_type": "markdown",
+"id": "6cb6a105",
+"metadata": {},
+"source": [
+"## Exercise 3: Implementing a KNN Algorithm with Scikit-learn\n",
+"\n",
+"In this exercise, we will have a brief introduction to the Scikit-learn library. We will see how this library can be used to implement the KNN algorithm in less than 20 lines of code.  Then the task will be to try to optimize the parameter of this algorithm. Don't worry! Scikit-Learn, and also de KNN algorithm will be explained very well in future modules.\n",
+"\n",
+"**The dataset**\n",
+"\n",
+"We are going to use the famous iris data set for our KNN example. The dataset consists of four attributes: sepal-width, sepal-length, petal-width and petal-length. These are the attributes of specific types of iris plant. The goal is to predict the class to which these plants belong. There are three classes in the dataset: Iris-setosa, Iris-versicolor and Iris-virginica.\n",
+"\n",
+"\n",
+"**Library installation**\n",
+"\n",
+"\n",
+"First let's install the Scikit-learn library by entering the following command in the terminal:\n",
+"\n",
+"```py\n",
+"pip install -U scikit-learn\n",
+"```\n",
+"\n",
+"Now let's see how to implement the KNN algorithm:\n",
+"\n",
+"**Importing libraries**"
+]
+},
+{
+"cell_type": "code",
+"execution_count": null,
+"id": "9d0f17f0",
+"metadata": {},
+"outputs": [],
+"source": [
+"import numpy as np\n",
+"import matplotlib.pyplot as plt\n",
+"import pandas as pd"
+]
+},
+{
+"cell_type": "markdown",
+"id": "d3e87b86",
+"metadata": {},
+"source": [
+"**Importing and loading the dataset**"
+]
+},
+{
+"cell_type": "code",
+"execution_count": null,
+"id": "ca2f71e1",
+"metadata": {},
+"outputs": [],
+"source": [
+"url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\"\n",
+"\n",
+"# Assign colum names to the dataset\n",
+"names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']\n",
+"\n",
+"# Read dataset to pandas dataframe\n",
+"dataset = pd.read_csv(url, names=names)"
+]
+},
+{
+"cell_type": "markdown",
+"id": "9fd366f8",
+"metadata": {},
+"source": [
+"**Looking at the first rows of the data**"
+]
+},
+{
+"cell_type": "code",
+"execution_count": null,
+"id": "058c3171",
+"metadata": {},
+"outputs": [],
+"source": [
+"dataset.head()"
+]
+},
+{
+"cell_type": "markdown",
+"id": "e3dae7c8",
+"metadata": {},
+"source": [
+"**Pre-processing**\n",
+"\n",
+"Split dataset into attributes and labels. Again, don't worry, this data preprocessing part will also be very well explained in a future module. Now, we will focus on the algorithm optimization."
+]
+},
+{
+"cell_type": "code",
+"execution_count": null,
+"id": "ce94dbb1",
+"metadata": {},
+"outputs": [],
+"source": [
+"X = dataset.iloc[:, :-1].values\n",
+"y = dataset.iloc[:, 4].values"
+]
+},
+{
+"cell_type": "markdown",
+"id": "d8b9dbed",
+"metadata": {},
+"source": [
+"**Train Test split**\n",
+"\n",
+"In machine learning, when we are building a model, we have to be careful to do not overfit it. Overfitting means that our model works very well in known data but when it works with unseen data, it has a poor performance. To avoid this, we divide our dataset into training and test datasets. This way our algorithm is tested on un-seen data to evaluate the performance of our algorithm. We will divide it into 80% training data and 20% for testing. This 4 blocks of data would be:\n",
+"\n",
+"-X_train: training features\n",
+"\n",
+"-X_test: test features\n",
+"\n",
+"-y_train: training labels\n",
+"\n",
+"-y_test: test labels"
+]
+},
+{
+"cell_type": "code",
+"execution_count": null,
+"id": "8ccaf14e",
+"metadata": {},
+"outputs": [],
+"source": [
+"from sklearn.model_selection import train_test_split\n",
+"\n",
+"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)"
+]
+},
+{
+"cell_type": "markdown",
+"id": "21b18743",
+"metadata": {},
+"source": [
+"**Feature Scaling**\n",
+"\n",
+"The majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.\n",
+"\n",
+"It is a good practice to scale the features so that all of them can be uniformly evaluated.\n",
+"We will read more about feature scaling in the data preprocessing module, but now, let's have a first look at how it is implemented."
+]
+},
+{
+"cell_type": "code",
+"execution_count": null,
+"id": "9904f4a0",
+"metadata": {},
+"outputs": [],
+"source": [
+"from sklearn.preprocessing import StandardScaler\n",
+"scaler = StandardScaler()\n",
+"scaler.fit(X_train)\n",
+"\n",
+"X_train = scaler.transform(X_train)\n",
+"X_test = scaler.transform(X_test)"
+]
+},
+{
+"cell_type": "markdown",
+"id": "6c368418",
+"metadata": {},
+"source": [
+"**Train the KNN algorithm**\n",
+"\n",
+"Now our data is ready for training. We will begin by importing the algorithm we want to train from the scikit-learn library, and then we will initialize it with one parameter: n_neigbours. This is basically the value for 'K'. We don't know what is the ideal number of neighbors(K) yet so we will start it with 5. After making the predictions we will try to otpimize this K value.\n",
+"\n",
+"Finally, we train our model using .fit on our training features and training labels (they both represent 80% of the data)."
+]
+},
+{
+"cell_type": "code",
+"execution_count": null,
+"id": "fa6cda99",
+"metadata": {},
+"outputs": [],
+"source": [
+"from sklearn.neighbors import KNeighborsClassifier\n",
+"classifier = KNeighborsClassifier(n_neighbors=5)\n",
+"classifier.fit(X_train, y_train)"
+]
+},
+{
+"cell_type": "markdown",
+"id": "14b37ff9",
+"metadata": {},
+"source": [
+"**Make predictions on our test data**\n",
+"\n",
+"The model has been trained, so now we use the same algorithm (stored in the 'classifier' variable)  to predict only on the features of the test dataset (20% of the data). This time we don't use the labels, because we want to predict them, and then compare them with the real test labels."
+]
+},
+{
+"cell_type": "code",
+"execution_count": null,
+"id": "2dab6f34",
+"metadata": {},
+"outputs": [],
+"source": [
+"y_pred = classifier.predict(X_test)"
+]
+},
+{
+"cell_type": "markdown",
+"id": "5ebcc483",
+"metadata": {},
+"source": [
+"**Evaluate the algorithm**\n",
+"\n",
+"Confusion matrix, precision, recall and f1 score are the most commonly used evaluation metrics for classification problems.\n",
+"The confusion_matrix and classification_report methods of the sklearn.metrics can be used to calculate these metrics. Here we will compare our y_pred results versus our y_test true labels."
+]
+},
+{
+"cell_type": "code",
+"execution_count": null,
+"id": "7eaac127",
+"metadata": {},
+"outputs": [],
+"source": [
+"from sklearn.metrics import classification_report, confusion_matrix\n",
+"print(confusion_matrix(y_test, y_pred))\n",
+"print(classification_report(y_test, y_pred))"
+]
+},
+{
+"cell_type": "markdown",
+"id": "0365c265",
+"metadata": {},
+"source": [
+"**Optimizing the KNN algorithm**\n",
+"\n",
+"There is no way to know beforehand which value of 'K' yields the best results since the beginning. We randomly chose 5 as the 'K' value.\n",
+"\n",
+"This is where we optimize our algorithm. To help find the best value of K:\n",
+"\n",
+"1. First create a function that calculates the mean of error for all the predicted values where K ranges from 1 to 40. \n",
+"2. Plot the error values against K values.\n",
+"3. Vary the test and training size along with the K value to see how your results differ and how can you improve the accuracy of your algorithm. \n",
+"\n"
+]
+},
+{
+"cell_type": "code",
+"execution_count": null,
+"id": "6418e5e4",
+"metadata": {},
+"outputs": [],
+"source": [
+"error = []\n",
+"\n",
+"# Calculating error for K values between 1 and 40\n",
+"for i in range(1, 40):\n",
+"    knn = KNeighborsClassifier(n_neighbors=i)\n",
+"    knn.fit(X_train, y_train)\n",
+"    pred_i = knn.predict(X_test)\n",
+"    error.append(np.mean(pred_i != y_test))\n",
+"   "
+]
+},
+{
+"cell_type": "code",
+"execution_count": null,
+"id": "9483300a",
+"metadata": {},
+"outputs": [],
+"source": [
+"# plot the error values against K values. Establish the graph size.\n",
+"\n",
+"plt.figure(figsize=(12, 6))\n",
+"plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',\n",
+"         markerfacecolor='blue', markersize=10)\n",
+"\n",
+"# title and labels (uncomment the following code lines)\n",
+"plt.title('Error Rate K Value') #plt.title('Error Rate K Value')\n",
+"plt.xlabel('K Value') #plt.xlabel('K Value')\n",
+"plt.ylabel('Mean Error') #plt.ylabel('Mean Error')"
+]
+},
+{
+"cell_type": "markdown",
+"id": "b281af6e",
+"metadata": {},
+"source": [
+"Source:\n",
+"\n",
+"https://www.geeksforgeeks.org/\n",
+"\n",
+"https://stackabuse.com/big-o-notation-and-algorithm-analysis-with-python-examples/\n",
+"\n",
+"https://stackabuse.com/quicksort-in-python/\n",
+"\n",
+"https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/"
+]
+}
+],
+"metadata": {
+"interpreter": {
+"hash": "9248718ffe6ce6938b217e69dbcc175ea21f4c6b28a317e96c05334edae734bb"
+},
+"kernelspec": {
+"display_name": "Python 3 (ipykernel)",
+"language": "python",
+"name": "python3"
+},
+"language_info": {
+"codemirror_mode": {
+"name": "ipython",
+"version": 3
+},
+"file_extension": ".py",
+"mimetype": "text/x-python",
+"name": "python",
+"nbconvert_exporter": "python",
+"pygments_lexer": "ipython3",
+"version": "3.9.12"
+}
+},
+"nbformat": 4,
+"nbformat_minor": 5
+}