diff --git a/notebooks/EvaluationDaskMLTaxyCabEmpty.ipynb b/notebooks/EvaluationDaskMLTaxyCabEmpty.ipynb
index 6d0bbe8..8f5b4fc 100644
--- a/notebooks/EvaluationDaskMLTaxyCabEmpty.ipynb
+++ b/notebooks/EvaluationDaskMLTaxyCabEmpty.ipynb
@@ -81,7 +81,7 @@
"source": [
"%%time\n",
"import pandas as pd\n",
- "train_df = pd.read_csv('gs://aibt-2024/train.csv', nrows = 10_000_000, storage_options={'token': 'anon'})\n",
+ "train_df = pd.read_csv('gs://supaero/train.csv', nrows = 10_000_000, storage_options={'token': 'anon'})\n",
"train_df.dtypes"
]
},
@@ -276,7 +276,7 @@
"metadata": {},
"outputs": [],
"source": [
- "test_df = pd.read_csv('gs://aibt-2024/test.csv', storage_options={'token': 'anon'})\n",
+ "test_df = pd.read_csv('gs://supaero/test.csv', storage_options={'token': 'anon'})\n",
"test_df.dtypes"
]
},
@@ -339,7 +339,7 @@
"\n",
"### Some questions on this first Analysis\n",
"\n",
- "- What is the most expensive part of the analysis, the one that takes the most time (see the %%time we used above)?\n",
+ "- What is the most time consuming part of the analysis (see the %%time we used in all the cells above)?\n",
""
]
},
@@ -474,7 +474,7 @@
"metadata": {},
"outputs": [],
"source": [
- "cluster.scale(10)"
+ "cluster.scale(16)"
]
},
{
@@ -654,7 +654,7 @@
"source": [
"\n",
" \n",
- "- Compare the time of this computation to the time of loading a subset of the Dataset with Pandas. Was it as fast as expected considering the number of workers we have?\n",
+ "- Compare the time of this computation to the time of loading a subset of the Dataset with Pandas. Was it as fast as expected considering the number of workers we have (Remember, counting items in memory is almost instataneous compared to loading the data from an object store)?\n",
" \n",
"I recommend trying to calculate an estimation of the time it would take with Pandas to read the entire dataset, and comparing it to the total time it took by **all** our Dask workers.\n",
""
@@ -744,7 +744,7 @@
"\n",
"\n",
"\n",
- "- This computation is slow, especially compared with Pandas, why? (Look a the Dashboard, again).\n",
+ "- This computation is slow, especially compared to Pandas, why? (Look a the Dashboard, again).\n",
"- Which part of the computation is slow, look at the Dashboard to see the name of the tasks. Hint, this is the same as Pandas.\n",
""
]
@@ -766,7 +766,7 @@
"source": [
"\n",
" \n",
- "- How could we optimize the next computations, using which Dask method? Same as Spark...\n",
+ "- How could we optimize the next computations, using which Dask method? Same as Spark... With a cluster, we have enough memory to avoid repeating the most time expensive operation...\n",
"- Where will be the data at the end of the computation triggered by this call?\n",
""
]
@@ -800,7 +800,7 @@
" \n",
"- Look at the Dashboard at what is happening beind the scene.\n",
" \n",
- "Wait for the end of this call on the Dashboard, then try again the previous computation on fare_amout.mean():\n",
+ "Wait for the end of this call **by looking on the Dashboard**, then try again the previous computation on fare_amout.mean():\n",
""
]
},
@@ -1057,7 +1057,7 @@
"source": [
"\n",
" \n",
- "### BONUS Questions (you don't have to do this, just go back to it if you want to improve, skip it at first)\n",
+ "### 2 points question (it might be a bit complex, consider skipping it at first, just go back to it later if you're stuck)\n",
"\n",
"Some other questions to practice\n",
"\n",
@@ -1131,8 +1131,6 @@
"\n",
"Here again define a method so that we can use it later for our test set evaluation.\n",
"\n",
- "
\n",
- "\n",
"\n",
" \n",
"- Just do the same as with the Pandas example by defining a get_input_matrix(df) function. But this time you'll generate a dask array (not numpy) using `to_dask_array(lengths=True)` method on the dataframe object instead of `np.column_stack` (look a bit a dask docs in order to find how to use this method). You should do a method that generate the X input features dask array, and also the same with y training results. You can do just one method that return both (return X, y). \n",
@@ -1185,8 +1183,6 @@
"see https://ml.dask.org/glm.html \n",
"and https://ml.dask.org/modules/generated/dask_ml.linear_model.LinearRegression.html#dask_ml.linear_model.LinearRegression\n",
"\n",
- "
\n",
- "\n",
"\n",
" \n",
"- Train a LinearRegression model from dask_ml.linear_model on our inputs\n",
@@ -1227,7 +1223,7 @@
"metadata": {},
"outputs": [],
"source": [
- "test_df = dd.read_csv('gs://aibt-2024/test.csv', storage_options={'token': 'anon'})\n",
+ "test_df = dd.read_csv('gs://supaero/test.csv', storage_options={'token': 'anon'})\n",
"test_df"
]
},
@@ -1306,7 +1302,6 @@
"metadata": {},
"source": [
"\n",
- " \n",
"- What RMSE did you get? Compare it to the Pandas only computation.\n",
""
]
@@ -1326,13 +1321,12 @@
"id": "c5c75934",
"metadata": {},
"source": [
- "# Distributed XGboost (optionnal, you can skip it at first)\n",
+ "### Distributed XGboost (consider skipping it at first if you've trouble doing it)\n",
"\n",
"Just use the documentation here https://xgboost.readthedocs.io/en/stable/tutorials/dask.html#overview to train a model on this dataset using xgboost.\n",
"\n",
- "
\n",
"\n",
- " \n",
+ "\n",
"- Just copy/paste the example (dtrain = ..., output = ...), and modify some input variables.\n",
"- Then make a prediction (but don't forget to use your test set, not as in the prediction = ... example from the Xgboost doc).\n",
"- Compute the mean square error on it.\n",
@@ -1375,7 +1369,7 @@
"id": "993d1443",
"metadata": {},
"source": [
- "## Use Dask to scale computation on Hyper Parameter Search\n",
+ "# Use Dask to scale computation on Hyper Parameter Search\n",
"\n",
"As seen above, Dask is well suited to distribute Data and learn a model on a big Data set. However, not all the models can be trained in parallel on sub chunks of Data. See https://scikit-learn.org/stable/computing/scaling_strategies.html for the compatible models of Scikit learn for example.\n",
"\n",