Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 13 additions & 19 deletions notebooks/EvaluationDaskMLTaxyCabEmpty.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@
"source": [
"%%time\n",
"import pandas as pd\n",
"train_df = pd.read_csv('gs://aibt-2024/train.csv', nrows = 10_000_000, storage_options={'token': 'anon'})\n",
"train_df = pd.read_csv('gs://supaero/train.csv', nrows = 10_000_000, storage_options={'token': 'anon'})\n",
"train_df.dtypes"
]
},
Expand Down Expand Up @@ -276,7 +276,7 @@
"metadata": {},
"outputs": [],
"source": [
"test_df = pd.read_csv('gs://aibt-2024/test.csv', storage_options={'token': 'anon'})\n",
"test_df = pd.read_csv('gs://supaero/test.csv', storage_options={'token': 'anon'})\n",
"test_df.dtypes"
]
},
Expand Down Expand Up @@ -339,7 +339,7 @@
"\n",
"### Some questions on this first Analysis\n",
"\n",
"- What is the most expensive part of the analysis, the one that takes the most time (see the %%time we used above)?\n",
"- What is the most time consuming part of the analysis (see the %%time we used in all the cells above)?\n",
"</span>"
]
},
Expand Down Expand Up @@ -474,7 +474,7 @@
"metadata": {},
"outputs": [],
"source": [
"cluster.scale(10)"
"cluster.scale(16)"
]
},
{
Expand Down Expand Up @@ -654,7 +654,7 @@
"source": [
"<span style=\"color:#EB5E0B;font-style:italic\">\n",
" \n",
"- Compare the time of this computation to the time of loading a subset of the Dataset with Pandas. Was it as fast as expected considering the number of workers we have?\n",
"- Compare the time of this computation to the time of loading a subset of the Dataset with Pandas. Was it as fast as expected considering the number of workers we have (Remember, counting items in memory is almost instataneous compared to loading the data from an object store)?\n",
" \n",
"I recommend trying to calculate an estimation of the time it would take with Pandas to read the entire dataset, and comparing it to the total time it took by **all** our Dask workers.\n",
"</span>"
Expand Down Expand Up @@ -744,7 +744,7 @@
"\n",
"<span style=\"color:#EB5E0B;font-style:italic\">\n",
"\n",
"- This computation is slow, especially compared with Pandas, why? (Look a the Dashboard, again).\n",
"- This computation is slow, especially compared to Pandas, why? (Look a the Dashboard, again).\n",
"- Which part of the computation is slow, look at the Dashboard to see the name of the tasks. Hint, this is the same as Pandas.\n",
"</span>"
]
Expand All @@ -766,7 +766,7 @@
"source": [
"<span style=\"color:#EB5E0B;font-style:italic\">\n",
" \n",
"- How could we optimize the next computations, using which Dask method? Same as Spark...\n",
"- How could we optimize the next computations, using which Dask method? Same as Spark... With a cluster, we have enough memory to avoid repeating the most time expensive operation...\n",
"- Where will be the data at the end of the computation triggered by this call?\n",
"</span>"
]
Expand Down Expand Up @@ -800,7 +800,7 @@
" \n",
"- Look at the Dashboard at what is happening beind the scene.\n",
" \n",
"Wait for the end of this call on the Dashboard, then try again the previous computation on fare_amout.mean():\n",
"Wait for the end of this call **by looking on the Dashboard**, then try again the previous computation on fare_amout.mean():\n",
"</span>"
]
},
Expand Down Expand Up @@ -1057,7 +1057,7 @@
"source": [
"<span style=\"color:#EB5E0B;font-style:italic\">\n",
" \n",
"### BONUS Questions (you don't have to do this, just go back to it if you want to improve, skip it at first)\n",
"### 2 points question (it might be a bit complex, consider skipping it at first, just go back to it later if you're stuck)\n",
"\n",
"Some other questions to practice\n",
"\n",
Expand Down Expand Up @@ -1131,8 +1131,6 @@
"\n",
"Here again define a method so that we can use it later for our test set evaluation.\n",
"\n",
"<br>\n",
"\n",
"<span style=\"color:#EB5E0B;font-style:italic\">\n",
" \n",
"- Just do the same as with the Pandas example by defining a get_input_matrix(df) function. But this time you'll generate a dask array (not numpy) using `to_dask_array(lengths=True)` method on the dataframe object instead of `np.column_stack` (look a bit a dask docs in order to find how to use this method). You should do a method that generate the X input features dask array, and also the same with y training results. You can do just one method that return both (return X, y). \n",
Expand Down Expand Up @@ -1185,8 +1183,6 @@
"see https://ml.dask.org/glm.html \n",
"and https://ml.dask.org/modules/generated/dask_ml.linear_model.LinearRegression.html#dask_ml.linear_model.LinearRegression\n",
"\n",
"<br>\n",
"\n",
"<span style=\"color:#EB5E0B;font-style:italic\">\n",
" \n",
"- Train a LinearRegression model from dask_ml.linear_model on our inputs\n",
Expand Down Expand Up @@ -1227,7 +1223,7 @@
"metadata": {},
"outputs": [],
"source": [
"test_df = dd.read_csv('gs://aibt-2024/test.csv', storage_options={'token': 'anon'})\n",
"test_df = dd.read_csv('gs://supaero/test.csv', storage_options={'token': 'anon'})\n",
"test_df"
]
},
Expand Down Expand Up @@ -1306,7 +1302,6 @@
"metadata": {},
"source": [
"<span style=\"color:#EB5E0B;font-style:italic\">\n",
" \n",
"- What RMSE did you get? Compare it to the Pandas only computation.\n",
"</span>"
]
Expand All @@ -1326,13 +1321,12 @@
"id": "c5c75934",
"metadata": {},
"source": [
"# Distributed XGboost (optionnal, you can skip it at first)\n",
"### Distributed XGboost (consider skipping it at first if you've trouble doing it)\n",
"\n",
"Just use the documentation here https://xgboost.readthedocs.io/en/stable/tutorials/dask.html#overview to train a model on this dataset using xgboost.\n",
"\n",
"<br>\n",
"<span style=\"color:#EB5E0B;font-style:italic\">\n",
" \n",
"\n",
"- Just copy/paste the example (dtrain = ..., output = ...), and modify some input variables.\n",
"- Then make a prediction (but don't forget to use your test set, not as in the prediction = ... example from the Xgboost doc).\n",
"- Compute the mean square error on it.\n",
Expand Down Expand Up @@ -1375,7 +1369,7 @@
"id": "993d1443",
"metadata": {},
"source": [
"## Use Dask to scale computation on Hyper Parameter Search\n",
"# Use Dask to scale computation on Hyper Parameter Search\n",
"\n",
"As seen above, Dask is well suited to distribute Data and learn a model on a big Data set. However, not all the models can be trained in parallel on sub chunks of Data. See https://scikit-learn.org/stable/computing/scaling_strategies.html for the compatible models of Scikit learn for example.\n",
"\n",
Expand Down
Loading