You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/tutorials/pages/jupyterhub.adoc
+52-1Lines changed: 52 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -375,7 +375,7 @@ As mentioned in an earlier section, we want to define the endpoints dynamically
375
375
376
376
=== Driver Service (Spark)
377
377
378
-
NOTE: when using Spark, please the `Provisos` section below.
378
+
NOTE: When using Spark from within a notebook, please the `Provisos` section below.
379
379
380
380
In the same way, we can use another script to define a driver service for each user.
381
381
This is essential when using Spark from within a JupyterHUb notebook so that executor pods can be spawned from the user's kernel in a user-specific way.
@@ -538,3 +538,54 @@ These Pods in turn can mount *all* volumes and Secrets in that namespace.
538
538
To prevent this from breaking user separation, it is planned to use an OPA gatekeeper to define OPA rules that restrict what the created executor Pods can mount. This is not yet implemented in the demo nor reflected in this tutorial.
539
539
540
540
=== Overview
541
+
542
+
The notebook starts a distributed Spark cluster, which runs until the notebook kernel is stopped.
543
+
In order to connect to the S3 backend, the following settings must be configured in the Spark session:
Since the notebook image does not include any AWS or Hadoop libraries, these are listed under `spark.jars.packages`.
558
+
How these libraries are handled can be seen by looking at the logs for the user pod and the executor pods that are spawned when the Spark session is created.
559
+
In the notebook pod (e.g. `jupyter-isla-williams---14730816`) we see that JupyterHub uses Ivy to fetch each library and resolve the dependencies:
found org.apache.hadoop#hadoop-client-api;3.3.4 in central
574
+
found org.xerial.snappy#snappy-java;1.1.8.2 in central
575
+
...
576
+
----
577
+
578
+
And in the executor, we see from the logs (simplified for clarity) that the user-specific driver service is used to provide these libraries.
579
+
The executor connects to the service and then iterates through the list of resolved dependencies, fetching each package to a temporary folder (`/var/data/spark-bfed3050-5f63-441d-9799-a196d7b54ce9/spark-a03b09a7-869e-4778-ac04-fa935bbca5ab`) before copying it to the working folder (`/opt/spark/work-dir`):
580
+
[source, console]
581
+
----
582
+
Successfully created connection to jupyter-isla-williams---14730816/10.96.29.131:2222
583
+
Created local directory at /var/data/spark-bfed3050-5f63-441d-9799-a196d7b54ce9/blockmgr-5b70510d-7d4d-452f-818a-2a02bd0d4227
584
+
Connecting to driver: spark://CoarseGrainedScheduler@jupyter-isla-williams---14730816:2222
585
+
Successfully registered with driver
586
+
Fetching spark://jupyter-isla-williams---14730816:2222/files/org.checkerframework_checker-qual-2.5.2.jar with timestamp 1741174390840
587
+
Fetching spark://jupyter-isla-williams---14730816:2222/files/org.checkerframework_checker-qual-2.5.2.jar to /var/data/spark-bfed3050-5f63-441d-9799-a196d7b54ce9/spark-a03b09a7-869e-4778-ac04-fa935bbca5ab/fetchFileTemp8701341596301771486.tmp
588
+
Copying /var/data/spark-bfed3050-5f63-441d-9799-a196d7b54ce9/spark-a03b09a7-869e-4778-ac04-fa935bbca5ab/1075326831741174390840_cache to /opt/spark/work-dir/./org.checkerframework_checker-qual-2.5.2.jar
589
+
----
590
+
591
+
Once the Spark session has been created, the notebook reads data from S3, performs a simple aggregation and re-writes it in different formats.
0 commit comments