Skip to content

Commit 9828d97

Browse files
committed
notebook summary
1 parent d845462 commit 9828d97

File tree

1 file changed

+52
-1
lines changed

1 file changed

+52
-1
lines changed

modules/tutorials/pages/jupyterhub.adoc

Lines changed: 52 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -375,7 +375,7 @@ As mentioned in an earlier section, we want to define the endpoints dynamically
375375
376376
=== Driver Service (Spark)
377377
378-
NOTE: when using Spark, please the `Provisos` section below.
378+
NOTE: When using Spark from within a notebook, please the `Provisos` section below.
379379
380380
In the same way, we can use another script to define a driver service for each user.
381381
This is essential when using Spark from within a JupyterHUb notebook so that executor pods can be spawned from the user's kernel in a user-specific way.
@@ -538,3 +538,54 @@ These Pods in turn can mount *all* volumes and Secrets in that namespace.
538538
To prevent this from breaking user separation, it is planned to use an OPA gatekeeper to define OPA rules that restrict what the created executor Pods can mount. This is not yet implemented in the demo nor reflected in this tutorial.
539539
540540
=== Overview
541+
542+
The notebook starts a distributed Spark cluster, which runs until the notebook kernel is stopped.
543+
In order to connect to the S3 backend, the following settings must be configured in the Spark session:
544+
545+
[source, python]
546+
----
547+
...
548+
.config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000/")
549+
.config("spark.hadoop.fs.s3a.path.style.access", "true")
550+
.config("spark.hadoop.fs.s3a.access.key", ...)
551+
.config("spark.hadoop.fs.s3a.secret.key", ...)
552+
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
553+
.config("spark.jars.packages", "org.apache.hadoop:hadoop-client-api:3.3.4,org.apache.hadoop:hadoop-client-runtime:3.3.4,org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-common:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.162")
554+
...
555+
----
556+
557+
Since the notebook image does not include any AWS or Hadoop libraries, these are listed under `spark.jars.packages`.
558+
How these libraries are handled can be seen by looking at the logs for the user pod and the executor pods that are spawned when the Spark session is created.
559+
In the notebook pod (e.g. `jupyter-isla-williams---14730816`) we see that JupyterHub uses Ivy to fetch each library and resolve the dependencies:
560+
561+
[source, console]
562+
----
563+
:: loading settings :: url = jar:file:/usr/local/spark-3.5.2-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
564+
Ivy Default Cache set to: /home/jovyan/.ivy2/cache
565+
The jars for the packages stored in: /home/jovyan/.ivy2/jars
566+
org.apache.hadoop#hadoop-client-api added as a dependency
567+
org.apache.hadoop#hadoop-client-runtime added as a dependency
568+
org.apache.hadoop#hadoop-aws added as a dependency
569+
org.apache.hadoop#hadoop-common added as a dependency
570+
com.amazonaws#aws-java-sdk-bundle added as a dependency
571+
:: resolving dependencies :: org.apache.spark#spark-submit-parent-bf8973c2-1a2f-425e-a272-2ef86cb852f8;1.0
572+
confs: [default]
573+
found org.apache.hadoop#hadoop-client-api;3.3.4 in central
574+
found org.xerial.snappy#snappy-java;1.1.8.2 in central
575+
...
576+
----
577+
578+
And in the executor, we see from the logs (simplified for clarity) that the user-specific driver service is used to provide these libraries.
579+
The executor connects to the service and then iterates through the list of resolved dependencies, fetching each package to a temporary folder (`/var/data/spark-bfed3050-5f63-441d-9799-a196d7b54ce9/spark-a03b09a7-869e-4778-ac04-fa935bbca5ab`) before copying it to the working folder (`/opt/spark/work-dir`):
580+
[source, console]
581+
----
582+
Successfully created connection to jupyter-isla-williams---14730816/10.96.29.131:2222
583+
Created local directory at /var/data/spark-bfed3050-5f63-441d-9799-a196d7b54ce9/blockmgr-5b70510d-7d4d-452f-818a-2a02bd0d4227
584+
Connecting to driver: spark://CoarseGrainedScheduler@jupyter-isla-williams---14730816:2222
585+
Successfully registered with driver
586+
Fetching spark://jupyter-isla-williams---14730816:2222/files/org.checkerframework_checker-qual-2.5.2.jar with timestamp 1741174390840
587+
Fetching spark://jupyter-isla-williams---14730816:2222/files/org.checkerframework_checker-qual-2.5.2.jar to /var/data/spark-bfed3050-5f63-441d-9799-a196d7b54ce9/spark-a03b09a7-869e-4778-ac04-fa935bbca5ab/fetchFileTemp8701341596301771486.tmp
588+
Copying /var/data/spark-bfed3050-5f63-441d-9799-a196d7b54ce9/spark-a03b09a7-869e-4778-ac04-fa935bbca5ab/1075326831741174390840_cache to /opt/spark/work-dir/./org.checkerframework_checker-qual-2.5.2.jar
589+
----
590+
591+
Once the Spark session has been created, the notebook reads data from S3, performs a simple aggregation and re-writes it in different formats.

0 commit comments

Comments
 (0)