-
Notifications
You must be signed in to change notification settings - Fork 515
Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script #1374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,6 +15,7 @@ Specifying a supported value for the `cuda-version` metadata variable | |
| will select compatible values for Driver, cuDNN, and NCCL from the script's | ||
| internal matrix. Default CUDA versions are typically: | ||
|
|
||
| * Dataproc 1.5: `11.6.2` | ||
| * Dataproc 2.0: `12.1.1` | ||
| * Dataproc 2.1: `12.4.1` | ||
| * Dataproc 2.2 & 2.3: `12.6.3` | ||
|
|
@@ -191,20 +192,19 @@ This script accepts the following metadata parameters: | |
| * `cudnn-version`: (Optional) Specify cuDNN version (e.g., `8.9.7.29`). | ||
| * `nccl-version`: (Optional) Specify NCCL version. | ||
| * `include-pytorch`: (Optional) `yes`|`no`. Default: `no`. | ||
| If `yes`, installs PyTorch, TensorFlow, RAPIDS, and PySpark in a Conda | ||
| If `yes`, installs PyTorch, Numba, RAPIDS, and PySpark in a Conda | ||
| environment. | ||
| * `gpu-conda-env`: (Optional) Name for the PyTorch Conda environment. | ||
| Default: `dpgce`. | ||
| * `container-runtime`: (Optional) E.g., `docker`, `containerd`, `crio`. | ||
| For NVIDIA Container Toolkit configuration. Auto-detected if not specified. | ||
| * `http-proxy`: (Optional) URL of an HTTP proxy for downloads. | ||
| * `http-proxy`: (Optional) Proxy address and port for HTTP requests (e.g., `your-proxy.com:3128`). | ||
| * `https-proxy`: (Optional) Proxy address and port for HTTPS requests (e.g., `your-proxy.com:3128`). Defaults to `http-proxy` if not set. | ||
| * `proxy-uri`: (Optional) A single proxy URI for both HTTP and HTTPS. Overridden by `http-proxy` or `https-proxy` if they are set. | ||
| * `no-proxy`: (Optional) Comma or space-separated list of hosts/domains to bypass the proxy. Defaults include localhost, metadata server, and Google APIs. User-provided values are appended to the defaults. | ||
| * `http-proxy-pem-uri`: (Optional) A `gs://` path to the | ||
| PEM-encoded certificate file used by the proxy specified in | ||
| `http-proxy`. This is needed if the proxy uses TLS and its | ||
| certificate is not already trusted by the cluster's default trust | ||
| store (e.g., if it's a self-signed certificate or signed by an | ||
| internal CA). The script will install this certificate into the | ||
| system and Java trust stores. | ||
| PEM-encoded CA certificate file for the proxy specified in | ||
| `http-proxy`/`https-proxy`. Required if the proxy uses TLS with a certificate not in the default system trust store. This certificate will be added to the system, Java, and Conda trust stores, and proxy connections will use HTTPS. | ||
|
Comment on lines
+195
to
+207
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The updated descriptions for |
||
| * `invocation-type`: (For Custom Images) Set to `custom-images` by image | ||
| building tools. Not typically set by end-users creating clusters. | ||
| * **Secure Boot Signing Parameters:** Used if Secure Boot is enabled and | ||
|
|
@@ -217,6 +217,20 @@ This script accepts the following metadata parameters: | |
| modulus_md5sum=<md5sum-of-your-mok-key-modulus> | ||
| ``` | ||
|
|
||
| ### Enhanced Proxy Support | ||
|
|
||
| This script includes robust support for environments requiring an HTTP/HTTPS proxy: | ||
|
|
||
| * **Configuration:** Use the `http-proxy`, `https-proxy`, or `proxy-uri` metadata to specify your proxy server (host:port). | ||
| * **Custom CA Certificates:** If your proxy uses a custom CA (e.g., self-signed), provide the CA certificate in PEM format via the `http-proxy-pem-uri` metadata (as a `gs://` path). The script will: | ||
| * Install the CA into the system trust store (`update-ca-certificates` or `update-ca-trust`). | ||
| * Add the CA to the Java cacerts trust store. | ||
| * Configure Conda to use the system trust store. | ||
| * Switch proxy communications to use HTTPS. | ||
| * **Tool Configuration:** The script automatically configures `curl`, `apt`, `dnf`, `gpg`, and Java to use the specified proxy settings and custom CA if provided. | ||
| * **Bypass:** The `no-proxy` metadata allows specifying hosts to bypass the proxy. Defaults include `localhost`, the metadata server, `.google.com`, and `.googleapis.com` to ensure essential services function correctly. | ||
| * **Verification:** The script performs connection tests to the proxy and attempts to reach external sites (google.com, nvidia.com) through the proxy to validate the configuration before proceeding with downloads. | ||
|
Comment on lines
+220
to
+232
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| ### Loading Built Kernel Module & Secure Boot | ||
|
|
||
| When the script needs to build NVIDIA kernel modules from source (e.g., using | ||
|
|
@@ -280,6 +294,7 @@ handles metric creation and reporting. | |
| * **Installation Failures:** Examine the initialization action log on the | ||
| affected node, typically `/var/log/dataproc-initialization-script-0.log` | ||
| (or a similar name if multiple init actions are used). | ||
| * **Network/Proxy Issues:** If using a proxy, double-check the `http-proxy`, `https-proxy`, `proxy-uri`, `no-proxy`, and `http-proxy-pem-uri` metadata settings. Ensure the proxy allows access to NVIDIA domains, GitHub, and package repositories. Check the init action log for curl errors or proxy test failures. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| * **GPU Agent Issues:** If the agent was installed (`install-gpu-agent=true`), | ||
| check its service logs using `sudo journalctl -u gpu-utilization-agent.service`. | ||
| * **Driver Load or Secure Boot Problems:** Review `dmesg` output and | ||
|
|
@@ -298,7 +313,7 @@ handles metric creation and reporting. | |
| * The script extensively caches downloaded artifacts (drivers, CUDA `.run` | ||
| files) and compiled components (kernel modules, NCCL, Conda environments) | ||
| to a GCS bucket. This bucket is typically specified by the | ||
| `dataproc-temp-bucket` cluster property or metadata. | ||
| `dataproc-temp-bucket` cluster property or metadata. Downloads and cache operations are proxy-aware. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| * **First Run / Cache Warming:** Initial runs on new configurations (OS, | ||
| kernel, or driver version combinations) that require source compilation | ||
| (e.g., for NCCL or kernel modules when no pre-compiled version is | ||
|
|
@@ -324,4 +339,4 @@ handles metric creation and reporting. | |
| Debian-based systems, including handling of archived backports repositories | ||
| to ensure dependencies can be met. | ||
| * Tested primarily with Dataproc 2.0+ images. Support for older Dataproc | ||
| 1.5 images is limited. | ||
| 1.5 images is limited. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding Dataproc 1.5 CUDA version to the default CUDA versions list improves the documentation's completeness and helps users understand the supported versions for older Dataproc images.