Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 25 additions & 10 deletions gpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Specifying a supported value for the `cuda-version` metadata variable
will select compatible values for Driver, cuDNN, and NCCL from the script's
internal matrix. Default CUDA versions are typically:

* Dataproc 1.5: `11.6.2`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Adding Dataproc 1.5 CUDA version to the default CUDA versions list improves the documentation's completeness and helps users understand the supported versions for older Dataproc images.

* Dataproc 2.0: `12.1.1`
* Dataproc 2.1: `12.4.1`
* Dataproc 2.2 & 2.3: `12.6.3`
Expand Down Expand Up @@ -191,20 +192,19 @@ This script accepts the following metadata parameters:
* `cudnn-version`: (Optional) Specify cuDNN version (e.g., `8.9.7.29`).
* `nccl-version`: (Optional) Specify NCCL version.
* `include-pytorch`: (Optional) `yes`|`no`. Default: `no`.
If `yes`, installs PyTorch, TensorFlow, RAPIDS, and PySpark in a Conda
If `yes`, installs PyTorch, Numba, RAPIDS, and PySpark in a Conda
environment.
* `gpu-conda-env`: (Optional) Name for the PyTorch Conda environment.
Default: `dpgce`.
* `container-runtime`: (Optional) E.g., `docker`, `containerd`, `crio`.
For NVIDIA Container Toolkit configuration. Auto-detected if not specified.
* `http-proxy`: (Optional) URL of an HTTP proxy for downloads.
* `http-proxy`: (Optional) Proxy address and port for HTTP requests (e.g., `your-proxy.com:3128`).
* `https-proxy`: (Optional) Proxy address and port for HTTPS requests (e.g., `your-proxy.com:3128`). Defaults to `http-proxy` if not set.
* `proxy-uri`: (Optional) A single proxy URI for both HTTP and HTTPS. Overridden by `http-proxy` or `https-proxy` if they are set.
* `no-proxy`: (Optional) Comma or space-separated list of hosts/domains to bypass the proxy. Defaults include localhost, metadata server, and Google APIs. User-provided values are appended to the defaults.
* `http-proxy-pem-uri`: (Optional) A `gs://` path to the
PEM-encoded certificate file used by the proxy specified in
`http-proxy`. This is needed if the proxy uses TLS and its
certificate is not already trusted by the cluster's default trust
store (e.g., if it's a self-signed certificate or signed by an
internal CA). The script will install this certificate into the
system and Java trust stores.
PEM-encoded CA certificate file for the proxy specified in
`http-proxy`/`https-proxy`. Required if the proxy uses TLS with a certificate not in the default system trust store. This certificate will be added to the system, Java, and Conda trust stores, and proxy connections will use HTTPS.
Comment on lines +195 to +207
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The updated descriptions for include-pytorch, http-proxy, https-proxy, proxy-uri, no-proxy, and http-proxy-pem-uri metadata parameters provide much-needed clarity on the new proxy configuration options and their usage. Replacing TensorFlow with Numba in the include-pytorch description also reflects the updated Conda environment package list.

* `invocation-type`: (For Custom Images) Set to `custom-images` by image
building tools. Not typically set by end-users creating clusters.
* **Secure Boot Signing Parameters:** Used if Secure Boot is enabled and
Expand All @@ -217,6 +217,20 @@ This script accepts the following metadata parameters:
modulus_md5sum=<md5sum-of-your-mok-key-modulus>
```

### Enhanced Proxy Support

This script includes robust support for environments requiring an HTTP/HTTPS proxy:

* **Configuration:** Use the `http-proxy`, `https-proxy`, or `proxy-uri` metadata to specify your proxy server (host:port).
* **Custom CA Certificates:** If your proxy uses a custom CA (e.g., self-signed), provide the CA certificate in PEM format via the `http-proxy-pem-uri` metadata (as a `gs://` path). The script will:
* Install the CA into the system trust store (`update-ca-certificates` or `update-ca-trust`).
* Add the CA to the Java cacerts trust store.
* Configure Conda to use the system trust store.
* Switch proxy communications to use HTTPS.
* **Tool Configuration:** The script automatically configures `curl`, `apt`, `dnf`, `gpg`, and Java to use the specified proxy settings and custom CA if provided.
* **Bypass:** The `no-proxy` metadata allows specifying hosts to bypass the proxy. Defaults include `localhost`, the metadata server, `.google.com`, and `.googleapis.com` to ensure essential services function correctly.
* **Verification:** The script performs connection tests to the proxy and attempts to reach external sites (google.com, nvidia.com) through the proxy to validate the configuration before proceeding with downloads.
Comment on lines +220 to +232
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new 'Enhanced Proxy Support' section clearly outlines the capabilities and configuration options for proxy usage, including custom CA certificates and tool integration. This detailed explanation is highly beneficial for users operating in proxied environments.


### Loading Built Kernel Module & Secure Boot

When the script needs to build NVIDIA kernel modules from source (e.g., using
Expand Down Expand Up @@ -280,6 +294,7 @@ handles metric creation and reporting.
* **Installation Failures:** Examine the initialization action log on the
affected node, typically `/var/log/dataproc-initialization-script-0.log`
(or a similar name if multiple init actions are used).
* **Network/Proxy Issues:** If using a proxy, double-check the `http-proxy`, `https-proxy`, `proxy-uri`, `no-proxy`, and `http-proxy-pem-uri` metadata settings. Ensure the proxy allows access to NVIDIA domains, GitHub, and package repositories. Check the init action log for curl errors or proxy test failures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Adding a specific troubleshooting entry for 'Network/Proxy Issues' is a valuable addition. It guides users directly to relevant metadata settings and log checks, which will significantly aid in diagnosing connectivity problems.

* **GPU Agent Issues:** If the agent was installed (`install-gpu-agent=true`),
check its service logs using `sudo journalctl -u gpu-utilization-agent.service`.
* **Driver Load or Secure Boot Problems:** Review `dmesg` output and
Expand All @@ -298,7 +313,7 @@ handles metric creation and reporting.
* The script extensively caches downloaded artifacts (drivers, CUDA `.run`
files) and compiled components (kernel modules, NCCL, Conda environments)
to a GCS bucket. This bucket is typically specified by the
`dataproc-temp-bucket` cluster property or metadata.
`dataproc-temp-bucket` cluster property or metadata. Downloads and cache operations are proxy-aware.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Clarifying that 'Downloads and cache operations are proxy-aware' in the 'Performance & Caching' section is important. It assures users that the caching mechanism will function correctly even when a proxy is configured.

* **First Run / Cache Warming:** Initial runs on new configurations (OS,
kernel, or driver version combinations) that require source compilation
(e.g., for NCCL or kernel modules when no pre-compiled version is
Expand All @@ -324,4 +339,4 @@ handles metric creation and reporting.
Debian-based systems, including handling of archived backports repositories
to ensure dependencies can be met.
* Tested primarily with Dataproc 2.0+ images. Support for older Dataproc
1.5 images is limited.
1.5 images is limited.
Loading