Skip to content

Conversation

@cjac
Copy link
Contributor

@cjac cjac commented Jan 23, 2026

Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script

This commit significantly enhances the robustness and configurability of the GPU driver installation script, particularly for environments with HTTP/HTTPS proxies and those using Secure Boot.

Key Changes:

  • Enhanced Proxy Configuration (set_proxy):

    • Added support for https-proxy and proxy-uri metadata, providing more flexibility in proxy setups.
    • Improved NO_PROXY handling with sensible defaults (including Google APIs) and user-configurable additions.
    • Integrated support for custom proxy CA certificates via http-proxy-pem-uri, including installation into system, Java, and Conda trust stores.
    • Connections to the proxy now use HTTPS when a custom CA is provided.
    • Added proxy connection and reachability tests to fail fast on misconfiguration.
    • Ensures curl, apt, dnf, gpg, and Java all respect the proxy settings.
  • Robust GPG Key Import (import_gpg_keys):

    • Introduced a new function to reliably import GPG keys from URLs or keyservers, fully respecting the configured proxy and custom CA settings.
    • This replaces direct curl | gpg --import calls, making key fetching more resilient in restricted network environments.
  • Secure Boot Signing Refinements:

    • The configure_dkms_certs function now always fetches keys from Secret Manager if private_secret_name is set, ensuring modulus_md5sum is available for GCS cache paths.
    • Kernel module signing is now more clearly integrated into the build process.
    • Improved checks to ensure modules are actually signed and loadable after installation when Secure Boot is active.
  • Resilient Driver Installation:

    • The script now checks if the nvidia module can be loaded at the beginning of install_nvidia_gpu_driver and will re-attempt installation if it fails.
    • curl calls for downloading drivers and other artifacts now use retry flags and honor proxy settings.
  • Conda Environment for PyTorch:

    • Adjusted package list for Conda environment, removing TensorFlow to streamline.
    • Added specific workarounds for Debian 10, using conda instead of mamba.
  • Documentation Updates (gpu/README.md):

    • Added details on the new proxy metadata: https-proxy, proxy-uri, no-proxy.
    • Created a new section "Enhanced Proxy Support" explaining the features.
    • Updated http-proxy-pem-uri description.
    • Added proxy considerations to the "Troubleshooting" section.

These changes aim to make the GPU initialization action more reliable across a wider range of network environments and improve the Secure Boot workflow.

@cjac cjac self-assigned this Jan 23, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @cjac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the GPU driver installation script's reliability and adaptability, particularly in complex network environments requiring HTTP/HTTPS proxies and for systems utilizing Secure Boot. The changes focus on making the installation process more robust, configurable, and resilient to common issues like network restrictions and module signing requirements, while also refining the Conda environment setup and updating documentation.

Highlights

  • Enhanced Proxy Configuration: Introduced support for https-proxy and proxy-uri metadata, refined NO_PROXY handling with intelligent defaults, and enabled custom CA certificate installation into system, Java, and Conda trust stores. Proxy connections now default to HTTPS when a custom CA is provided, and the script includes pre-flight tests for proxy connectivity and reachability.
  • Robust GPG Key Import: A new import_gpg_keys function centralizes GPG key fetching from URLs or keyservers, ensuring it respects proxy settings and custom CA certificates, replacing previous direct curl | gpg --import calls.
  • Secure Boot Signing Refinements: The configure_dkms_certs function now consistently fetches keys from Secret Manager if private_secret_name is set, ensuring modulus_md5sum is available. Kernel module signing is integrated into the build process, with checks to verify modules are signed and loadable after installation.
  • Resilient Driver Installation: The install_nvidia_gpu_driver function now includes an initial check for the nvidia module's loadability, triggering a re-installation attempt if it fails. curl commands for downloads now incorporate retry flags and honor proxy settings.
  • Conda Environment Adjustments: The PyTorch Conda environment package list was streamlined by removing TensorFlow. Specific workarounds were added for Debian 10, including using conda instead of mamba and disabling SSL verification.
  • Documentation Updates: The gpu/README.md file has been updated to reflect the new proxy metadata (https-proxy, proxy-uri, no-proxy), detail the enhanced proxy support, and add proxy-related troubleshooting guidance.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the GPU driver installation script by introducing robust proxy handling, improving Secure Boot integration, and refining Conda environment setup. Key improvements include flexible proxy configuration with support for HTTPS proxies and custom CA certificates, a new import_gpg_keys function for reliable GPG key fetching, and more thorough verification steps for signed kernel modules under Secure Boot. The documentation has also been updated to reflect these new features and provide better troubleshooting guidance. Overall, these changes make the script more resilient and configurable for diverse network environments and security requirements.

Comment on lines +1213 to 1234
# Verify signature after installation
if [[ -n "${PSN}" ]]; then
configure_dkms_certs

# Verify signatures and load
local signed=true
for module in $(find /lib/modules/${uname_r}/ -iname 'nvidia*.ko'); do
if ! modinfo "${module}" | grep -qi sig ; then
echo "ERROR: Module ${module} is NOT signed after installation."
signed=false
fi
done
if [[ "${signed}" != "true" ]]; then
echo "ERROR: Module signing failed."
exit 1
fi
execute_with_retries make -j$(nproc) modules \
> kernel-open/build.log \
2> kernel-open/build_error.log
# Sign kernel modules
if [[ -n "${PSN}" ]]; then
configure_dkms_certs
for module in $(find open-gpu-kernel-modules/kernel-open -name '*.ko'); do
"/lib/modules/${uname_r}/build/scripts/sign-file" sha256 \
"${mok_key}" \
"${mok_der}" \
"${module}"
done
clear_dkms_key

if ! modprobe nvidia; then
echo "ERROR: Failed to load nvidia module after build and sign."
exit 1
fi
make modules_install \
>> kernel-open/build.log \
2>> kernel-open/build_error.log
# Collect build logs and installed binaries
tar czvf "${local_tarball}" \
"${workdir}/open-gpu-kernel-modules/kernel-open/"*.log \
$(find /lib/modules/${uname_r}/ -iname 'nvidia*.ko')
${gsutil_cmd} cp "${local_tarball}" "${gcs_tarball}"
if ${gsutil_stat_cmd} "${gcs_tarball}.building" ; then ${gsutil_cmd} rm "${gcs_tarball}.building" || true ; fi
building_file=""
rm "${local_tarball}"
make clean
popd
echo "NVIDIA modules built, signed, and loaded successfully."
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Adding post-installation verification for module signatures and loadability is a critical improvement for Secure Boot. This ensures that the installed kernel modules are correctly signed and can be loaded by the system, preventing silent failures in secure environments.

Comment on lines +3271 to +3275
if [[ -v METADATA_HTTP_PROXY_PEM_URI ]] && [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]]; then
if [[ -z "${trusted_pem_path:-}" ]]; then
echo "WARNING: METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined." >&2
else
curl_retry_args+=(--cacert "${trusted_pem_path}")
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The warning METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined indicates a potential issue. trusted_pem_path is only set within set_proxy if both a proxy (http-proxy/https-proxy) and a PEM URI are provided. If http-proxy-pem-uri is provided but no http-proxy or https-proxy is set, set_proxy returns early, leaving trusted_pem_path undefined. This could lead to GPG key imports failing to use the custom CA, even if the PEM URI is present.

Comment on lines 2755 to +2768
pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy"
cat > "${pkg_proxy_conf_file}" <<EOF
Acquire::http::Proxy "http://${METADATA_HTTP_PROXY}";
Acquire::https::Proxy "http://${METADATA_HTTP_PROXY}";
EOF
echo "Acquire::http::Proxy \"http://${effective_proxy}\";" > "${pkg_proxy_conf_file}"
echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}"
echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}"
elif is_rocky ; then
pkg_proxy_conf_file="/etc/dnf/dnf.conf"

touch "${pkg_proxy_conf_file}"

if grep -q "^proxy=" "${pkg_proxy_conf_file}"; then
sed -i.bak "s@^proxy=.*@proxy=${HTTP_PROXY}@" "${pkg_proxy_conf_file}"
elif grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}"
sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}"
if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
sed -i.bak "/^\\\[main\\\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}"
else
local TMP_FILE=$(mktemp)
printf "[main]\nproxy=%s\n" "${HTTP_PROXY}" > "${TMP_FILE}"

cat "${TMP_FILE}" "${pkg_proxy_conf_file}" > "${pkg_proxy_conf_file}".new
mv "${pkg_proxy_conf_file}".new "${pkg_proxy_conf_file}"
echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}"
fi
echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The apt and dnf proxy configurations (Acquire::http::Proxy "http://${effective_proxy}"; and proxy=http://${effective_proxy}) use an http:// prefix. If effective_proxy is derived solely from https_proxy_val (meaning only an HTTPS proxy was specified), this could lead to apt/dnf attempting to connect to an HTTPS proxy using an HTTP scheme. While a later sed command attempts to correct this if http-proxy-pem-uri is set, it might be incorrect if http-proxy-pem-uri is not provided.

Suggested change
pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy"
cat > "${pkg_proxy_conf_file}" <<EOF
Acquire::http::Proxy "http://${METADATA_HTTP_PROXY}";
Acquire::https::Proxy "http://${METADATA_HTTP_PROXY}";
EOF
echo "Acquire::http::Proxy \"http://${effective_proxy}\";" > "${pkg_proxy_conf_file}"
echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}"
echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}"
elif is_rocky ; then
pkg_proxy_conf_file="/etc/dnf/dnf.conf"
touch "${pkg_proxy_conf_file}"
if grep -q "^proxy=" "${pkg_proxy_conf_file}"; then
sed -i.bak "s@^proxy=.*@proxy=${HTTP_PROXY}@" "${pkg_proxy_conf_file}"
elif grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}"
sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}"
if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
sed -i.bak "/^\\\[main\\\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}"
else
local TMP_FILE=$(mktemp)
printf "[main]\nproxy=%s\n" "${HTTP_PROXY}" > "${TMP_FILE}"
cat "${TMP_FILE}" "${pkg_proxy_conf_file}" > "${pkg_proxy_conf_file}".new
mv "${pkg_proxy_conf_file}".new "${pkg_proxy_conf_file}"
echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}"
fi
echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}"
fi
echo "Acquire::http::Proxy \"${HTTP_PROXY}\";" > "${pkg_proxy_conf_file}"
echo "Acquire::https::Proxy \"${HTTPS_PROXY}\";" >> "${pkg_proxy_conf_file}"
echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}"

break
fi
sleep 5m
sleep 1m # could take up to 180 minutes on single core nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Changing the sleep duration from 5 minutes to 1 minute with a clarifying comment is a good optimization. It reduces potential waiting time on single-core nodes while still allowing for concurrent builds to complete.

Comment on lines +1506 to +1509
if ! modprobe nvidia > /dev/null 2>&1; then
echo "NVIDIA module not loading. Removing completion marker to force
re-install."
mark_incomplete gpu-driver
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The early check for nvidia module loadability and the subsequent mark_incomplete gpu-driver call significantly improve the resilience of the driver installation. If the module isn't loading, it correctly forces a re-installation attempt, which can resolve transient issues or incomplete previous installations.

Comment on lines +3039 to +3040
if [[ "0" == "1" ]] && [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then
time dd if=/dev/zero of=/zero status=none ; sync ; sleep 3s ; rm -f /zero
) fi
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the exit_handler, disabling the dd if=/dev/zero operation in prepare_to_install using [[ "0" == "1" ]] is an effective way to prevent it from running. However, it would be clearer for future maintainers to either remove or explicitly comment out the code block if it's permanently disabled.

Suggested change
if [[ "0" == "1" ]] && [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then
time dd if=/dev/zero of=/zero status=none ; sync ; sleep 3s ; rm -f /zero
) fi
fi
# zero free disk space (only if creating image)
# This operation is currently disabled.
# if [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then


curl ${curl_retry_args} "${signing_key_url}" \
| gpg --import --no-default-keyring --keyring "${kr_path}"
import_gpg_keys --keyring-file "${kr_path}" --key-url "${signing_key_url}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The adoption of the import_gpg_keys function in os_add_repo ensures consistent and proxy-aware handling of GPG key imports for all repositories. This is a significant improvement for maintainability and reliability in diverse network environments.

Comment on lines +3198 to +3354
# Function to download GPG keys from URLs or Keyservers and import them to a specific keyring
# Usage:
# import_gpg_keys --keyring-file <PATH> \
# [--key-url <URL1> [--key-url <URL2> ...]] \
# [--key-id <ID1> [--key-id <ID2> ...]] \
# [--keyserver <KEYSERVER_URI>]
function import_gpg_keys() {
local keyring_file=""
local key_urls=()
local key_ids=()
local keyserver="hkp://keyserver.ubuntu.com:80" # Default keyserver
local tmpdir="${TMPDIR:-/tmp}" # Use TMPDIR if set, otherwise /tmp
local curl_retry_args=(-sSLf --retry 3 --retry-delay 5) # Basic curl retry args

# Parse named arguments
while [[ $# -gt 0 ]]; do
case "$1" in
--keyring-file)
keyring_file="$2"
shift 2
;;
--key-url)
key_urls+=("$2")
shift 2
;;
--key-id)
key_ids+=("$2")
shift 2
;;
--keyserver)
keyserver="$2"
shift 2
;;
*)
echo "Unknown option: $1" >&2
return 1
;;
esac
done

# Validate arguments
if [[ -z "${keyring_file}" ]]; then
echo "ERROR: --keyring-file is required." >&2
return 1
fi
if [[ ${#key_urls[@]} -eq 0 && ${#key_ids[@]} -eq 0 ]]; then
echo "ERROR: At least one --key-url or --key-id must be specified." >&2
return 1
fi

# Ensure the directory for the keyring file exists
local keyring_dir
keyring_dir=$(dirname "${keyring_file}")
if [[ ! -d "${keyring_dir}" ]]; then
echo "Creating directory for keyring: ${keyring_dir}"
mkdir -p "${keyring_dir}"
fi

local tmp_key_file=""
local success=true

# Setup curl proxy arguments if environment variables are set
local proxy_to_use=""
if [[ -n "${HTTPS_PROXY:-}" ]]; then
proxy_to_use="${HTTPS_PROXY}"
elif [[ -n "${HTTP_PROXY:-}" ]]; then
proxy_to_use="${HTTP_PROXY}"
fi

if [[ -n "${proxy_to_use}" ]]; then
curl_retry_args+=(-x "${proxy_to_use}")
fi

if [[ -v METADATA_HTTP_PROXY_PEM_URI ]] && [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]]; then
if [[ -z "${trusted_pem_path:-}" ]]; then
echo "WARNING: METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined." >&2
else
curl_retry_args+=(--cacert "${trusted_pem_path}")
fi
fi

# Process Key URLs
for current_key_url in "${key_urls[@]}"; do
echo "Attempting to download GPG key from URL: ${current_key_url}"
tmp_key_file="${tmpdir}/key_$(basename "${current_key_url}")_$(date +%s).asc"

if curl "${curl_retry_args[@]}" "${current_key_url}" -o "${tmp_key_file}"; then
if [[ -s "${tmp_key_file}" ]]; then
echo "Key file downloaded to ${tmp_key_file}."
if gpg --no-default-keyring --keyring "${keyring_file}" --import "${tmp_key_file}"; then
echo "Key from ${current_key_url} imported successfully to ${keyring_file}."
else
echo "ERROR: gpg --import failed for ${tmp_key_file} from ${current_key_url}." >&2
success=false
fi
else
echo "ERROR: Downloaded key file ${tmp_key_file} from ${current_key_url} is empty." >&2
success=false
fi
else
echo "ERROR: curl failed to download key from ${current_key_url}." >&2
success=false
fi
[[ -f "${tmp_key_file}" ]] && rm -f "${tmp_key_file}"
done

# Process Key IDs
for key_id in "${key_ids[@]}"; do
# Strip 0x prefix if present
clean_key_id="${key_id#0x}"
echo "Attempting to fetch GPG key ID ${clean_key_id} using curl from ${keyserver}"

local fallback_key_url
local server_host
server_host=$(echo "${keyserver}" | sed -e 's#hkp[s]*://##' -e 's#:[0-9]*##')

# Common keyserver URL patterns
if [[ "${server_host}" == "keyserver.ubuntu.com" ]]; then
fallback_key_url="https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x${clean_key_id}"
elif [[ "${server_host}" == "pgp.mit.edu" ]]; then
fallback_key_url="https://pgp.mit.edu/pks/lookup?op=get&search=0x${clean_key_id}"
elif [[ "${server_host}" == "keys.openpgp.org" ]]; then
fallback_key_url="https://keys.openpgp.org/vks/v1/by-fpr/${clean_key_id}"
else
fallback_key_url="https://${server_host}/pks/lookup?op=get&search=0x${clean_key_id}"
echo "WARNING: Using best-guess fallback URL for ${keyserver}: ${fallback_key_url}"
fi

tmp_key_file="${tmpdir}/${clean_key_id}.asc"
if curl "${curl_retry_args[@]}" "${fallback_key_url}" -o "${tmp_key_file}"; then
if [[ -s "${tmp_key_file}" ]]; then
if grep -q -iE '<html|<head|<!DOCTYPE' "${tmp_key_file}"; then
echo "ERROR: Output from keyserver for ${clean_key_id} appears to be HTML, not a key. Key likely not found at ${fallback_key_url}." >&2
success=false
elif gpg --no-default-keyring --keyring "${keyring_file}" --import "${tmp_key_file}"; then
echo "Key ${clean_key_id} imported successfully to ${keyring_file}."
else
echo "ERROR: gpg --import failed for ${clean_key_id} from ${fallback_key_url}." >&2
success=false
fi
else
echo "ERROR: Downloaded key file for ${clean_key_id} is empty from ${fallback_key_url}." >&2
success=false
fi
else
echo "ERROR: curl failed to download key ${clean_key_id} from ${fallback_key_url}." >&2
success=false
fi
[[ -f "${tmp_key_file}" ]] && rm -f "${tmp_key_file}"
done

if [[ "${success}" == "true" ]]; then
return 0
else
echo "ERROR: One or more keys failed to import." >&2
return 1
fi
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new import_gpg_keys function is a well-designed and crucial addition. It centralizes the logic for fetching GPG keys from URLs or keyservers, correctly handles proxy settings, and includes retry mechanisms. This significantly improves the robustness and maintainability of repository configuration across the script.

local key_ids=()
local keyserver="hkp://keyserver.ubuntu.com:80" # Default keyserver
local tmpdir="${TMPDIR:-/tmp}" # Use TMPDIR if set, otherwise /tmp
local curl_retry_args=(-sSLf --retry 3 --retry-delay 5) # Basic curl retry args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The curl_retry_args in import_gpg_keys are set to (-sSLf --retry 3 --retry-delay 5). This is less robust than the curl_retry_args (-fsSL --retry-connrefused --retry 10 --retry-max-time 30) used elsewhere in the script. Given that GPG key imports are critical for repository setup, using the more aggressive retry settings would improve reliability, especially in flaky network environments.

Suggested change
local curl_retry_args=(-sSLf --retry 3 --retry-delay 5) # Basic curl retry args
local curl_retry_args=(-fsSL --retry-connrefused --retry 10 --retry-max-time 30) # Basic curl retry args

Comment on lines +195 to +207
If `yes`, installs PyTorch, Numba, RAPIDS, and PySpark in a Conda
environment.
* `gpu-conda-env`: (Optional) Name for the PyTorch Conda environment.
Default: `dpgce`.
* `container-runtime`: (Optional) E.g., `docker`, `containerd`, `crio`.
For NVIDIA Container Toolkit configuration. Auto-detected if not specified.
* `http-proxy`: (Optional) URL of an HTTP proxy for downloads.
* `http-proxy`: (Optional) Proxy address and port for HTTP requests (e.g., `your-proxy.com:3128`).
* `https-proxy`: (Optional) Proxy address and port for HTTPS requests (e.g., `your-proxy.com:3128`). Defaults to `http-proxy` if not set.
* `proxy-uri`: (Optional) A single proxy URI for both HTTP and HTTPS. Overridden by `http-proxy` or `https-proxy` if they are set.
* `no-proxy`: (Optional) Comma or space-separated list of hosts/domains to bypass the proxy. Defaults include localhost, metadata server, and Google APIs. User-provided values are appended to the defaults.
* `http-proxy-pem-uri`: (Optional) A `gs://` path to the
PEM-encoded certificate file used by the proxy specified in
`http-proxy`. This is needed if the proxy uses TLS and its
certificate is not already trusted by the cluster's default trust
store (e.g., if it's a self-signed certificate or signed by an
internal CA). The script will install this certificate into the
system and Java trust stores.
PEM-encoded CA certificate file for the proxy specified in
`http-proxy`/`https-proxy`. Required if the proxy uses TLS with a certificate not in the default system trust store. This certificate will be added to the system, Java, and Conda trust stores, and proxy connections will use HTTPS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The updated descriptions for include-pytorch, http-proxy, https-proxy, proxy-uri, no-proxy, and http-proxy-pem-uri metadata parameters provide much-needed clarity on the new proxy configuration options and their usage. Replacing TensorFlow with Numba in the include-pytorch description also reflects the updated Conda environment package list.

This commit significantly enhances the robustness and configurability of the GPU driver installation script, particularly for environments with HTTP/HTTPS proxies and those using Secure Boot.

**Key Changes:**

*   **Enhanced Proxy Configuration (`set_proxy`):**
    *   Added support for `https-proxy` and `proxy-uri` metadata, providing more flexibility in proxy setups.
    *   Improved `NO_PROXY` handling with sensible defaults (including Google APIs) and user-configurable additions.
    *   Integrated support for custom proxy CA certificates via `http-proxy-pem-uri`, including installation into system, Java, and Conda trust stores.
    *   Connections to the proxy now use HTTPS when a custom CA is provided.
    *   Added proxy connection and reachability tests to fail fast on misconfiguration.
    *   Ensures `curl`, `apt`, `dnf`, `gpg`, and Java all respect the proxy settings.

*   **Robust GPG Key Import (`import_gpg_keys`):**
    *   Introduced a new function to reliably import GPG keys from URLs or keyservers, fully respecting the configured proxy and custom CA settings.
    *   This replaces direct `curl | gpg --import` calls, making key fetching more resilient in restricted network environments.

*   **Secure Boot Signing Refinements:**
    *   The `configure_dkms_certs` function now always fetches keys from Secret Manager if `private_secret_name` is set, ensuring `modulus_md5sum` is available for GCS cache paths.
    *   Kernel module signing is now more clearly integrated into the build process.
    *   Improved checks to ensure modules are actually signed and loadable after installation when Secure Boot is active.

*   **Resilient Driver Installation:**
    *   The script now checks if the `nvidia` module can be loaded at the beginning of `install_nvidia_gpu_driver` and will re-attempt installation if it fails.
    *   `curl` calls for downloading drivers and other artifacts now use retry flags and honor proxy settings.

*   **Conda Environment for PyTorch:**
    *   Adjusted package list for Conda environment, removing TensorFlow to streamline.
    *   Added specific workarounds for Debian 10, using `conda` instead of `mamba`.

*   **Documentation Updates (`gpu/README.md`):**
    *   Added details on the new proxy metadata: `https-proxy`, `proxy-uri`, `no-proxy`.
    *   Created a new section "Enhanced Proxy Support" explaining the features.
    *   Updated `http-proxy-pem-uri` description.
    *   Added proxy considerations to the "Troubleshooting" section.

These changes aim to make the GPU initialization action more reliable across a wider range of network environments and improve the Secure Boot workflow.
@cjac
Copy link
Contributor Author

cjac commented Jan 24, 2026

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant