From ae9be1e191ca3cf5e30f0d0a5a5b28d67397f8a7 Mon Sep 17 00:00:00 2001 From: "C.J. Collier" Date: Fri, 23 Jan 2026 22:43:15 +0000 Subject: [PATCH] Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script This commit significantly enhances the robustness and configurability of the GPU driver installation script, particularly for environments with HTTP/HTTPS proxies and those using Secure Boot. **Key Changes:** * **Enhanced Proxy Configuration (`set_proxy`):** * Added support for `https-proxy` and `proxy-uri` metadata, providing more flexibility in proxy setups. * Improved `NO_PROXY` handling with sensible defaults (including Google APIs) and user-configurable additions. * Integrated support for custom proxy CA certificates via `http-proxy-pem-uri`, including installation into system, Java, and Conda trust stores. * Connections to the proxy now use HTTPS when a custom CA is provided. * Added proxy connection and reachability tests to fail fast on misconfiguration. * Ensures `curl`, `apt`, `dnf`, `gpg`, and Java all respect the proxy settings. * **Robust GPG Key Import (`import_gpg_keys`):** * Introduced a new function to reliably import GPG keys from URLs or keyservers, fully respecting the configured proxy and custom CA settings. * This replaces direct `curl | gpg --import` calls, making key fetching more resilient in restricted network environments. * **Secure Boot Signing Refinements:** * The `configure_dkms_certs` function now always fetches keys from Secret Manager if `private_secret_name` is set, ensuring `modulus_md5sum` is available for GCS cache paths. * Kernel module signing is now more clearly integrated into the build process. * Improved checks to ensure modules are actually signed and loadable after installation when Secure Boot is active. * **Resilient Driver Installation:** * The script now checks if the `nvidia` module can be loaded at the beginning of `install_nvidia_gpu_driver` and will re-attempt installation if it fails. * `curl` calls for downloading drivers and other artifacts now use retry flags and honor proxy settings. * **Conda Environment for PyTorch:** * Adjusted package list for Conda environment, removing TensorFlow to streamline. * Added specific workarounds for Debian 10, using `conda` instead of `mamba`. * **Documentation Updates (`gpu/README.md`):** * Added details on the new proxy metadata: `https-proxy`, `proxy-uri`, `no-proxy`. * Created a new section "Enhanced Proxy Support" explaining the features. * Updated `http-proxy-pem-uri` description. * Added proxy considerations to the "Troubleshooting" section. These changes aim to make the GPU initialization action more reliable across a wider range of network environments and improve the Secure Boot workflow. --- gpu/README.md | 35 +- gpu/install_gpu_driver.sh | 863 +++++++++++++++++++++++++++----------- 2 files changed, 632 insertions(+), 266 deletions(-) diff --git a/gpu/README.md b/gpu/README.md index c4b2935eb..7a1b59e84 100644 --- a/gpu/README.md +++ b/gpu/README.md @@ -15,6 +15,7 @@ Specifying a supported value for the `cuda-version` metadata variable will select compatible values for Driver, cuDNN, and NCCL from the script's internal matrix. Default CUDA versions are typically: + * Dataproc 1.5: `11.6.2` * Dataproc 2.0: `12.1.1` * Dataproc 2.1: `12.4.1` * Dataproc 2.2 & 2.3: `12.6.3` @@ -191,20 +192,19 @@ This script accepts the following metadata parameters: * `cudnn-version`: (Optional) Specify cuDNN version (e.g., `8.9.7.29`). * `nccl-version`: (Optional) Specify NCCL version. * `include-pytorch`: (Optional) `yes`|`no`. Default: `no`. - If `yes`, installs PyTorch, TensorFlow, RAPIDS, and PySpark in a Conda + If `yes`, installs PyTorch, Numba, RAPIDS, and PySpark in a Conda environment. * `gpu-conda-env`: (Optional) Name for the PyTorch Conda environment. Default: `dpgce`. * `container-runtime`: (Optional) E.g., `docker`, `containerd`, `crio`. For NVIDIA Container Toolkit configuration. Auto-detected if not specified. - * `http-proxy`: (Optional) URL of an HTTP proxy for downloads. + * `http-proxy`: (Optional) Proxy address and port for HTTP requests (e.g., `your-proxy.com:3128`). + * `https-proxy`: (Optional) Proxy address and port for HTTPS requests (e.g., `your-proxy.com:3128`). Defaults to `http-proxy` if not set. + * `proxy-uri`: (Optional) A single proxy URI for both HTTP and HTTPS. Overridden by `http-proxy` or `https-proxy` if they are set. + * `no-proxy`: (Optional) Comma or space-separated list of hosts/domains to bypass the proxy. Defaults include localhost, metadata server, and Google APIs. User-provided values are appended to the defaults. * `http-proxy-pem-uri`: (Optional) A `gs://` path to the - PEM-encoded certificate file used by the proxy specified in - `http-proxy`. This is needed if the proxy uses TLS and its - certificate is not already trusted by the cluster's default trust - store (e.g., if it's a self-signed certificate or signed by an - internal CA). The script will install this certificate into the - system and Java trust stores. + PEM-encoded CA certificate file for the proxy specified in + `http-proxy`/`https-proxy`. Required if the proxy uses TLS with a certificate not in the default system trust store. This certificate will be added to the system, Java, and Conda trust stores, and proxy connections will use HTTPS. * `invocation-type`: (For Custom Images) Set to `custom-images` by image building tools. Not typically set by end-users creating clusters. * **Secure Boot Signing Parameters:** Used if Secure Boot is enabled and @@ -217,6 +217,20 @@ This script accepts the following metadata parameters: modulus_md5sum= ``` +### Enhanced Proxy Support + +This script includes robust support for environments requiring an HTTP/HTTPS proxy: + + * **Configuration:** Use the `http-proxy`, `https-proxy`, or `proxy-uri` metadata to specify your proxy server (host:port). + * **Custom CA Certificates:** If your proxy uses a custom CA (e.g., self-signed), provide the CA certificate in PEM format via the `http-proxy-pem-uri` metadata (as a `gs://` path). The script will: + * Install the CA into the system trust store (`update-ca-certificates` or `update-ca-trust`). + * Add the CA to the Java cacerts trust store. + * Configure Conda to use the system trust store. + * Switch proxy communications to use HTTPS. + * **Tool Configuration:** The script automatically configures `curl`, `apt`, `dnf`, `gpg`, and Java to use the specified proxy settings and custom CA if provided. + * **Bypass:** The `no-proxy` metadata allows specifying hosts to bypass the proxy. Defaults include `localhost`, the metadata server, `.google.com`, and `.googleapis.com` to ensure essential services function correctly. + * **Verification:** The script performs connection tests to the proxy and attempts to reach external sites (google.com, nvidia.com) through the proxy to validate the configuration before proceeding with downloads. + ### Loading Built Kernel Module & Secure Boot When the script needs to build NVIDIA kernel modules from source (e.g., using @@ -280,6 +294,7 @@ handles metric creation and reporting. * **Installation Failures:** Examine the initialization action log on the affected node, typically `/var/log/dataproc-initialization-script-0.log` (or a similar name if multiple init actions are used). + * **Network/Proxy Issues:** If using a proxy, double-check the `http-proxy`, `https-proxy`, `proxy-uri`, `no-proxy`, and `http-proxy-pem-uri` metadata settings. Ensure the proxy allows access to NVIDIA domains, GitHub, and package repositories. Check the init action log for curl errors or proxy test failures. * **GPU Agent Issues:** If the agent was installed (`install-gpu-agent=true`), check its service logs using `sudo journalctl -u gpu-utilization-agent.service`. * **Driver Load or Secure Boot Problems:** Review `dmesg` output and @@ -298,7 +313,7 @@ handles metric creation and reporting. * The script extensively caches downloaded artifacts (drivers, CUDA `.run` files) and compiled components (kernel modules, NCCL, Conda environments) to a GCS bucket. This bucket is typically specified by the - `dataproc-temp-bucket` cluster property or metadata. + `dataproc-temp-bucket` cluster property or metadata. Downloads and cache operations are proxy-aware. * **First Run / Cache Warming:** Initial runs on new configurations (OS, kernel, or driver version combinations) that require source compilation (e.g., for NCCL or kernel modules when no pre-compiled version is @@ -324,4 +339,4 @@ handles metric creation and reporting. Debian-based systems, including handling of archived backports repositories to ensure dependencies can be met. * Tested primarily with Dataproc 2.0+ images. Support for older Dataproc - 1.5 images is limited. \ No newline at end of file + 1.5 images is limited. diff --git a/gpu/install_gpu_driver.sh b/gpu/install_gpu_driver.sh index 9a1ee94cd..6ba740aa3 100644 --- a/gpu/install_gpu_driver.sh +++ b/gpu/install_gpu_driver.sh @@ -276,19 +276,19 @@ function set_driver_version() { echo "Checking for cached NVIDIA driver at: ${gcs_cache_path}" - if ! gsutil -q stat "${gcs_cache_path}"; then + if ! ${gsutil_stat_cmd} "${gcs_cache_path}" 2>/dev/null; then echo "Driver not found in GCS cache. Validating URL: ${gpu_driver_url}" # Use curl to check if the URL is valid (HEAD request) - if curl -sSLfI --connect-timeout 10 --max-time 30 "${gpu_driver_url}" 2>/dev/null | grep -E -q 'HTTP.*200'; then + if curl -LIv --retry-connrefused --retry 3 --retry-max-time 10 --proxy "${HTTP_PROXY}" ${METADATA_HTTP_PROXY_PEM_URI:+"--cacert"} ${trusted_pem_path:-} "${gpu_driver_url}" ; then echo "NVIDIA URL is valid. Downloading to cache..." local temp_driver_file="${tmpdir}/${driver_filename}" # Download the file echo "Downloading from ${gpu_driver_url} to ${temp_driver_file}" - if curl -sSLf -o "${temp_driver_file}" "${gpu_driver_url}"; then + if curl -Lv --retry-connrefused --retry 3 --retry-max-time 10 -o "${temp_driver_file}" --proxy "${HTTP_PROXY}" ${METADATA_HTTP_PROXY_PEM_URI:+"--cacert"} ${trusted_pem_path:-} "${gpu_driver_url}"; then echo "Download complete. Uploading to ${gcs_cache_path}" # Upload to GCS - if gsutil cp "${temp_driver_file}" "${gcs_cache_path}"; then + if ${gsutil_cmd} cp "${temp_driver_file}" "${gcs_cache_path}"; then echo "Successfully cached to GCS." rm -f "${temp_driver_file}" else @@ -893,12 +893,26 @@ function install_pytorch() { if test -d "${envpath}" ; then verb=install ; fi cudart_spec="cuda-cudart" if le_cuda11 ; then cudart_spec="cudatoolkit" ; fi + # disable ssl verification for debian10 + local conda_path="${conda_root_path}/bin/mamba" + + conda_pkg_list=( + "numba" "pytorch" "rapids" "pyspark" "cuda-version<=${CUDA_VERSION}" "${cudart_spec}" + ) + + if version_le "${DATAPROC_IMAGE_VERSION}" "2.0" ; then + + conda_pkg_list=("numba" "pytorch" "rapids" "pyspark" "cudatoolkit<=12.4" "python=3.10") + + conda_path="${conda_root_path}/bin/conda" + fi + conda_pkg=$( IFS=' ' ; echo "${conda_pkg_list[*]}" ) + # Install pytorch and company to this environment - "${conda_root_path}/bin/mamba" "${verb}" -n "${env}" \ + "${conda_path}" "${verb}" -n "${env}" \ -c conda-forge -c nvidia -c rapidsai \ - numba pytorch tensorflow[and-cuda] rapids pyspark \ - "cuda-version<=${CUDA_VERSION}" "${cudart_spec}" + ${conda_pkg} # Install jupyter kernel in this environment "${envpath}/bin/python3" -m pip install ipykernel @@ -923,70 +937,47 @@ function configure_dkms_certs() { echo "No signing secret provided. skipping"; return 0 fi - if [[ -f "${mok_der}" ]] ; then return 0; fi - - mkdir -p "${CA_TMPDIR}" - - # If the private key exists, verify it - if [[ -f "${CA_TMPDIR}/db.rsa" ]]; then - echo "Private key material exists" - local expected_modulus_md5sum - expected_modulus_md5sum=$(get_metadata_attribute modulus_md5sum) - if [[ -n "${expected_modulus_md5sum}" ]]; then - modulus_md5sum="${expected_modulus_md5sum}" - - # Verify that cert md5sum matches expected md5sum - if [[ "${modulus_md5sum}" != "$(openssl rsa -noout -modulus -in "${CA_TMPDIR}/db.rsa" | openssl md5 | awk '{print $2}')" ]]; then - echo "unmatched rsa key" - fi - - # Verify that key md5sum matches expected md5sum - if [[ "${modulus_md5sum}" != "$(openssl x509 -noout -modulus -in ${mok_der} | openssl md5 | awk '{print $2}')" ]]; then - echo "unmatched x509 cert" - fi - else - modulus_md5sum="$(openssl rsa -noout -modulus -in "${CA_TMPDIR}/db.rsa" | openssl md5 | awk '{print $2}')" - fi + # Always fetch keys if PSN is set to ensure modulus_md5sum is calculated. + if [[ -n "${PSN}" ]]; then + mkdir -p "${CA_TMPDIR}" + + # Retrieve cloud secrets keys + local sig_priv_secret_name + sig_priv_secret_name="${PSN}" + local sig_pub_secret_name + sig_pub_secret_name="$(get_metadata_attribute public_secret_name)" + local sig_secret_project + sig_secret_project="$(get_metadata_attribute secret_project)" + local sig_secret_version + sig_secret_version="$(get_metadata_attribute secret_version)" + + # If metadata values are not set, do not write mok keys + if [[ -z "${sig_priv_secret_name}" ]]; then return 0 ; fi + + # Write private material to volatile storage + gcloud secrets versions access "${sig_secret_version}" \ + --project="${sig_secret_project}" \ + --secret="${sig_priv_secret_name}" \ + | dd status=none of="${CA_TMPDIR}/db.rsa" + + # Write public material to volatile storage + gcloud secrets versions access "${sig_secret_version}" \ + --project="${sig_secret_project}" \ + --secret="${sig_pub_secret_name}" \ + | base64 --decode \ + | dd status=none of="${CA_TMPDIR}/db.der" + + local mok_directory="$(dirname "${mok_key}")" + mkdir -p "${mok_directory}" + + # symlink private key and copy public cert from volatile storage to DKMS directory ln -sf "${CA_TMPDIR}/db.rsa" "${mok_key}" + cp -f "${CA_TMPDIR}/db.der" "${mok_der}" - return + modulus_md5sum="$(openssl rsa -noout -modulus -in "${mok_key}" | openssl md5 | awk '{print $2}')" + echo "DEBUG: modulus_md5sum set to: ${modulus_md5sum}" fi - - # Retrieve cloud secrets keys - local sig_priv_secret_name - sig_priv_secret_name="${PSN}" - local sig_pub_secret_name - sig_pub_secret_name="$(get_metadata_attribute public_secret_name)" - local sig_secret_project - sig_secret_project="$(get_metadata_attribute secret_project)" - local sig_secret_version - sig_secret_version="$(get_metadata_attribute secret_version)" - - # If metadata values are not set, do not write mok keys - if [[ -z "${sig_priv_secret_name}" ]]; then return 0 ; fi - - # Write private material to volatile storage - gcloud secrets versions access "${sig_secret_version}" \ - --project="${sig_secret_project}" \ - --secret="${sig_priv_secret_name}" \ - | dd status=none of="${CA_TMPDIR}/db.rsa" - - # Write public material to volatile storage - gcloud secrets versions access "${sig_secret_version}" \ - --project="${sig_secret_project}" \ - --secret="${sig_pub_secret_name}" \ - | base64 --decode \ - | dd status=none of="${CA_TMPDIR}/db.der" - - local mok_directory="$(dirname "${mok_key}")" - mkdir -p "${mok_directory}" - - # symlink private key and copy public cert from volatile storage to DKMS directory - ln -sf "${CA_TMPDIR}/db.rsa" "${mok_key}" - cp -f "${CA_TMPDIR}/db.der" "${mok_der}" - - modulus_md5sum="$(openssl rsa -noout -modulus -in "${mok_key}" | openssl md5 | awk '{print $2}')" } function clear_dkms_key { @@ -1042,10 +1033,11 @@ function add_repo_nvidia_container_toolkit() { elif [[ -v http_proxy ]] ; then GPG_PROXY="--keyserver-options http-proxy=${http_proxy}" fi - execute_with_retries gpg --keyserver keyserver.ubuntu.com \ - ${GPG_PROXY_ARGS} \ - --no-default-keyring --keyring "${kr_path}" \ - --recv-keys "0xae09fe4bbd223a84b2ccfce3f60f4b3d7fa2af80" "0xeb693b3035cd5710e231e123a4b469963bf863cc" "0xc95b321b61e88c1809c4f759ddcae044f796ecb0" + import_gpg_keys --keyring-file "${kr_path}" \ + --key-id "0xae09fe4bbd223a84b2ccfce3f60f4b3d7fa2af80" \ + --key-id "0xeb693b3035cd5710e231e123a4b469963bf863cc" \ + --key-id "0xc95b321b61e88c1809c4f759ddcae044f796ecb0" + local -r repo_data="${nvctk_root}/stable/deb/\$(ARCH) /" local -r repo_path="/etc/apt/sources.list.d/${repo_name}.list" echo "deb [signed-by=${kr_path}] ${repo_data}" > "${repo_path}" @@ -1074,9 +1066,9 @@ function add_repo_cuda() { elif [[ -n "${http_proxy}" ]] ; then GPG_PROXY="--keyserver-options http-proxy=${http_proxy}" fi - execute_with_retries gpg --keyserver keyserver.ubuntu.com ${GPG_PROXY_ARGS} \ - --no-default-keyring --keyring "${kr_path}" \ - --recv-keys "0xae09fe4bbd223a84b2ccfce3f60f4b3d7fa2af80" "0xeb693b3035cd5710e231e123a4b469963bf863cc" + import_gpg_keys --keyring-file "${kr_path}" \ + --key-id "0xae09fe4bbd223a84b2ccfce3f60f4b3d7fa2af80" \ + --key-id "0xeb693b3035cd5710e231e123a4b469963bf863cc" else install_cuda_keyring_pkg # 11.7+, 12.0+ fi @@ -1085,13 +1077,57 @@ function add_repo_cuda() { fi } +function execute_github_driver_build() { + pushd open-gpu-kernel-modules + install_build_dependencies + if ( is_cuda11 && is_ubuntu22 ) ; then + echo "Kernel modules cannot be compiled for CUDA 11 on ${_shortname}" + exit 1 + fi + execute_with_retries make -j$(nproc) modules \ + > kernel-open/build.log \ + 2> kernel-open/build_error.log + make modules_install + # Sign kernel modules + if [[ -n "${PSN}" ]]; then + configure_dkms_certs + echo "DEBUG: mok_key=${mok_key}" + echo "DEBUG: mok_der=${mok_der}" + if [[ -f "${mok_key}" ]]; then ls -l "${mok_key}"; fi + if [[ -f "${mok_der}" ]]; then ls -l "${mok_der}"; fi + set -x + for module in $(find /lib/modules/${uname_r}/kernel/drivers/video -name '*nvidia*.ko') ; do + echo "DEBUG: Signing ${module}" + "/lib/modules/${uname_r}/build/scripts/sign-file" sha256 \ + "${mok_key}" \ + "${mok_der}" \ + "${module}" + echo "DEBUG: sign-file result: $?" + done + set +x + clear_dkms_key + fi + # Collect build logs and installed binaries + tar czvf "${local_tarball}" \ + "${workdir}/open-gpu-kernel-modules/kernel-open/"*.log \ + $(find /lib/modules/${uname_r}/ -iname 'nvidia*.ko') + ${gsutil_cmd} cp "${local_tarball}" "${gcs_tarball}" + if ${gsutil_stat_cmd} "${gcs_tarball}.building" ; then ${gsutil_cmd} rm "${gcs_tarball}.building" || true ; fi + building_file="" + rm "${local_tarball}" + make clean + popd +} + function build_driver_from_github() { # non-GPL driver will have been built on rocky8, or when driver # version is prior to open driver min, or GPU architecture is prior # to Turing if ( is_rocky8 \ || version_lt "${DRIVER_VERSION}" "${MIN_OPEN_DRIVER_VER}" \ - || [[ "$((16#${pci_device_id}))" < "$((16#1E00))" ]] ) ; then return 0 ; fi + || [[ "$((16#${pci_device_id}))" < "$((16#1E00))" ]] ) ; then + return 0 + fi pushd "${workdir}" test -d "${workdir}/open-gpu-kernel-modules" || { tarball_fn="${DRIVER_VERSION}.tar.gz" @@ -1100,9 +1136,35 @@ function build_driver_from_github() { \| tar xz mv "open-gpu-kernel-modules-${DRIVER_VERSION}" open-gpu-kernel-modules } + local nvidia_ko_path="$(find /lib/modules/$(uname -r)/ -name 'nvidia.ko' | head -n1)" + + local needs_build=false + if [[ -n "${nvidia_ko_path}" && -f "${nvidia_ko_path}" ]]; then + if modinfo "${nvidia_ko_path}" | grep -qi sig ; then + echo "NVIDIA kernel module found and appears signed." + # Try to load it to be sure + if ! modprobe nvidia > /dev/null 2>&1; then + echo "Module signed but failed to load. Rebuilding." + needs_build=true + else + echo "Module loaded successfully." + fi + else + echo "NVIDIA kernel module found but NOT signed. Rebuilding." + needs_build=true + fi + else + echo "NVIDIA kernel module not found. Building." + needs_build=true + fi + + + if [[ "${needs_build}" == "true" ]]; then + # Configure certs to get modulus_md5sum for the path + if [[ -n "${PSN}" ]]; then + configure_dkms_certs + fi - local nvidia_ko_path="$(find /lib/modules/$(uname -r)/ -name 'nvidia.ko')" - test -n "${nvidia_ko_path}" && test -f "${nvidia_ko_path}" || { local build_tarball="kmod_${_shortname}_${DRIVER_VERSION}.tar.gz" local local_tarball="${workdir}/${build_tarball}" local build_dir @@ -1128,7 +1190,7 @@ function build_driver_from_github() { ${gsutil_cmd} rm "${gcs_tarball}.building" || echo "might have been deleted by a peer" break fi - sleep 5m + sleep 1m # could take up to 180 minutes on single core nodes done fi fi @@ -1140,45 +1202,39 @@ function build_driver_from_github() { touch "${local_tarball}.building" ${gsutil_cmd} cp "${local_tarball}.building" "${gcs_tarball}.building" building_file="${gcs_tarball}.building" - pushd open-gpu-kernel-modules - install_build_dependencies - if ( is_cuda11 && is_ubuntu22 ) ; then - echo "Kernel modules cannot be compiled for CUDA 11 on ${_shortname}" + + execute_github_driver_build + fi + + ${gsutil_cmd} cat "${gcs_tarball}" | tar -C / -xzv + depmod -a + + # Verify signature after installation + if [[ -n "${PSN}" ]]; then + configure_dkms_certs + + # Verify signatures and load + local signed=true + for module in $(find /lib/modules/${uname_r}/ -iname 'nvidia*.ko'); do + if ! modinfo "${module}" | grep -qi sig ; then + echo "ERROR: Module ${module} is NOT signed after installation." + signed=false + fi + done + if [[ "${signed}" != "true" ]]; then + echo "ERROR: Module signing failed." exit 1 fi - execute_with_retries make -j$(nproc) modules \ - > kernel-open/build.log \ - 2> kernel-open/build_error.log - # Sign kernel modules - if [[ -n "${PSN}" ]]; then - configure_dkms_certs - for module in $(find open-gpu-kernel-modules/kernel-open -name '*.ko'); do - "/lib/modules/${uname_r}/build/scripts/sign-file" sha256 \ - "${mok_key}" \ - "${mok_der}" \ - "${module}" - done - clear_dkms_key + + if ! modprobe nvidia; then + echo "ERROR: Failed to load nvidia module after build and sign." + exit 1 fi - make modules_install \ - >> kernel-open/build.log \ - 2>> kernel-open/build_error.log - # Collect build logs and installed binaries - tar czvf "${local_tarball}" \ - "${workdir}/open-gpu-kernel-modules/kernel-open/"*.log \ - $(find /lib/modules/${uname_r}/ -iname 'nvidia*.ko') - ${gsutil_cmd} cp "${local_tarball}" "${gcs_tarball}" - if ${gsutil_stat_cmd} "${gcs_tarball}.building" ; then ${gsutil_cmd} rm "${gcs_tarball}.building" || true ; fi - building_file="" - rm "${local_tarball}" - make clean - popd + echo "NVIDIA modules built, signed, and loaded successfully." fi - ${gsutil_cmd} cat "${gcs_tarball}" | tar -C / -xzv - depmod -a - } + fi - popd + popd # ${workdir} } function build_driver_from_packages() { @@ -1446,6 +1502,12 @@ function install_nvidia_container_toolkit() { # Install NVIDIA GPU driver provided by NVIDIA function install_nvidia_gpu_driver() { + if ! modprobe nvidia > /dev/null 2>&1; then + echo "NVIDIA module not loading. Removing completion marker to force +re-install." + mark_incomplete gpu-driver + fi + is_complete gpu-driver && return if [[ "${gpu_count}" == "0" ]] ; then return ; fi @@ -1511,7 +1573,7 @@ function install_gpu_agent() { "${python_interpreter}" -m venv "${venv}" ( source "${venv}/bin/activate" - if [[ -v METADATA_HTTP_PROXY_PEM_URI ]]; then + if [[ -v METADATA_HTTP_PROXY_PEM_URI ]] && [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]]; then export REQUESTS_CA_BUNDLE="${trusted_pem_path}" pip install pip-system-certs unset REQUESTS_CA_BUNDLE @@ -2030,6 +2092,10 @@ function create_deferred_config_files() { # Deferred configuration script generated by install_gpu_driver.sh set -xeuo pipefail +readonly config_script_path="${config_script_path}" +readonly service_name="${service_name}" +readonly service_file="${service_file}" + # --- Minimal necessary functions and variables --- # Define constants readonly HADOOP_CONF_DIR='/etc/hadoop/conf' @@ -2355,8 +2421,7 @@ function clean_up_sources_lists() { local -r bigtop_kr_path="/usr/share/keyrings/bigtop-keyring.gpg" rm -f "${bigtop_kr_path}" - curl ${curl_retry_args} \ - "${bigtop_key_uri}" | gpg --dearmor -o "${bigtop_kr_path}" + import_gpg_keys --keyring-file "${bigtop_kr_path}" --key-url "${bigtop_key_uri}" sed -i -e "s:deb https:deb [signed-by=${bigtop_kr_path}] https:g" "${dataproc_repo_file}" sed -i -e "s:deb-src https:deb-src [signed-by=${bigtop_kr_path}] https:g" "${dataproc_repo_file}" @@ -2373,10 +2438,9 @@ function clean_up_sources_lists() { if test -f "${old_adoptium_list}" ; then rm -f "${old_adoptium_list}" fi - for keyid in "0x3b04d753c9050d9a5d343f39843c48a565f8f04b" "0x35baa0b33e9eb396f59ca838c0ba5ce6dc6315a3" ; do - curl ${curl_retry_args} "https://keyserver.ubuntu.com/pks/lookup?op=get&search=${keyid}" \ - | gpg --import --no-default-keyring --keyring "${adoptium_kr_path}" - done + import_gpg_keys --keyring-file "${adoptium_kr_path}" \ + --key-id "0x3b04d753c9050d9a5d343f39843c48a565f8f04b" \ + --key-id "0x35baa0b33e9eb396f59ca838c0ba5ce6dc6315a3" echo "deb [signed-by=${adoptium_kr_path}] https://packages.adoptium.net/artifactory/deb/ $(os_codename) main" \ > /etc/apt/sources.list.d/adoptium.list @@ -2388,8 +2452,7 @@ function clean_up_sources_lists() { local -r docker_key_url="https://download.docker.com/linux/$(os_id)/gpg" rm -f "${docker_kr_path}" - curl ${curl_retry_args} "${docker_key_url}" \ - | gpg --import --no-default-keyring --keyring "${docker_kr_path}" + import_gpg_keys --keyring-file "${docker_kr_path}" --key-url "${docker_key_url}" echo "deb [signed-by=${docker_kr_path}] https://download.docker.com/linux/$(os_id) $(os_codename) stable" \ > ${docker_repo_file} @@ -2399,8 +2462,7 @@ function clean_up_sources_lists() { local gcloud_kr_path="/usr/share/keyrings/cloud.google.gpg" if ls /etc/apt/sources.list.d/google-clou*.list ; then rm -f "${gcloud_kr_path}" - curl ${curl_retry_args} https://packages.cloud.google.com/apt/doc/apt-key.gpg \ - | gpg --import --no-default-keyring --keyring "${gcloud_kr_path}" + import_gpg_keys --keyring-file "${gcloud_kr_path}" --key-url "https://packages.cloud.google.com/apt/doc/apt-key.gpg" for list in google-cloud google-cloud-logging google-cloud-monitoring ; do list_file="/etc/apt/sources.list.d/${list}.list" if [[ -f "${list_file}" ]]; then @@ -2415,10 +2477,9 @@ function clean_up_sources_lists() { if [[ -f /etc/apt/sources.list.d/cran-r.list ]]; then local cranr_kr_path="/usr/share/keyrings/cran-r.gpg" rm -f "${cranr_kr_path}" - for keyid in "0x95c0faf38db3ccad0c080a7bdc78b2ddeabc47b7" "0xe298a3a825c0d65dfd57cbb651716619e084dab9" ; do - curl ${curl_retry_args} "https://keyserver.ubuntu.com/pks/lookup?op=get&search=${keyid}" \ - | gpg --import --no-default-keyring --keyring "${cranr_kr_path}" - done + import_gpg_keys --keyring-file "${cranr_kr_path}" \ + --key-id "0x95c0faf38db3ccad0c080a7bdc78b2ddeabc47b7" \ + --key-id "0xe298a3a825c0d65dfd57cbb651716619e084dab9" sed -i -e "s:deb http:deb [signed-by=${cranr_kr_path}] http:g" /etc/apt/sources.list.d/cran-r.list fi @@ -2427,8 +2488,9 @@ function clean_up_sources_lists() { # if [[ -f /etc/apt/sources.list.d/mysql.list ]]; then rm -f /usr/share/keyrings/mysql.gpg - curl ${curl_retry_args} 'https://keyserver.ubuntu.com/pks/lookup?op=get&search=0xBCA43417C3B485DD128EC6D4B7B3B788A8D3785C' | \ - gpg --dearmor -o /usr/share/keyrings/mysql.gpg + + import_gpg_keys --keyring-file /usr/share/keyrings/mysql.gpg --key-id "0xBCA43417C3B485DD128EC6D4B7B3B788A8D3785C" + sed -i -e 's:deb https:deb [signed-by=/usr/share/keyrings/mysql.gpg] https:g' /etc/apt/sources.list.d/mysql.list fi @@ -2534,7 +2596,7 @@ print( " samples-taken: ", scalar @siz, $/, echo "exit_handler has completed" # zero free disk space (only if creating image) - if [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then + if [[ "0" == "1" ]] && [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then dd if=/dev/zero of=/zero status=progress || true sync sleep 3s @@ -2545,165 +2607,288 @@ print( " samples-taken: ", scalar @siz, $/, } function set_proxy(){ - METADATA_HTTP_PROXY="$(get_metadata_attribute http-proxy '')" + local meta_http_proxy meta_https_proxy meta_proxy_uri + meta_http_proxy=$(get_metadata_attribute 'http-proxy' '') + meta_https_proxy=$(get_metadata_attribute 'https-proxy' '') + meta_proxy_uri=$(get_metadata_attribute 'proxy-uri' '') - if [[ -z "${METADATA_HTTP_PROXY}" ]] ; then return ; fi + echo "DEBUG: set_proxy: meta_http_proxy='${meta_http_proxy}'" + echo "DEBUG: set_proxy: meta_https_proxy='${meta_https_proxy}'" + echo "DEBUG: set_proxy: meta_proxy_uri='${meta_proxy_uri}'" - no_proxy_list=("localhost" "127.0.0.0/8" "::1" "metadata.google.internal" "169.254.169.254") + local http_proxy_val="" + local https_proxy_val="" - services=( compute secretmanager dns servicedirectory networkmanagement - bigquery composer pubsub bigquerydatatransfer networkservices - storage datafusion dataproc certificatemanager networksecurity - dataflow privateca logging ) + # Determine HTTP_PROXY value + if [[ -n "${meta_http_proxy}" ]] && [[ "${meta_http_proxy}" != ":" ]]; then + http_proxy_val="${meta_http_proxy}" + elif [[ -n "${meta_proxy_uri}" ]] && [[ "${meta_proxy_uri}" != ":" ]]; then + http_proxy_val="${meta_proxy_uri}" + fi - for svc in "${services[@]}"; do - no_proxy_list+=("${svc}.googleapis.com") - done + # Determine HTTPS_PROXY value + if [[ -n "${meta_https_proxy}" ]] && [[ "${meta_https_proxy}" != ":" ]]; then + https_proxy_val="${meta_https_proxy}" + elif [[ -n "${meta_proxy_uri}" ]] && [[ "${meta_proxy_uri}" != ":" ]]; then + https_proxy_val="${meta_proxy_uri}" + fi + + if [[ -z "${http_proxy_val}" && -z "${https_proxy_val}" ]]; then + echo "DEBUG: set_proxy: No valid proxy metadata found (http-proxy, https-proxy, or proxy-uri). Skipping proxy setup." + return 0 + fi + + + + local default_no_proxy_list=( + + "localhost" + + "127.0.0.1" + + "::1" + + "metadata.google.internal" + + "169.254.169.254" + + # *** Add Google APIs to NO_PROXY for Private Google Access *** + + ".google.com" - no_proxy="$( IFS=',' ; echo "${no_proxy_list[*]}" )" + ".googleapis.com" - export http_proxy="http://${METADATA_HTTP_PROXY}" - export https_proxy="http://${METADATA_HTTP_PROXY}" - export no_proxy - export HTTP_PROXY="http://${METADATA_HTTP_PROXY}" - export HTTPS_PROXY="http://${METADATA_HTTP_PROXY}" - export NO_PROXY="${no_proxy}" + ) + + + + local user_no_proxy + + user_no_proxy=$(get_metadata_attribute 'no-proxy' '') + + local user_no_proxy_list=() + + if [[ -n "${user_no_proxy}" ]]; then + + # Replace spaces with commas, then split by comma + + IFS=',' read -r -a user_no_proxy_list <<< "${user_no_proxy// /,}" + + fi + + + + local combined_no_proxy_list=( "${default_no_proxy_list[@]}" "${user_no_proxy_list[@]}" ) + + local no_proxy + + no_proxy=$( IFS=',' ; echo "${combined_no_proxy_list[*]}" ) + + export NO_PROXY="${no_proxy}" + + export no_proxy="${no_proxy}" + + - # configure gcloud - gcloud config set proxy/type http - gcloud config set proxy/address "${METADATA_HTTP_PROXY%:*}" - gcloud config set proxy/port "${METADATA_HTTP_PROXY#*:}" + # Export environment variables + if [[ -n "${http_proxy_val}" ]]; then + export HTTP_PROXY="http://${http_proxy_val}" + export http_proxy="http://${http_proxy_val}" + else + unset HTTP_PROXY + unset http_proxy + fi + echo "DEBUG: set_proxy: Initial HTTP_PROXY='${HTTP_PROXY:-}'" - # add proxy environment variables to /etc/environment - grep http_proxy /etc/environment || echo "http_proxy=${http_proxy}" >> /etc/environment - grep https_proxy /etc/environment || echo "https_proxy=${https_proxy}" >> /etc/environment - grep no_proxy /etc/environment || echo "no_proxy=${no_proxy}" >> /etc/environment - grep HTTP_PROXY /etc/environment || echo "HTTP_PROXY=${HTTP_PROXY}" >> /etc/environment - grep HTTPS_PROXY /etc/environment || echo "HTTPS_PROXY=${HTTPS_PROXY}" >> /etc/environment - grep NO_PROXY /etc/environment || echo "NO_PROXY=${NO_PROXY}" >> /etc/environment + if [[ -n "${https_proxy_val}" ]]; then + export HTTPS_PROXY="http://${https_proxy_val}" + export https_proxy="http://${https_proxy_val}" + else + unset HTTPS_PROXY + unset https_proxy + fi + echo "DEBUG: set_proxy: Initial HTTPS_PROXY='${HTTPS_PROXY:-}'" + + # Clear existing proxy settings in /etc/environment + sed -i -e '/^http_proxy=/d' -e '/^https_proxy=/d' -e '/^no_proxy=/d' \ + -e '/^HTTP_PROXY=/d' -e '/^HTTPS_PROXY=/d' -e '/^NO_PROXY=/d' /etc/environment + + # Add current proxy environment variables to /etc/environment + if [[ -n "${HTTP_PROXY:-}" ]]; then echo "HTTP_PROXY=${HTTP_PROXY}" >> /etc/environment; fi + if [[ -n "${http_proxy:-}" ]]; then echo "http_proxy=${http_proxy}" >> /etc/environment; fi + if [[ -n "${HTTPS_PROXY:-}" ]]; then echo "HTTPS_PROXY=${HTTPS_PROXY}" >> /etc/environment; fi + if [[ -n "${https_proxy:-}" ]]; then echo "https_proxy=${https_proxy}" >> /etc/environment; fi + echo "DEBUG: set_proxy: Effective HTTP_PROXY=${HTTP_PROXY:-}" + echo "DEBUG: set_proxy: Effective HTTPS_PROXY=${HTTPS_PROXY:-}" + echo "DEBUG: set_proxy: Effective NO_PROXY=${NO_PROXY:-}" + + if [[ -n "${http_proxy_val}" ]]; then + local proxy_host=$(echo "${http_proxy_val}" | cut -d: -f1) + local proxy_port=$(echo "${http_proxy_val}" | cut -d: -f2) + + echo "DEBUG: set_proxy: Testing TCP connection to proxy ${proxy_host}:${proxy_port}..." + if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then + echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}." + exit 1 + else + echo "DEBUG: set_proxy: TCP connection to proxy successful." + fi + echo "DEBUG: set_proxy: Testing external site access via proxy..." + local test_url="https://www.google.com" + if curl -vL --retry-connrefused --retry 2 --retry-max-time 10 --proxy "${HTTP_PROXY}" -o /dev/null "${test_url}"; then + echo "DEBUG: set_proxy: Successfully fetched ${test_url} via proxy." + else + echo "ERROR: Failed to fetch ${test_url} via proxy ${HTTP_PROXY}." + exit 1 + fi + fi + + # Configure package managers local pkg_proxy_conf_file - if is_debuntu ; then - # configure Apt to use the proxy: + local effective_proxy="${http_proxy_val:-${https_proxy_val}}" # Use a single value for apt/dnf + + if [[ -z "${effective_proxy}" ]]; then + echo "DEBUG: set_proxy: No HTTP or HTTPS proxy set for package managers." + elif is_debuntu ; then pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy" - cat > "${pkg_proxy_conf_file}" < "${pkg_proxy_conf_file}" + echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}" + echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}" elif is_rocky ; then pkg_proxy_conf_file="/etc/dnf/dnf.conf" - touch "${pkg_proxy_conf_file}" - - if grep -q "^proxy=" "${pkg_proxy_conf_file}"; then - sed -i.bak "s@^proxy=.*@proxy=${HTTP_PROXY}@" "${pkg_proxy_conf_file}" - elif grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then - sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}" + sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}" + if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then + sed -i.bak "/^\\\[main\\\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}" else - local TMP_FILE=$(mktemp) - printf "[main]\nproxy=%s\n" "${HTTP_PROXY}" > "${TMP_FILE}" - - cat "${TMP_FILE}" "${pkg_proxy_conf_file}" > "${pkg_proxy_conf_file}".new - mv "${pkg_proxy_conf_file}".new "${pkg_proxy_conf_file}" + echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}" + fi + echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}" + fi - rm "${TMP_FILE}" + # Configure dirmngr to use the HTTP proxy if set + if is_debuntu ; then + if ! dpkg -l | grep -q dirmngr; then + echo "DEBUG: set_proxy: dirmngr package not found, installing..." + execute_with_retries apt-get install -y -qq dirmngr + fi + elif is_rocky ; then + if ! rpm -q gnupg2-smime; then + echo "DEBUG: set_proxy: gnupg2-smime package not found, installing..." + execute_with_retries dnf install -y -q gnupg2-smime fi - else - echo "unknown OS" - exit 1 fi - # configure gpg to use the proxy: - if ! grep 'keyserver-options http-proxy' /etc/gnupg/dirmngr.conf ; then - mkdir -p /etc/gnupg - cat >> /etc/gnupg/dirmngr.conf <> "${dirmngr_conf}" + echo "DEBUG: set_proxy: Configured dirmngr proxy in ${dirmngr_conf}" fi - # Install the HTTPS proxy's certificate in the system and Java trust databases + # Install the HTTPS proxy's certificate METADATA_HTTP_PROXY_PEM_URI="$(get_metadata_attribute http-proxy-pem-uri '')" + if [[ -z "${METADATA_HTTP_PROXY_PEM_URI}" ]] ; then + echo "DEBUG: set_proxy: No http-proxy-pem-uri metadata found. Skipping cert install." + return 0 + fi + if [[ ! "${METADATA_HTTP_PROXY_PEM_URI}" =~ ^gs:// ]] ; then echo "ERROR: http-proxy-pem-uri value must start with gs://" ; exit 1 ; fi - if [[ -z "${METADATA_HTTP_PROXY_PEM_URI}" ]] ; then return ; fi - if [[ ! "${METADATA_HTTP_PROXY_PEM_URI}" =~ ^gs ]] ; then echo "http-proxy-pem-uri value should start with gs://" ; exit 1 ; fi - - local trusted_pem_dir - # Add this certificate to the OS trust database - # When proxy cert is provided, speak to the proxy over https + echo "DEBUG: set_proxy: http-proxy-pem-uri='${METADATA_HTTP_PROXY_PEM_URI}'" + local trusted_pem_dir proxy_ca_pem ca_subject if is_debuntu ; then trusted_pem_dir="/usr/local/share/ca-certificates" - mkdir -p "${trusted_pem_dir}" proxy_ca_pem="${trusted_pem_dir}/proxy_ca.crt" gsutil cp "${METADATA_HTTP_PROXY_PEM_URI}" "${proxy_ca_pem}" update-ca-certificates - trusted_pem_path="/etc/ssl/certs/ca-certificates.crt" - sed -i -e 's|http://|https://|' "${pkg_proxy_conf_file}" + export trusted_pem_path="/etc/ssl/certs/ca-certificates.crt" + if [[ -n "${effective_proxy}" ]]; then + sed -i -e 's|http://|https://|' "${pkg_proxy_conf_file}" + fi elif is_rocky ; then trusted_pem_dir="/etc/pki/ca-trust/source/anchors" - mkdir -p "${trusted_pem_dir}" proxy_ca_pem="${trusted_pem_dir}/proxy_ca.crt" gsutil cp "${METADATA_HTTP_PROXY_PEM_URI}" "${proxy_ca_pem}" update-ca-trust - trusted_pem_path="/etc/ssl/certs/ca-bundle.crt" - sed -i -e 's|^proxy=http://|proxy=https://|' "${pkg_proxy_conf_file}" - else - echo "unknown OS" - exit 1 + export trusted_pem_path="/etc/ssl/certs/ca-bundle.crt" + if [[ -n "${effective_proxy}" ]]; then + sed -i -e "s|^proxy=http://|proxy=https://|" "${pkg_proxy_conf_file}" + fi fi + export REQUESTS_CA_BUNDLE="${trusted_pem_path}" + echo "DEBUG: set_proxy: trusted_pem_path='${trusted_pem_path}'" - # configure gcloud to respect proxy ca cert - #gcloud config set core/custom_ca_certs_file "${proxy_ca_pem}" + local proxy_host="${http_proxy_val:-${https_proxy_val}}" + # Update env vars to use https + if [[ -n "${http_proxy_val}" ]]; then + export HTTP_PROXY="https://${http_proxy_val}" + export http_proxy="https://${http_proxy_val}" + fi + if [[ -n "${https_proxy_val}" ]]; then + export HTTPS_PROXY="https://${https_proxy_val}" + export https_proxy="https://${https_proxy_val}" + fi + sed -i -e 's|http://|https://|g' /etc/environment + echo "DEBUG: set_proxy: Final HTTP_PROXY='${HTTP_PROXY:-}'" + echo "DEBUG: set_proxy: Final HTTPS_PROXY='${HTTPS_PROXY:-}'" + + if [[ -n "${http_proxy_val}" ]]; then + sed -i -e "s|^http-proxy http://.*|http-proxy https://${http_proxy_val}|" /etc/gnupg/dirmngr.conf + fi + + # Verification steps from original script... ca_subject="$(openssl crl2pkcs7 -nocrl -certfile "${proxy_ca_pem}" | openssl pkcs7 -print_certs -noout | grep ^subject)" - # Verify that the proxy certificate is trusted - local output - output=$(echo | openssl s_client \ - -connect "${METADATA_HTTP_PROXY}" \ - -proxy "${METADATA_HTTP_PROXY}" \ - -CAfile "${proxy_ca_pem}") || { - echo "proxy certificate verification failed" - echo "${output}" - exit 1 - } - output=$(echo | openssl s_client \ - -connect "${METADATA_HTTP_PROXY}" \ - -proxy "${METADATA_HTTP_PROXY}" \ - -CAfile "${trusted_pem_path}") || { - echo "proxy ca certificate not included in system bundle" - echo "${output}" - exit 1 - } - output=$(curl --verbose -fsSL --retry-connrefused --retry 10 --retry-max-time 30 --head "https://google.com" 2>&1)|| { - echo "curl rejects proxy configuration" - echo "${curl_output}" - exit 1 - } - output=$(curl --verbose -fsSL --retry-connrefused --retry 10 --retry-max-time 30 --head "https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run" 2>&1)|| { - echo "curl rejects proxy configuration" - echo "${output}" - exit 1 - } + openssl s_client -connect "${proxy_host}" -CAfile "${proxy_ca_pem}" < /dev/null || { echo "ERROR: proxy cert verification failed" ; exit 1 ; } + openssl s_client -connect "${proxy_host}" -CAfile "${trusted_pem_path}" < /dev/null || { echo "ERROR: proxy ca not in system bundle" ; exit 1 ; } + + curl --verbose --cacert "${trusted_pem_path}" -x "${HTTPS_PROXY}" -fsSL --retry-connrefused --retry 10 --retry-max-time 30 --head "https://google.com" || { echo "ERROR: curl rejects proxy config for google.com" ; exit 1 ; } + curl --verbose --cacert "${trusted_pem_path}" -x "${HTTPS_PROXY}" -fsSL --retry-connrefused --retry 10 --retry-max-time 30 --head "https://developer.download.nvidia.com" || { echo "ERROR: curl rejects proxy config for nvidia.com" ; exit 1 ; } - # Instruct conda to use the system certificate - echo "Attempting to install pip-system-certs using the proxy certificate..." - export REQUESTS_CA_BUNDLE="${trusted_pem_path}" pip install pip-system-certs unset REQUESTS_CA_BUNDLE - # For the binaries bundled with conda, append our certificate to the bundle - openssl crl2pkcs7 -nocrl -certfile /opt/conda/default/ssl/cacert.pem | openssl pkcs7 -print_certs -noout | grep -Fx "${ca_subject}" || { - cat "${proxy_ca_pem}" >> /opt/conda/default/ssl/cacert.pem - } + if command -v conda &> /dev/null ; then + local conda_cert_file="/opt/conda/default/ssl/cacert.pem" + if [[ -f "${conda_cert_file}" ]]; then + openssl crl2pkcs7 -nocrl -certfile "${conda_cert_file}" | openssl pkcs7 -print_certs -noout | grep -Fxq "${ca_subject}" || { + cat "${proxy_ca_pem}" >> "${conda_cert_file}" + } + fi + fi + + if [[ -f "/etc/environment" ]]; then + JAVA_HOME="$(awk -F= '/^JAVA_HOME=/ {print $2}' /etc/environment)" + if [[ -n "${JAVA_HOME:-}" && -f "${JAVA_HOME}/bin/keytool" ]]; then + "${JAVA_HOME}/bin/keytool" -import -cacerts -storepass changeit -noprompt -alias swp_ca -file "${proxy_ca_pem}" + fi + fi + + echo "DEBUG: set_proxy: Verifying proxy connectivity..." - sed -i -e 's|http://|https://|' /etc/gnupg/dirmngr.conf - export http_proxy="https://${METADATA_HTTP_PROXY}" - export https_proxy="https://${METADATA_HTTP_PROXY}" - export HTTP_PROXY="https://${METADATA_HTTP_PROXY}" - export HTTPS_PROXY="https://${METADATA_HTTP_PROXY}" - sed -i -e 's|proxy=http://|proxy=https://|' -e 's|PROXY=http://|PROXY=https://|' /etc/environment + # Test fetching a file through the proxy + local test_url="https://www.gstatic.com/generate_204" +# local test_url="https://raw.githubusercontent.com/GoogleCloudDataproc/initialization-actions/master/README.md" + local test_output="${tmpdir}/proxy_test.md" + + echo "DEBUG: set_proxy: Attempting to download ${test_url} via proxy ${HTTPS_PROXY}" +# if curl --verbose --cacert "${trusted_pem_path}" -x "${HTTPS_PROXY}" -fsSL --retry-connrefused --retry 3 --retry-max-time 10 -o "${test_output}" "${test_url}"; then + if curl -vL --retry-connrefused --retry 1 --connect-timeout 0.01 --max-time 1 --proxy "${HTTP_PROXY}" -o /dev/null "${test_url}"; then + echo "DEBUG: set_proxy: Successfully downloaded test file through proxy." + rm -f "${test_output}" + else + echo "ERROR: Proxy test failed. Unable to download ${test_url} via ${HTTPS_PROXY}" + # Optionally print more debug info from curl if needed + exit 1 + fi - # Instruct the JRE to trust the certificate - JAVA_HOME="$(awk -F= '/^JAVA_HOME=/ {print $2}' /etc/environment)" - "${JAVA_HOME}/bin/keytool" -import -cacerts -storepass changeit -noprompt -alias swp_ca -file "${proxy_ca_pem}" + echo "DEBUG: set_proxy: Proxy verification successful." + + echo "DEBUG: set_proxy: Proxy setup complete." } function mount_ramdisk(){ @@ -2765,6 +2950,10 @@ function prepare_to_install(){ # Verify OS compatability and Secure boot state check_os check_secure_boot + # Setup temporary directories (potentially on RAM disk) + tmpdir=/tmp/ # Default + mount_ramdisk # Updates tmpdir if successful + install_log="${tmpdir}/install.log" # Set install log path based on final tmpdir set_proxy # --- Detect Image Build Context --- @@ -2811,11 +3000,6 @@ function prepare_to_install(){ # ["NVIDIA-Linux-x86_64-550.135.run"]="a8c3ae0076f11e864745fac74bfdb01f" # ["NVIDIA-Linux-x86_64-550.142.run"]="e507e578ecf10b01a08e5424dddb25b8" - # Setup temporary directories (potentially on RAM disk) - tmpdir=/tmp/ # Default - mount_ramdisk # Updates tmpdir if successful - install_log="${tmpdir}/install.log" # Set install log path based on final tmpdir - workdir=/opt/install-dpgce # Set GCS bucket for caching temp_bucket="$(get_metadata_attribute dataproc-temp-bucket)" @@ -2851,9 +3035,9 @@ function prepare_to_install(){ fi # zero free disk space (only if creating image) - if [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then ( set +e + if [[ "0" == "1" ]] && [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then time dd if=/dev/zero of=/zero status=none ; sync ; sleep 3s ; rm -f /zero - ) fi + fi install_dependencies @@ -2953,8 +3137,7 @@ function os_add_repo() { mkdir -p "$(dirname "${kr_path}")" - curl ${curl_retry_args} "${signing_key_url}" \ - | gpg --import --no-default-keyring --keyring "${kr_path}" + import_gpg_keys --keyring-file "${kr_path}" --key-url "${signing_key_url}" if is_debuntu ; then apt_add_repo "${repo_name}" "${signing_key_url}" "${repo_data}" "${4:-yes}" "${kr_path}" "${6:-}" else dnf_add_repo "${repo_name}" "${signing_key_url}" "${repo_data}" "${4:-yes}" "${kr_path}" "${6:-}" ; fi @@ -3011,6 +3194,174 @@ function install_spark_rapids() { "${spark_jars_dir}/${jar_basename}" } +# Function to download GPG keys from URLs or Keyservers and import them to a specific keyring +# Usage: +# import_gpg_keys --keyring-file \ +# [--key-url [--key-url ...]] \ +# [--key-id [--key-id ...]] \ +# [--keyserver ] +function import_gpg_keys() { + local keyring_file="" + local key_urls=() + local key_ids=() + local keyserver="hkp://keyserver.ubuntu.com:80" # Default keyserver + local tmpdir="${TMPDIR:-/tmp}" # Use TMPDIR if set, otherwise /tmp + local curl_retry_args=(-sSLf --retry 3 --retry-delay 5) # Basic curl retry args + + # Parse named arguments + while [[ $# -gt 0 ]]; do + case "$1" in + --keyring-file) + keyring_file="$2" + shift 2 + ;; + --key-url) + key_urls+=("$2") + shift 2 + ;; + --key-id) + key_ids+=("$2") + shift 2 + ;; + --keyserver) + keyserver="$2" + shift 2 + ;; + *) + echo "Unknown option: $1" >&2 + return 1 + ;; + esac + done + + # Validate arguments + if [[ -z "${keyring_file}" ]]; then + echo "ERROR: --keyring-file is required." >&2 + return 1 + fi + if [[ ${#key_urls[@]} -eq 0 && ${#key_ids[@]} -eq 0 ]]; then + echo "ERROR: At least one --key-url or --key-id must be specified." >&2 + return 1 + fi + + # Ensure the directory for the keyring file exists + local keyring_dir + keyring_dir=$(dirname "${keyring_file}") + if [[ ! -d "${keyring_dir}" ]]; then + echo "Creating directory for keyring: ${keyring_dir}" + mkdir -p "${keyring_dir}" + fi + + local tmp_key_file="" + local success=true + + # Setup curl proxy arguments if environment variables are set + local proxy_to_use="" + if [[ -n "${HTTPS_PROXY:-}" ]]; then + proxy_to_use="${HTTPS_PROXY}" + elif [[ -n "${HTTP_PROXY:-}" ]]; then + proxy_to_use="${HTTP_PROXY}" + fi + + if [[ -n "${proxy_to_use}" ]]; then + curl_retry_args+=(-x "${proxy_to_use}") + fi + + if [[ -v METADATA_HTTP_PROXY_PEM_URI ]] && [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]]; then + if [[ -z "${trusted_pem_path:-}" ]]; then + echo "WARNING: METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined." >&2 + else + curl_retry_args+=(--cacert "${trusted_pem_path}") + fi + fi + + # Process Key URLs + for current_key_url in "${key_urls[@]}"; do + echo "Attempting to download GPG key from URL: ${current_key_url}" + tmp_key_file="${tmpdir}/key_$(basename "${current_key_url}")_$(date +%s).asc" + + if curl "${curl_retry_args[@]}" "${current_key_url}" -o "${tmp_key_file}"; then + if [[ -s "${tmp_key_file}" ]]; then + echo "Key file downloaded to ${tmp_key_file}." + if gpg --no-default-keyring --keyring "${keyring_file}" --import "${tmp_key_file}"; then + echo "Key from ${current_key_url} imported successfully to ${keyring_file}." + else + echo "ERROR: gpg --import failed for ${tmp_key_file} from ${current_key_url}." >&2 + success=false + fi + else + echo "ERROR: Downloaded key file ${tmp_key_file} from ${current_key_url} is empty." >&2 + success=false + fi + else + echo "ERROR: curl failed to download key from ${current_key_url}." >&2 + success=false + fi + [[ -f "${tmp_key_file}" ]] && rm -f "${tmp_key_file}" + done + + # Process Key IDs + for key_id in "${key_ids[@]}"; do + # Strip 0x prefix if present + clean_key_id="${key_id#0x}" + echo "Attempting to fetch GPG key ID ${clean_key_id} using curl from ${keyserver}" + + local fallback_key_url + local server_host + server_host=$(echo "${keyserver}" | sed -e 's#hkp[s]*://##' -e 's#:[0-9]*##') + + # Common keyserver URL patterns + if [[ "${server_host}" == "keyserver.ubuntu.com" ]]; then + fallback_key_url="https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x${clean_key_id}" + elif [[ "${server_host}" == "pgp.mit.edu" ]]; then + fallback_key_url="https://pgp.mit.edu/pks/lookup?op=get&search=0x${clean_key_id}" + elif [[ "${server_host}" == "keys.openpgp.org" ]]; then + fallback_key_url="https://keys.openpgp.org/vks/v1/by-fpr/${clean_key_id}" + else + fallback_key_url="https://${server_host}/pks/lookup?op=get&search=0x${clean_key_id}" + echo "WARNING: Using best-guess fallback URL for ${keyserver}: ${fallback_key_url}" + fi + + tmp_key_file="${tmpdir}/${clean_key_id}.asc" + if curl "${curl_retry_args[@]}" "${fallback_key_url}" -o "${tmp_key_file}"; then + if [[ -s "${tmp_key_file}" ]]; then + if grep -q -iE '&2 + success=false + elif gpg --no-default-keyring --keyring "${keyring_file}" --import "${tmp_key_file}"; then + echo "Key ${clean_key_id} imported successfully to ${keyring_file}." + else + echo "ERROR: gpg --import failed for ${clean_key_id} from ${fallback_key_url}." >&2 + success=false + fi + else + echo "ERROR: Downloaded key file for ${clean_key_id} is empty from ${fallback_key_url}." >&2 + success=false + fi + else + echo "ERROR: curl failed to download key ${clean_key_id} from ${fallback_key_url}." >&2 + success=false + fi + [[ -f "${tmp_key_file}" ]] && rm -f "${tmp_key_file}" + done + + if [[ "${success}" == "true" ]]; then + return 0 + else + echo "ERROR: One or more keys failed to import." >&2 + return 1 + fi +} + +# Example Usage (uncomment to test) +# import_gpg_keys --keyring-file "/tmp/test-keyring.gpg" --key-url "https://nvidia.github.io/libnvidia-container/gpgkey" +# import_gpg_keys --keyring-file "/tmp/test-keyring.gpg" --key-id "A040830F7FAC5991" +# import_gpg_keys --keyring-file "/tmp/test-keyring.gpg" --key-id "B82D541C" --keyserver "hkp://keyserver.ubuntu.com:80" + +# To use this in another script: +# source ./gpg-import.sh +# import_gpg_keys --keyring-file "/usr/share/keyrings/my-repo.gpg" --key-url "https://example.com/repo.key" + # --- Script Entry Point --- prepare_to_install # Run preparation steps first main # Call main logic