Skip to content

[ET-VK][CI] Add test-vulkan-genai job for Parakeet on NVIDIA GPU runner#18335

Open
SS-JIA wants to merge 6 commits intogh/SS-JIA/497/basefrom
gh/SS-JIA/497/head
Open

[ET-VK][CI] Add test-vulkan-genai job for Parakeet on NVIDIA GPU runner#18335
SS-JIA wants to merge 6 commits intogh/SS-JIA/497/basefrom
gh/SS-JIA/497/head

Conversation

@SS-JIA
Copy link
Copy Markdown
Contributor

@SS-JIA SS-JIA commented Mar 19, 2026

Stack from ghstack (oldest at bottom):

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

  • Add --gpu flag to setup-vulkan-linux-deps.sh to skip SwiftShader
    installation when running on machines with a real GPU driver
  • Add vulkan as a supported device in export_model_artifact.sh and
    test_model_e2e.sh
  • Add test-vulkan-genai job to pull.yml on linux.g5.4xlarge.nvidia.gpu

Differential Revision: D97344728

cc @manuelcandales @digantdesai @cbilgin

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 19, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18335

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 208 Pending

As of commit affcc89 with merge base fb1618e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

SS-JIA pushed a commit that referenced this pull request Mar 19, 2026
Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)

ghstack-source-id: 354758467
Pull Request resolved: #18335
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@SS-JIA SS-JIA temporarily deployed to upload-benchmark-results March 19, 2026 20:09 — with GitHub Actions Inactive
…IA GPU runner"

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)

cc manuelcandales digantdesai cbilgin

[ghstack-poisoned]
SS-JIA pushed a commit that referenced this pull request Mar 19, 2026
Pull Request resolved: #18335

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`
ghstack-source-id: 354902046
@exported-using-ghexport

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)
@SS-JIA SS-JIA temporarily deployed to upload-benchmark-results March 19, 2026 23:53 — with GitHub Actions Inactive
…IA GPU runner"

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)

cc manuelcandales digantdesai cbilgin

[ghstack-poisoned]
SS-JIA pushed a commit that referenced this pull request Mar 20, 2026
Pull Request resolved: #18335

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`
- Fix Parakeet CMakeLists.txt to guard `quantized_ops_lib` and `custom_ops`
  with `if(TARGET ...)` and IMPORTED checks. When the Parakeet runner is
  built as a standalone CMake project (the second step of `make
  parakeet-vulkan`), these targets are found via `find_package(executorch)`
  as imported targets. The existing code called
  `executorch_target_link_options_shared_lib()` on them unconditionally,
  which internally calls `target_link_options()` — and CMake does not allow
  `target_link_options()` on imported targets. Other targets in the same
  file (e.g. `optimized_native_cpu_ops_lib`, `xnnpack_backend`) already had
  the correct guards; `quantized_ops_lib` and `custom_ops` were simply
  missing them.
ghstack-source-id: 355090034
@exported-using-ghexport

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)
@SS-JIA SS-JIA temporarily deployed to upload-benchmark-results March 20, 2026 07:22 — with GitHub Actions Inactive
…IA GPU runner"

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)

cc manuelcandales digantdesai cbilgin

[ghstack-poisoned]
SS-JIA pushed a commit that referenced this pull request Mar 27, 2026
Pull Request resolved: #18335

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`
- Fix Parakeet CMakeLists.txt to guard `quantized_ops_lib` and `custom_ops`
  with `if(TARGET ...)` and IMPORTED checks. When the Parakeet runner is
  built as a standalone CMake project (the second step of `make
  parakeet-vulkan`), these targets are found via `find_package(executorch)`
  as imported targets. The existing code called
  `executorch_target_link_options_shared_lib()` on them unconditionally,
  which internally calls `target_link_options()` — and CMake does not allow
  `target_link_options()` on imported targets. Other targets in the same
  file (e.g. `optimized_native_cpu_ops_lib`, `xnnpack_backend`) already had
  the correct guards; `quantized_ops_lib` and `custom_ops` were simply
  missing them.
ghstack-source-id: 358903222
@exported-using-ghexport

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)
@SS-JIA SS-JIA temporarily deployed to upload-benchmark-results March 27, 2026 19:23 — with GitHub Actions Inactive
@SS-JIA
Copy link
Copy Markdown
Contributor Author

SS-JIA commented Mar 27, 2026

@claude investigate the test-vulkan-genai job failure

@claude
Copy link
Copy Markdown

claude bot commented Mar 27, 2026

Claude finished @SS-JIA's task in 3m 58s —— View job


Investigating test-vulkan-genai Job Failure

  • Get CI status and identify the failing workflow run
  • Download and analyze job logs
  • Review relevant code changes
  • Report findings

Root Cause

The build fails at examples/models/parakeet/timestamp_utils.h:10 with:

fatal error: 'pytorch/tokenizers/tokenizer.h' file not found

This happens during the make parakeet-vulkan step inside test_model_e2e.sh.

Analysis

The CI job script (.github/workflows/pull.yml:1356-1375) runs two build stages:

  1. setup-linux.sh — Configures and builds ExecuTorch into cmake-out/ using raw cmake with only -DEXECUTORCH_BUILD_VULKAN=ON. This creates a CMake cache where most extensions are OFF, including:

    • EXECUTORCH_BUILD_EXTENSION_LLM=OFF
    • EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=OFF
    • EXECUTORCH_BUILD_EXTENSION_MODULE=OFF
    • EXECUTORCH_BUILD_EXTENSION_TENSOR=OFF
    • etc.
  2. make parakeet-vulkan (called from test_model_e2e.sh) — Runs cmake --workflow --preset llm-debug-vulkan, which uses the llm.cmake preset file. This preset attempts to enable EXECUTORCH_BUILD_EXTENSION_LLM=ON (among other things) via set_overridable_option. However, set_overridable_option (tools/cmake/common/preset.cmake:98-106) checks if(NOT DEFINED ${NAME}) — since the CMake cache in cmake-out/ already has these variables defined (set to OFF from step 1), the preset cannot override them.

As a result, the tokenizers library (built under EXECUTORCH_BUILD_EXTENSION_LLM) is never compiled or installed. When the parakeet runner tries to #include <pytorch/tokenizers/tokenizer.h>, the header doesn't exist.

The CI log confirms this at the configure output (log line ~23618):

EXECUTORCH_BUILD_EXTENSION_LLM              : OFF
EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER       : OFF

The Fix

The test-vulkan-genai CI job needs the setup-linux.sh step to also build with the LLM extensions enabled, or the cmake-out directory should be cleaned between the two build stages so the llm-debug-vulkan preset can configure from scratch.

Option A (Recommended): Skip the setup-linux.sh step entirely and let make parakeet-vulkan handle the full build. The parakeet-vulkan Makefile target already calls cmake --workflow --preset llm-debug-vulkan (which builds everything needed) before building the parakeet runner. The setup-linux.sh step is redundant (and harmful due to cache conflicts) when using the Makefile workflow.

The CI script should look like:

# Setup Vulkan SDK (no SwiftShader — use NVIDIA driver's Vulkan ICD)
source .ci/scripts/setup-vulkan-linux-deps.sh --gpu

# Install ExecuTorch Python package (for export step)
# ... (just the pip install, not the cmake build)

# Export parakeet with Vulkan backend
bash .ci/scripts/export_model_artifact.sh vulkan "nvidia/parakeet-tdt" "quantized-8da4w" "${RUNNER_ARTIFACT_DIR}"

# Build runner and test e2e (make parakeet-vulkan handles the full C++ build)
bash .ci/scripts/test_model_e2e.sh vulkan "nvidia/parakeet-tdt" "quantized-8da4w" "${RUNNER_ARTIFACT_DIR}"

Option B: Add additional CMAKE_ARGS to the setup-linux.sh call:

CMAKE_ARGS="-DEXECUTORCH_BUILD_VULKAN=ON -DEXECUTORCH_BUILD_EXTENSION_LLM=ON -DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON"

Option C: Clean cmake-out between the two build stages so the preset can configure cleanly:

rm -rf cmake-out
make parakeet-vulkan

Note: The existing test-parakeet-xnnpack-linux job likely works because the XNNPACK preset path doesn't have this same cache conflict, or the required extensions happen to be enabled by default.

@SS-JIA
Copy link
Copy Markdown
Contributor Author

SS-JIA commented Apr 6, 2026

@claude investigate the test-vulkan-genai job failure

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Claude finished @SS-JIA's task in 3m 14s —— View job


Investigating test-vulkan-genai Job Failure

  • Get CI status and identify failing runs
  • Download and analyze job logs
  • Review relevant code and CI configuration
  • Report findings with root cause and fix recommendations

Root Cause

The build fails at examples/models/parakeet/timestamp_utils.h:10 with:

fatal error: 'pytorch/tokenizers/tokenizer.h' file not found

This happens during the make parakeet-vulkan step called from test_model_e2e.sh.

Detailed Analysis

The CI job script (.github/workflows/pull.yml:1356-1375) runs these build stages in sequence:

  1. install_executorch.sh (called inside setup-linux.sh) — Runs pip install which internally uses the pybind preset to configure cmake-out/. This step actually has EXECUTORCH_BUILD_EXTENSION_LLM=ON (log line ~4866). It then cleans cmake-out/ and moves to step 2.

  2. build_executorch_runner_cmake (also inside setup-linux.sh) — Cleans cmake-out/, then runs:

    cmake -DEXECUTORCH_BUILD_VULKAN=ON -Bcmake-out .
    

    This creates a CMake cache where most extensions are OFF by default (log line ~19751):

    EXECUTORCH_BUILD_EXTENSION_LLM              : OFF
    EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER       : OFF
    EXECUTORCH_BUILD_EXTENSION_MODULE           : OFF
    EXECUTORCH_BUILD_EXTENSION_TENSOR           : OFF
    
  3. make parakeet-vulkan → first calls cmake --workflow --preset llm-debug-vulkan. This preset uses tools/cmake/preset/llm.cmake which calls set_overridable_option(EXECUTORCH_BUILD_EXTENSION_LLM ON). However, set_overridable_option (tools/cmake/common/preset.cmake:98-106) checks if(NOT DEFINED ${NAME}) — since the CMake cache in cmake-out/ already has these variables defined (set to OFF from step 2), the preset cannot override them. The log confirms at line ~23618:

    EXECUTORCH_BUILD_EXTENSION_LLM              : OFF
    EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER       : OFF
    
  4. Parakeet runner buildcd examples/models/parakeet && cmake --workflow --preset parakeet-vulkan — tries to build the parakeet runner but the tokenizers library was never built (it requires EXECUTORCH_BUILD_EXTENSION_LLM=ON), so the header pytorch/tokenizers/tokenizer.h doesn't exist in the install prefix.

Why test-parakeet-xnnpack-linux works

The XNNPACK CI job (.github/workflows/pull.yml:167-198) does not call setup-linux.sh. Instead it only runs ./install_executorch.sh (Python-only install), then goes straight to test_model_e2e.shmake parakeet-cpucmake --workflow --preset llm-release which configures cmake-out/ from scratch with no conflicting cache — so set_overridable_option can set EXECUTORCH_BUILD_EXTENSION_LLM=ON.

Recommended Fix

Option A (Recommended — matches the working XNNPACK pattern): Replace the setup-linux.sh call with just ./install_executorch.sh:

script: |
  set -eux
  CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
  conda activate "${CONDA_ENV}"

  # Setup Vulkan SDK (no SwiftShader — use NVIDIA driver's Vulkan ICD)
  source .ci/scripts/setup-vulkan-linux-deps.sh --gpu

  # Install ExecuTorch Python package only (no C++ build — the Makefile
  # workflow preset handles that)
  ./install_executorch.sh

  # Export parakeet with Vulkan backend
  bash .ci/scripts/export_model_artifact.sh vulkan "nvidia/parakeet-tdt" "quantized-8da4w" "${RUNNER_ARTIFACT_DIR}"

  # Build runner and test e2e (make parakeet-vulkan does the full C++ build)
  bash .ci/scripts/test_model_e2e.sh vulkan "nvidia/parakeet-tdt" "quantized-8da4w" "${RUNNER_ARTIFACT_DIR}"

The setup-linux.sh step is redundant (and harmful due to cache conflicts) because make parakeet-vulkan already does a full CMake workflow configure+build via the llm-debug-vulkan preset.

Option B: Clean cmake-out/ between the two build stages:

rm -rf cmake-out
make parakeet-vulkan

Option C: Pass all the required flags to setup-linux.sh:

CMAKE_ARGS="-DEXECUTORCH_BUILD_VULKAN=ON -DEXECUTORCH_BUILD_EXTENSION_LLM=ON -DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON" \
.ci/scripts/setup-linux.sh --build-tool "cmake"

Option A is cleanest since it matches the proven pattern from test-parakeet-xnnpack-linux.


…IA GPU runner"

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)

cc manuelcandales digantdesai cbilgin

[ghstack-poisoned]
SS-JIA pushed a commit that referenced this pull request Apr 6, 2026
Pull Request resolved: #18335

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`
- Fix Parakeet CMakeLists.txt to guard `quantized_ops_lib` and `custom_ops`
  with `if(TARGET ...)` and IMPORTED checks. When the Parakeet runner is
  built as a standalone CMake project (the second step of `make
  parakeet-vulkan`), these targets are found via `find_package(executorch)`
  as imported targets. The existing code called
  `executorch_target_link_options_shared_lib()` on them unconditionally,
  which internally calls `target_link_options()` — and CMake does not allow
  `target_link_options()` on imported targets. Other targets in the same
  file (e.g. `optimized_native_cpu_ops_lib`, `xnnpack_backend`) already had
  the correct guards; `quantized_ops_lib` and `custom_ops` were simply
  missing them.
ghstack-source-id: 363162811
@exported-using-ghexport

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)
@SS-JIA
Copy link
Copy Markdown
Contributor Author

SS-JIA commented Apr 6, 2026

@claude investigate the test-vulkan-genai job failure

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Claude finished @SS-JIA's task in 6m 3s —— View job


Investigating test-vulkan-genai Job Failure

  • Get CI status and identify the failing workflow run
  • Review relevant code changes since last investigation
  • Analyze whether the previous fix addresses the failure
  • Report findings

Status

The current CI run (pull #24047249156) was triggered at 19:27 UTC today and is still in progress. The test-vulkan-genai job requires a linux.g5.4xlarge.nvidia.gpu runner and hasn't been provisioned yet (the job is beyond the API's 30-job pagination window). There are no completed failures in the current run.

Code Changes Since Last Investigation

The latest commit (b83c0a2) has applied a fix for the root cause identified in my previous investigation. Specifically, .github/workflows/pull.yml:1372-1377 now includes:

# Clean cmake-out/ — install_executorch.sh creates a CMake cache with
# most extensions OFF, and set_overridable_option in the Makefile
# presets cannot override cached variables. Without this, the
# llm-debug-vulkan preset fails to enable EXECUTORCH_BUILD_EXTENSION_LLM
# and the tokenizers header is never installed.
rm -rf cmake-out

This directly addresses the root cause: install_executorch.sh runs pip install . --no-build-isolation which internally uses CMake with the pybind preset, creating a cmake-out/ directory with EXECUTORCH_BUILD_EXTENSION_LLM=OFF cached. Subsequently, make parakeet-vulkan calls cmake --workflow --preset llm-debug-vulkan, but set_overridable_option in tools/cmake/common/preset.cmake:98-106 refuses to override already-defined cache variables. The rm -rf cmake-out step clears the stale cache so the preset can configure cleanly.

The CI job also no longer calls setup-linux.sh (which ran an unnecessary C++ build), and instead only runs ./install_executorch.sh (Python package install), matching the working pattern from test-parakeet-xnnpack-linux.

Assessment

The fix looks correct. The rm -rf cmake-out between install_executorch.sh and test_model_e2e.sh should resolve the 'pytorch/tokenizers/tokenizer.h' file not found build error. This is essentially Option C from the previous analysis, combined with Option A (dropping setup-linux.sh in favor of install_executorch.sh).

The job needs to run on the GPU runner to confirm the fix works end-to-end. Monitor the current CI run at pull #24047249156 — the test-vulkan-genai job should appear once a GPU runner is provisioned.


@SS-JIA
Copy link
Copy Markdown
Contributor Author

SS-JIA commented Apr 6, 2026

@claude investigate the test-vulkan-genai job failure. You should see that the job failed due to being unable to initialize the global Vulkan runtime

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Claude finished @SS-JIA's task in 5m 33s —— View job


Investigating test-vulkan-genai Job Failure

  • Get CI status and identify the failing workflow run
  • Download and analyze job logs
  • Review relevant code and CI configuration
  • Report findings with root cause and fix recommendations

Root Cause

The parakeet_runner crashes at startup with:

terminate called after throwing an instance of 'vkcompute::vkapi::Error'
  what():  Exception raised from runtime at backends/vulkan/runtime/vk_api/Runtime.cpp:407:
    (p_runtime) is false! Pytorch Vulkan Runtime: The global runtime could not be retrieved
    because it failed to initialize.

The Vulkan ICD loader library (libvulkan.so.1) is missing from the Docker container.

Detailed Analysis

The Vulkan backend uses volk to dynamically load Vulkan at runtime (see backends/vulkan/CMakeLists.txt:54USE_VULKAN_VOLK is defined). When volkInitialize() is called, it does dlopen("libvulkan.so.1"). If this library isn't found, it returns failure, and init_global_vulkan_runtime() at Runtime.cpp:265 returns nullptr.

Why the library is missing:

  1. The Docker image (executorch-ubuntu-22.04-clang12) does not include libvulkan1. Checking the Dockerfile and install_base.sh confirms no Vulkan packages are installed (libvulkan1 only appears in install_arm.sh for the ARM docker image).

  2. The setup-vulkan-linux-deps.sh --gpu script (.ci/scripts/setup-vulkan-linux-deps.sh:48-58) skips SwiftShader when --gpu is passed. SwiftShader would normally provide libvulkan.so via its LD_LIBRARY_PATH export (line 25). Without SwiftShader, only the Vulkan SDK is installed — which adds bin/ tools to PATH but does not provide the loader library.

  3. The NVIDIA driver installer on the host even warns about this at log line 1282:

    WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD
    loader was detected on this system. The NVIDIA Vulkan ICD will not function without
    the loader. Most distributions package the Vulkan loader; try installing the
    "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.
    

    The NVIDIA driver installs the Vulkan ICD driver (nvidia_icd.json + libGLX_nvidia.so), but libvulkan.so.1 (the ICD loader that dispatches to drivers) must be installed separately.

  4. Although the Docker container is launched with --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all (which makes CUDA and the NVIDIA GPU visible), libvulkan.so.1 is not part of the NVIDIA Container Toolkit's automatic library injection — it must be explicitly installed in the container.

Why other Vulkan CI jobs work: The test-vulkan-operators-linux and test-vulkan-models-linux jobs run on linux.2xlarge (no GPU) and use SwiftShader, which bundles its own libvulkan.so and sets LD_LIBRARY_PATH.

Recommended Fix

Option A (Recommended — minimal change): Install libvulkan1 in the --gpu path of setup-vulkan-linux-deps.sh:

if [ "$USE_GPU" = false ]; then
  install_swiftshader
else
  # The Vulkan ICD loader (libvulkan.so.1) is needed to dispatch to the
  # NVIDIA ICD driver. SwiftShader bundles its own, but on GPU runners
  # we need to install it separately.
  apt-get update -q && apt-get install -y --no-install-recommends libvulkan1
fi

This installs the ~100KB Vulkan loader package which provides the libvulkan.so.1 that volk's volkInitialize() needs.

Option B (Docker image change): Add libvulkan1 to the base Docker image in .ci/docker/common/install_base.sh or the Dockerfile. This is cleaner but requires a docker image rebuild.

Option C (Use Vulkan SDK's loader): The LunarG Vulkan SDK includes libvulkan.so in its x86_64/lib/ directory. Add the SDK lib path to LD_LIBRARY_PATH:

install_vulkan_sdk() {
  # ... existing code ...
  export PATH="${PATH}:${_vulkan_sdk_dir}/${VULKAN_SDK_VERSION}/x86_64/bin/"
  export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:-}:${_vulkan_sdk_dir}/${VULKAN_SDK_VERSION}/x86_64/lib/"
}

This avoids needing apt-get (which may require sudo inside the container) and is self-contained.

Option A or C would fix the immediate issue. Option C is cleanest since it uses what's already downloaded.


…IA GPU runner"

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)

cc manuelcandales digantdesai cbilgin

[ghstack-poisoned]
SS-JIA pushed a commit that referenced this pull request Apr 6, 2026
Pull Request resolved: #18335

Add a new GitHub CI job that exports and runs the Parakeet TDT model with
the Vulkan backend on an NVIDIA GPU runner. The Vulkan export and runner
code already exists but had no CI coverage.

- Add `--gpu` flag to `setup-vulkan-linux-deps.sh` to skip SwiftShader
  installation when running on machines with a real GPU driver
- Add `vulkan` as a supported device in `export_model_artifact.sh` and
  `test_model_e2e.sh`
- Add `test-vulkan-genai` job to `pull.yml` on `linux.g5.4xlarge.nvidia.gpu`
- Fix Parakeet CMakeLists.txt to guard `quantized_ops_lib` and `custom_ops`
  with `if(TARGET ...)` and IMPORTED checks. When the Parakeet runner is
  built as a standalone CMake project (the second step of `make
  parakeet-vulkan`), these targets are found via `find_package(executorch)`
  as imported targets. The existing code called
  `executorch_target_link_options_shared_lib()` on them unconditionally,
  which internally calls `target_link_options()` — and CMake does not allow
  `target_link_options()` on imported targets. Other targets in the same
  file (e.g. `optimized_native_cpu_ops_lib`, `xnnpack_backend`) already had
  the correct guards; `quantized_ops_lib` and `custom_ops` were simply
  missing them.
ghstack-source-id: 363200555
@exported-using-ghexport

Differential Revision: [D97344728](https://our.internmc.facebook.com/intern/diff/D97344728/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants