Skip to content

AI-Hypercomputer/gpu-recipes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

507 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cloud GPU performance benchmark recipes

License

This repository contains recipes that provide instructions to reproduce specific workload performance measurements, which are part of a confidential benchmarking program. These recipes focus on helping you reliably achieve performance metrics, such as throughput, that demonstrate the combined hardware and software stack on GPUs.

Note: The recipes in this repository are not designed as general-purpose code samples or tutorials for using Compute Engine-based products.

Intended audience

This content is for you if you are a customer or partner who needs to:

  • Validate hardware performance with your suppliers.
  • Inform purchasing decisions using the benchmarking data.
  • Reproduce optimal performance scenarios before you customize workflows for your own requirements.

How to use these recipes

To reproduce a benchmark, follow these steps:

  1. Identify your requirements: determine the model, GPU type, workload, framework, and orchestrator that you are interested in.
  2. Select a recipe: based on your requirements use the Benchmark support matrix to find a recipe that meets your needs.
  3. Follow the recipe: each recipe will provide you with procedures to complete the following tasks:
    • prepare your environment.
    • run the benchmark.
    • analyze the benchmarks results. This includes not just the results but detailed logs for further analysis. You can automate your infrastructure setup using Cluster Toolkit. For more information, see Automated GPU environment deployment with Cluster Toolkit.

Benchmarks support matrix

Training benchmarks A3 Mega

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
GPT3-175B A3 Mega (NVIDIA H100) NeMo (25.07) Pre-training GKE Link
Llama-3-70B A3 Mega (NVIDIA H100) NeMo (25.07) Pre-training GKE Link
Mixtral-8-7B A3 Mega (NVIDIA H100) NeMo (25.07) Pre-training GKE Link

Training benchmarks A3 Ultra

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Llama-3.1-70B A3 Ultra (NVIDIA H200) MaxText Pre-training GKE Link
Llama-3.1-70B A3 Ultra (NVIDIA H200) NeMo (24.07) Pre-training GKE Link
Llama-3-70B A3 Ultra (NVIDIA H200) Megatron-Bridge (26.02) Pre-training GKE Link
Llama-3-70B A3 Ultra (NVIDIA H200) Megatron-Bridge (25.11) Pre-training Slurm Link
Llama-3-8B A3 Ultra (NVIDIA H200) Megatron-Bridge (25.11) Pre-training Slurm Link
Llama-3.1-405B A3 Ultra (NVIDIA H200) MaxText Pre-training GKE Link
Llama-3.1-405B A3 Ultra (NVIDIA H200) NeMo (24.12) Pre-training GKE Link
Mixtral-8-7B A3 Ultra (NVIDIA H200) NeMo (24.07) Pre-training GKE Link
DeepSeek-V3 A3 Ultra (NVIDIA H200) Megatron-Bridge (26.02) Pre-training GKE Link
GPT OSS 120B A3 Ultra (NVIDIA H200) NeMo (26.02) Pre-training GKE Link
Qwen-3-30B A3 Ultra (NVIDIA H200) NeMo (26.02) Pre-training GKE Link
Wan-2.1 A3 Ultra (NVIDIA H200) Megatron-Bridge (26.02) Pre-training GKE Link

Training benchmarks A4

Models GPU Machine Type Framework / Library Workload Type Orchestrator Link to the recipe
Llama-3.1-70B A4 (NVIDIA B200) MaxText Pre-training GKE Link
Llama-3.1-70B A4 (NVIDIA B200) NeMo (25.07) Pre-training GKE Link
Llama-3.1-70B A4 (NVIDIA B200) NeMo (26.02) Pre-training GKE Link
Llama-3.1-70B A4 (NVIDIA B200) Megatron-Bridge (25.09) Pre-training Slurm Link
Llama-3.1-405B A4 (NVIDIA B200) MaxText Pre-training GKE Link
Llama-3.1-405B A4 (NVIDIA B200) NeMo (25.07) Pre-training GKE Link
Llama-3.1-405B A4 (NVIDIA B200) NeMo (26.02) Pre-training GKE Link
Llama-3.1-405B A4 (NVIDIA B200) Megatron-Bridge (25.09) Pre-training Slurm Link
Mixtral-8-7B A4 (NVIDIA B200) NeMo (25.07) Pre-training GKE Link
PaliGemma2 A4 (NVIDIA B200) Hugging Face Accelerate Finetuning GKE Link
DeepSeek-V3 A4 (NVIDIA B200) Megatron-Bridge (25.11) Pre-training GKE Link
DeepSeek-V3 A4 (NVIDIA B200) Megatron-Bridge (26.02) Pre-training GKE Link
GPT OSS 120B A4 (NVIDIA B200) Megatron-Bridge (26.02) Pre-training GKE Link
Llama-3-8B A4 (NVIDIA B200) Megatron-Bridge (26.02) Pre-training GKE Link
Qwen-3-235B A4 (NVIDIA B200) Megatron-Bridge (25.11) Pre-training GKE Link
Qwen-3-235B A4 (NVIDIA B200) Megatron-Bridge (26.02) Pre-training GKE Link
Qwen-3-235B A4 (NVIDIA B200) Megatron-Bridge (25.11) Pre-training Slurm Link
Qwen-3-30B A4 (NVIDIA B200) NeMo (26.02) Pre-training GKE Link
Wan-2.1-14B A4 (NVIDIA B200) NeMo (25.11) Pre-training GKE Link

Training benchmarks A4X

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Llama-3.1-8B A4X (NVIDIA GB200) NeMo (25.07) Pre-training GKE Link
Llama-3.1-8B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training GKE Link
Llama-3.1-8B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training Slurm Link
Llama-3.1-70B A4X (NVIDIA GB200) NeMo (25.07) Pre-training GKE Link
Llama-3.1-70B A4X (NVIDIA GB200) Megatron-Bridge (26.02) Pre-training GKE Link
Llama-3.1-405B A4X (NVIDIA GB200) NeMo (25.07) Pre-training GKE Link
Llama-3.1-405B A4X (NVIDIA GB200) NeMo (26.02) Pre-training GKE Link
Llama-3.1-405B A4X (NVIDIA GB200) Megatron-Bridge (26.02) Pre-training GKE Link
Llama-3.1-405B A4X (NVIDIA GB200) Megatron-Bridge (25.09) Pre-training Slurm Link
Nemotron-4-340B A4X (NVIDIA GB200) NeMo (25.09) Pre-training GKE Link
Wan-2.1-14B A4X (NVIDIA GB200) NeMo (25.11) Pre-training GKE Link
Wan-2.1-14B A4X (NVIDIA GB200) NeMo (26.02) Pre-training GKE Link
Wan-2.1-14B A4X (NVIDIA GB200) NeMo (25.11) Pre-training Slurm Link
DeepSeek-V3 A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training GKE Link
Qwen-3-235B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training GKE Link
Qwen-3-235B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training Slurm Link
Qwen-3-30B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training GKE Link
Qwen-3-30B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training Slurm Link

Inference benchmarks A3 Mega

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Llama-4 A3 Mega (NVIDIA H100) SGLang Inference GKE Link
DeepSeek R1 671B A3 Mega (NVIDIA H100) SGLang Inference GKE Link
DeepSeek R1 671B A3 Mega (NVIDIA H100) vLLM Inference GKE Link

Inference benchmarks A3 Ultra

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
GPT OSS 120B A3 Ultra (NVIDIA H200) vLLM Inference GKE Link
Llama-4 A3 Ultra (NVIDIA H200) vLLM Inference GKE Link
Llama-3.1-405B A3 Ultra (NVIDIA H200) TensorRT-LLM Inference GKE Link
DeepSeek R1 671B A3 Ultra (NVIDIA H200) SGLang Inference GKE Link
DeepSeek R1 671B A3 Ultra (NVIDIA H200) vLLM Inference GKE Link

Inference benchmarks A4

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
DeepSeek R1 671B A4 (NVIDIA B200) vLLM Inference GKE Link
DeepSeek R1 671B A4 (NVIDIA B200) SGLang Inference GKE Link
DeepSeek R1 671B A4 (NVIDIA B200) TensorRT-LLM Inference GKE Link
Llama 3.1 405B A4 (NVIDIA B200) TensorRT-LLM Inference GKE Link
Qwen 2.5 VL 7B A4 (NVIDIA B200) TensorRT-LLM Inference GKE Link
Qwen 3 235B A22B A4 (NVIDIA B200) TensorRT-LLM Inference GKE Link
Qwen 3 32B A4 (NVIDIA B200) TensorRT-LLM Inference GKE Link

Inference benchmarks A4X

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
DeepSeek R1 671B A4X (NVIDIA GB200) vLLM (v0.14.0rc1) Inference GKE Link
Wan2.2 T2V A14B Diffusers A4X (NVIDIA GB200) SGLang (latest) Inference GKE Link
Wan2.2 I2V A14B Diffusers A4X (NVIDIA GB200) SGLang (latest) Inference GKE Link
DeepSeek R1 671B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link

Link for Using Google Cloud Storage (GCS) as Storage Option

Link for Using Lustre as Storage Option
Llama 3.1 405B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Llama 3.1 70B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Llama 3.1 8B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Qwen 2.5 VL 7B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Qwen 3 235B A22B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Qwen 3 32B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Qwen 3 4B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link

Inference benchmarks G4

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Qwen3 8B G4 (NVIDIA RTX PRO 6000 Blackwell) vLLM Inference GCE Link
Qwen3 30B A3B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Qwen3 4B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Qwen3 8B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Qwen3 32B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Qwen3 32B G4 (NVIDIA RTX PRO 6000 Blackwell) vLLM Inference GCE Link
Llama3.1 70B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
DeepSeek R1 G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Qwen3 235B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Wan2.2 14B G4 (NVIDIA RTX PRO 6000 Blackwell) SGLang Inference GCE Link

Checkpointing benchmarks

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Llama-3.1-70B A3 Mega (NVIDIA H100) NeMo Pre-training using Google Cloud Storage buckets for checkpoints GKE Link

Goodput benchmarks

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Llama-3.1-70B A3 Mega (NVIDIA H100) NeMo Pre-training using the Google Cloud Resiliency library GKE Link
Llama-3.1-405B A3 Ultra (NVIDIA H200) NeMo Pre-training using the Google Cloud Resiliency library GKE Link
Mixtral-8x7B A3 Ultra (NVIDIA H200) NeMo Pre-training using the Google Cloud Resiliency library GKE Link

Repository organization

  • ./training: this directory contains recipes with instructions to reproduce training benchmarks with GPUs.
  • ./inference: this directory contains recipes with instructions to reproduce inference benchmarks with GPUs.
  • ./src: this directory contains the shared dependencies required to run benchmarks, such as Docker images and Helm charts.
  • ./docs: this directory contains supporting documentation for explanations of benchmark methodologies or configurations.

Repository scope

This repository provides the steps that you can use to reproduce a specific benchmark. The actual performance measurements and the complete, confidential benchmark report are not included.

Methodology

Performance benchmarks measure the performance of various workloads on the platform. These benchmarks are primarily used to validate performance with hardware suppliers and to provide you with data for purchasing decisions.

Maintenance policy

Benchmark data is considered a point-in-time measurement and completed benchmarks are not repeated. We maintain and update the recipes in this repository on a best-effort basis.

Resources

For general guidance on how to get started using Compute products, refer to the official documentation and tutorials:

Report issues

If you have questions or encounter problems with this repository, report them through GitHub Issues or reach out to your Google Cloud account team for assistance.

Contributor notes

Note: This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.

About

Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors