Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@ modelopt/deploy @NVIDIA/modelopt-deploy-codeowners
modelopt/onnx @NVIDIA/modelopt-onnx-codeowners
modelopt/onnx/autocast @NVIDIA/modelopt-onnx-autocast-codeowners
modelopt/torch @NVIDIA/modelopt-torch-codeowners
modelopt/torch/_compress @NVIDIA/modelopt-torch-compress-codeowners
modelopt/torch/_deploy @NVIDIA/modelopt-torch-deploy-codeowners
modelopt/torch/distill @NVIDIA/modelopt-torch-distill-codeowners
modelopt/torch/export @NVIDIA/modelopt-torch-export-codeowners
modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
modelopt/torch/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
modelopt/torch/speculative @NVIDIA/modelopt-torch-speculative-codeowners
Expand Down
12 changes: 6 additions & 6 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,17 +24,17 @@ repos:
hooks:
- id: ruff-check
args: [--fix, --exit-non-zero-on-fix]
# See: commit hooks modifies block_config.py leading to test_compress.py failing (#25) · Issues · omniml / modelopt · GitLab
# See: commit hooks modifies block_config.py leading to test_puzzletron.py failing (#25) · Issues · omniml / modelopt · GitLab
exclude: >
(?x)^(
modelopt/torch/_compress/decilm/deci_lm_hf_code/block_config\.py|
modelopt/torch/_compress/decilm/deci_lm_hf_code/transformers_.*\.py
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config\.py|
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py
)$
- id: ruff-format
exclude: >
(?x)^(
modelopt/torch/_compress/decilm/deci_lm_hf_code/block_config\.py|
modelopt/torch/_compress/decilm/deci_lm_hf_code/transformers_.*\.py
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config\.py|
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py
)$

- repo: https://github.com/pre-commit/mirrors-mypy
Expand Down Expand Up @@ -107,7 +107,7 @@ repos:
examples/speculative_decoding/main.py|
examples/speculative_decoding/medusa_utils.py|
examples/speculative_decoding/server_generate.py|
modelopt/torch/_compress/decilm/deci_lm_hf_code/transformers_.*\.py|
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py|
)$

# Default hook for Apache 2.0 in c/c++/cuda files
Expand Down
3 changes: 1 addition & 2 deletions examples/pruning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Pruning can involve removal (prune) of Linear and Conv layers; and Transformer a
This section focuses on applying Model Optimizer's state-of-the-art complementary pruning modes to enable you to search for the best subnet architecture from your provided base model:

1. [Minitron](https://arxiv.org/pdf/2408.11796): A pruning method developed by NVIDIA Research for pruning GPT (and later extended to Mamba, MoE, and Hybrid Transformer Mamba) models in NVIDIA Megatron-LM or NeMo framework. It uses the activation magnitudes to prune the embedding hidden size; mlp ffn hidden size; transformer attention heads; mamba heads and head dimension; MoE number of experts, ffn hidden size, and shared expert intermediate size; and number of layers of the model.
1. [Puzzletron](../puzzletron/README.md): An advanced pruning method by NVIDIA using Mixed Integer Programming (MIP) based NAS search algorithm.
1. FastNAS: A pruning method recommended for Computer Vision models. Given a pretrained model, FastNAS finds the subnet which maximizes the score function while meeting the given constraints.
1. GradNAS: A light-weight pruning method recommended for language models like Hugging Face BERT, GPT-J. It uses the gradient information to prune the model's linear layers and attention heads to meet the given constraints.

Expand All @@ -23,8 +24,6 @@ This section focuses on applying Model Optimizer's state-of-the-art complementar

</div>

For more advanced pruning strategies, such as the [Puzzle methodology](https://arxiv.org/pdf/2411.19146), please see [Puzzle pruning example](../compress/README.md).

## Pre-Requisites

For Minitron pruning for Megatron-LM / NeMo models, use the NeMo container (e.g., `nvcr.io/nvidia/nemo:25.09`) which has all the dependencies installed.
Expand Down
30 changes: 15 additions & 15 deletions examples/compress/README.md → examples/puzzletron/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Compress Algorithm Tutorial
# Puzzletron Algorithm Tutorial

This tutorial demonstrates how to compress large language models using the compress algorithm based on the [Puzzle paper](https://arxiv.org/abs/2411.19146).
This tutorial demonstrates how to compress large language models using the puzzletron algorithm based on the [Puzzle paper](https://arxiv.org/abs/2411.19146).
The goal of the algorithm it to find the most optimal modifications to MLP and attention layers of the model, resulting in a heterogeneous model architecture.
The supported modifications are:

Expand All @@ -16,7 +16,7 @@ In this example, we compress the [Llama-3.1-8B-Instruct](https://huggingface.co/
- Install Model-Optimizer in editable mode with the corresponding dependencies:

```bash
pip install -e .[hf,compress]
pip install -e .[hf,puzzletron]
```

- For this example we are using 2x NVIDIA H100 80GB HBM3 to show multi-GPU steps. You can use also use s single GPU.
Expand All @@ -34,7 +34,7 @@ hf auth login
dataset split: "code", "math", "stem", "chat", excluding reasoning samples (2.62GB)

```bash
python -m modelopt.torch._compress.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2
python -m modelopt.torch.puzzletron.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2
```

2. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file.
Expand All @@ -51,23 +51,23 @@ hf auth login

We can also set the target size of the resulting model using `num_params = 7_000_000_000`. This will be used as an upper bound for the number of parameters of the model.

3. Run the compression script.
3. Run the puzzletron pipeline.

```bash
torchrun --nproc_per_node 2 examples/compress/main.py --config examples/compress/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Compress Progress"
torchrun --nproc_per_node 2 examples/puzzletron/main.py --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
```

This will save the full output to `log.txt` and display the following progress on screen:

```bash
[2025-11-02 12:06:34][rank-0][main.py:71] Compress Progress 1/8: starting compression pipeline
[2025-11-02 12:06:45][rank-0][compress_nas_plugin.py:123] Compress Progress 2/8: converting model from HF to DeciLM (single-gpu)
[2025-11-02 12:07:07][rank-0][compress_nas_plugin.py:132] Compress Progress 3/8: scoring pruning activations (multi-gpu)
[2025-11-02 12:11:36][rank-0][compress_nas_plugin.py:137] Compress Progress 4/8: pruning the model and saving pruned checkpoints (single-gpu)
[2025-11-02 12:12:20][rank-0][compress_nas_plugin.py:217] Compress Progress 5/8: building replacement library and subblock statistics (single-gpu)
[2025-11-02 12:12:21][rank-0][compress_nas_plugin.py:222] Compress Progress 6/8: calculating one block scores (multi-gpu)
[2025-11-02 12:50:41][rank-0][compress_nas_plugin.py:226] Compress Progress 7/8: running MIP and realizing models (multi-gpu)
[2025-11-02 12:52:34][rank-0][main.py:115] Compress Progress 8/8: compression pipeline completed (multi-gpu)
[2025-11-02 12:06:34][rank-0][main.py:71] Puzzletron Progress 1/8: starting puzzletron pipeline
[2025-11-02 12:06:45][rank-0][puzzletron_nas_plugin.py:123] Puzzletron Progress 2/8: converting model from HF to DeciLM (single-gpu)
[2025-11-02 12:07:07][rank-0][puzzletron_nas_plugin.py:132] Puzzletron Progress 3/8: scoring pruning activations (multi-gpu)
[2025-11-02 12:11:36][rank-0][puzzletron_nas_plugin.py:137] Puzzletron Progress 4/8: pruning the model and saving pruned checkpoints (single-gpu)
[2025-11-02 12:12:20][rank-0][puzzletron_nas_plugin.py:217] Puzzletron Progress 5/8: building replacement library and subblock statistics (single-gpu)
[2025-11-02 12:12:21][rank-0][puzzletron_nas_plugin.py:222] Puzzletron Progress 6/8: calculating one block scores (multi-gpu)
[2025-11-02 12:50:41][rank-0][puzzletron_nas_plugin.py:226] Puzzletron Progress 7/8: running MIP and realizing models (multi-gpu)
[2025-11-02 12:52:34][rank-0][main.py:115] Puzzletron Progress 8/8: puzzletron pipeline completed (multi-gpu)
```

Once the process is complete, the resulting network architecture will be recorded in `log.txt` for your review:
Expand Down Expand Up @@ -132,7 +132,7 @@ This assumes pruning, replacement library building, NAS scoring, and subblock st
For example, let's set `target_memory: 96_000` in `llama-3_1-8B_pruneffn_memory.yaml`.

```bash
torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Compress Progress"
torchrun --nproc_per_node 2 examples/puzzletron/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
```

This will generate the following network architecture (see `log.txt`):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ input_hf_model_path: /workspace/hf_models/meta-llama/Llama-3.1-8B-Instruct
# Dataset path for pruning and NAS scoring
dataset_path: /workspace/datasets/Nemotron-Post-Training-Dataset-v2

# Working directory for compression outputs
# Working directory for puzzletron outputs
puzzle_dir: /workspace/puzzle_dir

# MIP memory constraint (in MiB)
# MIP memory constraint (in MiB)
mip:
human_constraints:
target_memory: 78_000 # 78 GiB

# FFN intermediate sizes to search over (heterogeneous architecture)
pruning:
intermediate_size_list: [3072, 5888, 8704, 11520] # teacher_intermediate_size is 14336
intermediate_size_list: [3072, 5888, 8704, 11520] # teacher_intermediate_size is 14336
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ write_results: false
calc_losses_on_cpu: false
activations_log_dir:
model_name_or_path:
load_dataset_fn: ${get_object:modelopt.torch._compress.utils.data.dataloaders.load_from_disk_fn}
load_dataset_fn: ${get_object:modelopt.torch.puzzletron.utils.data.dataloaders.load_from_disk_fn}
36 changes: 18 additions & 18 deletions examples/compress/main.py → examples/puzzletron/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,14 @@
# limitations under the License.

"""
Main script for running the compress algorithm on large language models (based on Puzzle paper https://arxiv.org/abs/2411.19146).
Main script for running the puzzletron algorithm on large language models (based on Puzzle paper https://arxiv.org/abs/2411.19146).

This script provides two modes:
1. Default mode: Runs the full compression pipeline
1. Default mode: Runs the full puzzletron pipeline
2. MIP-only mode: Runs only the MIP search and realize models phase

Usage:
# Full compression pipeline
# Full puzzletron pipeline
torchrun main.py --config ./configs/llama_3.2_1B_pruneffn_memory.yaml

# Only MIP search and realize models phase
Expand All @@ -32,21 +32,21 @@
from datetime import timedelta
from pathlib import Path

import modelopt.torch._compress.mip.mip_and_realize_models as mip_and_realize_models
import modelopt.torch.nas as mtn
import modelopt.torch.puzzletron.mip.mip_and_realize_models as mip_and_realize_models
import modelopt.torch.utils.distributed as dist
from modelopt.torch._compress.nas.plugins.compress_nas_plugin import CompressModel
from modelopt.torch._compress.tools.hydra_utils import (
from modelopt.torch.puzzletron.nas.plugins.puzzletron_nas_plugin import PuzzletronModel
from modelopt.torch.puzzletron.tools.hydra_utils import (
initialize_hydra_config_for_dir,
register_hydra_resolvers,
)
from modelopt.torch._compress.tools.logger import mprint
from modelopt.torch.puzzletron.tools.logger import mprint


def parse_args():
"""Parse command line arguments."""
parser = argparse.ArgumentParser(
description="Compress large language models using the Compress algorithm (based on Puzzle paper https://arxiv.org/abs/2411.19146)"
description="Compress large language models using the Puzzletron algorithm (based on Puzzle paper https://arxiv.org/abs/2411.19146)"
)
parser.add_argument(
"--config",
Expand All @@ -63,13 +63,13 @@ def parse_args():
return parser.parse_args()


def run_full_compress(hydra_config_path: str):
"""Run the full compression pipeline.
def run_full_puzzletron(hydra_config_path: str):
"""Run the full puzzletron pipeline.

Args:
config_path: Path to the YAML configuration file
"""
mprint("Compress Progress 1/8: starting compression pipeline")
mprint("Puzzletron Progress 1/8: starting puzzletron pipeline")
dist.setup(timeout=timedelta(10))

# Register Hydra custom resolvers (needed for config resolution)
Expand All @@ -88,12 +88,12 @@ def run_full_compress(hydra_config_path: str):

# Convert model (convert from HF to DeciLM, score pruning activations,
# prune the model and save pruned checkpoints)
input_model = CompressModel()
input_model = PuzzletronModel()
converted_model = mtn.convert(
input_model,
mode=[
(
"compress",
"puzzletron",
{
"puzzle_dir": str(hydra_cfg.puzzle_dir),
"input_model_path": hydra_cfg.input_hf_model_path,
Expand All @@ -115,7 +115,7 @@ def run_full_compress(hydra_config_path: str):
)

dist.cleanup()
mprint("Compress Progress 8/8: compression pipeline completed (multi-gpu)")
mprint("Puzzletron Progress 8/8: puzzletron pipeline completed (multi-gpu)")


def run_mip_only(hydra_config_path: str):
Expand Down Expand Up @@ -144,12 +144,12 @@ def run_mip_only(hydra_config_path: str):
)

# mip_and_realize_models (distributed processing)
# TODO: How to make it part of mnt.search() api, similarly to run_full_compress() API
mprint("Compress Progress 7/8: running MIP and realizing models (multi-gpu)")
# TODO: How to make it part of mnt.search() api, similarly to run_full_puzzletron() API
mprint("Puzzletron Progress 7/8: running MIP and realizing models (multi-gpu)")
mip_and_realize_models.launch_mip_and_realize_model(hydra_cfg)

dist.cleanup()
mprint("Compress Progress 8/8: compression pipeline completed (multi-gpu)")
mprint("Puzzletron Progress 8/8: puzzletron pipeline completed (multi-gpu)")


def main():
Expand All @@ -158,7 +158,7 @@ def main():
if args.mip_only:
run_mip_only(hydra_config_path=args.config)
else:
run_full_compress(hydra_config_path=args.config)
run_full_puzzletron(hydra_config_path=args.config)


if __name__ == "__main__":
Expand Down
4 changes: 2 additions & 2 deletions modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@
from torch import nn

import modelopt.torch.utils.distributed as dist
from modelopt.torch._compress.tools.logger import aprint
from modelopt.torch._compress.tools.robust_json import json_dump
from modelopt.torch.puzzletron.tools.logger import aprint
from modelopt.torch.puzzletron.tools.robust_json import json_dump

__all__ = [
"ForwardHook",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,19 @@
# mypy: ignore-errors

"""Provides a function to register activation hooks for a model.
Activation hooks are used to compute activation scores for pruning."""
Activation hooks are used to compute activation scores for pruning.
"""

import re

from modelopt.torch._compress.decilm.deci_lm_hf_code.modeling_decilm import DeciLMForCausalLM
from modelopt.torch.nas.plugins.megatron_hooks.base_hooks import (
ForwardHook,
IndependentChannelContributionHook,
IndependentKvHeadContributionHook,
IterativeChannelContributionHook,
LayerNormContributionHook,
)
from modelopt.torch.puzzletron.decilm.deci_lm_hf_code.modeling_decilm import DeciLMForCausalLM


def register_activation_hooks(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,12 @@
from omegaconf import DictConfig

import modelopt.torch.utils.distributed as dist
from modelopt.torch._compress.tools.logger import mprint
from modelopt.torch._compress.tools.validate_model import validate_model
from modelopt.torch.puzzletron.tools.logger import mprint
from modelopt.torch.puzzletron.tools.validate_model import validate_model


def has_checkpoint_support(activation_hooks_kwargs: dict) -> bool:
"""
Determine if the activation hook method has proper checkpoint support implemented.
"""Determine if the activation hook method has proper checkpoint support implemented.

Args:
activation_hooks_kwargs: Hook configuration
Expand All @@ -47,8 +46,7 @@ def has_checkpoint_support(activation_hooks_kwargs: dict) -> bool:


def check_scoring_completion(activations_log_dir: str, activation_hooks_kwargs=None) -> bool:
"""
Check if scoring is already completed by looking for the expected output files.
"""Check if scoring is already completed by looking for the expected output files.
Also checks if the scoring method is safe for resume.

Args:
Expand Down Expand Up @@ -89,8 +87,7 @@ def check_scoring_completion(activations_log_dir: str, activation_hooks_kwargs=N


def should_skip_scoring_completely(cfg: DictConfig) -> bool:
"""
Determine if we should skip scoring entirely (only if 100% complete).
"""Determine if we should skip scoring entirely (only if 100% complete).
Partial progress should proceed to validate_model for proper resume.

Args:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,36 +14,31 @@
# See the License for the specific language governing permissions and
# limitations under the License.

"""
Unified command that runs build_replacement_library followed by calc_subblock_stats.
"""Unified command that runs build_replacement_library followed by calc_subblock_stats.

This script combines the functionality of both commands into a single workflow:
1. First, it builds the replacement library for the puzzle
2. Then, it calculates subblock statistics

Usage:

python modelopt.torch._compress.build_library_and_stats.py --config-dir configs --config-name Llama-3_1-8B puzzle_dir=/path/to/puzzle/dir dataset_path=/path/to/dataset
python modelopt.torch.puzzletron.build_library_and_stats.py --config-dir configs --config-name Llama-3_1-8B puzzle_dir=/path/to/puzzle/dir dataset_path=/path/to/dataset

The script uses the same Hydra configuration as the individual commands and supports
all the same configuration parameters for both build_replacement_library and calc_subblock_stats.
"""

import hydra
from omegaconf import DictConfig

from modelopt.torch._compress.replacement_library.build_replacement_library import (
from modelopt.torch.puzzletron.replacement_library.build_replacement_library import (
launch_build_replacement_library,
)
from modelopt.torch._compress.subblock_stats.calc_subblock_stats import launch_calc_subblock_stats
from modelopt.torch._compress.tools.hydra_utils import register_hydra_resolvers
from modelopt.torch._compress.tools.logger import mprint
from modelopt.torch._compress.utils.parsing import format_global_config
from modelopt.torch.puzzletron.subblock_stats.calc_subblock_stats import launch_calc_subblock_stats
from modelopt.torch.puzzletron.tools.logger import mprint


def launch_build_library_and_stats(cfg: DictConfig) -> None:
"""
Launch both build_replacement_library and calc_subblock_stats in sequence.
"""Launch both build_replacement_library and calc_subblock_stats in sequence.

Args:
cfg: Hydra configuration containing settings for both commands
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
import fire
import numpy as np

from modelopt.torch._compress.tools.logger import mprint
from modelopt.torch.puzzletron.tools.logger import mprint


def process_and_save_dataset(
Expand Down
Loading