NVIDIA · kevalmorabia97 · Jan 14, 2026 · Jan 13, 2026 · Jan 14, 2026
@@ -17,14 +17,14 @@ modelopt/deploy @NVIDIA/modelopt-deploy-codeowners
 modelopt/onnx @NVIDIA/modelopt-onnx-codeowners
 modelopt/onnx/autocast @NVIDIA/modelopt-onnx-autocast-codeowners
 modelopt/torch @NVIDIA/modelopt-torch-codeowners
-modelopt/torch/_compress @NVIDIA/modelopt-torch-compress-codeowners
 modelopt/torch/_deploy @NVIDIA/modelopt-torch-deploy-codeowners
 modelopt/torch/distill @NVIDIA/modelopt-torch-distill-codeowners
 modelopt/torch/export @NVIDIA/modelopt-torch-export-codeowners
 modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
 modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
 modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
 modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
+modelopt/torch/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
 modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
 modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
 modelopt/torch/speculative @NVIDIA/modelopt-torch-speculative-codeowners

@@ -24,17 +24,17 @@ repos:
     hooks:
       - id: ruff-check
         args: [--fix, --exit-non-zero-on-fix]
-        # See: commit hooks modifies block_config.py leading to test_compress.py failing (#25) · Issues · omniml / modelopt · GitLab
+        # See: commit hooks modifies block_config.py leading to test_puzzletron.py failing (#25) · Issues · omniml / modelopt · GitLab
         exclude: >
           (?x)^(
-              modelopt/torch/_compress/decilm/deci_lm_hf_code/block_config\.py|
-              modelopt/torch/_compress/decilm/deci_lm_hf_code/transformers_.*\.py
+              modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config\.py|
+              modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py
           )$
       - id: ruff-format
         exclude: >
           (?x)^(
-              modelopt/torch/_compress/decilm/deci_lm_hf_code/block_config\.py|
-              modelopt/torch/_compress/decilm/deci_lm_hf_code/transformers_.*\.py
+              modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config\.py|
+              modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py
           )$
 
   - repo: https://github.com/pre-commit/mirrors-mypy
@@ -107,7 +107,7 @@ repos:
               examples/speculative_decoding/main.py|
               examples/speculative_decoding/medusa_utils.py|
               examples/speculative_decoding/server_generate.py|
-              modelopt/torch/_compress/decilm/deci_lm_hf_code/transformers_.*\.py|
+              modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py|
           )$
 
       # Default hook for Apache 2.0 in c/c++/cuda files

@@ -7,6 +7,7 @@ Pruning can involve removal (prune) of Linear and Conv layers; and Transformer a
 This section focuses on applying Model Optimizer's state-of-the-art complementary pruning modes to enable you to search for the best subnet architecture from your provided base model:
 
 1. [Minitron](https://arxiv.org/pdf/2408.11796): A pruning method developed by NVIDIA Research for pruning GPT (and later extended to Mamba, MoE, and Hybrid Transformer Mamba) models in NVIDIA Megatron-LM or NeMo framework. It uses the activation magnitudes to prune the embedding hidden size; mlp ffn hidden size; transformer attention heads; mamba heads and head dimension; MoE number of experts, ffn hidden size, and shared expert intermediate size; and number of layers of the model.
+1. [Puzzletron](../puzzletron/README.md): An advanced pruning method by NVIDIA using Mixed Integer Programming (MIP) based NAS search algorithm.
 1. FastNAS: A pruning method recommended for Computer Vision models. Given a pretrained model, FastNAS finds the subnet which maximizes the score function while meeting the given constraints.
 1. GradNAS: A light-weight pruning method recommended for language models like Hugging Face BERT, GPT-J. It uses the gradient information to prune the model's linear layers and attention heads to meet the given constraints.
 
@@ -23,8 +24,6 @@ This section focuses on applying Model Optimizer's state-of-the-art complementar
 
 </div>
 
-For more advanced pruning strategies, such as the [Puzzle methodology](https://arxiv.org/pdf/2411.19146), please see [Puzzle pruning example](../compress/README.md).
-
 ## Pre-Requisites
 
 For Minitron pruning for Megatron-LM / NeMo models, use the NeMo container (e.g., `nvcr.io/nvidia/nemo:25.09`) which has all the dependencies installed.

@@ -1,6 +1,6 @@
-# Compress Algorithm Tutorial
+# Puzzletron Algorithm Tutorial
 
-This tutorial demonstrates how to compress large language models using the compress algorithm based on the [Puzzle paper](https://arxiv.org/abs/2411.19146).
+This tutorial demonstrates how to compress large language models using the puzzletron algorithm based on the [Puzzle paper](https://arxiv.org/abs/2411.19146).
 The goal of the algorithm it to find the most optimal modifications to MLP and attention layers of the model, resulting in a heterogeneous model architecture.
 The supported modifications are:
 
@@ -16,7 +16,7 @@ In this example, we compress the [Llama-3.1-8B-Instruct](https://huggingface.co/
 - Install Model-Optimizer in editable mode with the corresponding dependencies:
 
 ```bash
-pip install -e .[hf,compress]
+pip install -e .[hf,puzzletron]
 ```
 
 - For this example we are using 2x NVIDIA H100 80GB HBM3 to show multi-GPU steps. You can use also use s single GPU.
@@ -34,7 +34,7 @@ hf auth login
    dataset split: "code", "math", "stem", "chat", excluding reasoning samples (2.62GB)
 
    ```bash
-   python -m modelopt.torch._compress.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2
+   python -m modelopt.torch.puzzletron.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2
    ```
 
 2. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file.
@@ -51,23 +51,23 @@ hf auth login
 
    We can also set the target size of the resulting model using `num_params = 7_000_000_000`. This will be used as an upper bound for the number of parameters of the model.
 
-3. Run the compression script.
+3. Run the puzzletron pipeline.
 
    ```bash
-   torchrun --nproc_per_node 2 examples/compress/main.py --config examples/compress/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Compress Progress"
+   torchrun --nproc_per_node 2 examples/puzzletron/main.py --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
    ```
 
    This will save the full output to `log.txt` and display the following progress on screen:
 
    ```bash
-   [2025-11-02 12:06:34][rank-0][main.py:71] Compress Progress 1/8: starting compression pipeline
-   [2025-11-02 12:06:45][rank-0][compress_nas_plugin.py:123] Compress Progress 2/8: converting model from HF to DeciLM (single-gpu)
-   [2025-11-02 12:07:07][rank-0][compress_nas_plugin.py:132] Compress Progress 3/8: scoring pruning activations (multi-gpu)
-   [2025-11-02 12:11:36][rank-0][compress_nas_plugin.py:137] Compress Progress 4/8: pruning the model and saving pruned checkpoints (single-gpu)
-   [2025-11-02 12:12:20][rank-0][compress_nas_plugin.py:217] Compress Progress 5/8: building replacement library and subblock statistics (single-gpu)
-   [2025-11-02 12:12:21][rank-0][compress_nas_plugin.py:222] Compress Progress 6/8: calculating one block scores (multi-gpu)
-   [2025-11-02 12:50:41][rank-0][compress_nas_plugin.py:226] Compress Progress 7/8: running MIP and realizing models (multi-gpu)
-   [2025-11-02 12:52:34][rank-0][main.py:115] Compress Progress 8/8: compression pipeline completed (multi-gpu)
+   [2025-11-02 12:06:34][rank-0][main.py:71] Puzzletron Progress 1/8: starting puzzletron pipeline
+   [2025-11-02 12:06:45][rank-0][puzzletron_nas_plugin.py:123] Puzzletron Progress 2/8: converting model from HF to DeciLM (single-gpu)
+   [2025-11-02 12:07:07][rank-0][puzzletron_nas_plugin.py:132] Puzzletron Progress 3/8: scoring pruning activations (multi-gpu)
+   [2025-11-02 12:11:36][rank-0][puzzletron_nas_plugin.py:137] Puzzletron Progress 4/8: pruning the model and saving pruned checkpoints (single-gpu)
+   [2025-11-02 12:12:20][rank-0][puzzletron_nas_plugin.py:217] Puzzletron Progress 5/8: building replacement library and subblock statistics (single-gpu)
+   [2025-11-02 12:12:21][rank-0][puzzletron_nas_plugin.py:222] Puzzletron Progress 6/8: calculating one block scores (multi-gpu)
+   [2025-11-02 12:50:41][rank-0][puzzletron_nas_plugin.py:226] Puzzletron Progress 7/8: running MIP and realizing models (multi-gpu)
+   [2025-11-02 12:52:34][rank-0][main.py:115] Puzzletron Progress 8/8: puzzletron pipeline completed (multi-gpu)
    ```
 
    Once the process is complete, the resulting network architecture will be recorded in `log.txt` for your review:
@@ -132,7 +132,7 @@ This assumes pruning, replacement library building, NAS scoring, and subblock st
 For example, let's set `target_memory: 96_000` in `llama-3_1-8B_pruneffn_memory.yaml`.
 
 ```bash
-torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Compress Progress"
+torchrun --nproc_per_node 2 examples/puzzletron/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
 ```
 
 This will generate the following network architecture (see `log.txt`):

@@ -8,14 +8,14 @@ input_hf_model_path: /workspace/hf_models/meta-llama/Llama-3.1-8B-Instruct
 # Dataset path for pruning and NAS scoring
 dataset_path: /workspace/datasets/Nemotron-Post-Training-Dataset-v2
 
-# Working directory for compression outputs
+# Working directory for puzzletron outputs
 puzzle_dir: /workspace/puzzle_dir
 
-# MIP memory constraint (in MiB) 
+# MIP memory constraint (in MiB)
 mip:
   human_constraints:
     target_memory: 78_000 # 78 GiB
 
 # FFN intermediate sizes to search over (heterogeneous architecture)
 pruning:
-  intermediate_size_list: [3072, 5888, 8704, 11520]  # teacher_intermediate_size is 14336
+  intermediate_size_list: [3072, 5888, 8704, 11520] # teacher_intermediate_size is 14336
@@ -14,4 +14,4 @@ write_results: false
 calc_losses_on_cpu: false
 activations_log_dir:
 model_name_or_path:
-load_dataset_fn: ${get_object:modelopt.torch._compress.utils.data.dataloaders.load_from_disk_fn}
+load_dataset_fn: ${get_object:modelopt.torch.puzzletron.utils.data.dataloaders.load_from_disk_fn}
@@ -14,14 +14,14 @@
 # limitations under the License.
 
 """
-Main script for running the compress algorithm on large language models (based on Puzzle paper https://arxiv.org/abs/2411.19146).
+Main script for running the puzzletron algorithm on large language models (based on Puzzle paper https://arxiv.org/abs/2411.19146).
 
 This script provides two modes:
-1. Default mode: Runs the full compression pipeline
+1. Default mode: Runs the full puzzletron pipeline
 2. MIP-only mode: Runs only the MIP search and realize models phase
 
 Usage:
-    # Full compression pipeline
+    # Full puzzletron pipeline
     torchrun main.py --config ./configs/llama_3.2_1B_pruneffn_memory.yaml
 
     # Only MIP search and realize models phase
@@ -32,21 +32,21 @@
 from datetime import timedelta
 from pathlib import Path
 
-import modelopt.torch._compress.mip.mip_and_realize_models as mip_and_realize_models
 import modelopt.torch.nas as mtn
+import modelopt.torch.puzzletron.mip.mip_and_realize_models as mip_and_realize_models
 import modelopt.torch.utils.distributed as dist
-from modelopt.torch._compress.nas.plugins.compress_nas_plugin import CompressModel
-from modelopt.torch._compress.tools.hydra_utils import (
+from modelopt.torch.puzzletron.nas.plugins.puzzletron_nas_plugin import PuzzletronModel
+from modelopt.torch.puzzletron.tools.hydra_utils import (
     initialize_hydra_config_for_dir,
     register_hydra_resolvers,
 )
-from modelopt.torch._compress.tools.logger import mprint
+from modelopt.torch.puzzletron.tools.logger import mprint
 
 
 def parse_args():
     """Parse command line arguments."""
     parser = argparse.ArgumentParser(
-        description="Compress large language models using the Compress algorithm (based on Puzzle paper https://arxiv.org/abs/2411.19146)"
+        description="Compress large language models using the Puzzletron algorithm (based on Puzzle paper https://arxiv.org/abs/2411.19146)"
     )
     parser.add_argument(
         "--config",
@@ -63,13 +63,13 @@ def parse_args():
     return parser.parse_args()
 
 
-def run_full_compress(hydra_config_path: str):
-    """Run the full compression pipeline.
+def run_full_puzzletron(hydra_config_path: str):
+    """Run the full puzzletron pipeline.
 
     Args:
         config_path: Path to the YAML configuration file
     """
-    mprint("Compress Progress 1/8: starting compression pipeline")
+    mprint("Puzzletron Progress 1/8: starting puzzletron pipeline")
     dist.setup(timeout=timedelta(10))
 
     # Register Hydra custom resolvers (needed for config resolution)
@@ -88,12 +88,12 @@ def run_full_compress(hydra_config_path: str):
 
     # Convert model (convert from HF to DeciLM, score pruning activations,
     # prune the model and save pruned checkpoints)
-    input_model = CompressModel()
+    input_model = PuzzletronModel()
     converted_model = mtn.convert(
         input_model,
         mode=[
             (
-                "compress",
+                "puzzletron",
                 {
                     "puzzle_dir": str(hydra_cfg.puzzle_dir),
                     "input_model_path": hydra_cfg.input_hf_model_path,
@@ -115,7 +115,7 @@ def run_full_compress(hydra_config_path: str):
     )
 
     dist.cleanup()
-    mprint("Compress Progress 8/8: compression pipeline completed (multi-gpu)")
+    mprint("Puzzletron Progress 8/8: puzzletron pipeline completed (multi-gpu)")
 
 
 def run_mip_only(hydra_config_path: str):
@@ -144,12 +144,12 @@ def run_mip_only(hydra_config_path: str):
     )
 
     # mip_and_realize_models (distributed processing)
-    # TODO: How to make it part of mnt.search() api, similarly to run_full_compress() API
-    mprint("Compress Progress 7/8: running MIP and realizing models (multi-gpu)")
+    # TODO: How to make it part of mnt.search() api, similarly to run_full_puzzletron() API
+    mprint("Puzzletron Progress 7/8: running MIP and realizing models (multi-gpu)")
     mip_and_realize_models.launch_mip_and_realize_model(hydra_cfg)
 
     dist.cleanup()
-    mprint("Compress Progress 8/8: compression pipeline completed (multi-gpu)")
+    mprint("Puzzletron Progress 8/8: puzzletron pipeline completed (multi-gpu)")
 
 
 def main():
@@ -158,7 +158,7 @@ def main():
     if args.mip_only:
         run_mip_only(hydra_config_path=args.config)
     else:
-        run_full_compress(hydra_config_path=args.config)
+        run_full_puzzletron(hydra_config_path=args.config)
 
 
 if __name__ == "__main__":

@@ -26,8 +26,8 @@
 from torch import nn
 
 import modelopt.torch.utils.distributed as dist
-from modelopt.torch._compress.tools.logger import aprint
-from modelopt.torch._compress.tools.robust_json import json_dump
+from modelopt.torch.puzzletron.tools.logger import aprint
+from modelopt.torch.puzzletron.tools.robust_json import json_dump
 
 __all__ = [
     "ForwardHook",

@@ -15,18 +15,19 @@
 # mypy: ignore-errors
 
 """Provides a function to register activation hooks for a model.
-Activation hooks are used to compute activation scores for pruning."""
+Activation hooks are used to compute activation scores for pruning.
+"""
 
 import re
 
-from modelopt.torch._compress.decilm.deci_lm_hf_code.modeling_decilm import DeciLMForCausalLM
 from modelopt.torch.nas.plugins.megatron_hooks.base_hooks import (
     ForwardHook,
     IndependentChannelContributionHook,
     IndependentKvHeadContributionHook,
     IterativeChannelContributionHook,
     LayerNormContributionHook,
 )
+from modelopt.torch.puzzletron.decilm.deci_lm_hf_code.modeling_decilm import DeciLMForCausalLM
 
 
 def register_activation_hooks(

@@ -19,13 +19,12 @@
 from omegaconf import DictConfig
 
 import modelopt.torch.utils.distributed as dist
-from modelopt.torch._compress.tools.logger import mprint
-from modelopt.torch._compress.tools.validate_model import validate_model
+from modelopt.torch.puzzletron.tools.logger import mprint
+from modelopt.torch.puzzletron.tools.validate_model import validate_model
 
 
 def has_checkpoint_support(activation_hooks_kwargs: dict) -> bool:
-    """
-    Determine if the activation hook method has proper checkpoint support implemented.
+    """Determine if the activation hook method has proper checkpoint support implemented.
 
     Args:
         activation_hooks_kwargs: Hook configuration
@@ -47,8 +46,7 @@ def has_checkpoint_support(activation_hooks_kwargs: dict) -> bool:
 
 
 def check_scoring_completion(activations_log_dir: str, activation_hooks_kwargs=None) -> bool:
-    """
-    Check if scoring is already completed by looking for the expected output files.
+    """Check if scoring is already completed by looking for the expected output files.
     Also checks if the scoring method is safe for resume.
 
     Args:
@@ -89,8 +87,7 @@ def check_scoring_completion(activations_log_dir: str, activation_hooks_kwargs=N
 
 
 def should_skip_scoring_completely(cfg: DictConfig) -> bool:
-    """
-    Determine if we should skip scoring entirely (only if 100% complete).
+    """Determine if we should skip scoring entirely (only if 100% complete).
     Partial progress should proceed to validate_model for proper resume.
 
     Args:

@@ -14,36 +14,31 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-"""
-Unified command that runs build_replacement_library followed by calc_subblock_stats.
+"""Unified command that runs build_replacement_library followed by calc_subblock_stats.
 
 This script combines the functionality of both commands into a single workflow:
 1. First, it builds the replacement library for the puzzle
 2. Then, it calculates subblock statistics
 
 Usage:
 
-    python modelopt.torch._compress.build_library_and_stats.py --config-dir configs --config-name Llama-3_1-8B puzzle_dir=/path/to/puzzle/dir dataset_path=/path/to/dataset
+    python modelopt.torch.puzzletron.build_library_and_stats.py --config-dir configs --config-name Llama-3_1-8B puzzle_dir=/path/to/puzzle/dir dataset_path=/path/to/dataset
 
 The script uses the same Hydra configuration as the individual commands and supports
 all the same configuration parameters for both build_replacement_library and calc_subblock_stats.
 """
 
-import hydra
 from omegaconf import DictConfig
 
-from modelopt.torch._compress.replacement_library.build_replacement_library import (
+from modelopt.torch.puzzletron.replacement_library.build_replacement_library import (
     launch_build_replacement_library,
 )
-from modelopt.torch._compress.subblock_stats.calc_subblock_stats import launch_calc_subblock_stats
-from modelopt.torch._compress.tools.hydra_utils import register_hydra_resolvers
-from modelopt.torch._compress.tools.logger import mprint
-from modelopt.torch._compress.utils.parsing import format_global_config
+from modelopt.torch.puzzletron.subblock_stats.calc_subblock_stats import launch_calc_subblock_stats
+from modelopt.torch.puzzletron.tools.logger import mprint
 
 
 def launch_build_library_and_stats(cfg: DictConfig) -> None:
-    """
-    Launch both build_replacement_library and calc_subblock_stats in sequence.
+    """Launch both build_replacement_library and calc_subblock_stats in sequence.
 
     Args:
         cfg: Hydra configuration containing settings for both commands

@@ -19,7 +19,7 @@
 import fire
 import numpy as np
 
-from modelopt.torch._compress.tools.logger import mprint
+from modelopt.torch.puzzletron.tools.logger import mprint
 
 
 def process_and_save_dataset(