[NNX] NNX migration prep (7/N): NNX-native MaxEngine inference by ecnal-cienet · Pull Request #3821 · AI-Hypercomputer/maxtext

ecnal-cienet · 2026-05-05T19:33:03Z

NNX Migration Route Map

✅ Add NNX scaffolding: pure_nnx flag, init_state_fn, TrainStateNNX, NNX utils. Linen workflow unchanged. (PR NNX migration prep (1/N): pure_nnx flag and init_state_fn scaffolding #3427)
✅ NNX sharding utilities: get_abstract_state_nnx, get_named_sharding_nnx, set_named_sharding_nnx, get_partition_spec_nnx, get_mesh_from_config. (PR NNX migration prep (2/N): NNX utils and sharding utilities #3470)
✅ NNX fully supported end-to-end: TrainStateNNX, model creation, gradient accumulation, checkpointing, and training loop dispatch. (PR NNX migration prep (3/N): TrainState, model creation, and end-to-end training loop #3500)
✅ Sharding diagnostics on NNX, plus post-training bugfixes that surfaced once the NNX path got exercised end-to-end. (PR [NNX] NNX migration prep (4/N): sharding tools and post-training fixes #3652)
4.5. ✅ Linen↔NNX checkpoint converter. (PR [NNX] NNX migration prep (4.5/N): Linen<->NNX checkpoint converter #3843)
4.6. ❌ Linen↔NNX checkpoint comparator (sibling branch on PR4.5).
✅ NNX correctness fixes, feature enablements, and vocab tiling on NNX.
✅ NNX-native DPO.
🔄 [This PR] NNX-native MaxEngine inference. Drops the route-to-Linen path in maxengine.py; pure_nnx=True now drives a real NNX flow (two-mode abstract Transformer, nnx.merge/nnx.state per JIT body, pure-dict cache adapter so bulk_insert etc. reuse Linen plumbing). Stacks on PR6. Carve-outs (NotImplementedError'd with PR pointers): NNX+quantization → PR9.5; NNX+LoRA → PR8; prefill_multisampling/prefill_concat/stack_prefill_result_cache=True → follow-up.
❌ NNX-native LoRA + GRPO.
❌ NNX-aware QK-Clip + remaining checkpoint utilities.
9.5. ❌ NNX + AQT in MaxEngine + serve-mode reload + gpt3 prefill fix.
❌ Vocab tiling custom_vjp for NNX.
❌ Set NNX defaults to True; regenerate sharding goldens; flip back integration-test pure_nnx=False annotations.
❌ Delete Linen-specific code paths and NNX compatibility flags.

Description

PR5 routed maxengine.py inference to the Linen implementation regardless of pure_nnx, so pure_nnx=True users silently got the Linen engine. This PR replaces that route with a real NNX flow.

When config.pure_nnx=True, the engine builds an NNX Transformer, splits it into (params, cache, rest) with nnx.split, and at every JIT body reconstructs the model with nnx.merge(..., copy=True) to run the forward pass. Linen is preserved byte-for-byte: every NNX edit is gated if config.pure_nnx:, and pure_nnx=False stays the default.

Diff: +404 / −33 across 3 files. The only deletion is the "use Linen path regardless of pure_nnx" comment block in MaxEngine.__init__.

Design notes

Two abstract models. __init__ builds self.model (PREFILL, batch 1) and self.model_ar (AUTOREGRESSIVE, decode-state batch). NNX cache vars take their logical axis names (CACHE_BATCH vs CACHE_BATCH_PREFILL) from the construction model_mode; the decode-state cache must use AR-mode names so bulk_insert's cache_batch substring lookup hits. nnx.eval_shape is called directly (not via create_nnx_abstract_model) to avoid the jax.set_mesh wrap that trips Flax 0.12.6 on logical-only axes like "norm".
Cache adapter. The engine's cache helpers (bulk_insert, _insert_jit, _maybe_*_prefill_result_cache) walk the cache by path[-1].key. NNX nnx.Cache state would expose a .value accessor there, so the cache flows as a plain dict on both paths: nnx.State.to_pure_dict() after the run, nnx.replace_by_pure_dict() before nnx.merge. The Linen helpers stay unchanged.
Param loading (_load_params_nnx): accepts user-provided NNX params or loads via from_pretrained, materializes the rest (RNG) state once for nnx.merge, and builds the prefill/AR KV-cache shardings.
Annotation helpers (maxtext_utils.py): get_prefill_kv_cache_annotations_nnx / get_kv_cache_annotations_nnx mirror the Linen helpers' return shape (a tree of PartitionSpec).

Carve-outs

These raise NotImplementedError on the NNX path — deliberate scope cuts, not silent fallbacks. Each points at its follow-up; the Linen path (pure_nnx=False) remains the workaround.

Site	Tracked in
quantization (`quantize_params`, quantized-checkpoint load)	PR9.5
LoRA (`load_single_adapter`)	PR8
`prefill_multisampling`, `prefill_concat`, `stack_prefill_result_cache=True`	PR7.5

Tests

$ pytest tests/integration/maxengine_test.py
$ pytest tests/unit/maxengine_test.py -v

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-05-05T20:58:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…e.py) PR5 audited maxengine.py and routed the inference path to the Linen implementation regardless of pure_nnx, with a comment block explaining that "the flag affects training, not inference serving." That kept the Linen serving path unchanged but meant pure_nnx=True users silently got the Linen engine. This change replaces the route with a real NNX flow: when config.pure_nnx=True, the engine builds an NNX Transformer, splits out (params, cache, rest) with nnx.split, and at every JIT body merges the model concretely with nnx.merge to run the forward pass. Linen is preserved byte-for-byte; every NNX edit is gated `if config.pure_nnx:` and pure_nnx=False is still the default. maxengine.py (__init__): - Build two abstract NNX Transformers on the NNX path: self.model with model_mode=PREFILL (batch=1, single padded prompt) and self.model_ar with model_mode=AUTOREGRESSIVE (batch=micro_batch_size_to_train_on, decode_state shape). Both are needed because NNX cache vars inherit CACHE_BATCH_PREFILL vs CACHE_BATCH from the construction model_mode, and bulk_insert searches for the substring "cache_batch" in the AR-mode logical-axes tuple. nnx.eval_shape is called directly inside nn_partitioning.axis_rules rather than through create_nnx_abstract_model to avoid the jax.set_mesh wrap that trips Flax 0.12.6 on logical-only axes like "norm" (same reason get_abstract_state_nnx avoids set_mesh). - Cache the graphdef from a 3-way nnx.split(Param, Cache, ...) so JIT bodies can pass (params, cache, rest) separately to nnx.merge. The rest slot (RNG vars etc.) is materialized concretely in load_params. maxengine.py (cache adapter + _nnx_run_model): - bulk_insert / _insert_jit / _maybe_*_prefill_result_cache walk the cache via tree_map_with_path and switch on path[-1].key (the cache variable name like "cached_prefill_key"). Linen mutable cache is a plain nested dict. NNX Cache state would expose a ".value" accessor at that position. Bridge via nnx.State.to_pure_dict() (after the model run) and nnx.replace_by_pure_dict (before nnx.merge), so the cache plumbing helpers see the same shape on both paths. - Add _nnx_run_model: nnx.merge(graphdef, params, cache, rest, copy=True) -> model(...) -> nnx.state(model, nnx.Cache).to_pure_dict(). copy=True avoids reusing Variable objects across traces (TraceContextError), mirroring train.py's diff_wrapper workaround. - Add _nnx_cache_state_template / _nnx_init_cache_dict helpers parametrised by mode so prefill (batch 1) and decode_state (batch N) pull from the right abstract model. maxengine.py (load_params): - New _load_params_nnx: accepts user-provided NNX-shape params or loads via from_pretrained. For user-provided params, materializes a concrete model once via _create_model_fn() to capture a real rest state for nnx.merge (wasteful but simple; the from_pretrained branch avoids this). Refreshes self.graphdef from the concrete model so subsequent merges line up exactly. - Builds self.abstract_params, populates self.prefill_kv_cache_annotations and self.kv_cache_annotations (using model_ar for the latter so bulk_insert's substring lookup hits), wraps both into NamedSharding. - pure_nnx + quantization, pure_nnx + LoRA, pure_nnx + stack_prefill_result_cache=True, pure_nnx + prefill_multisampling, and pure_nnx + prefill_concat raise NotImplementedError for now; the Linen path is the workaround. AOT compilation (aot_compile / _compile_generate_and_get_layouts) is not gated and may work as-is; not exercised by tests yet. maxengine.py (init_decode_state, _prefill_jit, _generate_jit): - _init_decode_state_nnx zero-initializes a pure-dict cache from model_ar (so the leading batch dim matches generate's input shape) and builds kv_cache_annotations_named per leaf by reading nnx.Cache.metadata. Tries "out_sharding", "sharding", and "sharding_names" because Flax 0.12.6 renamed these. - _prefill_jit / _generate_jit add an `if config.pure_nnx:` branch that calls _nnx_run_model in place of self.model.apply with mutable=["cache"]. existing_prefix.cache is threaded as a pure-dict cache directly (no params|{"cache":...} dict-merge — params is an nnx.State, not a dict). maxtext_utils.py: - New get_prefill_kv_cache_annotations_nnx / get_kv_cache_annotations_nnx that mirror the Linen helpers' return shape (per-leaf PartitionSpec tree). Both delegate to _nnx_cache_partition_specs which extracts nnx.Cache state via nnx.split, calls get_nnx_named_sharding_with_scan_axis inside nn_partitioning.axis_rules so logical axes ("layers", "cache_batch", "norm", ...) resolve to physical mesh axes, and converts the result to a pure-dict tree. tests/unit/maxengine_test.py: - New tests: test_init_nnx, test_basic_prefill_nnx (with NaN/inf and per-layer cache shape checks), test_basic_decode_nnx (4-step generate with next_pos advancement check), test_quantize_raises_for_nnx, test_lora_raises_for_nnx. - New test_linen_nnx_parity_prefill: bridges Linen-init params into the NNX engine via linen_nnx_converter (convert_linen_to_nnx -> _strip_value_wrappers -> nnx.replace_by_pure_dict) and asserts the NNX engine's prefill matches Linen on the same weights — logits within bf16 tolerance (rtol=0.05, atol=0.1; the test config uses bf16 compute) and exact greedy first-token argmax. - Existing Linen tests untouched. Test summary: 9 passed, 1 skipped (test_chunked_prefill is a pre-existing CPU-only skip). bash lint.sh: codespell + pylint + pyink all green.

ecnal-cienet mentioned this pull request May 6, 2026

[NNX] NNX migration prep (8/N): NNX native lora grpo #3824

Open

4 tasks

ecnal-cienet force-pushed the feat/nnx-native-maxengine branch 2 times, most recently from 48da456 to 4dc3ae2 Compare May 7, 2026 04:38

ecnal-cienet mentioned this pull request May 7, 2026

[NNX] NNX migration prep (9/N): NNX-aware QK-Clip + checkpoint utilities #3836

Draft

4 tasks

ecnal-cienet force-pushed the feat/nnx-native-maxengine branch from 4dc3ae2 to 0dbc411 Compare May 7, 2026 19:49

ecnal-cienet mentioned this pull request May 7, 2026

[NNX] NNX migration prep (9.5/N): NNX + AQT in MaxEngine + serve-mode reload + gpt3 prefill fix #3844

Draft

4 tasks

ecnal-cienet force-pushed the feat/nnx-native-maxengine branch from 0dbc411 to 7e9d8d1 Compare May 7, 2026 21:47

ecnal-cienet mentioned this pull request May 8, 2026

[NNX] NNX migration prep (10/N): vocab tiling custom_vjp with output-head carve-out #3849

Draft

4 tasks

ecnal-cienet force-pushed the feat/nnx-native-maxengine branch 11 times, most recently from 9189f6f to cfc5fc4 Compare May 14, 2026 22:51

ecnal-cienet mentioned this pull request May 15, 2026

[NNX] NNX migration (11/N): set pure_nnx / enable_nnx / pure_nnx_decoder defaults to True #3526

Draft

4 tasks

ecnal-cienet force-pushed the feat/nnx-native-maxengine branch 9 times, most recently from 65c1910 to e86b140 Compare May 22, 2026 21:09

ecnal-cienet mentioned this pull request May 26, 2026

[NNX] NNX migration prep (7.5/N): finish MaxEngine inference carve-outs #3984

Open

4 tasks

ecnal-cienet force-pushed the feat/nnx-native-maxengine branch from 40714e4 to 09200fa Compare May 26, 2026 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NNX] NNX migration prep (7/N): NNX-native MaxEngine inference#3821

[NNX] NNX migration prep (7/N): NNX-native MaxEngine inference#3821
ecnal-cienet wants to merge 1 commit into
mainfrom
feat/nnx-native-maxengine

ecnal-cienet commented May 5, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ecnal-cienet commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NNX Migration Route Map

Description

Design notes

Carve-outs

Tests

Checklist

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ecnal-cienet commented May 5, 2026 •

edited

Loading

codecov Bot commented May 5, 2026 •

edited

Loading