[NNX] NNX migration prep (7/N): NNX-native MaxEngine inference#3821
Open
ecnal-cienet wants to merge 1 commit into
Open
[NNX] NNX migration prep (7/N): NNX-native MaxEngine inference#3821ecnal-cienet wants to merge 1 commit into
ecnal-cienet wants to merge 1 commit into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
4 tasks
48da456 to
4dc3ae2
Compare
4 tasks
4dc3ae2 to
0dbc411
Compare
4 tasks
0dbc411 to
7e9d8d1
Compare
4 tasks
9189f6f to
cfc5fc4
Compare
Draft
4 tasks
65c1910 to
e86b140
Compare
4 tasks
…e.py)
PR5 audited maxengine.py and routed the inference path to the Linen
implementation regardless of pure_nnx, with a comment block explaining
that "the flag affects training, not inference serving." That kept the
Linen serving path unchanged but meant pure_nnx=True users silently got
the Linen engine. This change replaces the route with a real NNX flow:
when config.pure_nnx=True, the engine builds an NNX Transformer, splits
out (params, cache, rest) with nnx.split, and at every JIT body merges
the model concretely with nnx.merge to run the forward pass. Linen is
preserved byte-for-byte; every NNX edit is gated `if config.pure_nnx:`
and pure_nnx=False is still the default.
maxengine.py (__init__):
- Build two abstract NNX Transformers on the NNX path: self.model with
model_mode=PREFILL (batch=1, single padded prompt) and self.model_ar
with model_mode=AUTOREGRESSIVE (batch=micro_batch_size_to_train_on,
decode_state shape). Both are needed because NNX cache vars inherit
CACHE_BATCH_PREFILL vs CACHE_BATCH from the construction model_mode,
and bulk_insert searches for the substring "cache_batch" in the
AR-mode logical-axes tuple. nnx.eval_shape is called directly inside
nn_partitioning.axis_rules rather than through create_nnx_abstract_model
to avoid the jax.set_mesh wrap that trips Flax 0.12.6 on logical-only
axes like "norm" (same reason get_abstract_state_nnx avoids set_mesh).
- Cache the graphdef from a 3-way nnx.split(Param, Cache, ...) so JIT
bodies can pass (params, cache, rest) separately to nnx.merge. The
rest slot (RNG vars etc.) is materialized concretely in load_params.
maxengine.py (cache adapter + _nnx_run_model):
- bulk_insert / _insert_jit / _maybe_*_prefill_result_cache walk the
cache via tree_map_with_path and switch on path[-1].key (the cache
variable name like "cached_prefill_key"). Linen mutable cache is a
plain nested dict. NNX Cache state would expose a ".value" accessor
at that position. Bridge via nnx.State.to_pure_dict() (after the
model run) and nnx.replace_by_pure_dict (before nnx.merge), so the
cache plumbing helpers see the same shape on both paths.
- Add _nnx_run_model: nnx.merge(graphdef, params, cache, rest, copy=True)
-> model(...) -> nnx.state(model, nnx.Cache).to_pure_dict(). copy=True
avoids reusing Variable objects across traces (TraceContextError),
mirroring train.py's diff_wrapper workaround.
- Add _nnx_cache_state_template / _nnx_init_cache_dict helpers
parametrised by mode so prefill (batch 1) and decode_state (batch N)
pull from the right abstract model.
maxengine.py (load_params):
- New _load_params_nnx: accepts user-provided NNX-shape params or loads
via from_pretrained. For user-provided params, materializes a concrete
model once via _create_model_fn() to capture a real rest state for
nnx.merge (wasteful but simple; the from_pretrained branch avoids
this). Refreshes self.graphdef from the concrete model so subsequent
merges line up exactly.
- Builds self.abstract_params, populates self.prefill_kv_cache_annotations
and self.kv_cache_annotations (using model_ar for the latter so
bulk_insert's substring lookup hits), wraps both into NamedSharding.
- pure_nnx + quantization, pure_nnx + LoRA, pure_nnx +
stack_prefill_result_cache=True, pure_nnx + prefill_multisampling,
and pure_nnx + prefill_concat raise NotImplementedError for now;
the Linen path is the workaround. AOT compilation
(aot_compile / _compile_generate_and_get_layouts) is not gated and
may work as-is; not exercised by tests yet.
maxengine.py (init_decode_state, _prefill_jit, _generate_jit):
- _init_decode_state_nnx zero-initializes a pure-dict cache from
model_ar (so the leading batch dim matches generate's input shape)
and builds kv_cache_annotations_named per leaf by reading
nnx.Cache.metadata. Tries "out_sharding", "sharding", and
"sharding_names" because Flax 0.12.6 renamed these.
- _prefill_jit / _generate_jit add an `if config.pure_nnx:` branch
that calls _nnx_run_model in place of self.model.apply with
mutable=["cache"]. existing_prefix.cache is threaded as a pure-dict
cache directly (no params|{"cache":...} dict-merge — params is an
nnx.State, not a dict).
maxtext_utils.py:
- New get_prefill_kv_cache_annotations_nnx / get_kv_cache_annotations_nnx
that mirror the Linen helpers' return shape (per-leaf PartitionSpec
tree). Both delegate to _nnx_cache_partition_specs which extracts
nnx.Cache state via nnx.split, calls
get_nnx_named_sharding_with_scan_axis inside
nn_partitioning.axis_rules so logical axes ("layers", "cache_batch",
"norm", ...) resolve to physical mesh axes, and converts the result
to a pure-dict tree.
tests/unit/maxengine_test.py:
- New tests: test_init_nnx, test_basic_prefill_nnx (with NaN/inf and
per-layer cache shape checks), test_basic_decode_nnx (4-step generate
with next_pos advancement check), test_quantize_raises_for_nnx,
test_lora_raises_for_nnx.
- New test_linen_nnx_parity_prefill: bridges Linen-init params into
the NNX engine via linen_nnx_converter (convert_linen_to_nnx ->
_strip_value_wrappers -> nnx.replace_by_pure_dict) and asserts the
NNX engine's prefill matches Linen on the same weights — logits
within bf16 tolerance (rtol=0.05, atol=0.1; the test config uses
bf16 compute) and exact greedy first-token argmax.
- Existing Linen tests untouched.
Test summary: 9 passed, 1 skipped (test_chunked_prefill is a
pre-existing CPU-only skip). bash lint.sh: codespell + pylint + pyink
all green.
40714e4 to
09200fa
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NNX Migration Route Map
pure_nnxflag,init_state_fn,TrainStateNNX, NNX utils. Linen workflow unchanged. (PR NNX migration prep (1/N): pure_nnx flag and init_state_fn scaffolding #3427)get_abstract_state_nnx,get_named_sharding_nnx,set_named_sharding_nnx,get_partition_spec_nnx,get_mesh_from_config. (PR NNX migration prep (2/N): NNX utils and sharding utilities #3470)TrainStateNNX, model creation, gradient accumulation, checkpointing, and training loop dispatch. (PR NNX migration prep (3/N): TrainState, model creation, and end-to-end training loop #3500)4.5. ✅ Linen↔NNX checkpoint converter. (PR [NNX] NNX migration prep (4.5/N): Linen<->NNX checkpoint converter #3843)
4.6. ❌ Linen↔NNX checkpoint comparator (sibling branch on PR4.5).
maxengine.py;pure_nnx=Truenow drives a real NNX flow (two-mode abstractTransformer,nnx.merge/nnx.stateper JIT body, pure-dict cache adapter sobulk_insertetc. reuse Linen plumbing). Stacks on PR6. Carve-outs (NotImplementedError'd with PR pointers): NNX+quantization → PR9.5; NNX+LoRA → PR8;prefill_multisampling/prefill_concat/stack_prefill_result_cache=True→ follow-up.9.5. ❌ NNX + AQT in MaxEngine + serve-mode reload + gpt3 prefill fix.
custom_vjpfor NNX.True; regenerate sharding goldens; flip back integration-testpure_nnx=Falseannotations.Description
PR5 routed
maxengine.pyinference to the Linen implementation regardless ofpure_nnx, sopure_nnx=Trueusers silently got the Linen engine. This PR replaces that route with a real NNX flow.When
config.pure_nnx=True, the engine builds an NNXTransformer, splits it into(params, cache, rest)withnnx.split, and at every JIT body reconstructs the model withnnx.merge(..., copy=True)to run the forward pass. Linen is preserved byte-for-byte: every NNX edit is gatedif config.pure_nnx:, andpure_nnx=Falsestays the default.Diff: +404 / −33 across 3 files. The only deletion is the "use Linen path regardless of
pure_nnx" comment block inMaxEngine.__init__.Design notes
__init__buildsself.model(PREFILL, batch 1) andself.model_ar(AUTOREGRESSIVE, decode-state batch). NNX cache vars take their logical axis names (CACHE_BATCHvsCACHE_BATCH_PREFILL) from the constructionmodel_mode; the decode-state cache must use AR-mode names sobulk_insert'scache_batchsubstring lookup hits.nnx.eval_shapeis called directly (not viacreate_nnx_abstract_model) to avoid thejax.set_meshwrap that trips Flax 0.12.6 on logical-only axes like"norm".bulk_insert,_insert_jit,_maybe_*_prefill_result_cache) walk the cache bypath[-1].key. NNXnnx.Cachestate would expose a.valueaccessor there, so the cache flows as a plain dict on both paths:nnx.State.to_pure_dict()after the run,nnx.replace_by_pure_dict()beforennx.merge. The Linen helpers stay unchanged._load_params_nnx): accepts user-provided NNX params or loads viafrom_pretrained, materializes therest(RNG) state once fornnx.merge, and builds the prefill/AR KV-cache shardings.maxtext_utils.py):get_prefill_kv_cache_annotations_nnx/get_kv_cache_annotations_nnxmirror the Linen helpers' return shape (a tree ofPartitionSpec).Carve-outs
These raise
NotImplementedErroron the NNX path — deliberate scope cuts, not silent fallbacks. Each points at its follow-up; the Linen path (pure_nnx=False) remains the workaround.quantize_params, quantized-checkpoint load)load_single_adapter)prefill_multisampling,prefill_concat,stack_prefill_result_cache=TrueTests
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.