Declarative MTP-Llama head stack + Gemma4 weight aggregation#528
Merged
Conversation
Express the embeddings/decoder weight mapping as NestedWeightConverter declarations in _create_weight_converters and reduce get_converters to the canonical emit+head form, mirroring LlamaBaseModelConverter. No change to emitted converters. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erter Add RepeatWeightConverter, a structural primitive that fans a sub-section converter over a config-driven count with index-templated prefixes (reusing one sub-config per iteration, unlike BlockSequenceWeightConverter which walks a per-position config list). Use it to declare MTP-Llama's per-prediction-distance blocks and norms on the base-model converter (where they belong: they live at the model root as multi_token_prediction.*, not under head), dropping the imperative get_converters override on the head converter. Emitted converters are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The top-level `from transformers import Gemma4ForCausalLM, Gemma4TextConfig` (added with Gemma 4 support in #492) makes the entire test_hf_roundtrip module fail to collect on transformers builds without Gemma 4 — taking the llama/mistral/qwen2/mixtral/mtp_llama round-trips down with it. Guard the import (matching the try/except pattern used elsewhere in tests) and skip any round-trip case whose model class is unavailable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ness The HF-coverage check rejected configs from supported transformers builds: the hand-curated _HF_METADATA_ALLOWLIST omits generic PretrainedConfig fields (torchscript and the generation kwargs) that v4 dumps into to_dict() but v5 moved out to GenerationConfig. With setup.cfg pinning transformers>=4.57.3,<6.0.0, every format's HF import failed on v4 with "unknown key 'torchscript'". Union the static allowlist with the live PretrainedConfig().to_dict() key set — every key a bare PretrainedConfig carries is generic metadata, never architecture — so the check adapts across the whole supported range instead of pinning a version-specific set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ig class Conversion tests instantiate the format's transformers config class; on supported-but-older transformers builds the class may not exist (e.g. no Gemma 4 before v5), making test_conversion[gemma4] error with `module transformers has no attribute Gemma4TextConfig`. Mirror the existing requires_cuda env-skip: in testing_group_enabled, skip the convert group when the class can't be imported, so the suite stays green across the supported transformers range (>=4.57,<6.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dConfig union The static _HF_METADATA_ALLOWLIST kept 13 generic fields (architectures, model_type, dtype, …) that a bare PretrainedConfig().to_dict() carries on every supported transformers version, so the dynamic union already covers them. Drop those; keep only what the union misses across the range: auto_map / torch_dtype / use_cache (absent from a bare config on both v4 and v5), the token ids (absent from a bare v5 config), and the model-specific init/pretraining metadata. Behavior-preserving. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The example named a downstream consumer (prediction-distance head stack); the preceding sentence already conveys the generic intent (runtime count drives the repeat). Per the style guide, explanatory text should not reference specific consumers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Continues the declarative weight-conversion initiative (#523, #527): removes the last imperative
get_convertersoverrides that hand-rolled weight aggregation, and makes HF conversion work and stay green across the supported transformers range (setup.cfg:>=4.57.3,<6.0.0).Changes
Gemma4 weight aggregation → declarative.
Gemma4BaseModelConverterdeclares embeddings + decoder asNestedWeightConverters;get_converterscollapses to the canonicalemit + headform, mirroringLlamaBaseModelConverter. Emitted converters unchanged.RepeatWeightConverterprimitive + MTP-Llama migration. New structural primitive that fans a sub-section converter over a config-driven count with index-templated prefixes, reusing one sub-config per iteration (unlikeBlockSequenceWeightConverter, which walks a per-position config list). Used to declare MTP-Llama's per-prediction-distance blocks and norms on the base-model converter — their correct home, since they live at the model root asmulti_token_prediction.*, not underhead— dropping the imperativeget_convertersoverride on the head converter. Emitted converters unchanged.HF metadata allowlist derived from
PretrainedConfig. The coverage check rejected configs from supported transformers builds: the hand-curated_HF_METADATA_ALLOWLISTomits genericPretrainedConfigfields (torchscript+ generation kwargs) that v4 dumps intoto_dict()but v5 moved toGenerationConfig. So every format's HF import failed on v4 withunknown key 'torchscript'. Now unions the static allowlist with the livePretrainedConfig().to_dict()key set — generic-by-definition, never architecture — adapting across the whole supported range.Gate optional Gemma4 import + conditional convert-skip. The top-level
from transformers import Gemma4ForCausalLM, Gemma4TextConfig(from #492) madetest_hf_roundtripfail to collect on builds without Gemma 4; guarded behind try/except with per-case skip. Symmetrically,testing_group_enablednow skips theconvertgroup when the format's HF config class can't be imported (mirroring the existingrequires_cudaenv-skip), sotest_conversion[gemma4]skips on transformers v4 instead of erroring.Verification (both supported transformers versions)
test_converters.pywalkertest_hf_roundtrip.pytest_checkpoint.py(gpt formats)gemma4 conversion is exercised on v5 and cleanly skipped on v4 (no Gemma 4 class there). The
torchscriptfailures that previously hit every format — including untouchedllama— are gone.🤖 Generated with Claude Code