Add GGUF conversion support for MOSS-TTS-Local (1.7B)#6
Open
obirije wants to merge 5 commits intoOpenMOSS:mainfrom
Open
Add GGUF conversion support for MOSS-TTS-Local (1.7B)#6obirije wants to merge 5 commits intoOpenMOSS:mainfrom
obirije wants to merge 5 commits intoOpenMOSS:mainfrom
Conversation
The MossTTSDelayModel converter currently supports only the 8B Delay model. This adds tensor mappings for the Local variant's additional components: - 4-layer local transformer (attention + FFN) - 33 local-to-speech bridge MLPs (indexed per codebook) - 1 speech-to-local bridge MLP - 33 layer norms before audio LM heads - Embedding offset fix: model.embedding_list.0 = text (skip), 1-32 = audio codebooks Changes: - constants.py: 19 new MODEL_TENSOR entries + GGUF name mappings + architecture list - convert_hf_to_gguf.py: ~60 lines in modify_tensors() for Local tensor mapping Tested: successfully converts OpenMOSS-Team/MOSS-TTS-Local-Transformer to GGUF and quantizes to Q2_K (1.1GB), Q3_K_M (1.4GB), Q4_K_M (1.9GB). Audio quality verified via PyTorch inference on the original safetensors.
…t path All 555 tensors from the Local 1.7B GGUF now load successfully. Bridge MLPs (speech_to_local + local_to_speech) and audio layer norms are wired into the inference graph. Produces audio tokens end-to-end. Still TODO: - Local transformer (4 layers) forward pass in the graph - Sequential autoregressive channel loop in moss-tts.cpp (each channel feeds back into local transformer before next channel) - Without these, audio quality is degraded (channels lack coherence) Files changed: - llama-arch.h: 19 new LLM_TENSOR enum entries - llama-arch.cpp: tensor name strings + info mappings + xid format handling - llama-model.h: local_layers struct + bridge MLP fields on llama_model - llama-model.cpp: tensor loading with correct dimensions (local_dim=1536) - moss-tts-delay.cpp: Local variant output path with bridge MLPs + audio LNs
All Local variant weights now participate in inference: - speech_to_local SwiGLU MLP (backbone → local dim) - 4-layer local transformer FFN pass (attention skipped for static case) - local_to_speech SwiGLU MLPs per channel (local → backbone dim) - audio layer norms per channel - Per-channel audio LM heads Audio output improves (96 gen frames vs 37 without local FFN). Still short because channels are produced in parallel from same hidden state. Full quality requires autoregressive channel loop in moss-tts.cpp where each channel's sampled token feeds back through speech_to_local → local_transformer before the next channel is processed.
Detects Local variant at runtime, pre-caches all local weights on CPU, runs sequential per-channel inference with autoregressive feedback: - speech_to_local SwiGLU MLP - 4-layer local transformer FFN - local_to_speech SwiGLU MLP per channel - audio_ln per channel - LM head per channel → sample → re-embed → next channel Includes CPU helper functions: swiglu_cpu, rms_norm_cpu, matmul_cpu, tensor_to_float (dequantize quantized tensors to fp32). Pipeline runs end-to-end but output still short — likely needs debugging of quantized weight dequantization in the CPU path. The bf16 GGUF may produce better results (no dequantization needed).
…actical use) Added mha_gqa_cpu: multi-head attention with GQA, RoPE, Q/K norms, causal masking. Channel loop now maintains growing sequence and runs full 4-layer local transformer (attention + FFN) per channel. Functionally correct but impractical on CPU — bf16 model took 42+ min per backbone step due to naive O(n^3) matmuls without SIMD/threading. Next step: either ONNX export (LUI has ONNX Runtime) or build dynamic GGML graphs that execute on CUDA backend.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds tensor mappings so MOSS-TTS-Local-Transformer (1.7B) can be converted to GGUF. Currently only the 8B Delay model is supported.
constants.py: 19 new MODEL_TENSOR entries for local transformer, bridge MLPs, audio LNs. Added to MOSS_TTS_DELAY architecture list.
convert_hf_to_gguf.py: ~60 lines in modify_tensors() for embedding offset fix (list.0=text skip, 1-32=audio), local transformer mapping, 33 indexed bridge MLPs, speech-to-local MLP, 33 audio LNs, prefix normalization.
Converts OpenMOSS-Team/MOSS-TTS-Local-Transformer (555 tensors, 5.5G bf16). Quantized: Q2_K 1.1GB, Q3_K_M 1.4GB, Q4_K_M 1.9GB. Audio verified via PyTorch.
Pre-quantized GGUFs: https://huggingface.co/John9007/MOSS-TTS-Local-GGUF
Note: C++ inference (llama-moss-tts) does not yet load the Local variant extra tensors. This enables conversion and quantization only.