Add GGUF conversion support for MOSS-TTS-Local (1.7B) by obirije · Pull Request #6 · OpenMOSS/llama.cpp

obirije · 2026-04-14T17:34:00Z

Adds tensor mappings so MOSS-TTS-Local-Transformer (1.7B) can be converted to GGUF. Currently only the 8B Delay model is supported.

constants.py: 19 new MODEL_TENSOR entries for local transformer, bridge MLPs, audio LNs. Added to MOSS_TTS_DELAY architecture list.

convert_hf_to_gguf.py: ~60 lines in modify_tensors() for embedding offset fix (list.0=text skip, 1-32=audio), local transformer mapping, 33 indexed bridge MLPs, speech-to-local MLP, 33 audio LNs, prefix normalization.

Converts OpenMOSS-Team/MOSS-TTS-Local-Transformer (555 tensors, 5.5G bf16). Quantized: Q2_K 1.1GB, Q3_K_M 1.4GB, Q4_K_M 1.9GB. Audio verified via PyTorch.

Pre-quantized GGUFs: https://huggingface.co/John9007/MOSS-TTS-Local-GGUF

Note: C++ inference (llama-moss-tts) does not yet load the Local variant extra tensors. This enables conversion and quantization only.

The MossTTSDelayModel converter currently supports only the 8B Delay model. This adds tensor mappings for the Local variant's additional components: - 4-layer local transformer (attention + FFN) - 33 local-to-speech bridge MLPs (indexed per codebook) - 1 speech-to-local bridge MLP - 33 layer norms before audio LM heads - Embedding offset fix: model.embedding_list.0 = text (skip), 1-32 = audio codebooks Changes: - constants.py: 19 new MODEL_TENSOR entries + GGUF name mappings + architecture list - convert_hf_to_gguf.py: ~60 lines in modify_tensors() for Local tensor mapping Tested: successfully converts OpenMOSS-Team/MOSS-TTS-Local-Transformer to GGUF and quantizes to Q2_K (1.1GB), Q3_K_M (1.4GB), Q4_K_M (1.9GB). Audio quality verified via PyTorch inference on the original safetensors.

…t path All 555 tensors from the Local 1.7B GGUF now load successfully. Bridge MLPs (speech_to_local + local_to_speech) and audio layer norms are wired into the inference graph. Produces audio tokens end-to-end. Still TODO: - Local transformer (4 layers) forward pass in the graph - Sequential autoregressive channel loop in moss-tts.cpp (each channel feeds back into local transformer before next channel) - Without these, audio quality is degraded (channels lack coherence) Files changed: - llama-arch.h: 19 new LLM_TENSOR enum entries - llama-arch.cpp: tensor name strings + info mappings + xid format handling - llama-model.h: local_layers struct + bridge MLP fields on llama_model - llama-model.cpp: tensor loading with correct dimensions (local_dim=1536) - moss-tts-delay.cpp: Local variant output path with bridge MLPs + audio LNs

All Local variant weights now participate in inference: - speech_to_local SwiGLU MLP (backbone → local dim) - 4-layer local transformer FFN pass (attention skipped for static case) - local_to_speech SwiGLU MLPs per channel (local → backbone dim) - audio layer norms per channel - Per-channel audio LM heads Audio output improves (96 gen frames vs 37 without local FFN). Still short because channels are produced in parallel from same hidden state. Full quality requires autoregressive channel loop in moss-tts.cpp where each channel's sampled token feeds back through speech_to_local → local_transformer before the next channel is processed.

Detects Local variant at runtime, pre-caches all local weights on CPU, runs sequential per-channel inference with autoregressive feedback: - speech_to_local SwiGLU MLP - 4-layer local transformer FFN - local_to_speech SwiGLU MLP per channel - audio_ln per channel - LM head per channel → sample → re-embed → next channel Includes CPU helper functions: swiglu_cpu, rms_norm_cpu, matmul_cpu, tensor_to_float (dequantize quantized tensors to fp32). Pipeline runs end-to-end but output still short — likely needs debugging of quantized weight dequantization in the CPU path. The bf16 GGUF may produce better results (no dequantization needed).

…actical use) Added mha_gqa_cpu: multi-head attention with GQA, RoPE, Q/K norms, causal masking. Channel loop now maintains growing sequence and runs full 4-layer local transformer (attention + FFN) per channel. Functionally correct but impractical on CPU — bf16 model took 42+ min per backbone step due to naive O(n^3) matmuls without SIMD/threading. Next step: either ONNX export (LUI has ONNX Runtime) or build dynamic GGML graphs that execute on CUDA backend.

github-actions bot added the python label Apr 14, 2026

github-actions bot added the model label Apr 14, 2026

obirije added 2 commits April 14, 2026 20:38

github-actions bot added the examples label Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GGUF conversion support for MOSS-TTS-Local (1.7B)#6

Add GGUF conversion support for MOSS-TTS-Local (1.7B)#6
obirije wants to merge 5 commits intoOpenMOSS:mainfrom
obirije:feature/moss-tts-local-gguf

obirije commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

obirije commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant