Skip to content

Add GGUF conversion support for MOSS-TTS-Local (1.7B)#6

Open
obirije wants to merge 5 commits intoOpenMOSS:mainfrom
obirije:feature/moss-tts-local-gguf
Open

Add GGUF conversion support for MOSS-TTS-Local (1.7B)#6
obirije wants to merge 5 commits intoOpenMOSS:mainfrom
obirije:feature/moss-tts-local-gguf

Conversation

@obirije
Copy link
Copy Markdown

@obirije obirije commented Apr 14, 2026

Adds tensor mappings so MOSS-TTS-Local-Transformer (1.7B) can be converted to GGUF. Currently only the 8B Delay model is supported.

constants.py: 19 new MODEL_TENSOR entries for local transformer, bridge MLPs, audio LNs. Added to MOSS_TTS_DELAY architecture list.

convert_hf_to_gguf.py: ~60 lines in modify_tensors() for embedding offset fix (list.0=text skip, 1-32=audio), local transformer mapping, 33 indexed bridge MLPs, speech-to-local MLP, 33 audio LNs, prefix normalization.

Converts OpenMOSS-Team/MOSS-TTS-Local-Transformer (555 tensors, 5.5G bf16). Quantized: Q2_K 1.1GB, Q3_K_M 1.4GB, Q4_K_M 1.9GB. Audio verified via PyTorch.

Pre-quantized GGUFs: https://huggingface.co/John9007/MOSS-TTS-Local-GGUF

Note: C++ inference (llama-moss-tts) does not yet load the Local variant extra tensors. This enables conversion and quantization only.

The MossTTSDelayModel converter currently supports only the 8B Delay model.
This adds tensor mappings for the Local variant's additional components:

- 4-layer local transformer (attention + FFN)
- 33 local-to-speech bridge MLPs (indexed per codebook)
- 1 speech-to-local bridge MLP
- 33 layer norms before audio LM heads
- Embedding offset fix: model.embedding_list.0 = text (skip), 1-32 = audio codebooks

Changes:
- constants.py: 19 new MODEL_TENSOR entries + GGUF name mappings + architecture list
- convert_hf_to_gguf.py: ~60 lines in modify_tensors() for Local tensor mapping

Tested: successfully converts OpenMOSS-Team/MOSS-TTS-Local-Transformer to GGUF
and quantizes to Q2_K (1.1GB), Q3_K_M (1.4GB), Q4_K_M (1.9GB).

Audio quality verified via PyTorch inference on the original safetensors.
…t path

All 555 tensors from the Local 1.7B GGUF now load successfully.
Bridge MLPs (speech_to_local + local_to_speech) and audio layer norms
are wired into the inference graph. Produces audio tokens end-to-end.

Still TODO:
- Local transformer (4 layers) forward pass in the graph
- Sequential autoregressive channel loop in moss-tts.cpp
  (each channel feeds back into local transformer before next channel)
- Without these, audio quality is degraded (channels lack coherence)

Files changed:
- llama-arch.h: 19 new LLM_TENSOR enum entries
- llama-arch.cpp: tensor name strings + info mappings + xid format handling
- llama-model.h: local_layers struct + bridge MLP fields on llama_model
- llama-model.cpp: tensor loading with correct dimensions (local_dim=1536)
- moss-tts-delay.cpp: Local variant output path with bridge MLPs + audio LNs
@github-actions github-actions bot added the model label Apr 14, 2026
obirije added 2 commits April 14, 2026 20:38
All Local variant weights now participate in inference:
- speech_to_local SwiGLU MLP (backbone → local dim)
- 4-layer local transformer FFN pass (attention skipped for static case)
- local_to_speech SwiGLU MLPs per channel (local → backbone dim)
- audio layer norms per channel
- Per-channel audio LM heads

Audio output improves (96 gen frames vs 37 without local FFN).
Still short because channels are produced in parallel from same hidden state.
Full quality requires autoregressive channel loop in moss-tts.cpp where each
channel's sampled token feeds back through speech_to_local → local_transformer
before the next channel is processed.
Detects Local variant at runtime, pre-caches all local weights on CPU,
runs sequential per-channel inference with autoregressive feedback:
- speech_to_local SwiGLU MLP
- 4-layer local transformer FFN
- local_to_speech SwiGLU MLP per channel
- audio_ln per channel
- LM head per channel → sample → re-embed → next channel

Includes CPU helper functions: swiglu_cpu, rms_norm_cpu, matmul_cpu,
tensor_to_float (dequantize quantized tensors to fp32).

Pipeline runs end-to-end but output still short — likely needs debugging
of quantized weight dequantization in the CPU path. The bf16 GGUF may
produce better results (no dequantization needed).
…actical use)

Added mha_gqa_cpu: multi-head attention with GQA, RoPE, Q/K norms,
causal masking. Channel loop now maintains growing sequence and runs
full 4-layer local transformer (attention + FFN) per channel.

Functionally correct but impractical on CPU — bf16 model took 42+ min
per backbone step due to naive O(n^3) matmuls without SIMD/threading.

Next step: either ONNX export (LUI has ONNX Runtime) or build dynamic
GGML graphs that execute on CUDA backend.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant