Skip to content

Issue Title: SGLang Backend: Context Drop / Identity Bleed in Multi-Speaker voice_clone_and_continuation (MOSS-TTSD v1.0) #119

@nonsense2025

Description

@nonsense2025

Description:
When running the fused MOSS-TTSD v1.0 + Tokenizer model via the moss-ttsd-v1.0-with-cat SGLang branch, the inference server struggles to maintain multimodal context during long, multi-speaker scenes.

While the native Hugging Face transformers implementation successfully holds the voice identity using a concatenated assistant_message acoustic history, the SGLang backend appears to drop or mishandle this complex acoustic KV-cache. During a multi-turn continuation, the model will suddenly drop the cloned identity ([S1] or [S2]) and hallucinate a generic, unconditioned voice (male or female) mid-scene.

Environment:

OS: Linux Alma 10.1

Hardware: NVIDIA RTX 5090 / 32GB RAM

SGLang Branch: moss-ttsd-v1.0-with-cat

Model: Fused OpenMOSS-Team/MOSS-TTSD-v1.0 + MOSS-Audio-Tokenizer

Startup Command: sglang serve --model-path --delay-pattern --trust-remote-code --port 30000

Steps to Reproduce:
The issue is easily reproducible using a multi-speaker prompt where the text exceeds a few hundred tokens. Below is a standalone Python reproduction script that hits the http://localhost:30000/generate endpoint using standard base64 encoding for the reference audio.

It passes two references (Harold and Johnny) and asks the SGLang server to generate a continuation. Midway through the response, the server loses the voice binding.

Reproduction Script (sglang_multispeaker_bug.py):

Python
import requests
import base64
import json

=====================================================================

SGLANG BUG REPORT: Multi-Speaker Identity Bleed

Ensure the SGLang server is running on port 30000 with the fused model

=====================================================================

SERVER_URL = "http://localhost:30000/generate"

def encode_audio(file_path):
with open(file_path, "rb") as audio_file:
return base64.b64encode(audio_file.read()).decode("utf-8")

Encode reference files

Note: Ensure harold_kline.wav and johnny_dollar.wav are in the same directory

audio_s1 = encode_audio("harold.wav")
audio_s2 = encode_audio("johnny.wav")

Construct the multi-speaker prompt

The text contains the reference transcripts followed by the continuation text

prompt_text = (
"[S1] According to subsection four, paragraph B of the policy, we are not liable for acts of... spontaneous combustion. "
"[S2] Your marriage is none of my business. But that policy his paper took out on him is. "
"[S1] This is Harold at Intermountain Indemnity in Dallas. "
"[S2] Harold. Sounds like you have either trouble or overtime. "
"[S1] Both. We insure the Dallas Mining Company."
"[S2] Mining, huh. Caveins and bad coffee. "
)

payload = {
"text": prompt_text,
"audio_data": [audio_s1, audio_s2],
"sampling_params": {
"temperature": 0.8,
"top_p": 0.80,
"repetition_penalty": 1.15,
"max_new_tokens": 2000
}
}

print("Sending multi-speaker payload to SGLang server...")
response = requests.post(SERVER_URL, json=payload)

if response.status_code == 200:
result = response.json()
# SGLang returns the generated audio as a base64 string in the 'text' field
if "text" in result:
output_audio_bytes = base64.b64decode(result["text"])
with open("sglang_bug_output.wav", "wb") as f:
f.write(output_audio_bytes)
print("Saved sglang_bug_output.wav. Listen to the voice drift midway through the dialogue.")
else:
print(f"Server Error {response.status_code}: {response.text}")
Expected Behavior:
The SGLang server should seamlessly handle the acoustic baton pass between [S1] and [S2], maintaining both distinct identities throughout the entire generated WAV file, matching the behavior of the native Hugging Face pipeline.

Actual Behavior:
The server begins the generation correctly but drops the multimodal context from the KV-cache, resulting in the sudden morphing of the voices into unconditioned base states.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions