Description:
When running the fused MOSS-TTSD v1.0 + Tokenizer model via the moss-ttsd-v1.0-with-cat SGLang branch, the inference server struggles to maintain multimodal context during long, multi-speaker scenes.
While the native Hugging Face transformers implementation successfully holds the voice identity using a concatenated assistant_message acoustic history, the SGLang backend appears to drop or mishandle this complex acoustic KV-cache. During a multi-turn continuation, the model will suddenly drop the cloned identity ([S1] or [S2]) and hallucinate a generic, unconditioned voice (male or female) mid-scene.
Environment:
OS: Linux Alma 10.1
Hardware: NVIDIA RTX 5090 / 32GB RAM
SGLang Branch: moss-ttsd-v1.0-with-cat
Model: Fused OpenMOSS-Team/MOSS-TTSD-v1.0 + MOSS-Audio-Tokenizer
Startup Command: sglang serve --model-path --delay-pattern --trust-remote-code --port 30000
Steps to Reproduce:
The issue is easily reproducible using a multi-speaker prompt where the text exceeds a few hundred tokens. Below is a standalone Python reproduction script that hits the http://localhost:30000/generate endpoint using standard base64 encoding for the reference audio.
It passes two references (Harold and Johnny) and asks the SGLang server to generate a continuation. Midway through the response, the server loses the voice binding.
Reproduction Script (sglang_multispeaker_bug.py):
Python
import requests
import base64
import json
=====================================================================
SGLANG BUG REPORT: Multi-Speaker Identity Bleed
Ensure the SGLang server is running on port 30000 with the fused model
=====================================================================
SERVER_URL = "http://localhost:30000/generate"
def encode_audio(file_path):
with open(file_path, "rb") as audio_file:
return base64.b64encode(audio_file.read()).decode("utf-8")
Encode reference files
Note: Ensure harold_kline.wav and johnny_dollar.wav are in the same directory
audio_s1 = encode_audio("harold.wav")
audio_s2 = encode_audio("johnny.wav")
Construct the multi-speaker prompt
The text contains the reference transcripts followed by the continuation text
prompt_text = (
"[S1] According to subsection four, paragraph B of the policy, we are not liable for acts of... spontaneous combustion. "
"[S2] Your marriage is none of my business. But that policy his paper took out on him is. "
"[S1] This is Harold at Intermountain Indemnity in Dallas. "
"[S2] Harold. Sounds like you have either trouble or overtime. "
"[S1] Both. We insure the Dallas Mining Company."
"[S2] Mining, huh. Caveins and bad coffee. "
)
payload = {
"text": prompt_text,
"audio_data": [audio_s1, audio_s2],
"sampling_params": {
"temperature": 0.8,
"top_p": 0.80,
"repetition_penalty": 1.15,
"max_new_tokens": 2000
}
}
print("Sending multi-speaker payload to SGLang server...")
response = requests.post(SERVER_URL, json=payload)
if response.status_code == 200:
result = response.json()
# SGLang returns the generated audio as a base64 string in the 'text' field
if "text" in result:
output_audio_bytes = base64.b64decode(result["text"])
with open("sglang_bug_output.wav", "wb") as f:
f.write(output_audio_bytes)
print("Saved sglang_bug_output.wav. Listen to the voice drift midway through the dialogue.")
else:
print(f"Server Error {response.status_code}: {response.text}")
Expected Behavior:
The SGLang server should seamlessly handle the acoustic baton pass between [S1] and [S2], maintaining both distinct identities throughout the entire generated WAV file, matching the behavior of the native Hugging Face pipeline.
Actual Behavior:
The server begins the generation correctly but drops the multimodal context from the KV-cache, resulting in the sudden morphing of the voices into unconditioned base states.
Description:
When running the fused MOSS-TTSD v1.0 + Tokenizer model via the moss-ttsd-v1.0-with-cat SGLang branch, the inference server struggles to maintain multimodal context during long, multi-speaker scenes.
While the native Hugging Face transformers implementation successfully holds the voice identity using a concatenated assistant_message acoustic history, the SGLang backend appears to drop or mishandle this complex acoustic KV-cache. During a multi-turn continuation, the model will suddenly drop the cloned identity ([S1] or [S2]) and hallucinate a generic, unconditioned voice (male or female) mid-scene.
Environment:
OS: Linux Alma 10.1
Hardware: NVIDIA RTX 5090 / 32GB RAM
SGLang Branch: moss-ttsd-v1.0-with-cat
Model: Fused OpenMOSS-Team/MOSS-TTSD-v1.0 + MOSS-Audio-Tokenizer
Startup Command: sglang serve --model-path --delay-pattern --trust-remote-code --port 30000
Steps to Reproduce:
The issue is easily reproducible using a multi-speaker prompt where the text exceeds a few hundred tokens. Below is a standalone Python reproduction script that hits the http://localhost:30000/generate endpoint using standard base64 encoding for the reference audio.
It passes two references (Harold and Johnny) and asks the SGLang server to generate a continuation. Midway through the response, the server loses the voice binding.
Reproduction Script (sglang_multispeaker_bug.py):
Python
import requests
import base64
import json
=====================================================================
SGLANG BUG REPORT: Multi-Speaker Identity Bleed
Ensure the SGLang server is running on port 30000 with the fused model
=====================================================================
SERVER_URL = "http://localhost:30000/generate"
def encode_audio(file_path):
with open(file_path, "rb") as audio_file:
return base64.b64encode(audio_file.read()).decode("utf-8")
Encode reference files
Note: Ensure harold_kline.wav and johnny_dollar.wav are in the same directory
audio_s1 = encode_audio("harold.wav")
audio_s2 = encode_audio("johnny.wav")
Construct the multi-speaker prompt
The text contains the reference transcripts followed by the continuation text
prompt_text = (
"[S1] According to subsection four, paragraph B of the policy, we are not liable for acts of... spontaneous combustion. "
"[S2] Your marriage is none of my business. But that policy his paper took out on him is. "
"[S1] This is Harold at Intermountain Indemnity in Dallas. "
"[S2] Harold. Sounds like you have either trouble or overtime. "
"[S1] Both. We insure the Dallas Mining Company."
"[S2] Mining, huh. Caveins and bad coffee. "
)
payload = {
"text": prompt_text,
"audio_data": [audio_s1, audio_s2],
"sampling_params": {
"temperature": 0.8,
"top_p": 0.80,
"repetition_penalty": 1.15,
"max_new_tokens": 2000
}
}
print("Sending multi-speaker payload to SGLang server...")
response = requests.post(SERVER_URL, json=payload)
if response.status_code == 200:
result = response.json()
# SGLang returns the generated audio as a base64 string in the 'text' field
if "text" in result:
output_audio_bytes = base64.b64decode(result["text"])
with open("sglang_bug_output.wav", "wb") as f:
f.write(output_audio_bytes)
print("Saved sglang_bug_output.wav. Listen to the voice drift midway through the dialogue.")
else:
print(f"Server Error {response.status_code}: {response.text}")
Expected Behavior:
The SGLang server should seamlessly handle the acoustic baton pass between [S1] and [S2], maintaining both distinct identities throughout the entire generated WAV file, matching the behavior of the native Hugging Face pipeline.
Actual Behavior:
The server begins the generation correctly but drops the multimodal context from the KV-cache, resulting in the sudden morphing of the voices into unconditioned base states.