Issue Title: SGLang Backend: Context Drop / Identity Bleed in Multi-Speaker voice_clone_and_continuation (MOSS-TTSD v1.0)

Description:
When running the fused MOSS-TTSD v1.0 + Tokenizer model via the moss-ttsd-v1.0-with-cat SGLang branch, the inference server struggles to maintain multimodal context during long, multi-speaker scenes.

While the native Hugging Face transformers implementation successfully holds the voice identity using a concatenated assistant_message acoustic history, the SGLang backend appears to drop or mishandle this complex acoustic KV-cache. During a multi-turn continuation, the model will suddenly drop the cloned identity ([S1] or [S2]) and hallucinate a generic, unconditioned voice (male or female) mid-scene.

Environment:

OS: Linux Alma 10.1

Hardware: NVIDIA RTX 5090 / 32GB RAM

SGLang Branch: moss-ttsd-v1.0-with-cat

Model: Fused OpenMOSS-Team/MOSS-TTSD-v1.0 + MOSS-Audio-Tokenizer

Startup Command: sglang serve --model-path <fused-model> --delay-pattern --trust-remote-code --port 30000

Steps to Reproduce:
The issue is easily reproducible using a multi-speaker prompt where the text exceeds a few hundred tokens. Below is a standalone Python reproduction script that hits the http://localhost:30000/generate endpoint using standard base64 encoding for the reference audio.

It passes two references (Harold and Johnny) and asks the SGLang server to generate a continuation. Midway through the response, the server loses the voice binding.

Reproduction Script (sglang_multispeaker_bug.py):

Python
import requests
import base64
import json

# =====================================================================
# SGLANG BUG REPORT: Multi-Speaker Identity Bleed
# Ensure the SGLang server is running on port 30000 with the fused model
# =====================================================================

SERVER_URL = "http://localhost:30000/generate"

def encode_audio(file_path):
    with open(file_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")

# Encode reference files
# Note: Ensure harold_kline.wav and johnny_dollar.wav are in the same directory
audio_s1 = encode_audio("harold.wav")
audio_s2 = encode_audio("johnny.wav")

# Construct the multi-speaker prompt
# The text contains the reference transcripts followed by the continuation text
prompt_text = (
    "[S1] According to subsection four, paragraph B of the policy, we are not liable for acts of... spontaneous combustion. "
    "[S2] Your marriage is none of my business. But that policy his paper took out on him is. "
    "[S1] This is Harold at Intermountain Indemnity in Dallas. "
    "[S2] Harold. Sounds like you have either trouble or overtime. "
    "[S1] Both. We insure the Dallas Mining Company."
    "[S2] Mining, huh. Caveins and bad coffee. "
)

payload = {
    "text": prompt_text,
    "audio_data": [audio_s1, audio_s2],
    "sampling_params": {
        "temperature": 0.8,
        "top_p": 0.80,
        "repetition_penalty": 1.15,
        "max_new_tokens": 2000
    }
}

print("Sending multi-speaker payload to SGLang server...")
response = requests.post(SERVER_URL, json=payload)

if response.status_code == 200:
    result = response.json()
    # SGLang returns the generated audio as a base64 string in the 'text' field
    if "text" in result:
        output_audio_bytes = base64.b64decode(result["text"])
        with open("sglang_bug_output.wav", "wb") as f:
            f.write(output_audio_bytes)
        print("Saved sglang_bug_output.wav. Listen to the voice drift midway through the dialogue.")
else:
    print(f"Server Error {response.status_code}: {response.text}")
Expected Behavior:
The SGLang server should seamlessly handle the acoustic baton pass between [S1] and [S2], maintaining both distinct identities throughout the entire generated WAV file, matching the behavior of the native Hugging Face pipeline.

Actual Behavior:
The server begins the generation correctly but drops the multimodal context from the KV-cache, resulting in the sudden morphing of the voices into unconditioned base states.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue Title: SGLang Backend: Context Drop / Identity Bleed in Multi-Speaker voice_clone_and_continuation (MOSS-TTSD v1.0) #119

=====================================================================

SGLANG BUG REPORT: Multi-Speaker Identity Bleed

Ensure the SGLang server is running on port 30000 with the fused model

=====================================================================

Encode reference files

Note: Ensure harold_kline.wav and johnny_dollar.wav are in the same directory

Construct the multi-speaker prompt

The text contains the reference transcripts followed by the continuation text

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue Title: SGLang Backend: Context Drop / Identity Bleed in Multi-Speaker voice_clone_and_continuation (MOSS-TTSD v1.0) #119

Description

=====================================================================

SGLANG BUG REPORT: Multi-Speaker Identity Bleed

Ensure the SGLang server is running on port 30000 with the fused model

=====================================================================

Encode reference files

Note: Ensure harold_kline.wav and johnny_dollar.wav are in the same directory

Construct the multi-speaker prompt

The text contains the reference transcripts followed by the continuation text

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions