Skip to content

fix: stabilize model-level TTS routing and add model TTS settings#1632

Merged
zerob13 merged 1 commit into
devfrom
fix/model-level-tts-runtime-and-settings
May 16, 2026
Merged

fix: stabilize model-level TTS routing and add model TTS settings#1632
zerob13 merged 1 commit into
devfrom
fix/model-level-tts-runtime-and-settings

Conversation

@zhangmo8
Copy link
Copy Markdown
Collaborator

@zhangmo8 zhangmo8 commented May 15, 2026

Summary

This PR finishes and hardens model-level TTS support across runtime, model DB inference, and settings UI.

Runtime / Routing

  • normalize TTS model IDs so prefixed IDs (e.g. openai/tts-1) are recognized
  • expand chat-audio TTS detection to include Xiaomi alias prefix
  • improve Pattern B chat-completions request by adding modalities: ["text", "audio"]
  • improve Pattern B response parsing with fallback path extraction

Model Type / Endpoint Inference

  • support tts in provider-model type inference paths
  • map inferred ModelType.TTS to ApiEndpointType.AudioSpeech
  • keep provider-managed model config sync aligned with TTS endpoint behavior

Settings UI

  • add model-level TTS settings fields (voice, responseFormat, speed, instructions)
  • integrate TTS settings into ModelConfigDialog
  • add TTS model type option and audio-speech endpoint option in model config
  • auto-sync API endpoint with selected model type (chat / image / audio-speech)
  • improve TTS type labeling in provider model list

Model DB

  • mark relevant aihubmix TTS models as type: "tts"
  • document skip for built-in xiaomimimo provider entry (not present)

Verification

  • pnpm run typecheck
  • pnpm run format
  • pnpm run i18n
  • pnpm run lint

All checks pass locally.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added text-to-speech (TTS) model support with configurable settings including voice selection, audio format, speed, and synthesis instructions.
    • Integrated TTS model detection and routing for enhanced audio generation capabilities.
    • Added new TTS provider models (OpenAI TTS, Google Gemini audio variants) to the model database.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

📝 Walkthrough

Walkthrough

This PR implements unified TTS provider support as a first-class model capability. It introduces ModelType.TTS and ApiEndpointType.AudioSpeech into shared contracts, adds provider routing to recognize TTS models, implements dual-pattern runtime execution (OpenAI-compatible /audio/speech and /chat/completions-with-audio), extends the frontend with TTS model detection and configuration UI, and populates the model database with TTS model entries.

Changes

Unified TTS Provider

Layer / File(s) Summary
Shared type and settings foundations
src/shared/model.ts, src/shared/ttsSettings.ts, src/shared/contracts/common.ts, src/shared/contracts/domainSchemas.ts, src/shared/types/model-db.ts, src/shared/types/presenters/legacy.presenters.d.ts
ModelType.TTS and ApiEndpointType.AudioSpeech enums are added; TtsSettings interface with voice, responseFormat, speed, and instructions fields; helpers for model ID classification, MIME-type mapping, and settings normalization; model-db schema extended to parse type: 'tts'; ModelConfig contract gains optional tts field.
Backend presenter model routing
src/main/presenter/configPresenter/index.ts, src/main/presenter/configPresenter/modelConfig.ts
Provider models with type: 'tts' are inferred as ModelType.TTS; buildConfigFromProviderModel maps ModelType.TTS to ApiEndpointType.AudioSpeech endpoint.
Provider TTS capability detection
src/main/presenter/llmProviderPresenter/providers/aiSdkProvider.ts
shouldUseOpenAITtsRoute predicate identifies OpenAI-compatible TTS models; shouldUseTts callback is wired into AiSdkRuntimeContext for compatible non-cloud providers; "new-api" model type recognition extended to classify TTS patterns; computed apiEndpoint maps TTS to AudioSpeech.
Runtime TTS execution
src/main/presenter/llmProviderPresenter/aiSdk/runtime.ts
AiSdkRuntimeContext gains optional shouldUseTts hook; runAiSdkCoreStream branches to TTS when enabled: extracts text from last user message, executes either Pattern A (/v1/audio/speech) or Pattern B (/v1/chat/completions with audio), encodes response as base64 data URL, caches via device presenter, emits image_data stream event with cached audio, and terminates with stop event.
Frontend model detection
src/renderer/src/composables/useModelTypeDetection.ts
useModelTypeDetection composable extended to recognize 'tts' model type and expose isTtsModel computed boolean.
Frontend configuration UI
src/renderer/src/components/settings/TtsSettingsFields.vue, src/renderer/src/components/settings/ModelConfigDialog.vue, src/renderer/settings/components/ProviderModelList.vue
New TtsSettingsFields.vue component renders voice, response format, speed, and instructions inputs with normalization; ModelConfigDialog conditionally renders TTS settings, adds tts type option, selects audio-speech endpoint for TTS, synchronizes endpoint based on model type, and normalizes settings on save; ProviderModelList adds TTS icon and i18n label.
Model database and documentation
resources/model-db/providers.json, docs/features/unified-tts-provider/*
TTS model entries added: tts-1, tts-1-hd, gpt-4o-mini-tts, gemini-2.5-flash-preview-tts, gemini-2.5-pro-preview-tts with modalities, limits, and metadata; feature specification, implementation plan, and task checklist documentation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • ThinkInAIXYZ/deepchat#1449: The main PR's unified TTS implementation directly extends the AI-SDK runtime by adding a new shouldUseTts hook and inserting a TTS routing path into runAiSdkCoreStream, which is part of the same AI-SDK runtime migration.
  • ThinkInAIXYZ/deepchat#1244: Main PR extends the existing apiEndpoint plumbing by updating ModelConfig/ModelConfigDialog routing and endpoint handling to add TTS-specific audio-speech behavior alongside chat/image endpoint selection.
  • ThinkInAIXYZ/deepchat#1291: Both PRs connect at the message/audio plumbing level; the main PR adds model-level TTS execution, while the retrieved PR implements renderer logic to emit and render audio blocks for voice/TTS output.

Suggested reviewers

  • zerob13

Poem

🐰 Hops through audio dreams so bright,
Where text meets voice in flight,
TTS now speaks with model grace,
A melody in every place! 🎵

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the core changes: stabilizing model-level TTS routing and adding TTS settings support across runtime, inference, and UI layers.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/model-level-tts-runtime-and-settings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
resources/model-db/providers.json (1)

181588-181610: ⚡ Quick win

Normalize gpt-4o-mini-tts field shape with peer TTS entries.

Line 181588 onward omits limit and open_weights, while neighboring TTS entries include them. Keeping a consistent record shape reduces downstream null-guard branching.

Proposed diff
         {
           "id": "gpt-4o-mini-tts",
           "name": "gpt-4o-mini-tts",
           "display_name": "gpt-4o-mini-tts",
           "modalities": {
             "input": [
               "text"
             ],
             "output": [
               "audio"
             ]
           },
+          "limit": {
+            "context": 8192,
+            "output": 8192
+          },
           "temperature": false,
           "tool_call": false,
           "reasoning": {
             "supported": false
           },
           "attachment": false,
+          "open_weights": false,
           "cost": {
             "input": 0.48,
             "output": 0.96
           },
           "type": "tts"
         },
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@resources/model-db/providers.json` around lines 181588 - 181610, The JSON
entry for the TTS model with id/name "gpt-4o-mini-tts" is missing the standard
fields "limit" and "open_weights" used by other TTS entries; update the object
for "gpt-4o-mini-tts" to include the same "limit" structure (e.g.,
requests/characters/hour or whatever shape peers use) and the "open_weights"
boolean/metadata key with the same defaults as neighboring TTS entries so the
record shape matches peers and avoids extra null checks in consumers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/main/presenter/llmProviderPresenter/aiSdk/runtime.ts`:
- Around line 313-318: The audio extraction is brittle: update the logic around
firstMessage and audioData to defensively check both paths and handle missing
fields—inspect firstMessage.audio?.data first, then iterate firstMessage.content
(if Array.isArray) and return the first item where item?.type === 'audio' &&
item?.audio?.data exists; if none found, also consider item?.audio (in case data
is top-level) or item?.data as fallback before throwing. Modify the code around
the audioData computation (referencing firstMessage and audioData) to perform
these guarded checks and only throw the Error('TTS response missing audio data
in choices[0].message.audio.data') after all fallbacks are exhausted.

In `@src/renderer/src/components/settings/TtsSettingsFields.vue`:
- Around line 54-61: The Label and Input in TtsSettingsFields.vue are using the
wrong i18n keys (settings.model.modelConfig.timeout.label and
settings.model.modelConfig.name.placeholder); replace them with dedicated
"instructions" keys (for example settings.model.instructions.label and
settings.model.instructions.placeholder) in the two t(...) calls used by Label
and the Input's placeholder, keep the binding to t and the `@update`:model-value
handler onInstructionsInput unchanged, and then add/update those new keys in the
i18n resource files so translations are available.

---

Nitpick comments:
In `@resources/model-db/providers.json`:
- Around line 181588-181610: The JSON entry for the TTS model with id/name
"gpt-4o-mini-tts" is missing the standard fields "limit" and "open_weights" used
by other TTS entries; update the object for "gpt-4o-mini-tts" to include the
same "limit" structure (e.g., requests/characters/hour or whatever shape peers
use) and the "open_weights" boolean/metadata key with the same defaults as
neighboring TTS entries so the record shape matches peers and avoids extra null
checks in consumers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d3fd03ba-47f7-4fee-9ec4-b6c314ffb6a4

📥 Commits

Reviewing files that changed from the base of the PR and between 5727aff and 8fa2e72.

📒 Files selected for processing (18)
  • docs/features/unified-tts-provider/plan.md
  • docs/features/unified-tts-provider/spec.md
  • docs/features/unified-tts-provider/tasks.md
  • resources/model-db/providers.json
  • src/main/presenter/configPresenter/index.ts
  • src/main/presenter/configPresenter/modelConfig.ts
  • src/main/presenter/llmProviderPresenter/aiSdk/runtime.ts
  • src/main/presenter/llmProviderPresenter/providers/aiSdkProvider.ts
  • src/renderer/settings/components/ProviderModelList.vue
  • src/renderer/src/components/settings/ModelConfigDialog.vue
  • src/renderer/src/components/settings/TtsSettingsFields.vue
  • src/renderer/src/composables/useModelTypeDetection.ts
  • src/shared/contracts/common.ts
  • src/shared/contracts/domainSchemas.ts
  • src/shared/model.ts
  • src/shared/ttsSettings.ts
  • src/shared/types/model-db.ts
  • src/shared/types/presenters/legacy.presenters.d.ts

Comment on lines +313 to +318
const audioData =
firstMessage?.audio?.data ??
firstMessage?.content?.find((item) => item?.type === 'audio')?.audio?.data
if (!audioData) {
throw new Error('TTS response missing audio data in choices[0].message.audio.data')
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Pattern B audio extraction may miss edge cases.

The fallback path at Line 315 searches for content array items with type === 'audio', but the primary path checks message.audio.data. If the response structure varies (e.g., message.content exists but item.type is not 'audio', or item.audio is missing), the extraction will fail.

🛡️ Consider a more defensive extraction
-    const audioData =
-      firstMessage?.audio?.data ??
-      firstMessage?.content?.find((item) => item?.type === 'audio')?.audio?.data
+    const audioData =
+      firstMessage?.audio?.data ??
+      (Array.isArray(firstMessage?.content)
+        ? firstMessage.content.find((item) => item?.type === 'audio')?.audio?.data
+        : undefined)
     if (!audioData) {
       throw new Error('TTS response missing audio data in choices[0].message.audio.data')
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const audioData =
firstMessage?.audio?.data ??
firstMessage?.content?.find((item) => item?.type === 'audio')?.audio?.data
if (!audioData) {
throw new Error('TTS response missing audio data in choices[0].message.audio.data')
}
const audioData =
firstMessage?.audio?.data ??
(Array.isArray(firstMessage?.content)
? firstMessage.content.find((item) => item?.type === 'audio')?.audio?.data
: undefined)
if (!audioData) {
throw new Error('TTS response missing audio data in choices[0].message.audio.data')
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main/presenter/llmProviderPresenter/aiSdk/runtime.ts` around lines 313 -
318, The audio extraction is brittle: update the logic around firstMessage and
audioData to defensively check both paths and handle missing fields—inspect
firstMessage.audio?.data first, then iterate firstMessage.content (if
Array.isArray) and return the first item where item?.type === 'audio' &&
item?.audio?.data exists; if none found, also consider item?.audio (in case data
is top-level) or item?.data as fallback before throwing. Modify the code around
the audioData computation (referencing firstMessage and audioData) to perform
these guarded checks and only throw the Error('TTS response missing audio data
in choices[0].message.audio.data') after all fallbacks are exhausted.

Comment on lines +54 to +61
<div class="space-y-2">
<Label>{{ t('settings.model.modelConfig.timeout.label') }}</Label>
<Input
:model-value="tts.instructions ?? ''"
:placeholder="t('settings.model.modelConfig.name.placeholder')"
@update:model-value="onInstructionsInput"
/>
</div>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Incorrect i18n keys for instructions field.

Line 55 uses settings.model.modelConfig.timeout.label for the instructions label, and Line 58 uses settings.model.modelConfig.name.placeholder for the instructions placeholder. These keys appear to be copy-paste errors and do not match the semantic purpose of the "instructions" field.

🔧 Proposed fix to use dedicated i18n keys
     <div class="space-y-2">
-      <Label>{{ t('settings.model.modelConfig.timeout.label') }}</Label>
+      <Label>{{ t('settings.provider.voiceai.instructions.label') }}</Label>
       <Input
         :model-value="tts.instructions ?? ''"
-        :placeholder="t('settings.model.modelConfig.name.placeholder')"
+        :placeholder="t('settings.provider.voiceai.instructions.placeholder')"
         `@update`:model-value="onInstructionsInput"
       />
     </div>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<div class="space-y-2">
<Label>{{ t('settings.model.modelConfig.timeout.label') }}</Label>
<Input
:model-value="tts.instructions ?? ''"
:placeholder="t('settings.model.modelConfig.name.placeholder')"
@update:model-value="onInstructionsInput"
/>
</div>
<div class="space-y-2">
<Label>{{ t('settings.provider.voiceai.instructions.label') }}</Label>
<Input
:model-value="tts.instructions ?? ''"
:placeholder="t('settings.provider.voiceai.instructions.placeholder')"
`@update`:model-value="onInstructionsInput"
/>
</div>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/renderer/src/components/settings/TtsSettingsFields.vue` around lines 54 -
61, The Label and Input in TtsSettingsFields.vue are using the wrong i18n keys
(settings.model.modelConfig.timeout.label and
settings.model.modelConfig.name.placeholder); replace them with dedicated
"instructions" keys (for example settings.model.instructions.label and
settings.model.instructions.placeholder) in the two t(...) calls used by Label
and the Input's placeholder, keep the binding to t and the `@update`:model-value
handler onInstructionsInput unchanged, and then add/update those new keys in the
i18n resource files so translations are available.

@zerob13 zerob13 merged commit f76d07f into dev May 16, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants