diff --git a/.mintlify/skills/fish-audio-api/SKILL.md b/.mintlify/skills/fish-audio-api/SKILL.md index 2bef9af..68e21c3 100644 --- a/.mintlify/skills/fish-audio-api/SKILL.md +++ b/.mintlify/skills/fish-audio-api/SKILL.md @@ -64,7 +64,7 @@ Response: streaming audio bytes (`Transfer-Encoding: chunked`) in the format set | `format` | `wav` \| `pcm` \| `mp3` \| `opus` | `mp3` | Output format. | | `sample_rate` | int \| null | null (44100, or 48000 for opus) | Output sample rate. | | `mp3_bitrate` | 64 \| 128 \| 192 | 128 | Only when `format=mp3`. | -| `opus_bitrate` | -1000 \| 24 \| 32 \| 48 \| 64 | -1000 (auto) | Only when `format=opus`. | +| `opus_bitrate` | -1000 \| 24000 \| 32000 \| 48000 \| 64000 | -1000 (auto) | Opus bitrate in **bps**. Only when `format=opus`. | | `latency` | `low` \| `normal` \| `balanced` | `normal` | Quality vs latency. | | `max_new_tokens` | int | 1024 | Per-chunk audio token cap. | | `repetition_penalty` | number | 1.2 | >1.0 reduces repeats. | diff --git a/.mintlify/skills/fish-audio-sdk/SKILL.md b/.mintlify/skills/fish-audio-sdk/SKILL.md new file mode 100644 index 0000000..605f0fe --- /dev/null +++ b/.mintlify/skills/fish-audio-sdk/SKILL.md @@ -0,0 +1,119 @@ +--- +name: fish-audio-sdk +description: Write code with the official Fish Audio SDKs — Python (`fishaudio`, PyPI `fish-audio-sdk`) and JavaScript/TypeScript (`fish-audio`). Use when the user wants text-to-speech, speech-to-text, voice cloning / voice-model management, or realtime WebSocket TTS through the installed SDK rather than raw HTTP. Covers install and auth, sync + async Python, the TypeScript client, exact method signatures and defaults, model selection (s1 / s2-pro), the real exception types, and the Python↔JavaScript naming differences. For raw REST/WebSocket calls without an SDK (curl, unsupported languages, edge runtimes), use the `fish-audio-api` skill instead. +--- + +# Fish Audio SDK Skill + +Use this skill to generate correct, runnable code with the **official Fish Audio SDKs**: + +- **Python** — package `fish-audio-sdk` on PyPI, imported as `fishaudio`. (The same wheel still ships a separate legacy `fish_audio_sdk` package — do **not** mix them; everything here is the modern `fishaudio` package.) +- **JavaScript / TypeScript** — package `fish-audio` on npm, imported as `FishAudioClient`. + +If the user wants raw `curl` / HTTP / WebSocket without installing an SDK, use the **`fish-audio-api`** skill instead. + +> This file is the index. Deeper, task-specific rules and full examples live in [`references/`](references/). Read the reference for the task you're doing before writing code. + +## Global facts + +- **Auth:** both SDKs read the API key from the `FISH_API_KEY` environment variable automatically. Get keys at `https://fish.audio/app/api-keys`. Never hardcode a key — read it from the environment. +- **Base URL:** `https://api.fish.audio` (override with `base_url=` in Python / `baseUrl:` in JS). +- **Models:** `s2-pro` (default — highest quality) and `s1`. `speech-1.5` / `speech-1.6` are **deprecated**. In Python pass `model="s2-pro"` (keyword); in JS pass the **positional** `backend` argument. +- **Audio formats:** `mp3` (default), `wav`, `pcm`, `opus`. +- **Playback in examples:** `play()` shells out to a system audio tool — Python uses **ffmpeg/ffplay** (or `mpv`), JS uses **ffplay**. It is for local/desktop use; in a server, `save()` to a file or stream the bytes instead. See [references/installation.md](references/installation.md). + +## Quick start — Python + +```python +from fishaudio import FishAudio +from fishaudio.utils import play, save + +client = FishAudio() # reads FISH_API_KEY + +# Generate speech (returns the full audio as bytes) +audio = client.tts.convert(text="Hello from Fish Audio!") + +save(audio, "output.mp3") # write to a file +# play(audio) # or play locally (needs ffmpeg) +``` + +Async — identical resource tree on `AsyncFishAudio`, used as a context manager: + +```python +import asyncio +from fishaudio import AsyncFishAudio +from fishaudio.utils import save + +async def main(): + async with AsyncFishAudio() as client: + audio = await client.tts.convert(text="Hello from Fish Audio!") + save(audio, "output.mp3") + +asyncio.run(main()) +``` + +## Quick start — JavaScript / TypeScript + +```ts +import { FishAudioClient, play } from "fish-audio"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +// convert() returns audio you can play or pipe to a file +const audio = await client.textToSpeech.convert({ + text: "Hello from Fish Audio!", +}); // defaults to model "s2-pro" +await play(audio); // local playback (needs ffplay) +``` + +To pick a model in JS, pass `backend` as the **positional** argument (not a named option): + +```ts +const audio = await client.textToSpeech.convert({ text: "Hi" }, "s1"); +``` + +## Capabilities → references + +| Task | Reference | +| ---------------------------------------------------------------- | ------------------------------------------------------------ | +| Install, auth, playback deps, verify a key | [references/installation.md](references/installation.md) | +| Text-to-Speech (convert, stream, formats, prosody, model select) | [references/text-to-speech.md](references/text-to-speech.md) | +| Voice cloning (instant references + persistent voice models) | [references/voice-cloning.md](references/voice-cloning.md) | +| Speech-to-Text (transcribe, segments, timestamps) | [references/speech-to-text.md](references/speech-to-text.md) | +| Realtime WebSocket TTS (stream text → audio) | [references/websocket.md](references/websocket.md) | +| Errors, retries, and timeouts (the **real** exception types) | [references/errors.md](references/errors.md) | + +## Python ↔ JavaScript name map + +The two SDKs do **not** use the same names. Use this map when porting code between them. + +| Concept | Python (`fishaudio`) | JavaScript (`fish-audio`) | +| --------------------- | ------------------------------------------------- | ---------------------------------------------------------- | +| Client | `FishAudio()` / `AsyncFishAudio()` | `new FishAudioClient({ apiKey })` | +| Text-to-Speech | `client.tts.convert(text=...)` → `bytes` | `client.textToSpeech.convert({ text })` | +| TTS HTTP stream | `client.tts.stream(...)` → `AudioStream` | (use `convert`; realtime streaming is `convertRealtime`) | +| Realtime WebSocket | `client.tts.stream_websocket(text_stream)` | `client.textToSpeech.convertRealtime(request, textStream)` | +| Speech-to-Text | `client.asr.transcribe(audio=...)` | `client.speechToText.convert({ audio })` | +| List voice models | `client.voices.list()` | `client.voices.search()` | +| Get voice model | `client.voices.get(id)` | `client.voices.get(id)` | +| Create voice (clone) | `client.voices.create(title=..., voices=[bytes])` | `client.voices.ivc.create({ title, voices: [File] })` | +| Update / delete voice | `client.voices.update(id, ...)` / `delete(id)` | `client.voices.update(id, ...)` / `delete(id)` | +| Credit balance | `client.account.get_credits()` | `client.user.get_api_credit()` | +| Subscription package | `client.account.get_package()` | `client.user.get_package()` | +| Choose model | `model="s2-pro"` keyword arg | positional `backend` arg, e.g. `convert(req, "s2-pro")` | + +## Decision shortcuts + +- **Audio from text** → `tts.convert` (Python) / `textToSpeech.convert` (JS). +- **Reuse a saved voice** → pass `reference_id` (the voice model `id`). +- **Clone a voice instantly from a clip** → pass `references=[ReferenceAudio(audio=..., text=...)]` (Python) / `references: [{ audio, text }]` (JS). See [voice-cloning](references/voice-cloning.md). +- **Persistent custom voice to reuse** → create a voice model, then use its `id` as `reference_id`. +- **Stream tokens from an LLM and play speech as it arrives** → `tts.stream_websocket` (Python) / `textToSpeech.convertRealtime` (JS). See [websocket](references/websocket.md). +- **Transcribe audio** → `asr.transcribe` (Python) / `speechToText.convert` (JS). + +## Gotchas (verified against the SDK source) + +- Python `latency` accepts only **`"normal"` or `"balanced"`** (default `"balanced"`) — there is no `"low"`. +- The Python client has **no `max_retries`** and does **not** auto-retry; the JS client **does** auto-retry (configurable via per-call `requestOptions.maxRetries`). See [errors](references/errors.md). +- Python defines a `ValidationError` class but **never raises it** — don't catch it expecting validation failures; a 422 surfaces as `APIError`. The JS SDK throws `UnprocessableEntityError` on 422. +- ASR segment `start` / `end` are in **seconds**, but `duration` is in **milliseconds**. See [speech-to-text](references/speech-to-text.md). diff --git a/.mintlify/skills/fish-audio-sdk/references/errors.md b/.mintlify/skills/fish-audio-sdk/references/errors.md new file mode 100644 index 0000000..b5b15bd --- /dev/null +++ b/.mintlify/skills/fish-audio-sdk/references/errors.md @@ -0,0 +1,126 @@ +# Errors, Retries & Timeouts + +The two SDKs have **different** exception models. The tables below reflect what the SDK source actually raises — not every exported class is thrown. + +## Python exceptions + +Hierarchy (all subclasses of `FishAudioError`): + +| Exception | When | Attributes | +| --------------------- | ---------------------------------------------- | --------------------------------- | +| `APIError` | base for HTTP errors | `.status`, `.message`, `.body` | +| `AuthenticationError` | 401 — bad/missing key | (APIError) | +| `PermissionError` | 403 | (APIError) | +| `NotFoundError` | 404 — voice id not found | (APIError) | +| `RateLimitError` | 429 | (APIError) | +| `ServerError` | 5xx | (APIError) | +| `WebSocketError` | realtime stream failed | — | +| `DependencyError` | missing system tool (e.g. ffmpeg for `play()`) | `.dependency`, `.install_command` | + +```python +from fishaudio import FishAudio +from fishaudio.exceptions import ( + AuthenticationError, + RateLimitError, + NotFoundError, + APIError, + FishAudioError, +) + +client = FishAudio() +try: + audio = client.tts.convert(text="Hello!", reference_id="maybe-missing") +except AuthenticationError: + ... # bad API key +except RateLimitError: + ... # slow down / out of quota +except NotFoundError: + ... # reference_id doesn't exist +except APIError as e: + print(e.status, e.message) # any other HTTP error +except FishAudioError as e: + print("SDK error:", e) # non-HTTP (e.g. WebSocketError, DependencyError) +``` + +> **Do not catch `ValidationError`.** The class exists and is exported, but the SDK **never raises it**. Invalid input comes back as an `APIError` (HTTP 422). Catch `APIError` (and read `.status == 422`) instead. + +### Retries & timeouts (Python) + +- **No automatic retries.** The Python client makes a single request and raises on failure. Implement your own retry loop if you need one (e.g. back off on `RateLimitError`). +- **Timeout** is set on the client: `FishAudio(timeout=240.0)` (seconds, default 240). +- `RequestOptions(max_retries=...)` exists but is currently a **no-op** — don't rely on it. `RequestOptions(timeout=..., additional_headers=...)` does work per request: + +```python +from fishaudio.core.request_options import RequestOptions + +audio = client.tts.convert( + text="Hello!", + request_options=RequestOptions(timeout=30.0, additional_headers={"X-Trace": "abc"}), +) +``` + +## JavaScript exceptions + +```ts +import { + FishAudioClient, + FishAudioError, + FishAudioTimeoutError, +} from "fish-audio"; +import { UnprocessableEntityError } from "fish-audio"; // re-exported from the package root + +const client = new FishAudioClient(); +try { + const audio = await client.textToSpeech.convert({ + text: "Hello!", + reference_id: "maybe-missing", + }); +} catch (err) { + if (err instanceof UnprocessableEntityError) { + console.error("422 validation:", err.body?.detail); // [{ loc, msg, type }] + } else if (err instanceof FishAudioTimeoutError) { + console.error("request timed out"); + } else if (err instanceof FishAudioError) { + console.error(err.statusCode, err.body); // branch on err.statusCode (401/403/404/...) + } else { + throw err; + } +} +``` + +What the JS client actually throws: + +| Error | When | +| ----------------------------------------------------- | -------------------------------------------------------------------------------------------- | +| `UnprocessableEntityError` (extends `FishAudioError`) | 422 — the **only** typed HTTP subclass thrown; `.body` is `{ detail: [{ loc, msg, type }] }` | +| `FishAudioError` | every other non-2xx response; read `.statusCode`, `.body`, `.rawResponse` | +| `FishAudioTimeoutError` | request exceeded the timeout | + +> The package also exports `BadRequestError`, `UnauthorizedError`, `ForbiddenError`, `NotFoundError`, and `TooEarlyError`, but the current client throws a generic `FishAudioError` for those statuses. **Branch on `err.statusCode`** rather than relying on `instanceof NotFoundError`. + +### Retries & timeouts (JavaScript) + +- **Automatic retries are built in.** The client retries `408`, `429`, and `>= 500` with exponential backoff (≈1 s base, 60 s cap) plus jitter, honoring `Retry-After`. You don't need to hand-roll a 429 loop. +- Tune per call via `requestOptions` (the trailing argument on every method): + +```ts +const audio = await client.textToSpeech.convert({ text: "Hello!" }, "s2-pro", { + maxRetries: 5, + timeoutInSeconds: 30, + abortSignal: controller.signal, +}); +``` + +- Default request timeout is **240 s** (`240000 ms`); override with `requestOptions.timeoutInSeconds`. +- `requestOptions` also accepts per-request `apiKey`, `headers`, and `queryParams`. + +## Inspecting raw responses (JS) + +Every method returns an awaitable that also exposes the raw response: + +```ts +const { data, rawResponse } = await client.textToSpeech + .convert({ text: "Hi" }) + .withRawResponse(); +console.log(rawResponse.headers); +``` diff --git a/.mintlify/skills/fish-audio-sdk/references/installation.md b/.mintlify/skills/fish-audio-sdk/references/installation.md new file mode 100644 index 0000000..1c1459d --- /dev/null +++ b/.mintlify/skills/fish-audio-sdk/references/installation.md @@ -0,0 +1,91 @@ +# Installation & Authentication + +## Python (`fishaudio`) + +```bash +pip install fish-audio-sdk # imported as `fishaudio` +pip install "fish-audio-sdk[utils]" # adds local audio playback helpers (play) +``` + +- Requires Python 3.9+. +- Import name is **`fishaudio`** even though the PyPI/dist name is `fish-audio-sdk`. + +## JavaScript / TypeScript (`fish-audio`) + +```bash +npm install fish-audio +# or: pnpm add fish-audio / yarn add fish-audio +``` + +- Requires Node.js 18+ (uses the global `fetch` / Web Streams). + +## Authentication + +Get an API key at `https://fish.audio/app/api-keys`. Both SDKs read `FISH_API_KEY` from the environment automatically. + +```bash +export FISH_API_KEY=your_api_key_here +``` + +```python +from fishaudio import FishAudio + +client = FishAudio() # reads FISH_API_KEY +client = FishAudio(api_key="your_api_key") # or pass explicitly +``` + +```ts +import { FishAudioClient } from "fish-audio"; + +const client = new FishAudioClient(); // reads FISH_API_KEY +const client2 = new FishAudioClient({ apiKey: process.env.MY_KEY }); // or pass explicitly +``` + +Never hardcode a key in source. If neither the argument nor `FISH_API_KEY` is set, the Python client raises `ValueError` at construction time. + +### Other client options + +| Option | Python | JavaScript | +| ------------------ | ----------------------------------- | ------------------------------------------ | +| API key | `api_key=` | `apiKey:` | +| Base URL | `base_url="https://api.fish.audio"` | `baseUrl:` / `environment:` | +| Request timeout | `timeout=240.0` (seconds) | per-call `requestOptions.timeoutInSeconds` | +| Custom HTTP client | `httpx_client=` | (not exposed) | + +> Python caveat: if you pass your own `httpx_client`, the SDK uses it **as-is** — your `base_url`, `timeout`, and the `Authorization` header are **not** applied to it. Pre-configure those on the client you inject. + +There is no client-level `max_retries` or `default_headers` option in Python. Per-request headers go through `request_options`. See [errors.md](errors.md) for retry/timeout behavior. + +## Local audio playback + +The `play()` helper is for local/desktop use and shells out to a system tool: + +- **Python:** needs `ffmpeg` (or pass `use_ffmpeg=False` to try `mpv`). Install the `[utils]` extra. Missing tools raise `DependencyError` with the install command. +- **JavaScript:** spawns `ffplay` (from ffmpeg) and is **Node-only**. + +Install ffmpeg: + +```bash +# macOS +brew install ffmpeg +# Debian/Ubuntu +sudo apt-get install ffmpeg +``` + +In a server or browser context, don't use `play()` — use `save()` (Python) or write/stream the bytes yourself. + +## Verify a key works + +```python +from fishaudio import FishAudio + +client = FishAudio() +print(client.account.get_credits()) # raises AuthenticationError (401) if the key is bad +``` + +```ts +import { FishAudioClient } from "fish-audio"; + +const client = new FishAudioClient(); +console.log(await client.user.get_api_credit()); +``` diff --git a/.mintlify/skills/fish-audio-sdk/references/speech-to-text.md b/.mintlify/skills/fish-audio-sdk/references/speech-to-text.md new file mode 100644 index 0000000..d7769f3 --- /dev/null +++ b/.mintlify/skills/fish-audio-sdk/references/speech-to-text.md @@ -0,0 +1,63 @@ +# Speech-to-Text (ASR) + +## Python — `client.asr.transcribe` + +```python +from fishaudio import FishAudio + +client = FishAudio() + +with open("audio.wav", "rb") as f: + result = client.asr.transcribe(audio=f.read(), language="en") + +print(result.text) + +for seg in result.segments: + print(f"[{seg.start:.2f}s - {seg.end:.2f}s] {seg.text}") +``` + +Keyword params: + +| Param | Type | Default | Notes | +| -------------------- | ------------------------ | ------------ | ----------------------------------------------------------------------------------------------------------------------- | +| `audio` | `bytes` | — (required) | Raw audio bytes. | +| `language` | `str` | auto-detect | Omit to auto-detect (e.g. `"en"`, `"zh"`, `"ja"`). | +| `include_timestamps` | `bool` | `True` | `False` omits per-segment timestamps (and `segments` is empty). Computing timestamps adds latency on clips under ~30 s. | +| `request_options` | `RequestOptions \| None` | `None` | Per-request timeout / headers. | + +### Response shape (`ASRResponse`) + +```python +result.text # str — full transcript +result.duration # float — total audio duration in MILLISECONDS +result.segments # list[ASRSegment] +# each segment: +seg.text # str +seg.start # float — seconds +seg.end # float — seconds +``` + +> **Unit gotcha (verified in source):** segment `start` / `end` are in **seconds**, but `duration` is in **milliseconds**. Don't assume they share a unit. + +## JavaScript — `client.speechToText.convert` + +```ts +import { FishAudioClient } from "fish-audio"; +import { readFile } from "node:fs/promises"; + +const client = new FishAudioClient(); + +const buf = await readFile("audio.wav"); +const result = await client.speechToText.convert({ + audio: new File([buf], "audio.wav"), + language: "en", // optional; omit to auto-detect + ignore_timestamps: false, // false → include per-segment timestamps +}); + +console.log(result.text); +for (const seg of result.segments) { + console.log(`[${seg.start}-${seg.end}] ${seg.text}`); +} +``` + +`STTRequest` = `{ audio: File; language?: string; ignore_timestamps?: boolean }`. Note JS uses `ignore_timestamps` (the inverse of Python's `include_timestamps`). `STTResponse` mirrors the Python shape: `{ text, duration, segments }`. diff --git a/.mintlify/skills/fish-audio-sdk/references/text-to-speech.md b/.mintlify/skills/fish-audio-sdk/references/text-to-speech.md new file mode 100644 index 0000000..b5d51db --- /dev/null +++ b/.mintlify/skills/fish-audio-sdk/references/text-to-speech.md @@ -0,0 +1,119 @@ +# Text-to-Speech + +## Python — `client.tts` + +`convert()` returns the **complete audio as `bytes`**. `stream()` returns an iterable of byte chunks. + +```python +from fishaudio import FishAudio +from fishaudio.utils import play, save + +client = FishAudio() + +# Simplest: default model voice +audio = client.tts.convert(text="Hello, world!") +save(audio, "out.mp3") + +# Use a specific voice model by id +audio = client.tts.convert( + text="Using a saved voice.", + reference_id="802e3bc2b27e49c2995d23ef70e6ac89", +) + +# Pick a model and adjust speed +audio = client.tts.convert(text="Speaking faster.", model="s1", speed=1.5) +``` + +### `tts.convert` parameters + +All keyword-only: + +| Param | Type | Default | Notes | +| ----------------- | ----------------------------------- | ------------- | ---------------------------------------------------------------- | +| `text` | `str` | — (required) | Text to synthesize. | +| `reference_id` | `str \| None` | `None` | Voice model id to speak with. | +| `references` | `list[ReferenceAudio] \| None` | `None` | Inline clone samples — see [voice-cloning.md](voice-cloning.md). | +| `format` | `"mp3" \| "wav" \| "pcm" \| "opus"` | `"mp3"` | Output format. | +| `latency` | `"normal" \| "balanced"` | `"balanced"` | `normal` = higher quality, `balanced` = faster. (No `"low"`.) | +| `speed` | `float` | — | Shortcut for prosody speed (0.5–2.0). | +| `config` | `TTSConfig` | `TTSConfig()` | Reusable bundle of the settings below. | +| `model` | `"s2-pro" \| "s1"` | `"s2-pro"` | Synthesis model. `speech-1.5` / `speech-1.6` are deprecated. | +| `request_options` | `RequestOptions \| None` | `None` | Per-request timeout / headers — see [errors.md](errors.md). | + +Direct params (`reference_id`, `format`, `latency`, `speed`) override the matching field on `config` when set. + +### Reusable config with `TTSConfig` + +```python +from fishaudio.types import TTSConfig, Prosody + +config = TTSConfig( + reference_id="933563129e564b19a115bedd57b7406a", + format="wav", + latency="normal", + prosody=Prosody(speed=1.2, volume=-5), # speed 0.5–2.0, volume dB -20..20 + temperature=0.7, # 0.0–1.0 + top_p=0.7, # 0.0–1.0 + chunk_length=200, # 100–300 +) + +audio1 = client.tts.convert(text="First line.", config=config) +audio2 = client.tts.convert(text="Second line.", config=config) +``` + +`TTSConfig` fields (with defaults): `format="mp3"`, `sample_rate=None`, `mp3_bitrate=128` (`64|128|192`), `opus_bitrate=32` (kbps: `-1000|24|32|48|64`, `-1000`=auto), `normalize=True`, `chunk_length=200`, `latency="balanced"`, `reference_id=None`, `references=[]`, `prosody=None`, `top_p=0.7`, `temperature=0.7`, `max_new_tokens=1024`, `repetition_penalty=1.2`, `min_chunk_length=50`, `condition_on_previous_chunks=True`, `early_stop_threshold=1.0`. + +### Streaming the HTTP response + +```python +# Iterate chunks as they arrive +for chunk in client.tts.stream(text="Long passage..."): + sink.write(chunk) + +# Or collect everything into bytes +audio = client.tts.stream(text="Hello!").collect() +``` + +Async: every method mirrors onto `AsyncFishAudio`; `await client.tts.convert(...)`, and `client.tts.stream(...)` must be awaited before iterating with `async for`. + +## JavaScript — `client.textToSpeech` + +`convert(request, backend?, requestOptions?)` resolves to a `ReadableStream` you can `play()` or pipe to a file. `backend` is the **second positional** argument (default `"s2-pro"`) — **not** a named option. + +```ts +import { FishAudioClient, play } from "fish-audio"; +import { createWriteStream } from "node:fs"; +import { Readable } from "node:stream"; + +const client = new FishAudioClient(); + +// default model (s2-pro) +const audio = await client.textToSpeech.convert({ text: "Hello, world!" }); +await play(audio); + +// specific voice + model +const audio2 = await client.textToSpeech.convert( + { + text: "Using a saved voice.", + reference_id: "802e3bc2b27e49c2995d23ef70e6ac89", + }, + "s1" // <-- positional backend, not { backend: "s1" } +); + +// pipe to a file instead of playing +await new Promise((resolve, reject) => + Readable.fromWeb(audio2) + .pipe(createWriteStream("out.mp3")) + .on("finish", resolve) + .on("error", reject) +); +``` + +`TTSRequest` (the first argument) fields: `text` (required), `reference_id?`, `references?`, `format?`, `latency?`, `prosody?: { speed?; volume? }`, `temperature?`, `top_p?`, `chunk_length?`, `mp3_bitrate?`, `opus_bitrate?`, `sample_rate?`, `normalize?`, plus the advanced generation knobs (`max_new_tokens`, `repetition_penalty`, etc.). Field names are `snake_case`, matching the API. + +> JS `backend` accepts the full union `'s1' | 's1-mini' | 's2-pro' | 'speech-1.5' | 'speech-1.6' | 'agent-x0'`. Prefer `s2-pro` (default) or `s1`. + +## Model & expression notes + +- `s2-pro` is the default and highest quality; `s1` is the previous generation. +- Emotion/expression is controlled inline in `text` (S1 uses `(parenthesis)` tags, S2-Pro uses free-form `[bracket]` tags) — there is no separate SDK parameter. Full tag list: `https://docs.fish.audio/api-reference/emotion-reference`. diff --git a/.mintlify/skills/fish-audio-sdk/references/voice-cloning.md b/.mintlify/skills/fish-audio-sdk/references/voice-cloning.md new file mode 100644 index 0000000..a56903b --- /dev/null +++ b/.mintlify/skills/fish-audio-sdk/references/voice-cloning.md @@ -0,0 +1,113 @@ +# Voice Cloning & Voice Models + +Two ways to use a custom voice: + +1. **Instant (zero-shot)** — pass reference audio inline on each `convert` call. Nothing is saved. +2. **Persistent voice model** — create a model once, then reuse its `id` as `reference_id`. + +## 1. Instant cloning (inline references) + +### Python + +```python +from fishaudio import FishAudio +from fishaudio.types import ReferenceAudio +from fishaudio.utils import save + +client = FishAudio() + +with open("reference.wav", "rb") as f: + audio = client.tts.convert( + text="This is spoken in the cloned voice.", + references=[ReferenceAudio(audio=f.read(), text="Exact transcript of reference.wav.")], + ) +save(audio, "cloned.mp3") +``` + +`ReferenceAudio` = `{ audio: , text: }`. `text` must match what's spoken in `audio` (include punctuation for prosody). 10–30 s of clean speech works best. + +### JavaScript + +```ts +import { FishAudioClient, play } from "fish-audio"; +import { readFile } from "node:fs/promises"; + +const client = new FishAudioClient(); + +const buf = await readFile("reference.wav"); +const audio = await client.textToSpeech.convert({ + text: "This is spoken in the cloned voice.", + references: [ + { + audio: new File([buf], "reference.wav"), + text: "Exact transcript of reference.wav.", + }, + ], +}); +await play(audio); +``` + +In JS, `ReferenceAudio.audio` is a `File`. + +## 2. Persistent voice models + +### Create — Python `voices.create` + +```python +with open("sample1.wav", "rb") as f1, open("sample2.wav", "rb") as f2: + voice = client.voices.create( + title="My Voice", + voices=[f1.read(), f2.read()], # list of raw audio bytes, one per sample + description="Custom clone", + texts=["Transcript of sample1.", "Transcript of sample2."], # optional; ASR is run if omitted + tags=["en", "narration"], + visibility="private", # "public" | "unlist" | "private" (default "private") + ) + +print(voice.id, voice.state) # state: created | training | trained | failed +``` + +`voices.create` keyword params: `title` (required), `voices: list[bytes]` (required), `description`, `texts`, `tags`, `cover_image: bytes`, `visibility="private"`, `train_mode="fast"`, `enhance_audio_quality=True`. + +### Create — JavaScript `voices.ivc.create` + +```ts +import { readFile } from "node:fs/promises"; + +const buf = await readFile("sample1.wav"); +const voice = await client.voices.ivc.create({ + title: "My Voice", + voices: [new File([buf], "sample1.wav")], // File[] + description: "Custom clone", + visibility: "private", +}); +console.log(voice._id, voice.state); +``` + +> Note the JS path is `client.voices.ivc.create` (IVC = instant voice cloning), and `voices` are `File[]`. + +### Use a created model + +```python +audio = client.tts.convert(text="Using my saved voice.", reference_id=voice.id) +``` + +```ts +const audio = await client.textToSpeech.convert({ + text: "Using my saved voice.", + reference_id: voice._id, +}); +``` + +## Managing voice models + +| Action | Python | JavaScript | +| ------ | -------------------------------------------------------------------------------------- | ----------------------------------------------------------- | +| List | `client.voices.list(self_only=True)` → `PaginatedResponse[Voice]` (`.total`, `.items`) | `client.voices.search({ self: true })` → `{ total, items }` | +| Get | `client.voices.get(voice_id)` | `client.voices.get(voiceId)` | +| Update | `client.voices.update(voice_id, title=..., visibility=...)` | `client.voices.update(voiceId, { title, visibility })` | +| Delete | `client.voices.delete(voice_id)` | `client.voices.delete(voiceId)` | + +Python `voices.list` is manually paged: `page_size` (default 10), `page_number` (default 1), plus filters `title`, `tags`, `self_only`, `author_id`, `language`, `title_language`, and `sort_by` (`"task_count"` default, or `"created_at"`). There is no auto-pager — loop `page_number` yourself. + +A model is usable as a `reference_id` once its `state` is `"trained"`. States: `created → training → trained` (or `failed`). diff --git a/.mintlify/skills/fish-audio-sdk/references/websocket.md b/.mintlify/skills/fish-audio-sdk/references/websocket.md new file mode 100644 index 0000000..470af14 --- /dev/null +++ b/.mintlify/skills/fish-audio-sdk/references/websocket.md @@ -0,0 +1,107 @@ +# Realtime WebSocket TTS + +Stream text in and get audio out as it's generated — ideal for piping an LLM's token stream to speech. + +## Python — `client.tts.stream_websocket` + +The first argument is an **iterable of text chunks** (plain `str`, or `TextEvent` / `FlushEvent` for fine control). The sync method returns an `Iterator[bytes]`; the async method must be awaited and returns an `AsyncIterator[bytes]`. + +### Sync + +```python +from fishaudio import FishAudio +from fishaudio.utils import play + +client = FishAudio() + +def text_chunks(): + yield "Hello, " + yield "this is " + yield "streaming speech." + +# Consume audio chunks as they arrive... +for chunk in client.tts.stream_websocket(text_chunks(), reference_id="", model="s2-pro"): + sink.write(chunk) + +# ...or just play the whole stream locally: +play(client.tts.stream_websocket(text_chunks())) +``` + +### Async + +```python +import asyncio +from fishaudio import AsyncFishAudio + +async def text_chunks(): + yield "Hello, " + yield "streaming speech." + +async def main(): + async with AsyncFishAudio() as client: + audio_stream = await client.tts.stream_websocket(text_chunks()) + with open("out.mp3", "wb") as f: + async for chunk in audio_stream: + f.write(chunk) + +asyncio.run(main()) +``` + +`stream_websocket` keyword params mirror `convert`: `reference_id`, `references`, `format`, `latency`, `speed`, `config`, `model` (default `"s2-pro"`), plus `ws_options` (a `WebSocketOptions` for keepalive/message-size tuning). The sync version also accepts `max_workers` (default `10`) for its background sender thread; the async version does not. + +For manual control, yield events instead of strings: + +```python +from fishaudio.types import TextEvent, FlushEvent + +def events(): + yield TextEvent(text="First sentence.") + yield FlushEvent() # force synthesis of buffered text now + yield TextEvent(text="Second sentence.") +``` + +The SDK sends the start/stop frames for you — you only supply text/flush. + +## JavaScript — `client.textToSpeech.convertRealtime` + +Returns a `RealtimeConnection`; subscribe to events with `RealtimeEvents`. Set `request.text` to `""` and stream the real text via the second argument. + +```ts +import { FishAudioClient, RealtimeEvents } from "fish-audio"; +import { writeFile } from "node:fs/promises"; + +const client = new FishAudioClient(); + +async function* textStream() { + for (const chunk of [ + "Hello from Fish Audio! ", + "Streaming over WebSocket.", + ]) { + yield chunk; + } +} + +const connection = await client.textToSpeech.convertRealtime( + { text: "", reference_id: "" }, + textStream() + // optional positional backend, e.g. "s2-pro" +); + +const chunks: Buffer[] = []; +connection.on(RealtimeEvents.OPEN, () => console.log("open")); +connection.on(RealtimeEvents.AUDIO_CHUNK, audio => { + if (audio instanceof Uint8Array || Buffer.isBuffer(audio)) + chunks.push(Buffer.from(audio)); +}); +connection.on(RealtimeEvents.ERROR, err => console.error(err)); +connection.on(RealtimeEvents.CLOSE, async () => { + await writeFile("out.mp3", Buffer.concat(chunks)); +}); +``` + +`RealtimeEvents`: `OPEN`, `AUDIO_CHUNK`, `ERROR`, `CLOSE`. `textStream` may be an `Iterable` or `AsyncIterable`. + +## Protocol notes + +- The close frame's event literal is **`"stop"`**, not `"close"` (handled for you by both SDKs; relevant only if you drop to raw frames — use the `fish-audio-api` skill for that). +- A realtime run that fails mid-stream surfaces as `WebSocketError` (Python) / an `ERROR` event (JS). Reconnect rather than retrying on the same socket. diff --git a/api-reference/asyncapi.yml b/api-reference/asyncapi.yml index 68cba7d..d8c40f3 100644 --- a/api-reference/asyncapi.yml +++ b/api-reference/asyncapi.yml @@ -488,10 +488,10 @@ components: type: integer enum: - -1000 - - 24 - - 32 - - 48 - - 64 + - 24000 + - 32000 + - 48000 + - 64000 default: -1000 description: | Opus bitrate in bps. -1000 for automatic. Only applies when format diff --git a/api-reference/emotion-reference.mdx b/api-reference/emotion-reference.mdx deleted file mode 100644 index 75f488d..0000000 --- a/api-reference/emotion-reference.mdx +++ /dev/null @@ -1,137 +0,0 @@ ---- -title: "Emotion Reference" -description: "Complete reference guide for all 64+ emotional expressions in Fish Audio" -icon: "book-open" ---- - -import BasicEmotions from '/snippets/emotion-list-basic.mdx'; -import AdvancedEmotions from '/snippets/emotion-list-advanced.mdx'; -import ToneMarkers from '/snippets/emotion-list-tones.mdx'; -import AudioEffects from '/snippets/emotion-list-effects.mdx'; -import SpecialEffects from '/snippets/emotion-list-special.mdx'; - -## Complete Emotion List - -This reference guide provides a comprehensive list of all 64+ supported emotional expressions and voice styles available in Fish Audio's S1 TTS model. The latest S2-Pro model supports free-form natural language emotion tags. - - -The `(parenthesis)` syntax on this page applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details. - - -## Basic Emotions (24) - - - -## Advanced Emotions (25) - - - -## Tone Markers (5) - - - -## Audio Effects (10) - - - -## Special Effects - - - -## Usage Examples - -### Single Emotion -``` -(happy) What a beautiful day! -(sad) I'm sorry for your loss. -(excited) We won the championship! -``` - -### Combined Effects -``` -(sad)(whispering) I'll miss you so much. -(angry)(shouting) Get out of here now! -(excited)(laughing) We did it! Ha ha ha! -``` - -### Natural Expressions -``` -That's hilarious! Ha ha ha! // Natural laughter -(sighing) Sigh... what a long day. -(panting) Huff... puff... almost there! -``` - -## Quick Selection Guide - -### For Customer Service -- **Greetings**: `(friendly)`, `(cheerful)`, `(helpful)` -- **Understanding**: `(empathetic)`, `(concerned)`, `(sympathetic)` -- **Problem-solving**: `(confident)`, `(determined)`, `(professional)` -- **Apologies**: `(apologetic)`, `(regretful)`, `(sincere)` - -### For Storytelling -- **Narration**: `(narrator)`, `(calm)`, `(mysterious)` -- **Character emotions**: Any from basic/advanced lists -- **Atmosphere**: `(whispering)`, `(dramatic)`, background effects -- **Action**: `(shouting)`, `(panting)`, `(struggling)` - -### For Educational Content -- **Introduction**: `(enthusiastic)`, `(welcoming)`, `(friendly)` -- **Explanations**: `(calm)`, `(clear)`, `(patient)` -- **Questions**: `(curious)`, `(encouraging)`, `(thoughtful)` -- **Praise**: `(proud)`, `(delighted)`, `(impressed)` - -### For Marketing -- **Excitement**: `(excited)`, `(enthusiastic)`, `(energetic)` -- **Trust**: `(confident)`, `(professional)`, `(sincere)` -- **Urgency**: `(urgent)`, `(in a hurry tone)`, `(important)` -- **Celebration**: `(celebrating)`, `(triumphant)`, `(joyful)` - -## Emotion Categories - -### Positive Emotions -`(happy)` `(excited)` `(delighted)` `(satisfied)` `(proud)` `(grateful)` `(confident)` `(relaxed)` `(hopeful)` `(optimistic)` `(moved)` `(compassionate)` - -### Negative Emotions -`(sad)` `(angry)` `(frustrated)` `(depressed)` `(upset)` `(worried)` `(scared)` `(nervous)` `(disappointed)` `(regretful)` `(guilty)` `(ashamed)` `(lonely)` `(bored)` - -### Neutral/Complex Emotions -`(calm)` `(curious)` `(surprised)` `(confused)` `(uncertain)` `(doubtful)` `(indifferent)` `(nostalgic)` `(sarcastic)` `(determined)` `(resigned)` - -### Social/Interpersonal Emotions -`(empathetic)` `(sympathetic)` `(embarrassed)` `(jealous)` `(envious)` `(disdainful)` `(contemptuous)` `(disgusted)` - -## Model Support Matrix - -| Model | Basic | Advanced | Tones | Effects | Intensity | -|-------|-------|----------|-------|---------|----------| -| Fish Speech 1.5 | ✓ | Limited | ✓ | 6/10 | No | -| Fish Audio S1 | ✓ | ✓ | ✓ | ✓ | ✓ | -| Fish Audio S2-Pro | ✓ | ✓ | ✓ | ✓ | ✓ | - -## Tips for Natural Speech - -1. **Start Simple**: Begin with basic emotions before combining -2. **Test Variations**: Different voices handle emotions differently -3. **Context Matters**: Match emotions to content logically -4. **Less is More**: Avoid overusing emotions in short text -5. **Natural Flow**: Space out emotional changes -6. **Sound Effects**: Include appropriate text after audio tags -7. **Preview Often**: Test how emotions sound with your voice - -## Common Mistakes to Avoid - -- ❌ Placing emotion tags mid-sentence in English -- ❌ Forgetting parentheses around tags -- ❌ Using unsupported custom tags -- ❌ Mixing conflicting emotions -- ❌ Overusing effects in short text -- ❌ Missing text for sound effects -- ❌ Using wrong language placement rules - -## See Also - -- [Emotion Control Guide](/developer-guide/core-features/emotions) - Technical implementation -- [Text-to-Speech Best Practices](/developer-guide/core-features/text-to-speech) -- [API Reference](/api-reference/introduction) -- [Try it live](https://fish.audio) - Test emotions in the playground \ No newline at end of file diff --git a/api-reference/endpoint/openapi-v1/text-to-speech.mdx b/api-reference/endpoint/openapi-v1/text-to-speech.mdx index 1dea609..f581b93 100644 --- a/api-reference/endpoint/openapi-v1/text-to-speech.mdx +++ b/api-reference/endpoint/openapi-v1/text-to-speech.mdx @@ -11,7 +11,7 @@ This endpoint only accepts `application/json` and `application/msgpack`. For best results, upload reference audio using the [create model](/api-reference/endpoint/model/create-model) before using this one. This improves speech quality and reduces latency. -To upload audio clips directly, without pre-uploading, serialize the request body with MessagePack as per the [instructions](/developer-guide/core-features/text-to-speech#direct-api-usage). +To upload audio clips directly, without pre-uploading, serialize the request body with MessagePack as per the [instructions](/features/text-to-speech#direct-api-messagepack). diff --git a/api-reference/endpoint/wallet/get-user-package.mdx b/api-reference/endpoint/wallet/get-user-package.mdx index d441cd8..9e8a66a 100644 --- a/api-reference/endpoint/wallet/get-user-package.mdx +++ b/api-reference/endpoint/wallet/get-user-package.mdx @@ -1,6 +1,6 @@ --- openapi: get /wallet/{user_id}/package -title: 'Get User Premium' +title: 'Get User Package' description: 'Get current user premium information' icon: "star" iconType: "solid" diff --git a/api-reference/errors.mdx b/api-reference/errors.mdx new file mode 100644 index 0000000..10d2648 --- /dev/null +++ b/api-reference/errors.mdx @@ -0,0 +1,104 @@ +--- +title: "Errors" +description: "HTTP status codes, the error response shape, and how to handle them in any language" +icon: "triangle-exclamation" +--- + +Every Fish Audio error comes back as JSON with a `message` and a `status`: + +```json +{ "message": "Invalid Token", "status": 401 } +``` + +(A request whose body can't be parsed returns a plain-text parse error instead.) + +## Status codes + +| Status | Meaning | What to do | +|---|---|---| +| `400` | Bad request — invalid parameters, or a `reference_id` / voice that doesn't exist | Fix the request; read the `message`. | +| `401` | Invalid or missing API key | Send `Authorization: Bearer `; check it on [API Keys](https://fish.audio/app/api-keys). | +| `402` | Insufficient credits | Top up on [Billing](https://fish.audio/app/billing). | +| `403` | Not permitted for this key/resource | Check the key's scope and the resource owner. | +| `404` | Model or voice not found | Verify the `model_id` / `reference_id`. | +| `429` | Rate limit exceeded | Back off and retry (see below). | +| `5xx` | Server error | Retry with backoff; if it persists, contact support. | + +## Retries + +Retry `429` and `5xx` with exponential backoff. Don't retry other `4xx` codes — they won't succeed without a change to the request. + +```python +import time +from fishaudio import FishAudio +from fishaudio.exceptions import RateLimitError, APIError + +client = FishAudio() + +for attempt in range(5): + try: + audio = client.tts.convert(text="Hello!") + break + except RateLimitError: + time.sleep(2 ** attempt) # 1s, 2s, 4s, ... + except APIError as e: + if e.status >= 500: + time.sleep(2 ** attempt) + else: + raise # 4xx — fix the request +``` + +## Handling errors in the SDKs + +Both SDKs raise typed exceptions you can branch on. The base class carries the status and the parsed body. + + +```python Python +from fishaudio.exceptions import ( + AuthenticationError, # 401 + RateLimitError, # 429 + NotFoundError, # 404 + APIError, # any other HTTP error — has .status and .message + FishAudioError, # base class for all SDK errors +) + +try: + audio = client.tts.convert(text="Hello!") +except AuthenticationError: + ... # invalid or missing key +except RateLimitError: + ... # back off and retry +except NotFoundError: + ... # bad reference_id / model id +except APIError as e: + print(e.status, e.message) # 400, 402, 5xx, ... +``` + +```javascript JavaScript +import { + UnauthorizedError, // 401 + TooEarlyError, // 429 + NotFoundError, // 404 + BadRequestError, // 400 + UnprocessableEntityError, // 422 + FishAudioError, // base — has .statusCode and .body +} from "fish-audio"; + +try { + await client.textToSpeech.convert({ text: "Hello!" }, "s2-pro"); +} catch (err) { + if (err instanceof UnauthorizedError) { + // invalid or missing key + } else if (err instanceof TooEarlyError) { + // back off and retry + } else if (err instanceof FishAudioError) { + console.error(err.statusCode, err.body); + } +} +``` + + + + Audio playback via `play()` needs `ffmpeg`. If it's missing, the Python SDK raises + `DependencyError` — install `ffmpeg` or save the audio to a file instead. + diff --git a/api-reference/introduction.mdx b/api-reference/introduction.mdx index 55ee40c..aba1caa 100644 --- a/api-reference/introduction.mdx +++ b/api-reference/introduction.mdx @@ -13,6 +13,10 @@ You can generate a new API key at [https://fish.audio/app/api-keys/](https://fis See our [Quick Start](/developer-guide/getting-started/quickstart) guide to generate audio in under 2 minutes. +## Errors + +Every error returns a JSON body with a `message` and a `status`. See [Errors](/api-reference/errors) for the full status-code table, retry guidance, and SDK exception handling. + ## OpenAPI Schema Fish Audio publishes a canonical OpenAPI schema at [https://api.fish.audio/openapi.json](https://api.fish.audio/openapi.json). @@ -28,7 +32,7 @@ Use our [/v1/tts endpoint](/api-reference/endpoint/openapi-v1/text-to-speech) to ## Real-time Streaming -Use our [Python SDK](/developer-guide/sdk-guide/python/websocket) or [JavaScript SDK](/developer-guide/sdk-guide/javascript/websocket) for real-time audio streaming with WebSocket. +Use our [Python SDK](/features/realtime-streaming) or [JavaScript SDK](/features/realtime-streaming) for real-time audio streaming with WebSocket. ## Rate Limits diff --git a/api-reference/sdk/javascript/api-reference.mdx b/api-reference/sdk/javascript/api-reference.mdx index 902bbdf..34ed8ff 100644 --- a/api-reference/sdk/javascript/api-reference.mdx +++ b/api-reference/sdk/javascript/api-reference.mdx @@ -1,5 +1,6 @@ --- -title: "API Reference" +title: "JavaScript SDK Reference" +sidebarTitle: "Reference" description: "Complete reference for Fish Audio JavaScript SDK" icon: "book" --- diff --git a/api-reference/sdk/python/overview.mdx b/api-reference/sdk/python/overview.mdx index 362510f..d32fce5 100644 --- a/api-reference/sdk/python/overview.mdx +++ b/api-reference/sdk/python/overview.mdx @@ -138,7 +138,7 @@ for chunk in client.tts.stream(text="Long content..."): audio = client.tts.stream(text="Hello!").collect() ``` -[Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/text-to-speech) +[Learn more](https://docs.fish.audio/features/text-to-speech) ### Speech-to-Text @@ -154,7 +154,7 @@ for segment in result.segments: print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}") ``` -[Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/speech-to-text) +[Learn more](https://docs.fish.audio/features/speech-to-text) ### Real-time Streaming @@ -175,16 +175,26 @@ play(audio_stream) **Asynchronous:** ```python +import asyncio +from fishaudio import AsyncFishAudio + async def text_chunks(): yield "Hello, " yield "this is " yield "streaming!" -audio_stream = await client.tts.stream_websocket(text_chunks(), latency="balanced") -play(audio_stream) +async def main(): + async with AsyncFishAudio() as client: + # stream_websocket is an async generator — iterate it, don't await the call + audio_stream = client.tts.stream_websocket(text_chunks(), latency="balanced") + with open("out.mp3", "wb") as f: + async for chunk in audio_stream: + f.write(chunk) + +asyncio.run(main()) ``` -[Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/websocket) +[Learn more](https://docs.fish.audio/features/realtime-streaming) ### Voice Cloning @@ -222,7 +232,7 @@ audio = client.tts.convert( ) ``` -[Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/voice-cloning) +[Learn more](https://docs.fish.audio/features/voice-cloning) ## Resource Clients @@ -239,8 +249,9 @@ audio = client.tts.convert( from fishaudio.exceptions import ( AuthenticationError, RateLimitError, - ValidationError, - FishAudioError + NotFoundError, + APIError, + FishAudioError, ) try: @@ -249,10 +260,12 @@ except AuthenticationError: print("Invalid API key") except RateLimitError: print("Rate limit exceeded") -except ValidationError as e: - print(f"Invalid request: {e}") +except NotFoundError: + print("Voice model not found") +except APIError as e: + print(f"API error {e.status}: {e.message}") # any other HTTP error, including 422 validation except FishAudioError as e: - print(f"API error: {e}") + print(f"SDK error: {e}") ``` ## Resources diff --git a/api-reference/sdk/python/types.mdx b/api-reference/sdk/python/types.mdx index df761b3..1fce2c5 100644 --- a/api-reference/sdk/python/types.mdx +++ b/api-reference/sdk/python/types.mdx @@ -57,7 +57,7 @@ Represents a TTS voice that can be used for synthesis. - `description` - Detailed description of the voice model - `cover_image` - URL to the voice model's cover image - `train_mode` - Training mode used. Options: "fast" -- `state` - Current model state (e.g., "ready", "training", "failed") +- `state` - Current model state: "created", "training", "trained", or "failed" - `tags` - List of tags for categorization (e.g., ["male", "english", "young"]) - `samples` - List of audio samples demonstrating the voice - `created_at` - Timestamp when the model was created @@ -378,12 +378,12 @@ Response from speech-to-text transcription. **Attributes**: - `text` - Complete transcription of the entire audio -- `duration` - Total audio duration in milliseconds +- `duration` - Total audio duration in seconds - `segments` - List of timestamped text segments. Empty if include_timestamps=False #### duration -Duration in milliseconds +Duration in seconds diff --git a/archive/python-sdk-legacy/migration-guide.mdx b/archive/python-sdk-legacy/migration-guide.mdx index d4098b0..647c8d5 100644 --- a/archive/python-sdk-legacy/migration-guide.mdx +++ b/archive/python-sdk-legacy/migration-guide.mdx @@ -389,11 +389,11 @@ seconds = segment.start / 1000 Detailed API documentation - + TTS features and examples - + Clone voices and manage models diff --git a/archive/python-sdk-legacy/text-to-speech.mdx b/archive/python-sdk-legacy/text-to-speech.mdx index eefc459..71c5bf5 100644 --- a/archive/python-sdk-legacy/text-to-speech.mdx +++ b/archive/python-sdk-legacy/text-to-speech.mdx @@ -210,7 +210,7 @@ def generate_with_retry(request, max_retries=3): ## Next Steps - [Fine-grained control](/developer-guide/core-features/fine-grained-control) for phoneme-level control and paralanguage -- [Voice cloning](/developer-guide/sdk-guide/python/voice-cloning) for custom voices -- [WebSocket streaming](/developer-guide/sdk-guide/python/websocket) for real-time apps -- [Guide and Best Practices](/developer-guide/core-features/text-to-speech) for production use +- [Voice cloning](/features/voice-cloning) for custom voices +- [WebSocket streaming](/features/realtime-streaming) for real-time apps +- [Guide and Best Practices](/features/text-to-speech) for production use - [API reference](/api-reference/endpoint/openapi-v1/text-to-speech) for direct API calls \ No newline at end of file diff --git a/developer-guide/best-practices/emotion-control.mdx b/developer-guide/best-practices/emotion-control.mdx index 589a6d9..4c97c1d 100644 --- a/developer-guide/best-practices/emotion-control.mdx +++ b/developer-guide/best-practices/emotion-control.mdx @@ -70,7 +70,7 @@ What a (happy) wonderful day! ## Available Emotions -See the [Emotion Reference](/api-reference/emotion-reference) for the full list of supported emotions. +See the [Emotion Control guide](/developer-guide/core-features/emotions) for the full list of supported emotions. ## Scene Examples diff --git a/developer-guide/core-features/creating-models.mdx b/developer-guide/core-features/creating-models.mdx deleted file mode 100644 index 5f6da13..0000000 --- a/developer-guide/core-features/creating-models.mdx +++ /dev/null @@ -1,400 +0,0 @@ ---- -title: "Creating Voice Models" -description: "Learn how to create custom voice models with Fish Audio" -icon: "wand-magic-sparkles" ---- - -import { AudioTranscript } from "/snippets/audio-transcript.jsx"; - -{/* speak-mintlify-hash: 6a43e7312895ae0c33a68fad2e95821fbecd8a5bfe0c250d3ee631871dc8d410 */} - - - - - -## Overview - -Create custom voice models to generate consistent, high-quality speech. You can create models through our web interface or programmatically via API. - -## Web Interface - -The easiest way to create a voice model: - - - - Visit [fish.audio](https://fish.audio) and log in - - Click on "Models" in your dashboard - Select "Create New Model" - - Add 1 or more voice samples (at least 10 seconds each) - - - Choose privacy settings and training options - - Click "Create" and wait for processing - - -## Using the API - -### Using the SDK - -Create models with the Python or JavaScript SDK: - - - - First, install the SDK: - - ```bash - pip install fish-audio-sdk - ``` - - Then create a model: - - ```python - from fishaudio import FishAudio - - client = FishAudio(api_key="your_api_key_here") - - with open("sample1.mp3", "rb") as f1, open("sample2.wav", "rb") as f2: - voice = client.voices.create( - title="My Voice Model", - voices=[f1.read(), f2.read()], - description="Custom voice for storytelling", - visibility="private", - enhance_audio_quality=True, - ) - - # The Python SDK maps the REST `_id` field to `voice.id`. - print(f"Voice model ID: {voice.id}") - ``` - - - - First, install the SDK: - - ```bash - npm install fish-audio - ``` - - Then create a model: - - ```javascript - import { FishAudioClient } from "fish-audio"; - import { createReadStream } from "fs"; - - const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); - - try { - const response = await fishAudio.voices.ivc.create({ - title: "My Voice Model", - voices: [ - createReadStream("sample1.mp3"), - createReadStream("sample2.wav"), - ], - description: "Custom voice for storytelling", - visibility: "private", - enhance_audio_quality: true, - }); - - console.log("Voice created:", { - _id: response._id, - title: response.title, - state: response.state, - }); - } catch (err) { - console.error("Create voice request failed:", err); - } - ``` - - - - -### Direct API - -Create models directly using the REST API: - - - The REST API accepts uploaded audio as `multipart/form-data`. Let your HTTP - client set the multipart `Content-Type` boundary for you. - - - - - ```bash - curl --request POST "https://api.fish.audio/model" \ - --header "Authorization: Bearer $FISH_API_KEY" \ - --form "type=tts" \ - --form "train_mode=fast" \ - --form "title=My Voice Model" \ - --form "visibility=private" \ - --form "description=Custom voice model" \ - --form "voices=@sample1.mp3" \ - --form "voices=@sample2.wav" \ - --form "enhance_audio_quality=true" - ``` - - - ```python - import requests - - with open("sample1.mp3", "rb") as f1, open("sample2.wav", "rb") as f2: - response = requests.post( - "https://api.fish.audio/model", - headers={"Authorization": "Bearer YOUR_API_KEY"}, - data=[ - ("type", "tts"), - ("train_mode", "fast"), - ("title", "My Voice Model"), - ("description", "Custom voice model"), - ("visibility", "private"), - ("enhance_audio_quality", "true"), - ], - files=[ - ("voices", f1), - ("voices", f2), - ], - ) - - response.raise_for_status() - result = response.json() - print(f"Model ID: {result['_id']}") - print(f"State: {result['state']}") - ``` - - - - ```javascript - import { readFile } from "fs/promises"; - - const form = new FormData(); - form.append("title", "My Voice Model"); - form.append("description", "Custom voice model"); - form.append("visibility", "private"); - form.append("type", "tts"); - form.append("train_mode", "fast"); - form.append("enhance_audio_quality", "true"); - - const v1 = await readFile("sample1.mp3"); - const v2 = await readFile("sample2.wav"); - form.append("voices", new Blob([v1]), "sample1.mp3"); - form.append("voices", new Blob([v2]), "sample2.wav"); - - const res = await fetch("https://api.fish.audio/model", { - method: "POST", - headers: { Authorization: "Bearer " }, - body: form, - }); - - if (!res.ok) throw new Error(await res.text()); - - const result = await res.json(); - console.log("Model ID:", result._id); - console.log("State:", result.state); - ``` - - - - -## Model Settings - -### Required Parameters - -| Parameter | Description | Type | Options | -| ---------------- | ------------------------------------------------------------------------------ | ----------------------- | ----------------------- | -| **title** | Name of your model | `string` | Any text | -| **voices** | One or more audio samples | `File` or `Array` | .mp3, .wav, .m4a, .opus | -| **type\*** | Model type | `enum` | `tts` | -| **train_mode\*** | Model train mode. `fast` means the model is instantly available after creation | `enum` | `fast` | - -\*Automatically set by Python and JavaScript SDKs - -### Optional Parameters - -| Parameter | Description | Type | Options | -| ------------------------- | ------------------------------------------------------------------- | --------------------------- | ---------------------------------------------------- | -| **visibility** | Who can use your model | `enum` | `private`, `public`, `unlist`
`default: public` | -| **description** | Model description | `string` or `null` | Any text | -| **cover_image** | Model cover image, required if the model is public | `File` | .jpg, .png | -| **texts** | Transcripts of audio samples. If omitted, ASR transcribes the audio | `string` or `Array` | Must match number of audio files | -| **tags** | Tags for your model | `string` or `Array` | Any text | -| **enhance_audio_quality** | Remove background noise and normalize audio | `boolean` | `true`, `false`
`default: true` | -| **generate_sample** | Generate a default sample text for the model | `boolean` | `true`, `false`
`default: false` | - - - The REST API defaults `visibility` to `public`. The SDK examples above set - `visibility` to `private`, which is safer for personal voice models and avoids - requiring a public `cover_image`. - - -For detailed explanations view our [API reference](/api-reference/endpoint/model/create-model). - -## Audio Requirements - -### Quality Guidelines - -**Minimum Requirements:** - -- At least 1 audio sample -- 10+ seconds per sample - -**Best Practices:** - -- Use multiple diverse samples -- 1 consistent speaker throughout -- Include different emotions and tones -- Record in a quiet environment -- Maintain steady volume - -## Adding Transcripts - -Including text transcripts improves model quality: - - - - ```python - import requests - - with open("hello.mp3", "rb") as f1, open("world.wav", "rb") as f2: - response = requests.post( - "https://api.fish.audio/model", - headers={"Authorization": "Bearer YOUR_API_KEY"}, - files=[ - ("voices", f1), - ("voices", f2), - ], - data=[ - ("type", "tts"), - ("train_mode", "fast"), - ("title", "Enhanced Model"), - ("texts", "Hello, this is my first recording."), - ("texts", "Welcome to the world of AI voices."), - ("visibility", "private"), - ], - ) - - response.raise_for_status() - print(response.json()["_id"]) - ``` - - - - ```javascript - import { FishAudioClient } from "fish-audio"; - import { createReadStream } from "fs"; - - const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); - - const response = await fishAudio.voices.ivc.create({ - title: "Enhanced Model", - voices: [ - createReadStream("hello.mp3"), - createReadStream("world.wav"), - ], - texts: [ - "Hello, this is my first recording.", - "Welcome to the world of AI voices.", - ], - // other optional fields: - // visibility: "private", - // enhance_audio_quality: true, - }); - - console.log("Model ID:", response._id); - ``` - - - - - - Text transcripts must match the exact number of audio files. If you provide 3 - audio files, you must provide exactly 3 text transcripts. - - -## Using Your Model - -Once training is complete: - -Use the SDK `voice.id` or the REST response `_id` as the TTS `reference_id`. - - - - ```python - from fishaudio import FishAudio - from fishaudio.utils import save - - client = FishAudio(api_key="your_api_key_here") - - audio = client.tts.convert( - text="Hello from my custom voice!", - reference_id="your_voice_model_id", - format="mp3", - ) - - save(audio, "output.mp3") - ``` - - - - ```javascript - import { FishAudioClient } from "fish-audio"; - import { writeFile } from "fs/promises"; - - const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); - - const audio = await fishAudio.textToSpeech.convert({ - text: "Hello from my custom voice!", - reference_id: "your_voice_model_id", - format: "mp3", - }); - - const buffer = Buffer.from(await new Response(audio).arrayBuffer()); - await writeFile("output.mp3", buffer); - console.log("✓ Audio saved to output.mp3"); - ``` - - - - -## Troubleshooting - -### Common Issues - -**Model training fails:** - -- Check audio quality and format -- Ensure single speaker in all samples -- Verify files are not corrupted -- Confirm REST requests include `type=tts`, `train_mode=fast`, `title`, and at least one `voices` file -- If `texts` are provided, make sure the count matches the number of `voices` files - -**Poor voice quality:** - -- Add more diverse audio samples -- Enable audio enhancement -- Use higher quality recording - -**Public model creation fails:** - -- Add a `cover_image`, or set `visibility` to `private` or `unlist` - -**Cannot use the created voice in TTS:** - -- Use REST `_id` or SDK `voice.id` as the TTS `reference_id` -- If the model state is not `trained`, check it with [Get Model](/api-reference/endpoint/model/get-model) - -## Best Practices - -1. **Start Simple:** Begin with 2-3 samples in fast mode to test -2. **Iterate:** Refine with cleaner samples, transcripts, and audio enhancement -3. **Document:** Keep track of which samples work best -4. **Test Thoroughly:** Try different texts and emotions -5. **Privacy First:** Keep personal models private - -## Support - -Need help creating models? - -- **API Documentation:** [Full API Reference](/api-reference/introduction) -- **Discord Community:** [Join our Discord](https://discord.gg/fish-audio) -- **Email Support:** support@fish.audio diff --git a/developer-guide/core-features/emotions.mdx b/developer-guide/core-features/emotions.mdx index 1af8b51..c66f8b6 100644 --- a/developer-guide/core-features/emotions.mdx +++ b/developer-guide/core-features/emotions.mdx @@ -9,6 +9,11 @@ import AdvancedEmotions from "/snippets/emotion-list-advanced-s2.mdx"; import ToneMarkers from "/snippets/emotion-list-tones-s2.mdx"; import AudioEffects from "/snippets/emotion-list-effects-s2.mdx"; import SpecialEffects from "/snippets/emotion-list-special-s2.mdx"; +import BasicEmotionsS1 from "/snippets/emotion-list-basic.mdx"; +import AdvancedEmotionsS1 from "/snippets/emotion-list-advanced.mdx"; +import ToneMarkersS1 from "/snippets/emotion-list-tones.mdx"; +import AudioEffectsS1 from "/snippets/emotion-list-effects.mdx"; +import SpecialEffectsS1 from "/snippets/emotion-list-special.mdx"; import { AudioTranscript } from "/snippets/audio-transcript.jsx"; {/* speak-mintlify-hash: 702d6176919d2e53007c71cc0850ac755add8d539e6fbd4c55fd20e8332821d7 */} @@ -18,15 +23,19 @@ import { AudioTranscript } from "/snippets/audio-transcript.jsx"; + + Drop text with the markers below into the `text` field and send a real request to hear the emotion. + + ## Overview Fish Audio models support 64+ emotional expressions and voice styles that can be controlled through text markers in your input. Add natural pauses, laughter, and other human-like elements to make speech more engaging and realistic. - This page shows S2 usage with `[bracket]` cues. If you use S1, you must use - markers wrapped in parentheses. See the [Models - Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) - for details. + This page shows S2 usage with `[bracket]` cues. If you use the legacy S1 model, + wrap markers in parentheses instead — see [S1 (legacy) syntax](#s1-legacy-syntax) + below for the full list, or the [Models + Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control). ## How It Works @@ -240,8 +249,34 @@ All 13 supported languages can use emotion markers. For sentence-level control, | Excited Laugh | `[excited][laughing]` | "We did it! Ha ha!" | | Nervous Question | `[nervous][uncertain]` | "Are you sure about this?" | +## S1 (legacy) syntax + +The default **S2-Pro** model uses `[bracket]` cues with free-form natural language. The previous-generation **S1** model uses the same emotion names but requires `(parentheses)` and a fixed tag set: + +```text +(happy) What a beautiful day! +(sad)(whispering) I'll miss you so much. +``` + + + + + + + + + + + + + + + + + + + ## See Also -- [Emotion Reference Guide](/api-reference/emotion-reference) - S1 emotion list with examples - [API Reference](/api-reference/introduction) - Implementation details -- [Text-to-Speech Guide and Best Practices](/developer-guide/core-features/text-to-speech) +- [Text-to-Speech Guide and Best Practices](/features/text-to-speech) diff --git a/developer-guide/core-features/fine-grained-control.mdx b/developer-guide/core-features/fine-grained-control.mdx index a1b86ef..023187a 100644 --- a/developer-guide/core-features/fine-grained-control.mdx +++ b/developer-guide/core-features/fine-grained-control.mdx @@ -1,5 +1,6 @@ --- title: "Fine-grained Control" +sidebarTitle: "Overview" description: "Advanced control over speech generation" icon: "sliders" iconType: "solid" @@ -13,6 +14,10 @@ import { AudioTranscript } from "/snippets/audio-transcript.jsx"; + + Put your phoneme or paralanguage tags into the `text` field and send a real request to hear the result. + + ## Getting Started To use fine-grained control, you can use either our SDK, API, or Playground. diff --git a/developer-guide/core-features/speech-to-text.mdx b/developer-guide/core-features/speech-to-text.mdx deleted file mode 100644 index fee3351..0000000 --- a/developer-guide/core-features/speech-to-text.mdx +++ /dev/null @@ -1,325 +0,0 @@ ---- -title: "Speech to Text Guide" -description: "Convert audio recordings into accurate text transcriptions" -icon: "microphone-lines" ---- -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: d76297f22aba43ae28b2b18a4102cd1dd58a82a461670bbb30192a9195ae5bda */} - - - - - -## Overview - -Transform any audio recording into text with Fish Audio's speech recognition. Perfect for transcriptions, subtitles, and voice commands. - -## Getting Started - -### Web Interface - -Transcribe audio instantly: - - - - Go to [fish.audio](https://fish.audio) and log in - - - Click on "Speech to Text" in your dashboard - - - Select your audio file (MP3, WAV, M4A) - - - Click "Transcribe" and copy your text - - - -## Supported Formats - -### Audio Files - -**Accepted formats:** -- MP3 (recommended) -- WAV -- M4A -- OGG -- FLAC -- AAC - -**File requirements:** -- Maximum size: 20MB -- Maximum duration: 60 minutes -- Minimum duration: 1 second - -## Language Support - -### Automatic Detection - -The system automatically detects the language spoken in your audio. No configuration needed! - -### Manual Selection - -For better accuracy, specify the language: - -**Major Languages:** -- English (en) -- Chinese (zh) -- Japanese (ja) - -With **additional languages** to be supported soon! - -## Audio Quality Tips - -### For Best Results - -**Recording Environment:** -- Quiet room with minimal echo -- No background music -- Clear, consistent speaking voice -- One speaker at a time - -**Audio Settings:** -- Sample rate: 16kHz or higher -- Bit rate: 128kbps or higher -- Mono or stereo (mono preferred) - -### Common Issues - -**Poor transcription quality?** -- Remove background noise -- Increase microphone volume -- Speak clearly and not too fast -- Avoid multiple speakers talking over each other - -## Use Cases - -### Meeting Transcription - -Convert recorded meetings into searchable text: - -1. Record your meeting (Zoom, Teams, etc.) -2. Export the audio file -3. Upload to Fish Audio -4. Get formatted transcription with timestamps - -### Podcast Transcripts - -Create written versions of your podcasts: - -- Generate show notes automatically -- Create searchable content -- Improve accessibility -- Enable translations - -### Video Subtitles - -Generate subtitles for your videos: - -1. Extract audio from video -2. Transcribe with Fish Audio -3. Get timestamped text -4. Import into video editor - -### Voice Notes - -Convert voice memos to text: - -- Dictate ideas quickly -- Transcribe later for editing -- Search through voice notes -- Share as text documents - -## Advanced Features - -### Timestamps - -Get precise timing for each spoken segment: - -``` -[00:00:00] Welcome to our podcast. -[00:00:03] Today we're discussing AI technology. -[00:00:07] Let's dive right in. -``` - -Perfect for: -- Creating subtitles -- Navigating long recordings -- Synchronizing with video -- Building searchable archives - -### Speaker Detection - -Identify different speakers in conversations: - -``` -Speaker 1: "What do you think about the proposal?" -Speaker 2: "I think it has potential." -Speaker 1: "Let's discuss the details." -``` - -### Punctuation & Formatting - -Automatic formatting includes: -- Sentence capitalization -- Punctuation marks -- Paragraph breaks -- Number formatting - -## Tips for Different Content - -### Interviews - -**Best practices:** -- Use a good microphone for each speaker -- Record in a quiet environment -- Speak one at a time -- Keep consistent volume levels - -### Lectures & Presentations - -**Optimize for:** -- Clear articulation of technical terms -- Pause between topics -- Repeat important points -- Avoid reading too fast - -### Phone Calls - -**Considerations:** -- Phone audio is lower quality -- Expect slightly lower accuracy -- Speak clearly and slowly -- Avoid speakerphone if possible - -## Accuracy Expectations - -### What Affects Accuracy - -**Positive factors:** -- Clear audio quality -- Native speaker accent -- Common vocabulary -- Single speaker - -**Challenging factors:** -- Heavy accents -- Technical jargon -- Multiple speakers -- Background noise - -### Typical Accuracy Rates - -- **Professional recording:** 95-98% -- **Clean amateur recording:** 90-95% -- **Phone/video calls:** 85-90% -- **Noisy environments:** 75-85% - -## Post-Processing Tips - -### Editing Transcriptions - -After transcription: - -1. **Review for accuracy** - Check names and technical terms -2. **Add formatting** - Break into paragraphs -3. **Correct errors** - Fix any misheard words -4. **Add context** - Include speaker names - -### Export Options - -Save your transcriptions as: -- Plain text (.txt) -- Word document (.docx) -- Subtitle file (.srt) -- PDF document - -## Common Applications - -### Business - -- Meeting minutes -- Interview transcripts -- Call recordings -- Training materials - -### Education - -- Lecture notes -- Research interviews -- Student recordings -- Language learning - -### Content Creation - -- Video scripts -- Podcast show notes -- Social media captions -- Blog post drafts - -### Accessibility - -- Hearing impaired support -- Multi-language content -- Searchable archives -- Documentation - -## Troubleshooting - -### No Text Output - -**Check:** -- Audio file isn't corrupted -- File format is supported -- Audio contains speech -- Volume is audible - -### Incorrect Language - -**Solutions:** -- Manually select the correct language -- Ensure majority of audio is in one language -- Separate multi-language content - -### Missing Words - -**Common causes:** -- Speaking too fast -- Mumbling or unclear speech -- Technical terms not recognized -- Very quiet sections - -## Privacy & Security - -### Your Data - -- Audio files are processed securely -- Transcriptions are private to your account -- Files are not used for training -- Delete anytime from your account - -### Sensitive Content - -For confidential audio: -- Use on-premise solutions if available -- Review privacy policy -- Consider redacting sensitive information -- Download and delete after processing - -## Best Practices Summary - -1. **Start with quality audio** - Good input = good output -2. **Choose the right environment** - Quiet spaces work best -3. **Speak clearly** - Articulate and consistent pace -4. **Review and edit** - All transcriptions benefit from review -5. **Use appropriate tools** - Different content needs different approaches - -## Get Support - -Need help with transcription? - -- **Try it free:** [fish.audio](https://fish.audio) -- **Community:** [Discord](https://discord.gg/fish-audio) -- **Email:** support@fish.audio -- **Status:** [status.fish.audio](https://status.fish.audio) \ No newline at end of file diff --git a/developer-guide/core-features/text-to-speech.mdx b/developer-guide/core-features/text-to-speech.mdx deleted file mode 100644 index 8ab70ce..0000000 --- a/developer-guide/core-features/text-to-speech.mdx +++ /dev/null @@ -1,657 +0,0 @@ ---- -title: "Text to Speech" -description: "Convert text to natural-sounding speech with Fish Audio" -icon: "volume-high" ---- - -import { AudioSample } from '/snippets/audio-sample.jsx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 97976236c8a840e3fa946aaa32fbe89b46871f27c4ff63c296aeadf1bdb00b36 */} - - - - - -## Overview - -Transform any text into natural, expressive speech using Fish Audio's advanced TTS models. Choose from pre-made voices or use your own cloned voices. - - - - - -Discover the world's best cloned voices models on our [Discovery](https://fish.audio/discovery) page. - -## Quick Start - -### Web Interface - -The easiest way to generate speech: - - - - Go to [fish.audio](https://fish.audio) and log in - - - Type or paste the text you want to convert - - - Select from available voices or use your own - - - Click "Generate" and download your audio - - - -## Using the SDK - - - - - - ```bash - pip install fish-audio-sdk - ``` - - - - Generate speech with just a few lines of code: - ```python - from fishaudio import FishAudio - from fishaudio.utils import save - - # Initialize client - client = FishAudio(api_key="your_api_key_here") - - # Generate speech - audio = client.tts.convert( - text="Hello, world!", - reference_id="your_voice_model_id" - ) - save(audio, "output.mp3") - - print("✓ Audio saved to output.mp3") - ``` - - - - - - - - ```bash - npm install fish-audio - ``` - - - - Generate speech with just a few lines of code: - ```javascript - import { FishAudioClient } from "fish-audio"; - import { writeFile } from "fs/promises"; - - // Initialize session - const fishAudio = new FishAudioClient({ apiKey: "your_api_key_here" }); - - const audio = await fishAudio.textToSpeech.convert({ - text: "Hello, world!", - reference_id: "your_voice_model_id", - }); - - const buffer = Buffer.from(await new Response(audio).arrayBuffer()); - await writeFile("output.mp3", buffer); - - console.log("✓ Audio saved to output.mp3"); - ``` - - - - - -## Voice Options - -### Using Pre-made Voices - -Browse and select voices from the playground: - - - - ```python - # Use a voice from the playground - audio = client.tts.convert( - text="Welcome to Fish Audio!", - reference_id="7f92f8afb8ec43bf81429cc1c9199cb1" - ) - ``` - - - ```javascript - # Use a voice from the playground - const audio = await fishAudio.textToSpeech.convert({ - text: "Welcome to Fish Audio!", - reference_id: "7f92f8afb8ec43bf81429cc1c9199cb1", - }); - ``` - - - -### Using Your Cloned Voice - -Use voices you've created: - - - - ```python - # Use your own cloned voice - audio = client.tts.convert( - text="This is my custom voice speaking", - reference_id="your_model_id" - ) - ``` - - - ```javascript - # Use your own cloned voice - const audio = await fishAudio.textToSpeech.convert({ - text: "This is my custom voice speaking", - reference_id: "your_model_id", - }); - ``` - - - -### Using Reference Audio - -Provide reference audio directly: - - - - ```python - from fishaudio.types import ReferenceAudio - - # Use reference audio on-the-fly - with open("voice_sample.wav", "rb") as f: - audio = client.tts.convert( - text="Hello from reference audio", - references=[ - ReferenceAudio( - audio=f.read(), - text="Sample text from the audio" - ) - ] - ) - ``` - - - ```javascript - // Use reference audio on-the-fly - const fileBuffer = await readFile("voice_sample.wav"); - const voiceFile = new File([fileBuffer], "voice_sample.wav"); - - const audio = await fishAudio.textToSpeech.convert({ - text: "Hello from reference audio", - references: [ - { audio: voiceFile, text: "Sample text from the audio" } - ] - }); - ``` - - - -## Model Selection - -Choose the right model for your needs: - -| Model | Best For | Quality | Speed | -|---|---|---|---| -| **s1** | Prototyping | Excellent | Fast | -| **s2-pro** | Latest features | Excellent | Fastest | - -Specify a model in your request: - - - - ```python - # Using the latest model (default) - audio = client.tts.convert(text="Hello world") - ``` - - - ```javascript - // Using the latest S2-Pro model - const audio = await fishAudio.textToSpeech.convert( - { text: "Hello world" }, - "s2-pro" - ); - ``` - - - -## Advanced Options - -### Audio Formats - -Choose your output format: - - - - ```python - audio = client.tts.convert( - text="Your text here", - format="mp3", # Options: "mp3", "wav", "pcm", "opus" - mp3_bitrate=128 # For MP3: 64, 128, or 192 - ) - ``` - - - ```javascript - const audio = await fishAudio.textToSpeech.convert({ - text: "Your text here", - format: "mp3", // Options: "mp3", "wav", "pcm", "opus" - mp3_bitrate: 128, // For MP3: 64, 128, or 192 - }); - ``` - - - -### Chunk Length - -Control text processing chunks: - - - - ```python - audio = client.tts.convert( - text="Long text content...", - chunk_length=200 # 100-300 characters per chunk - ) - ``` - - - ```javascript - const audio = await fishAudio.textToSpeech.convert({ - text: "Long text content...", - chunk_length: 200, // 100-300 characters per chunk - }); - ``` - - - -### Latency Mode - -Optimize for speed or quality: - - - - ```python - audio = client.tts.convert( - text="Quick response needed", - latency="balanced" # "normal" or "balanced" - ) - ``` - - - ```javascript - const audio = await fishAudio.textToSpeech.convert({ - text: "Quick response needed", - latency: "balanced", // "normal" or "balanced" - }); - ``` - - - - -Balanced mode reduces latency to ~300ms but may slightly decrease stability. - - -## Direct API Usage - -For direct API calls without the SDK: - - - - ```python - import httpx - import ormsgpack - - # Prepare request - request_data = { - "text": "Hello, world!", - "reference_id": "your_model_id", - "format": "mp3" - } - - # Make API call - with httpx.Client() as client: - response = client.post( - "https://api.fish.audio/v1/tts", - content=ormsgpack.packb(request_data), - headers={ - "authorization": "Bearer YOUR_API_KEY", - "content-type": "application/msgpack", - "model": "s2-pro" - } - ) - - # Save audio - with open("output.mp3", "wb") as f: - f.write(response.content) - ``` - - - ```javascript - import { encode } from "@msgpack/msgpack"; - import { writeFile } from "fs/promises"; - - const body = encode({ - text: "Hello, world!", - reference_id: "your_model_id", - format: "mp3", - }); - - const res = await fetch("https://api.fish.audio/v1/tts", { - method: "POST", - headers: { - Authorization: "Bearer ", - "Content-Type": "application/msgpack", - model: "s2-pro", - }, - body, - }); - - const buffer = Buffer.from(await res.arrayBuffer()); - await writeFile("output.mp3", buffer); - ``` - - - -## Streaming Audio - -Stream audio for real-time applications: - - - - ```python - # Stream audio chunks - audio_stream = client.tts.stream( - text="Streaming this text in real-time", - reference_id="model_id" - ) - - with open("stream_output.mp3", "wb") as f: - for chunk in audio_stream: - f.write(chunk) - # Process chunk immediately for real-time playback - ``` - - - ```javascript - // Use a Websocket to stream real-time audio - - import { FishAudioClient, RealtimeEvents } from "fish-audio"; - import { writeFile } from "fs/promises"; - import path from "path"; - - // Simple async generator that yields text chunks - async function* makeTextStream() { - const chunks = [ - "Hello from Fish Audio! ", - "This is a realtime text-to-speech test. ", - "We are streaming multiple chunks over WebSocket.", - ]; - for (const chunk of chunks) { - yield chunk; - } - } - - const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); - - // For realtime, set text to "" and stream the content via makeTextStream - const request = { text: "" }; - - const connection = await fishAudio.textToSpeech.convertRealtime(request, makeTextStream()); - - // Collect audio and write to a file when the stream ends - const chunks: Buffer[] = []; - connection.on(RealtimeEvents.OPEN, () => console.log("WebSocket opened")); - connection.on(RealtimeEvents.AUDIO_CHUNK, (audio: unknown): void => { - if (audio instanceof Uint8Array || Buffer.isBuffer(audio)) { - chunks.push(Buffer.from(audio)); - } - }); - connection.on(RealtimeEvents.ERROR, (err) => console.error("WebSocket error:", err)); - connection.on(RealtimeEvents.CLOSE, async () => { - const outPath = path.resolve(process.cwd(), "out.mp3"); - await writeFile(outPath, Buffer.concat(chunks)); - console.log("Saved to", outPath); - }); - ``` - - - -### Streaming with Timestamps - -Use the [Text to Speech Stream with Timestamps API](/api-reference/endpoint/openapi-v1/text-to-speech-stream-with-timestamps) when you need generated audio and alignment data in the same stream. This endpoint returns Server-Sent Events where each event includes an `audio_base64` chunk and, when available, the latest cumulative `alignment` snapshot for a `chunk_seq`. Clients should concatenate audio chunks in arrival order and replace stored alignment snapshots by `chunk_seq`. - - - Timestamped streaming is best for karaoke-style highlighting, synchronized - captions, phrase progress indicators, and timeline editing. For this endpoint, - prefer `opus` over `mp3` when possible because Opus provides cleaner streaming - boundaries for alignment. - - -## Adding Emotions - - -The `(parenthesis)` syntax below applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details. - - -Make your speech more expressive: - - - - ```python - # Add emotion markers to your text - emotional_text = """ - (excited) I just won the lottery! - (sad) But then I lost the ticket. - (laughing) Just kidding, I found it! - """ - - audio = client.tts.convert( - text=emotional_text, - reference_id="model_id" - ) - ``` - - - ```javascript - // Add emotion markers to your text - const emotionalText = `(excited) I just won the lottery! -(sad) But then I lost the ticket. -(laughing) Just kidding, I found it!`; - - const audio = await fishAudio.textToSpeech.convert({ - text: emotionalText, - reference_id: "model_id", - }); - ``` - - - -Available emotions: -- Basic: `(happy)`, `(sad)`, `(angry)`, `(excited)`, `(calm)` -- Tones: `(shouting)`, `(whispering)`, `(soft tone)` -- Effects: `(laughing)`, `(sighing)`, `(crying)` - -For more precise control over pronunciation and additional paralanguage features like pauses and breathing, see [Fine-grained Control](/developer-guide/core-features/fine-grained-control). - -## Best Practices - -### Text Preparation - -**Do:** -- Use proper punctuation for natural pauses -- Add emotion markers for expression -- Break long texts into paragraphs -- Use consistent formatting - -**Don't:** -- Use ALL CAPS (unless shouting) -- Mix multiple languages randomly -- Include special characters unnecessarily -- Forget punctuation - -### Performance Tips - -1. **Batch Processing:** Process multiple texts efficiently -2. **Cache Models:** Store frequently used model IDs -3. **Optimize Chunk Size:** Use 200 characters for best balance -4. **Handle Errors:** Implement retry logic for network issues - -### Quality Optimization - -For best results: -- Use high-quality reference audio for cloning -- Choose appropriate emotion markers -- Test different latency modes -- Monitor API rate limits - -## Troubleshooting - -### Common Issues - -**No audio output:** -- Check API key validity -- Verify model ID exists -- Ensure proper audio format - -**Poor quality:** -- Use better reference audio -- Try normal latency mode -- Check text formatting - -**Slow generation:** -- Use balanced latency mode -- Reduce chunk length -- Check network connection - -## Code Examples - -### Batch Processing - - - - ```python - from fishaudio.utils import save - - texts = [ - "First announcement", - "Second announcement", - "Third announcement" - ] - - for i, text in enumerate(texts): - audio = client.tts.convert( - text=text, - reference_id="model_id" - ) - save(audio, f"output_{i}.mp3") - ``` - - - ```javascript - const texts = [ - "First announcement", - "Second announcement", - "Third announcement", - ]; - - for (let i = 0; i < texts.length; i++) { - const audio = await fishAudio.textToSpeech.convert({ - text: texts[i], - reference_id: "model_id", - }); - const buffer = Buffer.from(await new Response(audio).arrayBuffer()); - await writeFile(`output_${i}.mp3`, buffer); - } - ``` - - - -### Error Handling - - - - ```python - import time - from fishaudio.exceptions import FishAudioError - - def generate_with_retry(text, max_retries=3): - for attempt in range(max_retries): - try: - audio = client.tts.convert( - text=text, - reference_id="model_id" - ) - return audio - except FishAudioError as e: - if attempt < max_retries - 1: - time.sleep(2 ** attempt) # Exponential backoff - else: - raise e - ``` - - - ```javascript - async function generateWithRetry(text, maxRetries = 3) { - for (let attempt = 0; attempt < maxRetries; attempt++) { - try { - const audio = await fishAudio.textToSpeech.convert({ - text, - reference_id: "model_id", - }); - const buffer = Buffer.from(await new Response(audio).arrayBuffer()); - return buffer; - } catch (err) { - if (attempt < maxRetries - 1) { - const delayMs = 2 ** attempt * 1000; - await new Promise((r) => setTimeout(r, delayMs)); - } else { - throw err; - } - } - } - } - - const buffer = await generateWithRetry("Hello with retry"); - await writeFile("retry_output.mp3", buffer); - ``` - - - -## API Reference - -### Request Parameters - -| Parameter | Type | Description | Default | -|---|---|---|---| -| **text** | string | Text to convert | Required | -| **reference_id** | string | Model/voice ID | None | -| **format** | string | Audio format | "mp3" | -| **chunk_length** | integer | Characters per chunk | 200 | -| **normalize** | boolean | Normalize text | true | -| **latency** | string | Speed vs quality | "normal" | - -### Response - -Returns audio data in the specified format as binary stream. - -## Get Support - -Need help with text-to-speech? - -- [API Reference](/api-reference/introduction) -- **Discord Community:** [Join our Discord](https://discord.gg/fish-audio) -- **Email Support:** support@fish.audio diff --git a/developer-guide/getting-started/api-key.mdx b/developer-guide/getting-started/api-key.mdx new file mode 100644 index 0000000..6c48152 --- /dev/null +++ b/developer-guide/getting-started/api-key.mdx @@ -0,0 +1,76 @@ +--- +title: "Get Your API Key" +description: "Create a Fish Audio account, generate an API key, and make your first request" +icon: "key" +--- + +Everything you build with Fish Audio — the API, the Python library, JavaScript — authenticates with a single **API key**. Here's how to get one and make your first call in a couple of minutes. + +## 1. Create an account and key + + + + Go to [fish.audio/auth/signup](https://fish.audio/auth/signup), create an account, and verify your email. + + + Sign in and open [fish.audio/app/api-keys](https://fish.audio/app/api-keys). + + + Click **Create New Key**, give it a descriptive name (and an expiration if you want), then **copy the key and store it securely** — treat it like a password. + + Never commit your API key to version control or share it publicly. + + + +## 2. Store it as an environment variable + +The SDKs and the examples throughout these docs read your key from `FISH_API_KEY`: + +```bash +export FISH_API_KEY="your_api_key_here" +``` + +This keeps the key out of your code and lets you use different keys for development and production. + +## 3. Make your first request + + +```python Python +from fishaudio import FishAudio +from fishaudio.utils import save + +client = FishAudio() # reads FISH_API_KEY +audio = client.tts.convert(text="Hello from Fish Audio!") +save(audio, "hello.mp3") +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/tts \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: s2-pro" \ + --data '{ "text": "Hello from Fish Audio!", "format": "mp3" }' \ + --output hello.mp3 +``` + + +You just generated your first audio. Where to next: + + + + Voices, formats, speed, and streaming. + + + A fuller first-request walkthrough. + + + Build a custom voice from your own audio. + + + Transcribe audio with timestamps. + + + + + **Building with an AI coding agent?** Install the Fish Audio skill so it writes correct SDK/API code — `npx skills add docs.fish.audio`. See [AI Coding Agents](/developer-guide/resources/coding-agents). + diff --git a/developer-guide/getting-started/introduction.mdx b/developer-guide/getting-started/introduction.mdx deleted file mode 100644 index b872c5b..0000000 --- a/developer-guide/getting-started/introduction.mdx +++ /dev/null @@ -1,139 +0,0 @@ ---- -title: "Overview of Fish Audio" -description: "Discover Fish Audio's powerful voice generation platform and what you can build" -icon: "whale" ---- -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 8c07d4d6124c05fdf89f134c80a6b162016fc0a0632dddffb3c16cec3d5ca0d3 */} - - - - - -## What is Fish Audio? - -Fish Audio is a cutting-edge AI platform for voice generation, voice cloning, and audio storytelling. -Our technology brings dynamic, natural-sounding voices to your applications, enabling immersive experiences across industries. - - - Introducing our latest generation voice models: - - **Fish Audio S2-Pro:** Our latest model delivers unparalleled naturalness and emotion, setting a new standard for AI-generated speech. [Learn more about our models →](/developer-guide/models-pricing/models-overview) - - -## Core Capabilities - - - - Generate natural, expressive speech from text in multiple languages and styles - - - - Create custom voice models from as little as 15 seconds of audio - - - - Build multi-character narratives with emotion and dynamic voice switching - - - -## Try It Now - - - - Test our voices in the interactive playground - no code required - - - - Browse available voice models and their capabilities - - - -## Ready to Start? - -Get your API key and make your first API call in minutes. - - - Generate your first AI voice in under 5 minutes - - - -## Platform Capabilities - -Fish Audio empowers developers to create innovative voice experiences across diverse industries. Whether you're building consumer apps, enterprise solutions, or creative tools, our platform provides the flexibility and power you need. - -### What You Can Build - - - - Automate podcast production, YouTube narration, and audiobook generation - - - - Create dynamic NPC dialogue and real-time character voices - - - - Build interactive language learning tools and accessible educational content - - - - Deploy natural-sounding IVR systems and support agents - - - - Develop screen readers and voice restoration tools - - - - Generate ASMR content, music vocals, interactive stories, and adult content - - - -### Key Features - - - - Stream audio in real-time for live applications - - - - Industry-leading naturalness and clarity - - - - Generate speech in 30+ languages - - - - Fine-tune prosody, emotion, and speaking style - - - - RESTful API with SDKs for Python, Node.js, and more - - - - Handle everything from prototypes to production workloads - - - -## Learn More - -- [Models & Pricing](/developer-guide/models-pricing/models-overview) - Explore voice models and pricing options -- [Core Features](/developer-guide/core-features/text-to-speech) - Deep dive into TTS and voice cloning -- [SDKs & Tools](/developer-guide/sdk-guide/python/installation) - Install language-specific libraries -- [Best Practices](/developer-guide/best-practices/voice-cloning) - Production-ready tips and optimization for voice cloning, emotion and expression control, and real-time voice streaming \ No newline at end of file diff --git a/developer-guide/getting-started/quickstart.mdx b/developer-guide/getting-started/quickstart.mdx index dcc9722..6538170 100644 --- a/developer-guide/getting-started/quickstart.mdx +++ b/developer-guide/getting-started/quickstart.mdx @@ -90,7 +90,7 @@ Choose your preferred method to generate speech: from fishaudio.utils import save # Initialize with your API key - client = FishAudio(api_key="your_api_key_here") + client = FishAudio() # reads FISH_API_KEY # Generate speech audio = client.tts.convert(text="Hello! Welcome to Fish Audio.") @@ -182,6 +182,17 @@ Choose your preferred method to generate speech: +## If your first request fails + +| Status | Likely cause | Fix | +|---|---|---| +| `401` | Invalid or missing API key | Check `FISH_API_KEY` and your key on [API Keys](https://fish.audio/app/api-keys). | +| `402` | Out of credits | Top up on [Billing](https://fish.audio/app/billing). | +| `400` | Bad `reference_id` / parameters | Verify the voice id; read the error `message`. | +| `429` | Rate limit | Wait and retry with backoff. | + +See [Errors](/api-reference/errors) for the full table and retry handling. + ## Customizing Your Voice The examples above use the default voice. To use a different voice, add the `reference_id` parameter with a model ID from [fish.audio](https://fish.audio). You can find the model ID in the URL or use the copy button when viewing any voice. @@ -214,7 +225,7 @@ Then generate speech with your chosen voice: curl -X POST https://api.fish.audio/v1/tts \ -H "Authorization: Bearer $FISH_API_KEY" \ -H "Content-Type: application/json" \ - -H "model: s2" \ + -H "model: s2-pro" \ -d '{ "text": "This is a custom voice from Fish Audio! You can explore hundreds of different voices on the platform, or even create your own.", "reference_id": "'"$REFERENCE_ID"'", @@ -230,7 +241,7 @@ Then generate speech with your chosen voice: from fishaudio import FishAudio from fishaudio.utils import save - client = FishAudio(api_key="your_api_key_here") + client = FishAudio() # reads FISH_API_KEY # Generate speech with custom voice audio = client.tts.convert( diff --git a/developer-guide/products/story-studio.mdx b/developer-guide/products/story-studio.mdx deleted file mode 100644 index a6a626d..0000000 --- a/developer-guide/products/story-studio.mdx +++ /dev/null @@ -1,22 +0,0 @@ ---- -title: "Story Studio" -description: "Build immersive audio stories and narratives" -icon: "wand-magic-sparkles" ---- -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: de306c9f3b406de604d8bad8fabe70e7d955d04a136d1aaf49c69d0ea87dd072 */} - - - - - - -Coming soon! We're preparing comprehensive documentation for Story Studio. - - -In the meantime, you can: -- Visit the [Fish Audio Playground](https://fish.audio) to explore our storytelling features -- Check back soon for detailed guides and tutorials - -Join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates. \ No newline at end of file diff --git a/developer-guide/products/tts.mdx b/developer-guide/products/tts.mdx deleted file mode 100644 index 964c6a5..0000000 --- a/developer-guide/products/tts.mdx +++ /dev/null @@ -1,23 +0,0 @@ ---- -title: "Text to Speech" -description: "Convert text into natural-sounding speech with Fish Audio's AI voices" -icon: "volume-high" ---- -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 320a06bd0e91ba30dae7508f5d75d09d4b59171d5b825cf336d4027ab1e55323 */} - - - - - - -Coming soon! We're preparing comprehensive documentation for our Text-to-Speech web interface. - - -In the meantime, you can: -- Visit the [Fish Audio Playground](https://fish.audio) to try our TTS features -- Check our [API documentation](/api-reference/endpoint/openapi-v1/text-to-speech) for programmatic access -- Read our [TTS Guide and Best Practices](/developer-guide/core-features/text-to-speech) - -Check back soon or join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates. \ No newline at end of file diff --git a/developer-guide/products/voice-cloning.mdx b/developer-guide/products/voice-cloning.mdx deleted file mode 100644 index fdf3a26..0000000 --- a/developer-guide/products/voice-cloning.mdx +++ /dev/null @@ -1,23 +0,0 @@ ---- -title: "Voice Cloning" -description: "Create custom voice models from audio samples" -icon: "microphone" ---- -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 22bd06fb705ab4f510dcb10a5a0a976e777ebfb717f33f3039ff2ac373eb37c0 */} - - - - - - -Coming soon! We're preparing comprehensive documentation for our Voice Cloning web interface. - - -In the meantime, you can: -- Visit the [Fish Audio Playground](https://fish.audio) to try voice cloning -- View our [Python SDK voice cloning guide](/developer-guide/sdk-guide/python/voice-cloning) -- Read our [voice cloning best practices](/developer-guide/best-practices/voice-cloning) - -Check back soon or join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates. \ No newline at end of file diff --git a/developer-guide/resources/agent-quickstart.mdx b/developer-guide/resources/agent-quickstart.mdx index 0da5df0..0f4202b 100644 --- a/developer-guide/resources/agent-quickstart.mdx +++ b/developer-guide/resources/agent-quickstart.mdx @@ -1,113 +1,176 @@ --- title: "Agent Quickstart" -description: "Low-noise entry points and canonical URLs for AI agents using Fish Audio documentation" +description: "Build with Fish Audio using your AI coding agent — install the skill and start prompting in a minute" icon: "robot" --- -## Purpose - -This page is the recommended starting point for AI agents, RAG pipelines, and documentation crawlers that need accurate Fish Audio references with minimal markup noise. - -## Built-In Agent Indexes - -This documentation site already provides built-in LLM-friendly indexes: - -- [llms.txt](https://docs.fish.audio/llms.txt) for the curated documentation index -- [llms-full.txt](https://docs.fish.audio/llms-full.txt) for broader site context - -In most cases, agents should read `llms.txt` first and only fetch `llms-full.txt` when they need wider context across the whole documentation set. - -## Install the Agent Skill - -For coding agents that support [Agent Skills](https://github.com/vercel-labs/skills) (Claude Code, Cursor, Windsurf, Codex, and others), install the ready-made raw-API skill with a single command: - -```bash -npx skills add https://docs.fish.audio --skill fish-audio-api +Install the Fish Audio **agent skill** and your coding agent — Claude Code, Cursor, Codex, and others — writes correct, current Fish Audio code: right method names, units, and error types, instead of guessing. Here's the fastest path. + + + + ```bash + npx skills add https://docs.fish.audio + ``` + + This installs both Fish Audio skills (a canonical copy in `.agents/skills/`, with symlinks for Claude Code and Cursor). Run `npx skills update` later to refresh them. + + + + Python (`fish-audio-sdk`) and JavaScript (`fish-audio`) — exact method signatures, sync + async, model selection, and the real exception types. + + + Raw REST + WebSocket for any language — auth, endpoints, MessagePack/JSON/multipart rules, and the streaming protocol. + + + + + + [Create a key](/developer-guide/getting-started/api-key) and export it — the code your agent writes reads it from the environment: + + ```bash + export FISH_API_KEY="your_api_key_here" + ``` + + + + Prompt in plain language — it uses the correct client, methods, and error types: + + + + "Generate speech with Fish Audio in a cloned voice and save it to a file." + + + "Transcribe `speech.wav` with Fish Audio and print the segments." + + + "Stream an LLM's tokens to Fish Audio TTS over the WebSocket." + + + "Call the Fish Audio TTS REST API from Go, no SDK." + + + + + +## Install options + + +```bash All skills +npx skills add https://docs.fish.audio ``` -The skill teaches the agent how to call the Fish Audio REST and WebSocket APIs directly from `curl`, Python, Node.js, or any HTTP client — no SDK required. It covers authentication, every endpoint in our [OpenAPI schema](https://docs.fish.audio/api-reference/openapi.json), MessagePack vs JSON vs multipart encoding rules, multi-speaker dialogue, and the WebSocket streaming protocol. - -Discovery endpoint: [/.well-known/agent-skills/index.json](https://docs.fish.audio/.well-known/agent-skills/index.json). Run `npx skills add https://docs.fish.audio` (without `--skill`) to install every skill published here, including the auto-generated product overview skill. - -## Retrieval Order - -1. Read [llms.txt](https://docs.fish.audio/llms.txt) for the curated documentation index. -2. Read [llms-full.txt](https://docs.fish.audio/llms-full.txt) when broad site context is needed. -3. Read [OpenAPI](https://docs.fish.audio/api-reference/openapi.json) for REST schemas, parameters, and examples. -4. Read [AsyncAPI](https://docs.fish.audio/api-reference/asyncapi.yml) for the WebSocket streaming protocol. -5. Fetch individual `.md` pages only after narrowing to a specific task. - -## Canonical API Facts - -- Base API URL: `https://api.fish.audio` -- Authentication: `Authorization: Bearer ` -- TTS model selection: send a required `model` header. Recommended default: `s2-pro` -- Main REST endpoints: - - `POST /v1/tts` - - `POST /v1/asr` - - `GET /model` - - `POST /model` - - `GET /model/{id}` - - `PATCH /model/{id}` - - `DELETE /model/{id}` -- Real-time streaming endpoint: `wss://api.fish.audio/v1/tts/live` - -## High-Value URLs - -### Start Here - -- [Agent Quickstart](https://docs.fish.audio/developer-guide/resources/agent-quickstart.md) -- [Quick Start](https://docs.fish.audio/developer-guide/getting-started/quickstart.md) -- [AI Coding Agents](https://docs.fish.audio/developer-guide/resources/coding-agents.md) - -### API Specs - -- [OpenAPI](https://docs.fish.audio/api-reference/openapi.json) -- [AsyncAPI](https://docs.fish.audio/api-reference/asyncapi.yml) -- [API Introduction](https://docs.fish.audio/api-reference/introduction.md) - -### Authentication And SDK Setup - -- [Python Authentication](https://docs.fish.audio/developer-guide/sdk-guide/python/authentication.md) -- [JavaScript Authentication](https://docs.fish.audio/developer-guide/sdk-guide/javascript/authentication.md) -- [Python SDK Overview](https://docs.fish.audio/developer-guide/sdk-guide/python/overview.md) -- [JavaScript Installation](https://docs.fish.audio/developer-guide/sdk-guide/javascript/installation.md) - -### Core Product Tasks - -- [Text to Speech Guide](https://docs.fish.audio/developer-guide/core-features/text-to-speech.md) -- [Speech to Text Guide](https://docs.fish.audio/developer-guide/core-features/speech-to-text.md) -- [Creating Voice Models](https://docs.fish.audio/developer-guide/core-features/creating-models.md) -- [Emotion Control](https://docs.fish.audio/developer-guide/core-features/emotions.md) -- [Fine-grained Control](https://docs.fish.audio/developer-guide/core-features/fine-grained-control.md) - -### Real-Time And Integrations - -- [WebSocket TTS Streaming](https://docs.fish.audio/api-reference/endpoint/websocket/tts-live.md) -- [Real-time Voice Streaming Best Practices](https://docs.fish.audio/developer-guide/best-practices/real-time-streaming.md) -- [Python WebSocket Streaming](https://docs.fish.audio/developer-guide/sdk-guide/python/websocket.md) -- [JavaScript WebSocket](https://docs.fish.audio/developer-guide/sdk-guide/javascript/websocket.md) -- [LiveKit Integration](https://docs.fish.audio/developer-guide/integrations/livekit.md) -- [Pipecat Integration](https://docs.fish.audio/developer-guide/integrations/pipecat.md) - -### Models, Pricing, And Lifecycle - -- [Models Overview](https://docs.fish.audio/developer-guide/models-pricing/models-overview.md) -- [Choosing a Model](https://docs.fish.audio/developer-guide/models-pricing/choosing-a-model.md) -- [Pricing And Rate Limits](https://docs.fish.audio/developer-guide/models-pricing/pricing-and-rate-limits.md) -- [Model Deprecations](https://docs.fish.audio/developer-guide/models-pricing/deprecations.md) - -## Task Routing - -- If the task is "generate speech", start with Quick Start, the Text to Speech guide, and `POST /v1/tts`. -- If the task is "transcribe audio", start with the Speech to Text guide and `POST /v1/asr`. -- If the task is "clone or manage voices", start with Creating Voice Models and the `/model` endpoints. -- If the task is "stream audio in real time", start with AsyncAPI, WebSocket TTS Streaming, and the WebSocket SDK guides. -- If the task is "pick the right model or estimate cost", start with Models Overview and Pricing And Rate Limits. +```bash One skill +npx skills add https://docs.fish.audio --skill fish-audio-sdk +``` -## Notes For Agents +```bash Target an agent +# claude-code, cursor, codex, ... +npx skills add https://docs.fish.audio -a claude-code +``` -- Prefer `openapi.json` and `asyncapi.yml` for machine-readable schemas. -- Prefer `.md` URLs when you need a single human-authored page in Markdown form. -- Some richer pages use interactive MDX widgets. If a fetched page contains UI or component noise, fall back to this page, `llms.txt`, `llms-full.txt`, or the API spec files first. -- Treat this page as the canonical low-noise entry point for Fish Audio documentation retrieval. +```bash List / inspect first +npx skills add https://docs.fish.audio --list +``` + + + + Targeting specific agents, the live-docs MCP server, skill-vs-MCP, and reading the skills before you install. + + +## Next steps + + + + Create a key and make your first request. + + + Generate your first audio by hand, in any language. + + + Voices, formats, streaming, and the direct API. + + + Endpoints, parameters, and the OpenAPI schema. + + + +## For autonomous agents & RAG pipelines + +Not a coding agent installing a skill — an autonomous agent, RAG pipeline, or crawler? Start from these low-noise, machine-readable entry points: + +- [llms.txt](https://docs.fish.audio/llms.txt) — curated documentation index (read this first) +- [llms-full.txt](https://docs.fish.audio/llms-full.txt) — broader context across the whole site +- [OpenAPI](https://docs.fish.audio/api-reference/openapi.json) — REST schemas, parameters, and examples +- [AsyncAPI](https://docs.fish.audio/api-reference/asyncapi.yml) — the WebSocket streaming protocol + + + + - Base API URL: `https://api.fish.audio` + - Authentication: `Authorization: Bearer ` + - TTS model selection: send a required `model` header. Recommended default: `s2-pro` + - Main REST endpoints: + - `POST /v1/tts` + - `POST /v1/asr` + - `GET /model` + - `POST /model` + - `GET /model/{id}` + - `PATCH /model/{id}` + - `DELETE /model/{id}` + - Real-time streaming endpoint: `wss://api.fish.audio/v1/tts/live` + + + + 1. Read [llms.txt](https://docs.fish.audio/llms.txt) for the curated documentation index. + 2. Read [llms-full.txt](https://docs.fish.audio/llms-full.txt) when broad site context is needed. + 3. Read [OpenAPI](https://docs.fish.audio/api-reference/openapi.json) for REST schemas, parameters, and examples. + 4. Read [AsyncAPI](https://docs.fish.audio/api-reference/asyncapi.yml) for the WebSocket streaming protocol. + 5. Fetch individual `.md` pages only after narrowing to a specific task. + + + + **API specs** + - [OpenAPI](https://docs.fish.audio/api-reference/openapi.json) + - [AsyncAPI](https://docs.fish.audio/api-reference/asyncapi.yml) + - [API Introduction](https://docs.fish.audio/api-reference/introduction.md) + + **Auth & SDK setup** + - [Python Authentication](https://docs.fish.audio/developer-guide/sdk-guide/python/authentication.md) + - [JavaScript Authentication](https://docs.fish.audio/developer-guide/getting-started/api-key.md) + - [Python SDK Overview](https://docs.fish.audio/api-reference/sdk/python/overview.md) + - [JavaScript SDK Reference](https://docs.fish.audio/api-reference/sdk/javascript/api-reference.md) + + **Core product tasks** + - [Text to Speech Guide](https://docs.fish.audio/features/text-to-speech.md) + - [Speech to Text Guide](https://docs.fish.audio/features/speech-to-text.md) + - [Creating Voice Models](https://docs.fish.audio/features/voice-cloning.md) + - [Emotion Control](https://docs.fish.audio/developer-guide/core-features/emotions.md) + - [Fine-grained Control](https://docs.fish.audio/developer-guide/core-features/fine-grained-control.md) + + **Real-time & integrations** + - [WebSocket TTS Streaming](https://docs.fish.audio/api-reference/endpoint/websocket/tts-live.md) + - [Real-time Streaming Best Practices](https://docs.fish.audio/developer-guide/best-practices/real-time-streaming.md) + - [Realtime Streaming (SDK)](https://docs.fish.audio/features/realtime-streaming.md) + - [LiveKit Integration](https://docs.fish.audio/developer-guide/integrations/livekit.md) + - [Pipecat Integration](https://docs.fish.audio/developer-guide/integrations/pipecat.md) + + **Models, pricing & lifecycle** + - [Models Overview](https://docs.fish.audio/developer-guide/models-pricing/models-overview.md) + - [Choosing a Model](https://docs.fish.audio/developer-guide/models-pricing/choosing-a-model.md) + - [Pricing & Rate Limits](https://docs.fish.audio/developer-guide/models-pricing/pricing-and-rate-limits.md) + - [Model Deprecations](https://docs.fish.audio/developer-guide/models-pricing/deprecations.md) + + + + - **Generate speech** → Quick Start, the Text to Speech guide, and `POST /v1/tts`. + - **Transcribe audio** → the Speech to Text guide and `POST /v1/asr`. + - **Clone or manage voices** → Creating Voice Models and the `/model` endpoints. + - **Stream audio in real time** → AsyncAPI, WebSocket TTS Streaming, and the realtime guides. + - **Pick a model or estimate cost** → Models Overview and Pricing & Rate Limits. + + + + - Prefer `openapi.json` and `asyncapi.yml` for machine-readable schemas. + - Append `.md` to any page URL to fetch the human-authored page as plain Markdown. + - Some richer pages use interactive MDX widgets. If a fetched page contains UI or component noise, fall back to `llms.txt`, `llms-full.txt`, or the API spec files. + + diff --git a/developer-guide/resources/coding-agents.mdx b/developer-guide/resources/coding-agents.mdx index 8c8444b..8967979 100644 --- a/developer-guide/resources/coding-agents.mdx +++ b/developer-guide/resources/coding-agents.mdx @@ -1,478 +1,153 @@ --- title: "AI Coding Agents" -description: "Connect AI coding assistants to Fish Audio documentation via MCP for real-time API guidance" +description: "Install the Fish Audio skill so your coding agent writes correct SDK and API code" icon: "robot" --- import { AudioTranscript } from "/snippets/audio-transcript.jsx"; -{/* speak-mintlify-hash: bb48040abec73604a76c532d042b645b95eeccb2d2d82929888bf41b3180935a */} - -## Overview - -Integrate Fish Audio's comprehensive documentation directly into your AI coding assistants. Using MCP (Model Context Protocol), coding agents like Claude Code, Cursor, and Windsurf can access our latest API references, guides, and examples in real-time. - - -The Fish Audio MCP server provides instant access to: -- Complete API documentation -- SDK usage examples -- Best practices and implementation patterns -- Troubleshooting guides - -Connect once and get accurate, up-to-date Fish Audio knowledge in your coding environment. - - - - -This documentation site also exposes built-in LLM-friendly indexes: - -- [llms.txt](https://docs.fish.audio/llms.txt) for the curated page index -- [llms-full.txt](https://docs.fish.audio/llms-full.txt) for broader site context - -If your coding agent supports direct document fetching, start with `llms.txt` before pulling individual pages. - - - -## Install as an Agent Skill - -Fish Audio publishes a ready-made [Agent Skill](https://github.com/vercel-labs/skills) that teaches your coding agent how to call the Fish Audio REST and WebSocket APIs directly, without an SDK. It covers authentication, every endpoint in our OpenAPI schema, MessagePack vs JSON vs multipart encoding rules, multi-speaker dialogue, and the WebSocket streaming protocol. - - - - ```bash - npx skills add https://docs.fish.audio --skill fish-audio-api - ``` - - This installs the skill into your agent's local skill directory (for example `~/.claude/skills/fish-audio-api/`). Once installed, ask your agent to "call the Fish Audio TTS API with curl" or "stream TTS over WebSocket in Python" and it will follow the skill's conventions. - - - - ```bash - npx skills add https://docs.fish.audio - ``` - - Installs every skill advertised at [/.well-known/agent-skills/index.json](https://docs.fish.audio/.well-known/agent-skills/index.json), including the auto-generated product overview skill and the raw-API skill. - - - - The discovery index lives at [/.well-known/agent-skills/index.json](https://docs.fish.audio/.well-known/agent-skills/index.json) and each skill's raw markdown is served at [/.well-known/agent-skills/<skill>/SKILL.md](https://docs.fish.audio/.well-known/agent-skills/fish-audio-api/SKILL.md). Review the skill content first, then install with: - - ```bash - npx skills add https://docs.fish.audio --list # show available skills - npx skills add https://docs.fish.audio --skill fish-audio-api - ``` - - +Install the Fish Audio **agent skill**, and your coding agent — Claude Code, Cursor, Codex, and others — writes correct, current Fish Audio code: right method names, units, and error types, instead of guessing. - -The `skills` CLI works with any agent that uses `SKILL.md` conventions — Claude Code, Cursor, Windsurf, Codex, and others. See [`npx skills --help`](https://github.com/vercel-labs/skills) for agent-specific install flags such as `-a claude-code` or `-a cursor`. - +## Install the skill - -Prefer MCP if you want live documentation search inside your editor. Prefer the Agent Skill if you want a self-contained instruction file that works offline after install and doesn't rely on a running MCP server. - +```bash +npx skills add https://docs.fish.audio +``` -## Why Use MCP Integration? +This installs both Fish Audio skills into your agent (a canonical copy in `.agents/skills/`, with symlinks for Claude Code and Cursor). Run `npx skills update` later to refresh them. - - - Access the latest API documentation without leaving your editor + + + Python (`fish-audio-sdk`) and JavaScript (`fish-audio`) — exact method signatures and defaults, sync + async, model selection, and the real exception types. - - - Generate working code based on current API specifications - - - - Get context-aware help for debugging and optimization + + Raw REST + WebSocket for any language or edge runtime — auth, endpoints, MessagePack/JSON/multipart rules, and the streaming protocol. -## Setup - - - - - - Open your terminal in your project directory and run: - - ```bash - claude mcp add --transport http fish-audio --scope project https://docs.fish.audio/mcp - ``` - - This creates a `.mcp.json` file in your project root with the Fish Audio documentation server configuration. - - - Claude Code supports three installation scopes: - - - **`--scope project`** (recommended): Stores configuration in `.mcp.json` at project root. Version-controlled and shared with your team. - - **`--scope user`**: Available globally across all your projects, but private to your account. - - **`--scope local`** (default): Project-specific but private to you only. Good for experimentation. - - For team collaboration, use project scope and commit the `.mcp.json` file to git. - - - - - Check that the server is connected: +### Install options - ```bash - claude mcp list - ``` - - You should see `fish-audio` in the list of configured servers. - - - - Ask Claude Code: "What Fish Audio models are available?" or "How do I use Fish Audio's TTS API?" - - - - - - - - - Use `Cmd+Shift+P` (Mac) or `Ctrl+Shift+P` (Windows/Linux) to open the command palette, then search for "Open MCP settings". - - - - Select "Add custom MCP" to open the `mcp.json` configuration file. - - - - Add the Fish Audio documentation server: - - ```json - { - "mcpServers": { - "fish-audio": { - "url": "https://docs.fish.audio/mcp" - } - } - } - ``` - - - - Save the configuration file and reload Cursor to apply changes. - - - - In Cursor's chat, ask: "What tools do you have available?" You should see the Fish Audio MCP server listed. Then try: "What Fish Audio TTS models are available?" - - - - - Cursor's MCP support was added in early 2025. Ensure you're running the latest version for full functionality. - + +```bash All skills +npx skills add https://docs.fish.audio +``` - +```bash One skill +npx skills add https://docs.fish.audio --skill fish-audio-sdk +``` - - - - Go to `File > Preferences > Windsurf Settings`, then navigate to `Cascade > Model Context Protocol (MCP) Servers`. - - - - Click "Add custom server +" or "View raw config" to edit the configuration file at `~/.codeium/windsurf/mcp_config.json`. - - - - Add the Fish Audio documentation server: - - ```json - { - "mcpServers": { - "fish-audio": { - "url": "https://docs.fish.audio/mcp" - } - } - } - ``` - - - - Save the configuration and click the refresh button in Windsurf to apply changes. - - - - Open Cascade chat (Ctrl+L) and ask: "Search Fish Audio docs for TTS API usage" or "What emotion parameters does Fish Audio support?" - - - - - Windsurf's MCP support was introduced in Wave 3 (February 2025). Ensure you're running the latest version. - +```bash Target an agent +# claude-code, cursor, codex, ... +npx skills add https://docs.fish.audio -a claude-code +``` - - +```bash List / inspect first +npx skills add https://docs.fish.audio --list +``` + -## Using the Integration +Want to read them before installing? The skills are served at [/.well-known/agent-skills/index.json](https://docs.fish.audio/.well-known/agent-skills/index.json), with each skill's markdown at `/.well-known/agent-skills//SKILL.md`. -### Example Queries +## Try it -Once connected, ask your coding agent questions naturally: +Once installed, ask your agent in plain language — it will use the correct client, methods, and error types: - - "How do I authenticate with Fish Audio API?" + + "Generate speech with Fish Audio in a cloned voice and save it to a file." - - "Show me Python code for text-to-speech" - + + "Transcribe `speech.wav` with Fish Audio and print the segments." + - - "What emotion parameters are available?" - + + "Stream an LLM's tokens to Fish Audio TTS over the WebSocket." + - - "Help me implement real-time streaming" + + "Call the Fish Audio TTS REST API from Go, no SDK." -### Code Generation Examples - - - - Ask: "Generate a Python function for text-to-speech with Fish Audio" - - ```python - from fish_audio import FishAudioClient - - def text_to_speech(text: str, voice_id: str, output_file: str): - """Convert text to speech using Fish Audio API""" - client = FishAudioClient(api_key="your-api-key") - - response = client.tts.create( - text=text, - model_id=voice_id, - format="mp3" - ) +## Or connect via MCP - with open(output_file, "wb") as f: - f.write(response.audio_data) +Prefer live documentation search inside your editor over a self-contained skill file? Connect the Fish Audio MCP server, which serves the latest docs to your agent. - return output_file + + + ```bash + claude mcp add --transport http fish-audio --scope project https://docs.fish.audio/mcp ``` - - - - Ask: "Create a voice cloning pipeline with error handling" - - ```python - from fish_audio import FishAudioClient - import logging - - def clone_voice(audio_path: str, name: str): - """Clone a voice from audio sample""" - client = FishAudioClient(api_key="your-api-key") + This writes a `.mcp.json` in your project root. Verify with `claude mcp list` (you should see `fish-audio`), then ask "What Fish Audio TTS models are available?" - try: - # Upload audio sample - with open(audio_path, "rb") as f: - model = client.models.create( - name=name, - audio_data=f.read(), - description="Custom cloned voice" - ) - - logging.info(f"Voice cloned: {model.id}") - return model.id + + - **`--scope project`** (recommended): config in `.mcp.json`, version-controlled and shared with your team. + - **`--scope user`**: global across your projects, private to your account. + - **`--scope local`** (default): project-specific and private to you. + + - except Exception as e: - logging.error(f"Cloning failed: {e}") - raise + + Open the command palette (`Cmd/Ctrl+Shift+P`) → "Open MCP settings" → "Add custom MCP", and add: + + ```json + { + "mcpServers": { + "fish-audio": { "url": "https://docs.fish.audio/mcp" } + } + } ``` + Save and reload Cursor, then ask "What Fish Audio TTS models are available?" - - Ask: "Implement real-time TTS streaming" - - ```python - from fish_audio import FishAudioClient - import asyncio - - async def stream_tts(text: str, voice_id: str): - """Stream TTS audio in real-time""" - client = FishAudioClient(api_key="your-api-key") - - async for chunk in client.tts.stream( - text=text, - model_id=voice_id, - chunk_size=1024 - ): - # Process audio chunk - yield chunk + + Go to `Settings → Cascade → MCP Servers → View raw config` (`~/.codeium/windsurf/mcp_config.json`) and add: + + ```json + { + "mcpServers": { + "fish-audio": { "url": "https://docs.fish.audio/mcp" } + } + } ``` + Save, refresh, then in Cascade ask "Search Fish Audio docs for TTS API usage." -## Available Documentation + + **Skill vs MCP:** the skill is a self-contained instruction file that works offline after install; MCP fetches the latest docs live. You can use both. This site also exposes [llms.txt](https://docs.fish.audio/llms.txt) and [llms-full.txt](https://docs.fish.audio/llms-full.txt) for agents that fetch docs directly. + -Your coding agent can access: +## Next steps - - - Complete endpoint documentation with parameters + + + Create a key and make your first request. - - Python SDK usage and examples - - - - Optimization patterns and tips - - - - Available models and rate limits - - - - Custom voice creation guides - - - - Common issues and solutions + + Voices, formats, streaming, and the direct API. - -## Advanced Usage - -### Custom Commands - -Create agent workflows for common tasks: - - -```text Voice Pipeline -"Create a complete voice generation pipeline with: -- Authentication -- Voice selection -- Emotion control -- Error handling -- Audio export" -``` - -```text Batch Processing -"Build a batch TTS processor that: -- Reads from CSV -- Handles rate limits -- Retries on failure -- Tracks progress" -``` - -```text WebSocket Client -"Implement a WebSocket client for: -- Real-time streaming -- Auto-reconnection -- Buffer management -- Error recovery" -``` - - - -### Context-Aware Features - -With MCP integration, your agent can: - -- Suggest appropriate models based on use case -- Handle rate limiting automatically -- Provide inline documentation -- Validate API calls against specifications -- Recommend optimization strategies - -## Troubleshooting - - - - If the MCP server isn't connecting: - - 1. Verify internet connectivity - 2. Check `https://docs.fish.audio/mcp` is accessible - 3. Ensure your agent supports MCP protocol - 4. Restart your coding environment - 5. Clear any cached configurations - - - - - The MCP server always serves the latest documentation: - - 1. Refresh the MCP connection in settings - 2. Clear documentation cache if available - 3. Report persistent issues to support@fish.audio - - - - - If certain features aren't available: - - 1. Verify you're using the latest agent version - 2. Check MCP protocol compatibility - 3. Ensure proper server configuration - 4. Contact support for assistance - - - - -## Security - - - **Your data is safe:** - MCP provides read-only access to public documentation - - No API keys are transmitted through MCP - All connections use HTTPS - encryption - No user queries or usage data is stored - - -## Next Steps - - - - Start with Fish Audio API basics + + Endpoints, parameters, and the OpenAPI schema. - - Install and configure the Python SDK - - - - Learn text-to-speech optimization - - - - Create custom voice models + + Status codes, retries, and SDK exception handling. ## Support -Need help with MCP integration? - -- **Technical Support**: [support@fish.audio](mailto:support@fish.audio) -- **Documentation Issues**: [GitHub](https://github.com/fishaudio) +- **Technical support**: [support@fish.audio](mailto:support@fish.audio) +- **Issues**: [GitHub](https://github.com/fishaudio) - **Community**: [Discord](https://discord.gg/dF9Db2Tt3Y) diff --git a/developer-guide/resources/migration.mdx b/developer-guide/resources/migration.mdx deleted file mode 100644 index 83788a0..0000000 --- a/developer-guide/resources/migration.mdx +++ /dev/null @@ -1,25 +0,0 @@ ---- -title: "Migration Guide" -description: "Switch from ElevenLabs, OpenAI, or other TTS providers to Fish Audio" -icon: "arrow-right-arrow-left" ---- -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: c215f39e8e8f5d56a6dd2f4f770673e67e8392e0cdda36a70906fbbad83c9957 */} - - - - - - -Coming soon! We're preparing comprehensive migration guides to help you seamlessly switch to Fish Audio. - - -We're working on detailed migration guides for: -- ElevenLabs -- OpenAI TTS -- Google Cloud Text-to-Speech -- Amazon Polly -- Other TTS providers - -Check back soon or join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates. \ No newline at end of file diff --git a/developer-guide/resources/roadmap.mdx b/developer-guide/resources/roadmap.mdx deleted file mode 100644 index b64f2a6..0000000 --- a/developer-guide/resources/roadmap.mdx +++ /dev/null @@ -1,36 +0,0 @@ ---- -title: "Roadmap" -description: "Upcoming features and improvements for Fish Audio" -icon: "map" ---- -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 5e6a0d1e04bf7ca95de425bdc59aaa4966137474d47fa7e89ed49a51d00ac2f7 */} - - - - - -## Roadmap - -Explore what's coming next for Fish Audio. Our roadmap reflects our current priorities and vision for the platform. - - - This roadmap is subject to change based on user feedback and technical considerations. Features may be added, modified, or removed as we continue to develop the platform. - - -### Coming Soon - -Details about our upcoming features and improvements will be published here. - -## Feature Requests - -Have a feature request or want to vote on priorities? We'd love to hear from you: - -- **Email**: [support@fish.audio](mailto:support@fish.audio) -- **Discord**: Join our [community Discord](https://discord.gg/dF9Db2Tt3Y) -- **GitHub**: Open an issue on our [GitHub repository](https://github.com/fishaudio) - -## Stay Updated - -Subscribe to our [changelog](/developer-guide/getting-started/changelog) RSS feed to get notified when new features are released. \ No newline at end of file diff --git a/developer-guide/sdk-guide/cookbook/batch-transcribe-with-language-hint.mdx b/developer-guide/sdk-guide/cookbook/batch-transcribe-with-language-hint.mdx new file mode 100644 index 0000000..78d9ab6 --- /dev/null +++ b/developer-guide/sdk-guide/cookbook/batch-transcribe-with-language-hint.mdx @@ -0,0 +1,107 @@ +--- +title: "Batch-transcribe files with a language hint" +description: "Loop over local audio files and transcribe each with an explicit language, collecting text and duration per file" +icon: "layer-group" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +## Prerequisites + + + +## Recipe + +Read each file's bytes from disk and pass them to [`asr.transcribe()`](/api-reference/sdk/python/resources#transcribe) with an explicit `language`. A language hint is more reliable than auto-detection when you already know the source language, especially for phonetically similar languages. Collect one result row per file as you go. + + +```python Synchronous +from fishaudio import FishAudio + +client = FishAudio() + +paths = ["speech.wav"] # add more file paths here +language = "en" + +results = [] +for path in paths: + with open(path, "rb") as f: + audio = f.read() + + transcript = client.asr.transcribe(audio=audio, language=language) + results.append({ + "file": path, + "text": transcript.text, + "duration": transcript.duration, # seconds + }) + +for row in results: + print(f"{row['file']} ({row['duration']:.1f}s): {row['text']}") +``` + +```python Asynchronous +import asyncio +from fishaudio import AsyncFishAudio + +async def main(): + async with AsyncFishAudio() as client: + paths = ["speech.wav"] # add more file paths here + language = "en" + + results = [] + for path in paths: + with open(path, "rb") as f: + audio = f.read() + + transcript = await client.asr.transcribe(audio=audio, language=language) + results.append({ + "file": path, + "text": transcript.text, + "duration": transcript.duration, # seconds + }) + + for row in results: + print(f"{row['file']} ({row['duration']:.1f}s): {row['text']}") + +asyncio.run(main()) +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { readFile } from "fs/promises"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +const paths = ["speech.wav"]; // add more file paths here +const language = "en"; + +const results = []; +for (const path of paths) { + const audio = new File([await readFile(path)], path); + + const transcript = await client.speechToText.convert({ audio, language }); + results.push({ + file: path, + text: transcript.text, + duration: transcript.duration, // seconds + }); +} + +for (const row of results) { + console.log(`${row.file} (${row.duration.toFixed(1)}s): ${row.text}`); +} +``` + + +Each call returns an [`ASRResponse`](/api-reference/sdk/python/types#asrresponse-objects) with `.text`, a `.duration` in seconds, and per-phrase `.segments`. The loop keeps files independent, so one bad file does not block the rest of the batch. + + + Auto-detection (omit `language`) works well, but passing an explicit + `language` improves accuracy for similar-sounding languages. Use one + `language` per batch — split mixed-language files into separate lists. + + +## Related + +- [Speech-to-Text guide](/features/speech-to-text) +- [Instant voice cloning](/developer-guide/sdk-guide/cookbook/instant-voice-cloning) diff --git a/developer-guide/sdk-guide/cookbook/clone-and-wait-until-ready.mdx b/developer-guide/sdk-guide/cookbook/clone-and-wait-until-ready.mdx new file mode 100644 index 0000000..756fcea --- /dev/null +++ b/developer-guide/sdk-guide/cookbook/clone-and-wait-until-ready.mdx @@ -0,0 +1,135 @@ +--- +title: "Clone a voice and wait until it is ready" +description: "Create a persistent voice from a reference clip, poll until training finishes, then synthesize with it" +icon: "clock" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +A persistent voice is trained asynchronously: `voices.create()` returns immediately with a voice whose `state` is `created` or `training`. Before you can synthesize with it, you need to wait until its `state` becomes `trained`. + +This recipe creates a voice from `reference.wav`, polls [`voices.get()`](/api-reference/sdk/python/resources#get) until training finishes (with a timeout), then synthesizes with `reference_id`. + +## Prerequisites + + + +## Recipe + +Poll `voices.get(voice.id).state` on an interval, stopping when it reaches `trained` (or raising if it `failed` or the timeout elapses). Then pass the voice id as `reference_id` on `convert()`. + + +```python Synchronous +import time + +from fishaudio import FishAudio +from fishaudio.utils import save + +client = FishAudio() + +# 1. Create a persistent voice from a reference clip. +with open("reference.wav", "rb") as f: + voice = client.voices.create(title="My Voice", voices=[f.read()]) + +# 2. Poll until the voice finishes training. +deadline = time.time() + 300 # 5-minute timeout +while voice.state != "trained": + if voice.state == "failed": + raise RuntimeError(f"Voice {voice.id} failed to train") + if time.time() > deadline: + raise TimeoutError(f"Voice {voice.id} not ready (state={voice.state})") + time.sleep(5) + voice = client.voices.get(voice.id) + +# 3. Synthesize with the trained voice. +audio = client.tts.convert( + text="My voice is ready to use.", + reference_id=voice.id, +) +save(audio, "out.mp3") +``` + +```python Asynchronous +import asyncio + +from fishaudio import AsyncFishAudio +from fishaudio.utils import save + + +async def main(): + async with AsyncFishAudio() as client: + # 1. Create a persistent voice from a reference clip. + with open("reference.wav", "rb") as f: + voice = await client.voices.create(title="My Voice", voices=[f.read()]) + + # 2. Poll until the voice finishes training. + deadline = asyncio.get_event_loop().time() + 300 # 5-minute timeout + while voice.state != "trained": + if voice.state == "failed": + raise RuntimeError(f"Voice {voice.id} failed to train") + if asyncio.get_event_loop().time() > deadline: + raise TimeoutError(f"Voice {voice.id} not ready (state={voice.state})") + await asyncio.sleep(5) + voice = await client.voices.get(voice.id) + + # 3. Synthesize with the trained voice. + audio = await client.tts.convert( + text="My voice is ready to use.", + reference_id=voice.id, + ) + save(audio, "out.mp3") + + +asyncio.run(main()) +``` + +```javascript JavaScript +import { readFile, writeFile } from "fs/promises"; + +import { FishAudioClient } from "fish-audio"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +// 1. Create a persistent voice from a reference clip. +const sample = new File([await readFile("reference.wav")], "reference.wav"); +let voice = await client.voices.ivc.create({ + title: "My Voice", + visibility: "private", + voices: [sample], +}); + +// 2. Poll until the voice finishes training. +const deadline = Date.now() + 300_000; // 5-minute timeout +while (voice.state !== "trained") { + if (voice.state === "failed") { + throw new Error(`Voice ${voice._id} failed to train`); + } + if (Date.now() > deadline) { + throw new Error(`Voice ${voice._id} not ready (state=${voice.state})`); + } + await new Promise((resolve) => setTimeout(resolve, 5000)); + voice = await client.voices.get(voice._id); +} + +// 3. Synthesize with the trained voice. +const stream = await client.textToSpeech.convert( + { text: "My voice is ready to use.", reference_id: voice._id }, + "s2-pro", +); +const chunks = []; +for await (const chunk of stream) chunks.push(Buffer.from(chunk)); +await writeFile("out.mp3", Buffer.concat(chunks)); +``` + + +A voice moves through `created` → `training` → `trained`, or ends in `failed`. Always handle `failed` and the timeout so a stuck voice cannot loop forever. + + + Training a persistent voice takes time, so only create one when you will reuse the voice across many requests. For one-off synthesis, skip the wait entirely and pass a `ReferenceAudio` inline — see [Instant voice cloning](/developer-guide/sdk-guide/cookbook/instant-voice-cloning). + + +## Related + +- [Instant voice cloning](/developer-guide/sdk-guide/cookbook/instant-voice-cloning) +- [Voice Cloning guide](/features/voice-cloning) +- [Voices API reference](/api-reference/sdk/python/resources#voices) diff --git a/developer-guide/sdk-guide/cookbook/discover-library-voice.mdx b/developer-guide/sdk-guide/cookbook/discover-library-voice.mdx new file mode 100644 index 0000000..4982003 --- /dev/null +++ b/developer-guide/sdk-guide/cookbook/discover-library-voice.mdx @@ -0,0 +1,96 @@ +--- +title: "Discover and reuse a Voice Library voice" +description: "Search the public Voice Library by title, pick a result, and synthesize speech with it as your reference_id" +icon: "magnifying-glass" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +## Prerequisites + + + +## Recipe + +Set `self_only=False` on [`voices.list()`](/api-reference/sdk/python/resources#list) to search the public [Voice Library](/features/manage-voices) instead of only your own models. The response carries `total` (matches across all pages) and `items` (this page). Pick a result's `id` and pass it straight to [`tts.convert()`](/api-reference/sdk/python/resources#convert) as `reference_id` — no cloning, no model to manage. + + +```python Python +from fishaudio import FishAudio +from fishaudio.utils import save + +client = FishAudio() # reads FISH_API_KEY + +# Search the public library by title (not just your own voices) +page = client.voices.list(title="narration", self_only=False, page_size=10) +print(f"{page.total} public voices match") + +# Pick the first result; fall back to a known id if the search is empty +reference_id = "" +for voice in page.items: + print(voice.id, voice.title, voice.languages) + reference_id = reference_id if reference_id != "" else voice.id + +# Synthesize with the discovered voice as the reference +audio = client.tts.convert( + text="Speaking with a voice I found in the public library.", + reference_id=reference_id, +) +save(audio, "out.mp3") +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { writeFile } from "fs/promises"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +// Search the public Voice Library by title (not just your own voices) +const page = await client.voices.search({ title: "narration", page_size: 10 }); +console.log(`${page.total} public voices match`); + +// Pick the first result; fall back to a known id if the search is empty +let referenceId = ""; +for (const voice of page.items) { + console.log(voice._id, voice.title, voice.languages); + referenceId = referenceId !== "" ? referenceId : voice._id; +} + +// Synthesize with the discovered voice as the reference +const stream = await client.textToSpeech.convert( + { + text: "Speaking with a voice I found in the public library.", + reference_id: referenceId, + format: "mp3", + }, + "s2-pro" +); + +const chunks = []; +for await (const chunk of stream) chunks.push(Buffer.from(chunk)); +await writeFile("out.mp3", Buffer.concat(chunks)); +``` + + +`page.total` is the full match count, so `total > len(page.items)` tells you there are more pages — bump `page_number` to walk them. Any public voice `id` is a ready-to-use `reference_id`; nothing is saved to your account. + +You can hit the same endpoint directly: + +```bash +curl "https://api.fish.audio/model?title=narration&page_size=10" \ + --header "Authorization: Bearer $FISH_API_KEY" + +# Response: { "total": 128, "items": [ { "_id": "...", "title": "...", ... } ] } +``` + + + Title search is fuzzy and ranked by usage, so the top result is usually the + most popular match. Add `language=["en"]` to narrow by spoken language, or + raise `page_size` and page with `page_number` to scan deeper. + + +## Related + +- [Manage Voices](/features/manage-voices) +- [Instant voice cloning](/developer-guide/sdk-guide/cookbook/instant-voice-cloning) +- [Python reference: voices](/api-reference/sdk/python/resources) diff --git a/developer-guide/sdk-guide/cookbook/instant-voice-cloning.mdx b/developer-guide/sdk-guide/cookbook/instant-voice-cloning.mdx new file mode 100644 index 0000000..ba681b1 --- /dev/null +++ b/developer-guide/sdk-guide/cookbook/instant-voice-cloning.mdx @@ -0,0 +1,109 @@ +--- +title: "Instant voice cloning" +description: "Clone a voice on the fly from a short reference clip, with no model to manage" +icon: "clone" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +## Prerequisites + + + +## Recipe + +Pass a [`ReferenceAudio`](/api-reference/sdk/python/types#referenceaudio-objects) (raw audio bytes + an exact transcript) on the `convert` call. Nothing is saved server-side — the clone applies to that request only. + + +```python Synchronous +from fishaudio import FishAudio +from fishaudio.types import ReferenceAudio +from fishaudio.utils import save + +client = FishAudio() + +with open("reference.wav", "rb") as f: + audio = client.tts.convert( + text="This sentence is spoken in the cloned voice.", + references=[ReferenceAudio( + audio=f.read(), + text="Exact transcript of what is said in reference.wav.", + )], + ) + +save(audio, "cloned.mp3") +``` + +```python Asynchronous +import asyncio +from fishaudio import AsyncFishAudio +from fishaudio.types import ReferenceAudio +from fishaudio.utils import save + +async def main(): + async with AsyncFishAudio() as client: + with open("reference.wav", "rb") as f: + audio = await client.tts.convert( + text="This sentence is spoken in the cloned voice.", + references=[ReferenceAudio( + audio=f.read(), + text="Exact transcript of what is said in reference.wav.", + )], + ) + save(audio, "cloned.mp3") + +asyncio.run(main()) +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { readFile, writeFile } from "fs/promises"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +const reference = new File( + [await readFile("reference.wav")], + "reference.wav", +); + +const stream = await client.textToSpeech.convert( + { + text: "This sentence is spoken in the cloned voice.", + references: [ + { + audio: reference, + text: "Exact transcript of what is said in reference.wav.", + }, + ], + format: "mp3", + }, + "s2-pro", +); + +const chunks = []; +for await (const chunk of stream) chunks.push(Buffer.from(chunk)); + +await writeFile("cloned.mp3", Buffer.concat(chunks)); +``` + + + + Use 10–30 s of clean speech, and make `text` match the audio exactly + (including punctuation) for the best prosody. + + +## Reuse a voice across many requests + +If you'll use the voice repeatedly, create a persistent model once and pass its id as `reference_id` — see the [Voice Cloning guide](/features/voice-cloning). + +```python +with open("sample.wav", "rb") as f: + voice = client.voices.create(title="My Voice", voices=[f.read()]) + +audio = client.tts.convert(text="Reusing my saved voice.", reference_id=voice.id) +``` + +## Related + +- [Voice Cloning guide](/features/voice-cloning) +- [Stream TTS to a file](/developer-guide/sdk-guide/cookbook/streaming-to-file) diff --git a/developer-guide/sdk-guide/cookbook/oneshot-vs-persistent-cloning.mdx b/developer-guide/sdk-guide/cookbook/oneshot-vs-persistent-cloning.mdx new file mode 100644 index 0000000..2b9b386 --- /dev/null +++ b/developer-guide/sdk-guide/cookbook/oneshot-vs-persistent-cloning.mdx @@ -0,0 +1,139 @@ +--- +title: "One-shot vs persistent cloning: pick the right approach" +description: "Choose between instant per-request cloning and a saved, reusable voice model" +icon: "clone" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +## Prerequisites + + + +## Recipe + +There are two ways to clone a voice. Pick by how often you'll reuse it: + +- **One-shot (instant)** — pass a [`ReferenceAudio`](/api-reference/sdk/python/types#referenceaudio-objects) (raw bytes + exact transcript) on each `convert` call. Nothing is stored server-side; the clone lives only for that request. +- **Persistent** — call `voices.create` once to train a model, then reuse its id as `reference_id` on every request. No reference upload per call, and the same voice is shared across processes. + +Start with one-shot. Below, a single reference clip is cloned inline with no model to manage: + + +```python Synchronous +from fishaudio import FishAudio +from fishaudio.types import ReferenceAudio +from fishaudio.utils import save + +client = FishAudio() + +with open("reference.wav", "rb") as f: + audio = client.tts.convert( + text="This line is spoken in the cloned voice, no model required.", + references=[ReferenceAudio( + audio=f.read(), + text="Exact transcript of what is said in reference.wav.", + )], + ) + +save(audio, "oneshot.mp3") +``` + +```python Asynchronous +import asyncio +from fishaudio import AsyncFishAudio +from fishaudio.types import ReferenceAudio +from fishaudio.utils import save + +async def main(): + async with AsyncFishAudio() as client: + with open("reference.wav", "rb") as f: + audio = await client.tts.convert( + text="This line is spoken in the cloned voice, no model required.", + references=[ReferenceAudio( + audio=f.read(), + text="Exact transcript of what is said in reference.wav.", + )], + ) + save(audio, "oneshot.mp3") + +asyncio.run(main()) +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { readFile, writeFile } from "fs/promises"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +// One-shot: clone inline by sending the reference bytes + exact transcript. +// Nothing is stored server-side; the clone lives only for this request. +const reference = await readFile("reference.wav"); + +const stream = await client.textToSpeech.convert( + { + text: "This line is spoken in the cloned voice, no model required.", + references: [ + { + audio: new File([reference], "reference.wav"), + text: "Exact transcript of what is said in reference.wav.", + }, + ], + format: "mp3", + }, + "s2-pro" +); + +const chunks = []; +for await (const chunk of stream) chunks.push(Buffer.from(chunk)); +await writeFile("oneshot.mp3", Buffer.concat(chunks)); +``` + + + + One-shot re-sends the reference bytes on every request, so it's ideal for + one-off or rarely-repeated voices. Once a voice is used more than a handful of + times, switch to a persistent model to skip the per-call upload. + + +## Train a persistent voice once, reuse forever + +Call [`voices.create`](/api-reference/sdk/python/resources#create) to train a model, then pass `voice.id` as `reference_id`. The same id works from any process and across SDK and REST. + +```python +with open("reference.wav", "rb") as f: + voice = client.voices.create(title="My Narrator", voices=[f.read()]) + +# reuse the same id on every later request — no reference upload +audio = client.tts.convert( + text="Reusing my saved voice across many requests.", + reference_id=voice.id, +) +save(audio, "persistent.mp3") +``` + +Already have a trained voice id? Skip training and pass it directly: + +```python +audio = client.tts.convert(text="Hello again.", reference_id="") +``` + +## Which to choose + +| | One-shot | Persistent | +| --- | --- | --- | +| Setup | None | One `voices.create` call | +| Per request | Re-uploads reference bytes | Sends only `reference_id` | +| Stored server-side | No | Yes (manage with `voices.update` / `voices.delete`) | +| Best for | One-off or experimental clones | Voices reused many times or across services | + + + For either path, give the reference 10–30 s of clean speech and make the + transcript match the audio exactly (including punctuation) for the best prosody. + + +## Related + +- [Instant voice cloning](/developer-guide/sdk-guide/cookbook/instant-voice-cloning) +- [Voice Cloning guide](/features/voice-cloning) +- [Manage voices](/features/manage-voices) diff --git a/developer-guide/sdk-guide/cookbook/realtime-llm-to-speech.mdx b/developer-guide/sdk-guide/cookbook/realtime-llm-to-speech.mdx new file mode 100644 index 0000000..a877203 --- /dev/null +++ b/developer-guide/sdk-guide/cookbook/realtime-llm-to-speech.mdx @@ -0,0 +1,75 @@ +--- +title: "Realtime: LLM tokens → speech" +description: "Pipe a streaming LLM response straight into speech over a WebSocket as tokens arrive" +icon: "bolt" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +## Prerequisites + + + +## Recipe + +[`stream_websocket()`](/api-reference/sdk/python/resources#stream_websocket) takes an iterable of text chunks and yields audio chunks in real time. Feed it your LLM's token stream and play or forward the audio as it's produced. + + +```python Synchronous +from fishaudio import FishAudio +from fishaudio.utils import play + +client = FishAudio() + +def llm_tokens(): + # Replace with your real streaming LLM call + for token in ["The ", "first ", "move ", "sets ", "everything ", "in ", "motion."]: + yield token + +audio_stream = client.tts.stream_websocket(llm_tokens(), reference_id="") +play(audio_stream) # or: for chunk in audio_stream: send_to_client(chunk) +``` + +```python Asynchronous +import asyncio +from fishaudio import AsyncFishAudio + +async def llm_tokens(): + for token in ["The ", "first ", "move ", "sets ", "everything ", "in ", "motion."]: + yield token + +async def main(): + async with AsyncFishAudio() as client: + audio_stream = client.tts.stream_websocket(llm_tokens()) + with open("out.mp3", "wb") as f: + async for chunk in audio_stream: + f.write(chunk) + +asyncio.run(main()) +``` + + +## Force generation at a boundary + +By default the engine buffers text until it has enough for natural prosody. Yield a [`FlushEvent`](/api-reference/sdk/python/types#flushevent-objects) to force synthesis of what's buffered — useful for turn-taking in a conversation: + +```python +from fishaudio.types import TextEvent, FlushEvent + +def turns(): + yield TextEvent(text="Are you ready?") + yield FlushEvent() # speak the question now + yield TextEvent(text="Let's begin.") +``` + +The SDK sends the start/stop frames for you — you only supply text and optional flushes. + + + Errors mid-stream surface as `WebSocketError`. Reconnect with a fresh call + rather than retrying on the same socket. + + +## Related + +- [Realtime WebSocket guide](/features/realtime-streaming) +- [Errors & Retries](/developer-guide/sdk-guide/python/errors) diff --git a/developer-guide/sdk-guide/cookbook/streaming-to-file.mdx b/developer-guide/sdk-guide/cookbook/streaming-to-file.mdx new file mode 100644 index 0000000..8b24e71 --- /dev/null +++ b/developer-guide/sdk-guide/cookbook/streaming-to-file.mdx @@ -0,0 +1,80 @@ +--- +title: "Stream TTS to a file" +description: "Generate long audio and write it to disk chunk-by-chunk, without buffering it all in memory" +icon: "file-audio" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +## Prerequisites + + + +## Recipe + +For long text, use [`stream()`](/api-reference/sdk/python/resources#stream) and write each chunk as it arrives instead of holding the whole file in memory. + + +```python Synchronous +from fishaudio import FishAudio + +client = FishAudio() + +with open("output.mp3", "wb") as f: + for chunk in client.tts.stream(text="A very long passage of text..."): + f.write(chunk) +``` + +```python Asynchronous +import asyncio +from fishaudio import AsyncFishAudio + +async def main(): + async with AsyncFishAudio() as client: + audio_stream = await client.tts.stream(text="A very long passage of text...") + with open("output.mp3", "wb") as f: + async for chunk in audio_stream: + f.write(chunk) + +asyncio.run(main()) +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { createWriteStream } from "fs"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +// convert() returns a ReadableStream. Iterate it and write each +// chunk as it arrives, so you never hold the whole file in memory. +const stream = await client.textToSpeech.convert( + { text: "A very long passage of text...", format: "mp3" }, + "s2-pro" +); + +const file = createWriteStream("output.mp3"); +for await (const chunk of stream) { + file.write(Buffer.from(chunk)); +} +file.end(); +``` + + +## Collect instead of iterate + +If you just want the full bytes, call `.collect()`: + +```python +audio = client.tts.stream(text="Hello!").collect() # -> bytes +``` + + + `convert()` already returns the complete audio as `bytes` — reach for + `stream()` when you want to start writing/forwarding bytes before generation + finishes, or to avoid buffering large files. + + +## Related + +- [Text-to-Speech guide](/features/text-to-speech) +- [Realtime: LLM tokens → speech](/developer-guide/sdk-guide/cookbook/realtime-llm-to-speech) diff --git a/developer-guide/sdk-guide/cookbook/telephony-8khz-audio.mdx b/developer-guide/sdk-guide/cookbook/telephony-8khz-audio.mdx new file mode 100644 index 0000000..3d80f63 --- /dev/null +++ b/developer-guide/sdk-guide/cookbook/telephony-8khz-audio.mdx @@ -0,0 +1,101 @@ +--- +title: "Telephony-grade audio (8 kHz) for IVR and phone" +description: "Generate 8 kHz mono WAV/PCM that matches the narrowband sample rate phone networks expect for IVR and call-center playback" +icon: "phone" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +## Prerequisites + + + +## Recipe + +Phone networks carry narrowband audio at 8 kHz. Generating at a higher rate just forces the carrier to downsample on the way through — wasting bandwidth and often softening the result. Synthesize at 8 kHz directly and the bytes are ready to hand to your IVR or SIP stack. + +Set the sample rate on [`TTSConfig`](/api-reference/sdk/python/types#ttsconfig-objects) (it is not a top-level argument) and write the WAV to disk. + + +```python Synchronous +from fishaudio import FishAudio +from fishaudio.types import TTSConfig +from fishaudio.utils import save + +client = FishAudio() + +audio = client.tts.convert( + text="Thank you for calling. Press one to speak with an agent.", + config=TTSConfig(format="wav", sample_rate=8000), +) + +save(audio, "out.wav") +``` + +```python Asynchronous +import asyncio +from fishaudio import AsyncFishAudio +from fishaudio.types import TTSConfig +from fishaudio.utils import save + +async def main(): + async with AsyncFishAudio() as client: + audio = await client.tts.convert( + text="Thank you for calling. Press one to speak with an agent.", + config=TTSConfig(format="wav", sample_rate=8000), + ) + save(audio, "out.wav") + +asyncio.run(main()) +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { writeFile } from "fs/promises"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +const stream = await client.textToSpeech.convert( + { + text: "Thank you for calling. Press one to speak with an agent.", + format: "wav", + sample_rate: 8000, + }, + "s2-pro" +); + +const chunks = []; +for await (const chunk of stream) chunks.push(Buffer.from(chunk)); + +await writeFile("out.wav", Buffer.concat(chunks)); +``` + + +The output is a mono 8 kHz WAV — the standard for G.711 PCM telephony. For a headerless stream to feed straight into a SIP or RTP pipeline, switch to raw PCM with `format="pcm"`; the sample rate stays on `TTSConfig`. + +```python +audio = client.tts.convert( + text="Thank you for calling. Press one to speak with an agent.", + config=TTSConfig(format="pcm", sample_rate=8000), +) +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/tts \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: s2-pro" \ + --data '{ "text": "Thank you for calling. Press one to speak with an agent.", "format": "wav", "sample_rate": 8000 }' \ + --output out.wav +``` + + + 8 kHz discards everything above ~4 kHz, so plosives and sibilance lose detail. + Keep prompts short and articulate, and reserve higher sample rates (16/24 kHz) + for VoIP or recordings that never touch the legacy phone network. + + +## Related + +- [Text-to-Speech guide](/features/text-to-speech) +- [Stream TTS to a file](/developer-guide/sdk-guide/cookbook/streaming-to-file) diff --git a/developer-guide/sdk-guide/cookbook/transcribe-to-captions.mdx b/developer-guide/sdk-guide/cookbook/transcribe-to-captions.mdx new file mode 100644 index 0000000..a7fc18a --- /dev/null +++ b/developer-guide/sdk-guide/cookbook/transcribe-to-captions.mdx @@ -0,0 +1,102 @@ +--- +title: "Transcribe audio to SRT/VTT captions" +description: "Transcribe audio with timestamps and write valid SRT and WebVTT caption files from the segments" +icon: "closed-captioning" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +## Prerequisites + + + +## Recipe + +Call [`asr.transcribe()`](/api-reference/sdk/python/resources#transcribe) with `include_timestamps=True`, then turn each [`ASRSegment`](/api-reference/sdk/python/types#asrsegment-objects) into a numbered cue. Segment `start` / `end` are in **seconds**, so the only real work is formatting them — SRT wants `HH:MM:SS,mmm` (comma), WebVTT wants `HH:MM:SS.mmm` (dot). + + +```python Python +from fishaudio import FishAudio + +client = FishAudio() + + +def to_srt_timestamp(seconds: float) -> str: + """Format a time in seconds as an SRT timestamp: HH:MM:SS,mmm.""" + millis = round(seconds * 1000) + hours, millis = divmod(millis, 3_600_000) + minutes, millis = divmod(millis, 60_000) + secs, millis = divmod(millis, 1000) + return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}" + + +with open("speech.wav", "rb") as f: + result = client.asr.transcribe(audio=f.read(), include_timestamps=True) + +# SRT: 1-based index, comma decimal separator, blank line between cues. +with open("captions.srt", "w", encoding="utf-8") as srt: + for i, segment in enumerate(result.segments, start=1): + start = to_srt_timestamp(segment.start) + end = to_srt_timestamp(segment.end) + srt.write(f"{i}\n{start} --> {end}\n{segment.text.strip()}\n\n") + +# WebVTT: same cues, "WEBVTT" header, dot decimal separator. +with open("captions.vtt", "w", encoding="utf-8") as vtt: + vtt.write("WEBVTT\n\n") + for segment in result.segments: + start = to_srt_timestamp(segment.start).replace(",", ".") + end = to_srt_timestamp(segment.end).replace(",", ".") + vtt.write(f"{start} --> {end}\n{segment.text.strip()}\n\n") + +print(f"Wrote {len(result.segments)} cues to captions.srt and captions.vtt") +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { readFile, writeFile } from "fs/promises"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +// Format a time in seconds as an SRT timestamp: HH:MM:SS,mmm. +function toSrtTimestamp(seconds) { + let millis = Math.round(seconds * 1000); + const hours = Math.floor(millis / 3_600_000); + millis -= hours * 3_600_000; + const minutes = Math.floor(millis / 60_000); + millis -= minutes * 60_000; + const secs = Math.floor(millis / 1000); + millis -= secs * 1000; + const pad = (n, width) => String(n).padStart(width, "0"); + return `${pad(hours, 2)}:${pad(minutes, 2)}:${pad(secs, 2)},${pad(millis, 3)}`; +} + +const result = await client.speechToText.convert({ + audio: new File([await readFile("speech.wav")], "speech.wav"), + language: "en", + ignore_timestamps: false, +}); + +// SRT: 1-based index, comma decimal separator, blank line between cues. +const cues = result.segments.map((segment, i) => { + const start = toSrtTimestamp(segment.start); + const end = toSrtTimestamp(segment.end); + return `${i + 1}\n${start} --> ${end}\n${segment.text.trim()}\n`; +}); +await writeFile("captions.srt", cues.join("\n"), "utf-8"); + +console.log(`Wrote ${result.segments.length} cues to captions.srt`); +``` + + +Both files share one timestamp helper — WebVTT is just the SRT formatting with `,` swapped for `.`, so there is no second formatter to keep in sync. + + + Pass `language=` (for example `"en"` or `"zh"`) when you know it — explicit + language selection sharpens segment boundaries, which keeps your cue timing + tight. + + +## Related + +- [Speech-to-Text guide](/features/speech-to-text) +- [ASR Types Reference](/api-reference/sdk/python/types#asr) diff --git a/developer-guide/sdk-guide/cookbook/voice-agent-loop.mdx b/developer-guide/sdk-guide/cookbook/voice-agent-loop.mdx new file mode 100644 index 0000000..f87fdcf --- /dev/null +++ b/developer-guide/sdk-guide/cookbook/voice-agent-loop.mdx @@ -0,0 +1,123 @@ +--- +title: "Build a voice agent loop: speech in, reply, speech out" +description: "Transcribe an utterance, generate a reply with your own LLM, and stream that reply back out as speech" +icon: "comments" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +## Prerequisites + + + +## Recipe + +A voice agent is three stages chained together: [`asr.transcribe()`](/api-reference/sdk/python/resources#transcribe) turns the caller's audio into text, your own LLM turns that text into a reply, and [`tts.stream()`](/api-reference/sdk/python/resources#stream) turns the reply back into speech. The transcript and the reply are just strings, so the only Fish Audio-specific parts are the first and last calls. Streaming the reply lets you start writing (or forwarding) audio before the whole sentence is synthesized. + + +```python Synchronous +from fishaudio import FishAudio +from fishaudio.utils import save + +client = FishAudio() + +def reply_from_llm(text: str) -> str: + # ---- PLACEHOLDER ---- + # Call your own LLM here and return its reply as a string. + # e.g. return openai_client.chat.completions.create(...).choices[0].message.content + return f"You said: {text}. How can I help?" + +def voice_agent_turn(audio_path: str, out_path: str) -> str: + with open(audio_path, "rb") as f: + heard = client.asr.transcribe(audio=f.read()) + + reply = reply_from_llm(heard.text) + + audio_stream = client.tts.stream(text=reply, reference_id="") + save(audio_stream, out_path) # writes chunks as they arrive + return reply + +reply = voice_agent_turn("speech.wav", "reply.mp3") +print("Agent:", reply) +``` + +```python Asynchronous +import asyncio +from fishaudio import AsyncFishAudio +from fishaudio.utils import save + +def reply_from_llm(text: str) -> str: + # ---- PLACEHOLDER ---- + # Call your own LLM here and return its reply as a string. + return f"You said: {text}. How can I help?" + +async def main(): + async with AsyncFishAudio() as client: + with open("speech.wav", "rb") as f: + heard = await client.asr.transcribe(audio=f.read()) + + reply = reply_from_llm(heard.text) + + audio_stream = await client.tts.stream(text=reply, reference_id="") + with open("reply.mp3", "wb") as out: + async for chunk in audio_stream: + out.write(chunk) + print("Agent:", reply) + +asyncio.run(main()) +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { readFile, writeFile } from "fs/promises"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +function replyFromLlm(text) { + // ---- PLACEHOLDER ---- + // Call your own LLM here and return its reply as a string. + // e.g. return openaiClient.chat.completions.create(...).choices[0].message.content + return `You said: ${text}. How can I help?`; +} + +async function voiceAgentTurn(audioPath, outPath) { + const heard = await client.speechToText.convert({ + audio: new File([await readFile(audioPath)], audioPath), + language: "en", + }); + + const reply = replyFromLlm(heard.text); + + const stream = await client.textToSpeech.convert( + { text: reply, reference_id: "", format: "mp3" }, + "s2-pro" + ); + const chunks = []; + for await (const chunk of stream) chunks.push(Buffer.from(chunk)); + await writeFile(outPath, Buffer.concat(chunks)); + return reply; +} + +const reply = await voiceAgentTurn("speech.wav", "reply.mp3"); +console.log("Agent:", reply); +``` + + +`heard` is an [`ASRResponse`](/api-reference/sdk/python/types#asrresponse-objects): `heard.text` is the full transcript and `heard.duration` is the clip length in seconds. Pass `language="en"` to `transcribe()` to skip auto-detection when you already know the input language. + + + For the lowest latency, feed your LLM's token stream straight into + [`stream_websocket()`](/api-reference/sdk/python/resources#stream_websocket) + instead of waiting for the full reply string — see + [Realtime: LLM tokens → speech](/developer-guide/sdk-guide/cookbook/realtime-llm-to-speech). + + +## Reply in the caller's voice + +`reference_id` points the reply at a saved voice. Drop it to use the default voice, or clone the caller's voice from the same clip you just transcribed by passing `references` instead — see [Instant voice cloning](/developer-guide/sdk-guide/cookbook/instant-voice-cloning). + +## Related + +- [Speech-to-Text guide](/features/speech-to-text) +- [Realtime: LLM tokens → speech](/developer-guide/sdk-guide/cookbook/realtime-llm-to-speech) +- [Stream TTS to a file](/developer-guide/sdk-guide/cookbook/streaming-to-file) diff --git a/developer-guide/sdk-guide/javascript/authentication.mdx b/developer-guide/sdk-guide/javascript/authentication.mdx deleted file mode 100644 index 747c7e8..0000000 --- a/developer-guide/sdk-guide/javascript/authentication.mdx +++ /dev/null @@ -1,75 +0,0 @@ ---- -title: "Authentication" -description: "Manage API keys and client setup in the Fish Audio JavaScript SDK" -icon: "key" ---- - -import Prerequisites from '/snippets/prerequisites.mdx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 8dffb4526a7b17be3092274000e01772b599c37b7307a4f8404cb633dfca7c72 */} - - - - - -## Prerequisites - - - -## Client Initialization - -Initialize a `FishAudioClient` with your API key to start using the SDK: - -```typescript -import { FishAudioClient } from "fish-audio"; - -// Initialize with your API key -const fishAudio = new FishAudioClient({ apiKey: "your_api_key" }); -``` - -### Using Environment Variables - -For better security, store your API key in environment variables: - - - - Set the environment variable in your shell: - ```bash - export FISH_API_KEY=your_api_key_here - ``` - Then initialize immediately: - ```typescript - import { FishAudioClient } from "fish-audio"; - - const fishAudio = new FishAudioClient(); - ``` - - - ```typescript - import { config } from "dotenv"; - import { FishAudioClient } from "fish-audio"; - - // Load environment variables from .env file - config(); - - const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); - ``` - - Create a `.env` file in your project root: - ```bash - FISH_API_KEY=your_api_key_here - ``` - - - -### Custom Endpoints - -If you need to use a proxy or custom endpoint: - -```typescript -const fishAudio = new FishAudioClient({ - apiKey: "your_api_key", - baseUrl: "https://your-proxy-domain.com", -}); -``` \ No newline at end of file diff --git a/developer-guide/sdk-guide/javascript/installation.mdx b/developer-guide/sdk-guide/javascript/installation.mdx deleted file mode 100644 index f2512d9..0000000 --- a/developer-guide/sdk-guide/javascript/installation.mdx +++ /dev/null @@ -1,45 +0,0 @@ ---- -title: "Installation" -description: "Install and set up the Fish Audio JavaScript SDK" -icon: "download" ---- - -import Support from '/snippets/support.mdx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 8caf44030e141bbcc8d79ccffd18823edd74f0903979d6ad76b1866a1a5339a6 */} - - - - - -To use the Fish Audio API in server-side JavaScript environments like Node.js, Deno, or Bun, -you can use the official [Fish Audio SDK for TypeScript and JavaScript](https://www.npmjs.com/package/fish-audio). - -## Requirements - -- Node.js 18 or higher - -## Install - -Install the JavaScript SDK from npm. Choose your preferred package manager: - - - - ```bash - npm install fish-audio - ``` - - - ```bash - yarn add fish-audio - ``` - - - ```bash - pnpm add fish-audio - ``` - - - - \ No newline at end of file diff --git a/developer-guide/sdk-guide/javascript/speech-to-text.mdx b/developer-guide/sdk-guide/javascript/speech-to-text.mdx deleted file mode 100644 index 89e87fd..0000000 --- a/developer-guide/sdk-guide/javascript/speech-to-text.mdx +++ /dev/null @@ -1,168 +0,0 @@ ---- -title: "Speech to Text" -description: "Convert audio to text with Fish Audio JavaScript SDK" -icon: "microphone-lines" ---- - -import Prerequisites from '/snippets/prerequisites.mdx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: e413397f313f01574283793309833959152e13d765e0a0e0b1112957102fc8f9 */} - - - - - -## Prerequisites - - - -## Basic Usage - -Transcribe audio to text: - -```typescript -import { FishAudioClient } from "fish-audio"; -import { createReadStream } from "fs"; - -const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); - -const result = await fishAudio.speechToText.convert({ - audio: createReadStream("audio.mp3"), -}); - -console.log(result.text); -console.log("Duration (s):", result.duration); -``` - -## Language Specification - -Improve accuracy by specifying the language: - -```typescript -// English transcription -await fishAudio.speechToText.convert({ - audio: createReadStream("audio.mp3"), - language: "en" -}); - -// Chinese transcription -await fishAudio.speechToText.convert({ - audio: createReadStream("audio.mp3"), - language: "zh" -}); -``` - -Common language codes: `en` (English), `zh` (Chinese), `es` (Spanish), `fr` (French), `de` (German), `ja` (Japanese), `ko` (Korean), `pt` (Portuguese) - - -Automatic language detection works well, but specifying the language improves accuracy and speed. - - -## Working with Segments - -Get detailed timing for each segment: - -```typescript -const response = await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3") }); - -// Full transcription -console.log(response.text); - -// Segment details -for (const seg of response.segments ?? []) { - console.log(`[${seg.start.toFixed(2)}s - ${seg.end.toFixed(2)}s] ${seg.text}`); -} -``` - -## Timestamps Control - -Control timestamp generation: - -```typescript -// Include timestamps (default) -await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3"), ignore_timestamps: false }); - -// Skip timestamp processing for faster results -await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3"), ignore_timestamps: true }); -``` - - -`ignore_timestamps: false` (default) includes segment timestamps. Set to `true` to skip timestamp processing for faster transcription when you only need the text. - - -## Audio Formats - -Supported audio formats: -- MP3 (recommended) -- WAV -- M4A -- OGG -- FLAC -- AAC - -File requirements: -- Maximum size: 20MB -- Maximum duration: 60 minutes -- Sample rate: 16kHz or higher recommended - -## Transcribing TTS Output - -Transcribe generated speech: - -```typescript -import { FishAudioClient } from "fish-audio"; - -const fishAudio = new FishAudioClient(); - -// Generate speech -const ttsAudio = await fishAudio.textToSpeech.convert({ text: "Hello, this is a test" }); - -// Transcribe it -const asr = await fishAudio.speechToText.convert({ audio: ttsAudio }); -console.log(asr.text); -``` - -## Error Handling - -Handle common errors: - -```typescript -try { - await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3") }); -} catch (e: any) { - const status = e?.status || e?.response?.status; - if (status === 413) console.error("Audio file too large (max 20MB)"); - else if (status === 400) console.error("Invalid audio format"); - else throw e; -} -``` - -## Response Structure - -The ASR response includes: - -| Field | Type | Description | -|------------|-------------------|--------------------------------| -| `text` | string | Complete transcription | -| `duration` | number | Audio duration (seconds) | -| `segments` | ASRSegment[] | Timestamped text segments | - -Segment structure: -| Field | Type | Description | -|---------|--------|--------------------------| -| `text` | string | Segment text | -| `start` | number | Start time (seconds) | -| `end` | number | End time (seconds) | - - -Note the timing units: `duration` and segment times are in seconds. - - -## Request Parameters - -| Parameter | Type | Description | Default | -|---------------------|----------------------------|----------------------------|--------------------| -| `audio` | File | Buffer | Readable stream | Audio to transcribe | Required | -| `language` | string | Language code (e.g., "en") | None (auto-detect) | -| `ignore_timestamps` | boolean | Skip timestamp processing | false | \ No newline at end of file diff --git a/developer-guide/sdk-guide/javascript/text-to-speech.mdx b/developer-guide/sdk-guide/javascript/text-to-speech.mdx deleted file mode 100644 index 3934cc0..0000000 --- a/developer-guide/sdk-guide/javascript/text-to-speech.mdx +++ /dev/null @@ -1,201 +0,0 @@ ---- -title: "Text to Speech" -description: "Convert text to natural speech with Fish Audio JavaScript SDK" -icon: "microphone" ---- - -import Prerequisites from '/snippets/prerequisites.mdx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: fc17eef01159e7d16739b323c5a330a562a917a11e29489e3ccfc436fe8c5841 */} - - - - - -## Prerequisites - - - -## Basic Usage - -Generate speech from text: - -```typescript -import { FishAudioClient, play } from "fish-audio"; - -const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); - -const audio = await fishAudio.textToSpeech.convert({ - text: "Hello, world!", -}); - -await play(audio); -``` - -## Using Voice Models - -Specify a voice model for consistent voice generation: - -```typescript -import { FishAudioClient } from "fish-audio"; - -const fishAudio = new FishAudioClient(); - -const audio = await fishAudio.textToSpeech.convert({ - text: "This is my custom voice", - reference_id: "your_model_id", // Your model ID from fish.audio -}); - -await play(audio); -``` - -### Getting Model IDs - -The `reference_id` is the model ID from the URL when viewing a model on Fish Audio: - -- Model URL: `https://fish.audio/m/802e3bc2b27e49c2995d23ef70e6ac89` -- Reference ID: `802e3bc2b27e49c2995d23ef70e6ac89` - -You can also get model IDs programmatically: - -```typescript -// List your models -const results = await fishAudio.voices.search({ self: true }); -for (const model of results.items ?? []) { - console.log(`${model.title}: ${model._id}`); -} - -// Get specific model details -const model = await fishAudio.voices.get("your_model_id"); -console.log(`Model: ${model.title}, ID: ${model._id}`); -``` - -## Emotions - - -The `(parenthesis)` syntax below applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details. - - -Add emotional expressions to your text: - -```typescript -import type { TTSRequest } from "fish-audio"; - -const text = ` -(happy) I'm excited to share this! -(sad) Unfortunately, it didn't work out. -(whispering) This is a secret. -`; - -const request: TTSRequest = { text, reference_id: "model_id" }; -``` - -Common emotions: `(happy)`, `(sad)`, `(angry)`, `(excited)`, `(calm)`, `(surprised)`, `(whispering)`, `(shouting)`, `(laughing)`, `(sighing)` - -For more advanced control over speech generation, including phoneme-level control and additional paralanguage features, see [Fine-grained Control](/developer-guide/core-features/fine-grained-control). - -## Audio Formats - -Choose output format based on your needs: - -```typescript -// MP3 (default) -await fishAudio.textToSpeech.convert({ text: "...", format: "mp3", mp3_bitrate: 192 }); - -// WAV - uncompressed -await fishAudio.textToSpeech.convert({ text: "...", format: "wav", sample_rate: 44100 }); - -// Opus - efficient for streaming -await fishAudio.textToSpeech.convert({ text: "...", format: "opus", opus_bitrate: 48 }); - -// PCM - raw audio data -await fishAudio.textToSpeech.convert({ text: "...", format: "pcm", sample_rate: 16000 }); -``` - -## Prosody Control - -Adjust speech speed and volume: - -```typescript -const audio = await fishAudio.textToSpeech.convert({ - text: "Adjusted speech", - prosody: { - speed: 1.2, // 0.5 - 2.0 - volume: 5, // -20 - 20 - }, -}); -``` - -## Advanced Parameters - -Fine-tune generation: - -```typescript -const audio = await client.textToSpeech.convert({ - text: "Your text here", - chunk_length: 200, // Characters per chunk (100-300) - normalize: true, // Normalize text - latency: "balanced", // "normal" or "balanced" - temperature: 0.7, // Randomness (0.0-1.0) - top_p: 0.7, // Token selection (0.0-1.0) -}); -``` - -## Choosing Backend - -Our state-of-the-art [S2-Pro model](/developer-guide/models-pricing/models-overview) -is the default backend model for TTS. Optionally specify the model via the second argument (`backend: Backends`). - -```typescript -const audio = await fishAudio.textToSpeech.convert({ - text: "Hello, world!", -}, "s2-pro"); -``` - -## Streaming - -For real-time streaming, see the [WebSocket guide](/developer-guide/sdk-guide/javascript/websocket). - -## Error Handling - -Handle common errors: - -```typescript -async function generateWithRetry(request: Record, maxRetries = 3) { - const fishAudio = new FishAudioClient(); - for (let attempt = 0; attempt < maxRetries; attempt++) { - try { - return await fishAudio.textToSpeech.convert(request); - } catch (e: any) { - const status = e?.status || e?.response?.status; - if (status === 429) await new Promise(r => setTimeout(r, 2 ** attempt * 1000)); - else if (status === 401) throw new Error("Invalid API key"); - else throw e; - } - } -} -``` - -## Request Parameters - -| Parameter | Type | Description | Default | -|----------------|----------|----------------------|------------| -| `text` | string | Text to convert | Required | -| `reference_id` | string | Voice model ID | None | -| `references` | object[] | Reference audio | [] | -| `format` | string | Audio format | "mp3" | -| `chunk_length` | number | Chunk size (100-300) | 200 | -| `normalize` | boolean | Normalize text | true | -| `latency` | string | Speed vs quality | "balanced" | -| `prosody` | object | Speed/volume | None | -| `temperature` | number | Randomness | 0.7 | -| `top_p` | number | Token selection | 0.7 | - -## Next Steps - -- [Fine-grained control](/developer-guide/core-features/fine-grained-control) for phoneme-level control and paralanguage -- [Voice cloning](/developer-guide/sdk-guide/javascript/voice-cloning) for custom voices -- [WebSocket streaming](/developer-guide/sdk-guide/javascript/websocket) for real-time apps -- [Guide and Best Practices](/developer-guide/core-features/text-to-speech) for production use -- [API reference](/api-reference/endpoint/openapi-v1/text-to-speech) for direct API calls \ No newline at end of file diff --git a/developer-guide/sdk-guide/javascript/voice-cloning.mdx b/developer-guide/sdk-guide/javascript/voice-cloning.mdx deleted file mode 100644 index a62a024..0000000 --- a/developer-guide/sdk-guide/javascript/voice-cloning.mdx +++ /dev/null @@ -1,220 +0,0 @@ ---- -title: "Voice Cloning" -description: "Clone voices using reference audio with Fish Audio JavaScript SDK" -icon: "clone" ---- - -import Prerequisites from '/snippets/prerequisites.mdx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: df4ea781a900d8c8f6605b8782598ffba3879dfea424692438113630bae5d91c */} - - - - - -## Prerequisites - - - -## Overview - -Voice cloning allows you to generate speech that matches a specific voice using reference audio. Fish Audio supports two approaches: -- Using pre-trained voice models (reference_id) -- Providing reference audio directly in your request - - -Use `reference_id` when you'll reuse a voice multiple times - it's faster and more efficient. Use `references` for one-off voice cloning or testing different voices without creating models. - - -## Using Reference Audio - -Clone a voice by providing reference audio directly: - -```typescript -import { FishAudioClient } from "fish-audio"; -import type { TTSRequest, ReferenceAudio } from "fish-audio"; -import { readFile } from "fs/promises"; - -const fishAudio = new FishAudioClient(); - -const audioBuffer = await readFile("voice_sample.wav"); -const referenceFile = new File([audioBuffer], "voice_sample.wav"); - -const referenceAudio: ReferenceAudio = { - audio: referenceFile, - text: "Text spoken in the reference audio" -}; - -const request: TTSRequest = { - text: "Hello, world!", - references: [referenceAudio] -}; - -const audio = await client.textToSpeech.convert(request); -``` - -## Multiple References - -Improve voice quality by providing multiple reference samples: - -```typescript -import type { TTSRequest, ReferenceAudio } from "fish-audio"; -import { readFile } from "fs/promises"; - -const references = [] as ReferenceAudio[]; - -for (const i of [0, 1, 2]) { - const buf = await readFile(`sample_${i}.wav`); - references.push({ audio: new File([buf], `sample_${i}.wav`), text: `Text from sample ${i}` }); -} - -const request: TTSRequest = { - text: "Better voice quality with multiple references", - references, -}; -``` - -## Creating Voice Models - -For repeated use, create a persistent voice model: - -```typescript -import { FishAudioClient } from "fish-audio"; -import { createReadStream } from "fs"; - -const fishAudio = new FishAudioClient(); - -// Create a voice model from samples -const response = await fishAudio.voices.ivc.create({ - title: "My Custom Voice", - voices: [ - createReadStream("voice_0.wav"), - createReadStream("voice_1.wav"), - createReadStream("voice_2.wav"), - ], - cover_image: createReadStream("cover.png"), -}); - -console.log("Created model:", response._id); - -// Use the model -const audio = await fishAudio.textToSpeech.convert({ - text: "Using my saved voice model", - reference_id: response._id, -}); -``` - -## Best Practices - -### Audio Quality - -For best results, reference audio should: -- Be 10-30 seconds long per sample -- Have clear speech without background noise -- Match the language you'll generate -- Include varied intonation and emotion - -### Sample Text - -The text parameter in ReferenceAudio should: -- Match exactly what's spoken in the audio -- Include punctuation for proper prosody -- Be in the same language as generation - -### Performance Tips - -1. **Pre-upload models** for frequently used voices -2. **Use 2-3 reference samples** for optimal quality -3. **Keep samples under 30 seconds** each -4. **Normalize audio levels** before uploading - -## Audio Format Requirements - -Supported formats for reference audio: -- WAV (recommended) -- MP3 -- M4A -- Other common audio formats - -Sample rates: -- 16kHz minimum -- 44.1kHz recommended -- Mono or stereo (converted to mono) - -## Example: Voice Bank - -Build a library of cloned voices: - -```typescript -import { FishAudioClient } from "fish-audio"; - -const fishAudio = new FishAudioClient(); - -async function createVoiceBank() { - const voiceBank: Record = {}; - const models = await fishAudio.voices.search(); - for (const m of models.items ?? []) voiceBank[m.title] = m._id as string; - return voiceBank; -} - -async function generateWithVoice(text: string, voiceName: string) { - const bank = await createVoiceBank(); - const modelId = bank[voiceName]; - if (!modelId) throw new Error(`Voice '${voiceName}' not found`); - return fishAudio.textToSpeech.convert({ text, reference_id: modelId }); -} -``` - -## Combining with Emotions - -Add emotions to cloned voices: - -```typescript -// With a saved model -await fishAudio.textToSpeech.convert({ - text: "(happy) This is exciting news! (calm) Let me explain the details.", - reference_id: "your_model_id", -}); - -// Or with direct references -await fishAudio.textToSpeech.convert({ - text: "(excited) Amazing discovery!", - references: [referenceAudio], -}); -``` - -## Error Handling - -Common issues and solutions: - -```typescript -try { - await fishAudio.textToSpeech.convert({ text: "Test speech", references: [referenceAudio] }); -} catch (e: any) { - const msg = String(e?.message || e); - if (msg.includes("Invalid audio format")) console.error("Check audio format - use WAV or MP3"); - else if (msg.includes("Audio too short")) console.error("Reference audio should be at least 10 seconds"); - else throw e; -} -``` - -{/* -## Model Management - -Basic model operations: - -```typescript -// List/search -await fishAudio.voices.search(); - -// Get one -await fishAudio.voices.get("your_model_id"); - -// Update -await fishAudio.voices.update("your_model_id", { title: "new_title" }); - -// Delete -await fishAudio.voices.delete("your_model_id"); -``` -*/} diff --git a/developer-guide/sdk-guide/javascript/websocket.mdx b/developer-guide/sdk-guide/javascript/websocket.mdx deleted file mode 100644 index a9c16cf..0000000 --- a/developer-guide/sdk-guide/javascript/websocket.mdx +++ /dev/null @@ -1,176 +0,0 @@ ---- -title: "WebSocket" -description: "Real-time streaming with Fish Audio JavaScript SDK" -icon: "bolt" ---- - -import Prerequisites from '/snippets/prerequisites.mdx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: ad7ba016bd2904359639932e55d311b20f2b15ef5a61fdef2e2618da18ea8264 */} - - - - - -## Prerequisites - - - -## Overview - -WebSocket streaming enables real-time text-to-speech generation, perfect for conversational AI, live captioning, and streaming applications. - -## Basic Streaming - -Stream text and receive audio in real-time: - -```typescript -import { FishAudioClient, RealtimeEvents } from "fish-audio"; -import { writeFile } from "fs/promises"; -import path from "path"; - -// Simple async generator that yields text chunks -async function* makeTextStream() { - const chunks = [ - "Hello from Fish Audio! ", - "This is a realtime text-to-speech test. ", - "We are streaming multiple chunks over WebSocket.", - ]; - for (const chunk of chunks) { - yield chunk; - } -} - -const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); - -// For realtime, set text to "" and stream the content via makeTextStream -const request = { text: "" }; - -const connection = await fishAudio.textToSpeech.convertRealtime(request, makeTextStream()); - -// Collect audio and write to a file when the stream ends -const chunks: Buffer[] = []; -connection.on(RealtimeEvents.OPEN, () => console.log("WebSocket opened")); -connection.on(RealtimeEvents.AUDIO_CHUNK, (audio: unknown): void => { - if (audio instanceof Uint8Array || Buffer.isBuffer(audio)) { - chunks.push(Buffer.from(audio)); - } -}); -connection.on(RealtimeEvents.ERROR, (err) => console.error("WebSocket error:", err)); -connection.on(RealtimeEvents.CLOSE, async () => { - const outPath = path.resolve(process.cwd(), "out.mp3"); - await writeFile(outPath, Buffer.concat(chunks)); - console.log("Saved to", outPath); -}); -``` - - -Set `text: ""` in the request when streaming. The actual text comes from your text stream generator. - - -## Using Voice Models - -Stream with a specific voice: - -```typescript -const request = { - text: "", // Empty for streaming - reference_id: "your_model_id", - format: "mp3", -}; - -const conn = await fishAudio.textToSpeech.convertRealtime(request, makeTextStream()); -conn.on(RealtimeEvents.AUDIO_CHUNK, () => { /* handle audio */ }); -``` - -## Dynamic Text Generation - -Stream text as it's generated: - -```typescript -async function* generateText() { - const responses = [ - "Processing your request...", - "Here's what I found:", - "The answer is 42.", - ]; - for (const response of responses) { - for (const word of response.split(" ")) { - yield word + " "; - await new Promise(r => setTimeout(r, 20)); - } - } -} - -await fishAudio.textToSpeech.convertRealtime({ text: "" }, generateText()); -``` - -## Line-by-Line Processing - -Stream text line by line: - -```typescript -import { createReadStream } from "fs"; -import readline from "readline"; - -async function* readFileLines(filepath: string) { - const rl = readline.createInterface({ input: createReadStream(filepath) }); - for await (const line of rl) { - yield line.trim() + " "; - } -} - -await fishAudio.textToSpeech.convertRealtime({ text: "" }, readFileLines("story.txt")); -``` - -## Errors - -Handle connection errors via event listeners: - -```typescript -connection.on(RealtimeEvents.ERROR, (err) => { - console.error("WebSocket error:", err); - // Fallback to regular TTS or retry -}); -``` - -## Configuration/Choosing Backend - -Customize WebSocket behavior by configuring the client.
-Optionally specify the backend model to use. -Our state-of-the-art [S2-Pro model](/developer-guide/models-pricing/models-overview) is the default: - -```typescript -// Custom endpoint -const fishAudio = new FishAudioClient({ - apiKey: process.env.FISH_API_KEY, - baseUrl: "https://api.fish.audio", // Use a proxy/custom endpoint if needed -}); - -// Select backend model -const conn = await fishAudio.textToSpeech.convertRealtime( - request, - makeTextStream(), - backend: "s2-pro" -); -``` - -## Best Practices - -1. **Chunk Size**: Yield text in natural phrases for best prosody -2. **Buffer Management**: Process audio chunks immediately to avoid memory buildup -3. **Connection Reuse**: Keep WebSocket sessions alive for multiple streams -4. **Error Recovery**: Implement retry logic for connection failures -5. **Format Selection**: Use PCM for real-time playback, MP3 for storage - -## Events - -The connection emits these events: - -| Event | Description | -|-----------------------|--------------------------------------| -| `OPEN` | WebSocket connection established | -| `AUDIO_CHUNK` | Audio chunk received (Uint8Array) | -| `ERROR` | Error occurred on the connection | -| `CLOSE` | Connection closed | \ No newline at end of file diff --git a/developer-guide/sdk-guide/python/authentication.mdx b/developer-guide/sdk-guide/python/authentication.mdx index 33e67bb..03b9cb2 100644 --- a/developer-guide/sdk-guide/python/authentication.mdx +++ b/developer-guide/sdk-guide/python/authentication.mdx @@ -118,11 +118,11 @@ Handle [`AuthenticationError`](/api-reference/sdk/python/exceptions#authenticati ## Next Steps - + Generate speech with the authenticated client - + Clone voices and create custom models diff --git a/developer-guide/sdk-guide/python/errors.mdx b/developer-guide/sdk-guide/python/errors.mdx new file mode 100644 index 0000000..da52877 --- /dev/null +++ b/developer-guide/sdk-guide/python/errors.mdx @@ -0,0 +1,140 @@ +--- +title: "Errors & Retries" +description: "Exception types, retry strategy, and timeouts in the Fish Audio Python SDK" +icon: "triangle-exclamation" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +## Prerequisites + + + +## Exception hierarchy + +Every SDK error inherits from [`FishAudioError`](/api-reference/sdk/python/exceptions#fishaudioerror-objects). HTTP failures raise [`APIError`](/api-reference/sdk/python/exceptions#apierror-objects) or one of its subclasses, which expose `.status`, `.message`, and `.body`. + +| Exception | Raised when | Notes | +| --- | --- | --- | +| `AuthenticationError` | `401` | Missing or invalid API key | +| `PermissionError` | `403` | Key lacks permission for the resource | +| `NotFoundError` | `404` | Voice model id not found | +| `RateLimitError` | `429` | Rate limit / quota exceeded | +| `ServerError` | `5xx` | Transient server-side failure | +| `APIError` | any other non-2xx | Base for the above; `status == 422` for invalid parameters | +| `WebSocketError` | realtime stream failed mid-session | Reconnect rather than retrying the same socket | +| `DependencyError` | a required system tool is missing (e.g. ffmpeg for `play()`) | Carries `.dependency` and `.install_command` | + + + There is no separate `ValidationError` raised at runtime. Invalid request + parameters come back as an `APIError` with `status == 422` — catch `APIError`, + not `ValidationError`. + + +## Handling errors + + +```python Synchronous +from fishaudio import FishAudio +from fishaudio.exceptions import ( + AuthenticationError, + RateLimitError, + NotFoundError, + APIError, + FishAudioError, +) + +client = FishAudio() + +try: + audio = client.tts.convert(text="Hello!", reference_id="maybe-missing") +except AuthenticationError: + print("Invalid API key") +except RateLimitError: + print("Rate limited — back off and retry") +except NotFoundError: + print("That voice model does not exist") +except APIError as e: + print(f"API error {e.status}: {e.message}") # includes 422 validation +except FishAudioError as e: + print(f"SDK error: {e}") # e.g. WebSocketError, DependencyError +``` + +```python Asynchronous +import asyncio +from fishaudio import AsyncFishAudio +from fishaudio.exceptions import RateLimitError, APIError, FishAudioError + +async def main(): + async with AsyncFishAudio() as client: + try: + audio = await client.tts.convert(text="Hello!") + except RateLimitError: + print("Rate limited — back off and retry") + except APIError as e: + print(f"API error {e.status}: {e.message}") + except FishAudioError as e: + print(f"SDK error: {e}") + +asyncio.run(main()) +``` + + +## Retries + +The Python client does **not** retry automatically — each call makes a single request and raises on failure. Add your own backoff where it matters, typically around `RateLimitError` and `ServerError`: + +```python +import time +from fishaudio import FishAudio +from fishaudio.exceptions import RateLimitError, ServerError + +client = FishAudio() + +def convert_with_retry(text: str, max_retries: int = 3) -> bytes: + for attempt in range(max_retries): + try: + return client.tts.convert(text=text) + except (RateLimitError, ServerError): + if attempt == max_retries - 1: + raise + time.sleep(2 ** attempt) # exponential backoff + raise RuntimeError("unreachable") +``` + + + `RequestOptions` accepts a `max_retries` field, but the current client does + not act on it — use an explicit loop like the one above. + + +## Timeouts + +The request timeout is set on the client (seconds; default `240`): + +```python +from fishaudio import FishAudio + +client = FishAudio(timeout=30.0) +``` + +Override headers or timeout for a single request with `request_options`: + +```python +from fishaudio.core.request_options import RequestOptions + +audio = client.tts.convert( + text="Hello!", + request_options=RequestOptions(timeout=15.0, additional_headers={"X-Trace": "abc"}), +) +``` + + + If you inject your own `httpx_client`, the SDK uses it as-is — the client-level + `timeout`, `base_url`, and the `Authorization` header are **not** applied to + it. Configure those on the client you pass in. + + +## Related + +- [Exceptions API reference](/api-reference/sdk/python/exceptions) +- [Real-time WebSocket](/features/realtime-streaming) — `WebSocketError` handling diff --git a/developer-guide/sdk-guide/python/overview.mdx b/developer-guide/sdk-guide/python/overview.mdx deleted file mode 100644 index caea451..0000000 --- a/developer-guide/sdk-guide/python/overview.mdx +++ /dev/null @@ -1,597 +0,0 @@ ---- -title: "Overview" -description: "The official Python library for the Fish Audio API" -icon: "python" ---- - -import Prerequisites from '/snippets/prerequisites.mdx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 76979ffd115d3a77f60fd200d42e49605cce2cc21a06c8f380cfafff92443525 */} - - - - - -This guide will walk you through installation, authentication, and core features. - - -If you're using the legacy Session-based API (`fish_audio_sdk`), see the [migration guide](/archive/python-sdk-legacy/migration-guide) to upgrade to the new SDK. - - -## Installation - - - -Install via pip (Python 3.9 or higher required): - -```bash -pip install fish-audio-sdk -``` - -For audio playback utilities, install with the `utils` extra: - -```bash -pip install fish-audio-sdk[utils] -``` - - - - - - - -Configure your API key using environment variables: - -```bash -export FISH_API_KEY=your_api_key_here -``` - -Or create a `.env` file in your project root: - -```bash -FISH_API_KEY=your_api_key_here -``` - - - -## Quick Start - -Get started with the [`FishAudio`](/api-reference/sdk/python/client#fishaudio-objects) client in less than a minute: - - -```python Synchronous -from fishaudio import FishAudio -from fishaudio.utils import play, save - -# Initialize client (reads from FISH_API_KEY environment variable) -client = FishAudio() - -# Generate and play audio -audio = client.tts.convert(text="Hello, playing from Fish Audio!") -play(audio) - -# Generate and save audio -audio = client.tts.convert(text="Saving this audio to a file!") -save(audio, "output.mp3") -``` - -```python Asynchronous -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play, save - -async def main(): - # Initialize async client - client = AsyncFishAudio() - - # Generate and play audio - audio = await client.tts.convert(text="Hello, playing from Fish Audio!") - play(audio) - - # Generate and save audio - audio = await client.tts.convert(text="Saving this audio to a file!") - save(audio, "output.mp3") - -asyncio.run(main()) -``` - - -## Core Features - -### Text-to-Speech - -Fully customizable text-to-speech generation: - - -```python Synchronous focus={6-10} -from fishaudio import FishAudio -from fishaudio.utils import play - -client = FishAudio() - -# With a specific voice -audio = client.tts.convert( - text="Custom voice", - reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian -) -play(audio) -``` - -```python Asynchronous focus={8-12} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # With a specific voice - audio = await client.tts.convert( - text="Custom voice", - reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian - ) - play(audio) - -asyncio.run(main()) -``` - - - -```python Synchronous focus={6-10} -from fishaudio import FishAudio -from fishaudio.utils import play - -client = FishAudio() - -# With speed control -audio = client.tts.convert( - text="I'm talking pretty fast, is this still too slow?", - speed=1.5 # 1.5x speed -) -play(audio) -``` - -```python Asynchronous focus={8-12} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # With speed control - audio = await client.tts.convert( - text="I'm talking pretty fast, is this still too slow?", - speed=1.5 # 1.5x speed - ) - play(audio) - -asyncio.run(main()) -``` - - -Create reusable configurations with [`TTSConfig`](/api-reference/sdk/python/types#ttsconfig-objects). [`Prosody`](/api-reference/sdk/python/types#prosody-objects) controls speech characteristics like speed and volume: - - -```python Synchronous focus={7-18} -from fishaudio import FishAudio -from fishaudio.types import TTSConfig, Prosody -from fishaudio.utils import play - -client = FishAudio() - -# Define config once -my_config = TTSConfig( - prosody=Prosody(speed=1.2, volume=-5), - reference_id="933563129e564b19a115bedd57b7406a", # Sarah - format="wav", - latency="balanced" -) - -# Reuse across multiple generations -audio1 = client.tts.convert(text="Welcome to our product demonstration.", config=my_config) -audio2 = client.tts.convert(text="Let me show you the key features.", config=my_config) -audio3 = client.tts.convert(text="Thank you for watching this tutorial.", config=my_config) - -play(audio1) -play(audio2) -play(audio3) -``` - -```python Asynchronous focus={9-20} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.types import TTSConfig, Prosody -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # Define config once - my_config = TTSConfig( - prosody=Prosody(speed=1.2, volume=-5), - reference_id="933563129e564b19a115bedd57b7406a", # Sarah - format="wav", - latency="balanced" - ) - - # Reuse across multiple generations - audio1 = await client.tts.convert(text="Welcome to our product demonstration.", config=my_config) - audio2 = await client.tts.convert(text="Let me show you the key features.", config=my_config) - audio3 = await client.tts.convert(text="Thank you for watching this tutorial.", config=my_config) - - play(audio1) - play(audio2) - play(audio3) - -asyncio.run(main()) -``` - - - -For chunk-by-chunk processing, use [`stream()`](/api-reference/sdk/python/resources#stream) which returns an `AudioStream` (iterable). For real-time streaming with dynamic text, see [Real-time Streaming](#real-time-streaming) below. - - -Learn more in the [Text-to-Speech guide](/developer-guide/sdk-guide/python/text-to-speech). - -### Speech-to-Text - -Transcribe audio to text for various use cases: - - -```python Synchronous focus={5-16} -from fishaudio import FishAudio - -client = FishAudio() - -# Transcribe audio -with open("audio.wav", "rb") as f: - result = client.asr.transcribe( - audio=f.read(), - language="en" # Optional: specify language - ) - -print(result.text) - -# Access segments -for segment in result.segments: - print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}") -``` - -```python Asynchronous focus={7-18} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Transcribe audio - with open("audio.wav", "rb") as f: - result = await client.asr.transcribe( - audio=f.read(), - language="en" # Optional: specify language - ) - - print(result.text) - - # Access segments - for segment in result.segments: - print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}") - -asyncio.run(main()) -``` - - -Learn more in the [Speech-to-Text guide](/developer-guide/sdk-guide/python/speech-to-text). - -### Real-time Streaming - -Stream dynamically generated text for conversational AI and live applications. Perfect for integrating with LLM streaming responses, live captions, and chatbot interactions: - - -```python Synchronous focus={7-15} -from fishaudio import FishAudio -from fishaudio.utils import play - -client = FishAudio() - -# Stream dynamically generated text (e.g., from LLM) -def text_chunks(): - yield "Hello, " - yield "this is " - yield "streaming text!" - -audio_stream = client.tts.stream_websocket( - text_chunks(), - latency="balanced" -) - -play(audio_stream) -``` - -```python Asynchronous focus={9-17} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # Stream dynamically generated text - async def text_chunks(): - yield "Hello, " - yield "this is " - yield "streaming text!" - - audio_stream = await client.tts.stream_websocket( - text_chunks(), - latency="balanced" - ) - - play(audio_stream) - -asyncio.run(main()) -``` - - -Learn more in the [WebSocket Streaming guide](/developer-guide/sdk-guide/python/websocket). - -### Voice Cloning - -**Instant voice cloning** - Clone a voice on-the-fly using [`ReferenceAudio`](/api-reference/sdk/python/types#referenceaudio-objects): - - -```python Synchronous focus={6-12} -from fishaudio import FishAudio -from fishaudio.types import ReferenceAudio - -client = FishAudio() - -# Instant voice cloning -with open("reference.wav", "rb") as f: - audio = client.tts.convert( - text="This will sound like the reference voice", - references=[ReferenceAudio( - audio=f.read(), - text="Text spoken in the reference audio" - )] - ) -``` - -```python Asynchronous focus={8-14} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.types import ReferenceAudio - -async def main(): - client = AsyncFishAudio() - - # Instant voice cloning - with open("reference.wav", "rb") as f: - audio = await client.tts.convert( - text="This will sound like the reference voice", - references=[ReferenceAudio( - audio=f.read(), - text="Text spoken in the reference audio" - )] - ) - -asyncio.run(main()) -``` - - -**Voice models** - Create persistent voice models for repeated use: - - -```python Synchronous focus={6-11} -from fishaudio import FishAudio - -client = FishAudio() - -# Create persistent voice model -with open("voice_sample.wav", "rb") as f: - voice = client.voices.create( - title="My Custom Voice", - voices=[f.read()], - description="Custom voice clone" - ) -print(f"Created voice: {voice.id}") -``` - -```python Asynchronous focus={8-13} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Create persistent voice model - with open("voice_sample.wav", "rb") as f: - voice = await client.voices.create( - title="My Custom Voice", - voices=[f.read()], - description="Custom voice clone" - ) - print(f"Created voice: {voice.id}") - -asyncio.run(main()) -``` - - -Learn more in the [Voice Cloning guide](/developer-guide/sdk-guide/python/voice-cloning). - -## Client Initialization - - - -The recommended approach using environment variables: - -```python -from fishaudio import FishAudio - -# Automatically reads from FISH_API_KEY environment variable -client = FishAudio() -``` - - - -Provide the API key directly: - -```python -from fishaudio import FishAudio - -client = FishAudio(api_key="your_api_key") -``` - - -Never commit API keys to version control. Use environment variables or secret management systems. - - - - -Configure a custom base URL: - -```python -from fishaudio import FishAudio - -client = FishAudio( - api_key="your_api_key", - base_url="https://your-proxy-domain.com" -) -``` - - - -## Sync vs Async - -The SDK provides both synchronous and asynchronous clients: - - -```python Synchronous -from fishaudio import FishAudio - -# For typical applications -client = FishAudio() -audio = client.tts.convert(text="Hello!") -``` - -```python Asynchronous -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - # For async applications (web servers, concurrent tasks) - client = AsyncFishAudio() - audio = await client.tts.convert(text="Hello!") - -asyncio.run(main()) -``` - - - -Use [`AsyncFishAudio`](/api-reference/sdk/python/client#asyncfishaudio-objects) when: -- Building async web applications (FastAPI, Sanic, etc.) -- Processing multiple requests concurrently -- Integrating with other async libraries -- You need maximum performance - - -## Resource Clients - -The SDK organizes functionality into resource clients: - -| Resource | Description | Key Methods | -|------------------|--------------------|-------------------------------------------------------| -| [`client.tts`](/api-reference/sdk/python/resources#ttsclient-objects) | Text-to-speech | `convert()`, `stream()`, `stream_websocket()` | -| [`client.asr`](/api-reference/sdk/python/resources#asrclient-objects) | Speech recognition | `transcribe()` | -| [`client.voices`](/api-reference/sdk/python/resources#voicesclient-objects) | Voice management | `list()`, `get()`, `create()`, `update()`, `delete()` | -| [`client.account`](/api-reference/sdk/python/resources#accountclient-objects) | Account info | `get_credits()`, `get_package()` | - -## Utility Functions - -The SDK includes helpful utilities (requires `utils` extra): - -```python -from fishaudio.utils import save, play, stream - -# Save audio to file -save(audio, "output.mp3") - -# Play audio (automatically detects environment) -play(audio) # Works in Jupyter, regular Python, etc. - -# Stream audio in real-time (requires mpv) -stream(audio_iterator) -``` - -Use [`play()`](/api-reference/sdk/python/utils#play) for playback and [`save()`](/api-reference/sdk/python/utils#save) for writing audio files. - -Learn more in the [API Reference - Utils](/api-reference/sdk/python/utils). - -## Error Handling - -The SDK provides a comprehensive exception hierarchy: - -```python -from fishaudio import FishAudio -from fishaudio.exceptions import ( - FishAudioError, - AuthenticationError, - RateLimitError, - ValidationError -) - -client = FishAudio() - -try: - audio = client.tts.convert(text="Hello!") -except AuthenticationError: - print("Invalid API key") -except RateLimitError: - print("Rate limit exceeded. Please wait before retrying.") -except ValidationError as e: - print(f"Invalid request: {e}") -except FishAudioError as e: - print(f"API error: {e}") -``` - -The SDK includes exceptions for [`AuthenticationError`](/api-reference/sdk/python/exceptions#authenticationerror-objects), [`RateLimitError`](/api-reference/sdk/python/exceptions#ratelimiterror-objects), [`ValidationError`](/api-reference/sdk/python/exceptions#validationerror-objects), and [`FishAudioError`](/api-reference/sdk/python/exceptions#fishaudioerror-objects) for common error scenarios. - -Learn more in the [API Reference - Exceptions](/api-reference/sdk/python/exceptions). - -## Next Steps - - - - Set up API keys and client configuration - - - - Generate natural-sounding speech - - - - Clone voices and manage voice models - - - - Transcribe audio to text - - - - Real-time audio streaming - - - - Complete API documentation - - - -## Resources - -- [GitHub Repository](https://github.com/fishaudio/fish-audio-python) -- [PyPI Package](https://pypi.org/project/fish-audio-sdk/) -- [Migration Guide](/archive/python-sdk-legacy/migration-guide) - Upgrade from legacy SDK -- [Best Practices](/developer-guide/best-practices/) - Production-ready tips -- [API Reference](/api-reference/sdk/python/) - Detailed documentation \ No newline at end of file diff --git a/developer-guide/sdk-guide/python/speech-to-text.mdx b/developer-guide/sdk-guide/python/speech-to-text.mdx deleted file mode 100644 index 776f135..0000000 --- a/developer-guide/sdk-guide/python/speech-to-text.mdx +++ /dev/null @@ -1,169 +0,0 @@ ---- -title: "Speech-to-Text" -description: "Transcribe audio to text with the Fish Audio Python SDK" -icon: "microphone-lines" ---- - -import Prerequisites from '/snippets/prerequisites.mdx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 7f58d0e3f5a776bec6496f58879f50737b57f27bc2e0fb54e0cd37598df71ac8 */} - - - - - -## Prerequisites - - - -## Basic Transcription - -Transcribe audio files to text with automatic language detection using [`asr.transcribe()`](/api-reference/sdk/python/resources#transcribe): - - -```python Synchronous focus={6-10} -from fishaudio import FishAudio - -client = FishAudio() - -# Transcribe audio -with open("audio.mp3", "rb") as f: - result = client.asr.transcribe(audio=f.read()) - -print(f"Transcription: {result.text}") -print(f"Duration: {result.duration}ms") -``` - -```python Asynchronous focus={8-12} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Transcribe audio - with open("audio.mp3", "rb") as f: - result = await client.asr.transcribe(audio=f.read()) - - print(f"Transcription: {result.text}") - print(f"Duration: {result.duration}ms") - -asyncio.run(main()) -``` - - -The [`ASRResponse`](/api-reference/sdk/python/types#asrresponse-objects) object contains the full transcription and segment details. - -## Language Specification - -Specify the language for more accurate transcription: - - -```python Synchronous focus={5-11} -from fishaudio import FishAudio - -client = FishAudio() - -# Specify language code -with open("chinese_audio.mp3", "rb") as f: - result = client.asr.transcribe( - audio=f.read(), - language="zh" # Chinese - ) - -print(result.text) -``` - -```python Asynchronous focus={7-13} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Specify language code - with open("chinese_audio.mp3", "rb") as f: - result = await client.asr.transcribe( - audio=f.read(), - language="zh" # Chinese - ) - - print(result.text) - -asyncio.run(main()) -``` - - - -Auto-detection works well for most cases, but specifying the language can improve accuracy, especially for languages with similar phonetics. - - -## Segment Timestamps - -Access word-level or phrase-level timestamps: - - -```python Synchronous focus={5-14} -from fishaudio import FishAudio - -client = FishAudio() - -# Transcribe with segments -with open("audio.mp3", "rb") as f: - result = client.asr.transcribe(audio=f.read()) - -# Access full text -print(f"Full text: {result.text}") - -# Iterate through segments -for segment in result.segments: - print(f"[{segment.start}ms - {segment.end}ms]: {segment.text}") -``` - -```python Asynchronous focus={7-16} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Transcribe with segments - with open("audio.mp3", "rb") as f: - result = await client.asr.transcribe(audio=f.read()) - - # Access full text - print(f"Full text: {result.text}") - - # Iterate through segments - for segment in result.segments: - print(f"[{segment.start}ms - {segment.end}ms]: {segment.text}") - -asyncio.run(main()) -``` - - -## Next Steps - - - - Convert transcribed text back to speech - - - - Use transcribed audio for voice cloning - - - - Complete ASR API documentation - - - - Production tips and optimization - - - -## Related Resources - -- [ASR Types Reference](/api-reference/sdk/python/types#asr) - ASR response data structures -- [Error Handling](/api-reference/sdk/python/exceptions) - Exception types and handling diff --git a/developer-guide/sdk-guide/python/text-to-speech.mdx b/developer-guide/sdk-guide/python/text-to-speech.mdx deleted file mode 100644 index 639d70d..0000000 --- a/developer-guide/sdk-guide/python/text-to-speech.mdx +++ /dev/null @@ -1,874 +0,0 @@ ---- -title: "Text-to-Speech" -description: "Generate natural-sounding speech with the Fish Audio Python SDK" -icon: "microphone" ---- - -import Prerequisites from '/snippets/prerequisites.mdx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 8449d7e9db3c781cf36b33bb4913e5b883353cb2fde48a7c5ac1670fbbd6feb5 */} - - - - - -## Prerequisites - - - -## Understanding TTS Methods - -The SDK provides three methods for text-to-speech generation, each optimized for different use cases: - -| Method | Returns | Best For | -|--------|---------|----------| -| [`convert()`](/api-reference/sdk/python/resources#convert) | Complete audio bytes | Most use cases - simple, gets full audio at once | -| [`stream()`](/api-reference/sdk/python/resources#stream) | `AudioStream` | Chunk-by-chunk processing, memory-efficient transfer | -| [`stream_websocket()`](/api-reference/sdk/python/resources#stream_websocket) | Audio bytes iterator | Real-time streaming with dynamic text (LLM responses, conversational AI) | - - -Use `convert()` for most use cases. Use `stream()` for memory efficiency when handling large files. Use `stream_websocket()` when text is generated dynamically in real-time. - - -## Basic Usage - -Generate speech from text with a single function call: - - -```python Synchronous focus={6-9} -from fishaudio import FishAudio -from fishaudio.utils import save, play - -client = FishAudio() - -# Generate speech (returns bytes) -audio = client.tts.convert(text="Hello, welcome to Fish Audio!") - -# Play or save the audio -play(audio) -save(audio, "output.mp3") -``` - -```python Asynchronous focus={8-11} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import save, play - -async def main(): - client = AsyncFishAudio() - - # Generate speech (returns bytes) - audio = await client.tts.convert(text="Hello, welcome to Fish Audio!") - - # Play or save the audio - play(audio) - save(audio, "output.mp3") - -asyncio.run(main()) -``` - - -## Using Voice Models - -Specify a voice model for consistent voice characteristics: - - -```python Synchronous focus={6-10} -from fishaudio import FishAudio -from fishaudio.utils import play - -client = FishAudio() - -# Use a specific voice -audio = client.tts.convert( - text="This uses a specific voice model", - reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian -) -play(audio) -``` - -```python Asynchronous focus={8-12} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # Use a specific voice - audio = await client.tts.convert( - text="This uses a specific voice model", - reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian - ) - play(audio) - -asyncio.run(main()) -``` - - -### Finding Voice Models - -Get voice model IDs from the Fish Audio website or programmatically: - - -```python Synchronous focus={5-16} -from fishaudio import FishAudio -from fishaudio.utils import play - -client = FishAudio() - -# List available voices -voices = client.voices.list(language="en", tags="male") - -for voice in voices.items: - print(f"{voice.title}: {voice.id}") - -# Use a voice from the list -audio = client.tts.convert( - text="Generated with discovered voice", - reference_id=voices.items[0].id -) -play(audio) -``` - -```python Asynchronous focus={7-18} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # List available voices - voices = await client.voices.list(language="en", tags="male") - - for voice in voices.items: - print(f"{voice.title}: {voice.id}") - - # Use a voice from the list - audio = await client.tts.convert( - text="Generated with discovered voice", - reference_id=voices.items[0].id - ) - play(audio) - -asyncio.run(main()) -``` - - -Learn more in the [Voice Cloning guide](/developer-guide/sdk-guide/python/voice-cloning). - -## Emotions and Expressions - - -The `(parenthesis)` syntax below applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details. - - -Add emotional expressions to make speech more natural: - - -```python Synchronous focus={5-16} -from fishaudio import FishAudio -from fishaudio.utils import play - -client = FishAudio() - -text = """ -(happy) I'm excited to announce this! -(sad) Unfortunately, it didn't work out. -(angry) This is so frustrating! -(calm) Let me explain the details. -""" - -audio = client.tts.convert( - text=text, - reference_id="933563129e564b19a115bedd57b7406a" # Sarah -) -play(audio) -``` - -```python Asynchronous focus={7-18} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - text = """ - (happy) I'm excited to announce this! - (sad) Unfortunately, it didn't work out. - (angry) This is so frustrating! - (calm) Let me explain the details. - """ - - audio = await client.tts.convert( - text=text, - reference_id="933563129e564b19a115bedd57b7406a" # Sarah - ) - play(audio) - -asyncio.run(main()) -``` - - -See the [Emotion Reference](/api-reference/emotion-reference) for all available emotions and [Fine-grained Control](/developer-guide/core-features/fine-grained-control) for advanced usage. - -## Audio Formats - -Choose the output format based on your needs: - - -```python Synchronous focus={5-21} -from fishaudio import FishAudio - -client = FishAudio() - -# MP3 (default) - good balance of quality and size -audio = client.tts.convert( - text="MP3 format", - format="mp3" -) - -# WAV - uncompressed, highest quality -audio = client.tts.convert( - text="WAV format", - format="wav" -) - -# PCM - raw audio data for streaming -audio = client.tts.convert( - text="PCM format", - format="pcm" -) -``` - -```python Asynchronous focus={7-23} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # MP3 (default) - good balance of quality and size - audio = await client.tts.convert( - text="MP3 format", - format="mp3" - ) - - # WAV - uncompressed, highest quality - audio = await client.tts.convert( - text="WAV format", - format="wav" - ) - - # PCM - raw audio data for streaming - audio = await client.tts.convert( - text="PCM format", - format="pcm" - ) - -asyncio.run(main()) -``` - - -## Prosody Control - -Adjust speech speed and volume for natural-sounding output: - - -```python Synchronous focus={6-10} -from fishaudio import FishAudio -from fishaudio.utils import play - -client = FishAudio() - -# Simple speed adjustment -audio = client.tts.convert( - text="This will be spoken faster", - speed=1.5 # 1.5x speed (range: 0.5-2.0) -) -play(audio) -``` - -```python Asynchronous focus={8-12} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # Simple speed adjustment - audio = await client.tts.convert( - text="This will be spoken faster", - speed=1.5 # 1.5x speed (range: 0.5-2.0) - ) - play(audio) - -asyncio.run(main()) -``` - - -For combined speed and volume control, use [`TTSConfig`](/api-reference/sdk/python/types#ttsconfig-objects) with [`Prosody`](/api-reference/sdk/python/types#prosody-objects): - - -```python Synchronous focus={7-17} -from fishaudio import FishAudio -from fishaudio.types import TTSConfig, Prosody -from fishaudio.utils import play - -client = FishAudio() - -# Configure prosody with TTSConfig -audio = client.tts.convert( - text="Adjusted speech with custom speed and volume", - config=TTSConfig( - prosody=Prosody( - speed=1.2, # 20% faster - volume=5 # Louder (range: -20 to 20) - ) - ) -) -play(audio) -``` - -```python Asynchronous focus={9-19} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.types import TTSConfig, Prosody -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # Configure prosody with TTSConfig - audio = await client.tts.convert( - text="Adjusted speech with custom speed and volume", - config=TTSConfig( - prosody=Prosody( - speed=1.2, # 20% faster - volume=5 # Louder (range: -20 to 20) - ) - ) - ) - play(audio) - -asyncio.run(main()) -``` - - -## Reusable TTS Configuration - -Create a configuration once and reuse it across multiple generations: - - -```python Synchronous focus={5-18} -from fishaudio import FishAudio -from fishaudio.types import TTSConfig, Prosody - -client = FishAudio() - -# Define config once -my_config = TTSConfig( - prosody=Prosody(speed=1.2, volume=-5), - reference_id="bf322df2096a46f18c579d0baa36f41d", # Adrian - format="wav", - latency="balanced" -) - -# Reuse across multiple generations -audio1 = client.tts.convert(text="Welcome to our product demonstration.", config=my_config) -audio2 = client.tts.convert(text="Let me show you the key features.", config=my_config) -audio3 = client.tts.convert(text="Thank you for watching this tutorial.", config=my_config) -``` - -```python Asynchronous focus={7-20} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.types import TTSConfig, Prosody - -async def main(): - client = AsyncFishAudio() - - # Define config once - my_config = TTSConfig( - prosody=Prosody(speed=1.2, volume=-5), - reference_id="bf322df2096a46f18c579d0baa36f41d", # Adrian - format="wav", - latency="balanced" - ) - - # Reuse across multiple generations - audio1 = await client.tts.convert(text="Welcome to our product demonstration.", config=my_config) - audio2 = await client.tts.convert(text="Let me show you the key features.", config=my_config) - audio3 = await client.tts.convert(text="Thank you for watching this tutorial.", config=my_config) - -asyncio.run(main()) -``` - - -## Chunk-by-Chunk Streaming - -Use `stream()` for memory-efficient transfer and progressive download. Chunks are network transmission units (not semantic audio segments): - - -```python Synchronous focus={5-8} -from fishaudio import FishAudio - -client = FishAudio() - -# Collect all chunks efficiently -audio_stream = client.tts.stream(text="Long text here") -audio = audio_stream.collect() # Returns complete audio as bytes -``` - -```python Asynchronous focus={7-10} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Collect all chunks efficiently - audio_stream = await client.tts.stream(text="Long text here") - audio = await audio_stream.collect() # Returns complete audio as bytes - -asyncio.run(main()) -``` - - -For streaming to files or network without buffering in memory: - - -```python Synchronous focus={5-9} -from fishaudio import FishAudio - -client = FishAudio() - -# Stream directly to file (memory efficient for large audio) -audio_stream = client.tts.stream(text="Very long text...") -with open("output.mp3", "wb") as f: - for chunk in audio_stream: - f.write(chunk) # Write each chunk as it arrives -``` - -```python Asynchronous focus={7-11} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Stream directly to file (memory efficient for large audio) - audio_stream = await client.tts.stream(text="Very long text...") - with open("output.mp3", "wb") as f: - async for chunk in audio_stream: - f.write(chunk) # Write each chunk as it arrives - -asyncio.run(main()) -``` - - - -Use `stream()` when you have complete text upfront. For real-time streaming with dynamically generated text (LLMs, live captions), use `stream_websocket()` instead. - - -## Real-time WebSocket Streaming - -For real-time applications where text is generated dynamically, use [`stream_websocket()`](/api-reference/sdk/python/resources#stream_websocket). This is perfect for LLM integrations, conversational AI, and live captions: - -### Basic WebSocket Streaming - - -```python Synchronous focus={5-15} -from fishaudio import FishAudio -from fishaudio.utils import play - -client = FishAudio() - -# Stream dynamically generated text -def text_chunks(): - yield "Hello, " - yield "this is " - yield "streaming text!" - -audio_stream = client.tts.stream_websocket( - text_chunks(), - latency="balanced" -) - -play(audio_stream) -``` - -```python Asynchronous focus={7-16} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # Stream dynamically generated text - async def text_chunks(): - yield "Hello, " - yield "this is " - yield "streaming text!" - - audio_stream = await client.tts.stream_websocket( - text_chunks(), - latency="balanced" - ) - - play(audio_stream) - -asyncio.run(main()) -``` - - -### Understanding `FlushEvent` - -The [`FlushEvent`](/api-reference/sdk/python/types#flushevent-objects) forces the TTS engine to immediately generate audio from the accumulated text buffer. This is useful when you want to ensure audio is generated at specific points, even if the buffer hasn't reached the optimal chunk size. - - -```python Synchronous focus={6-18} -from fishaudio import FishAudio -from fishaudio.types import FlushEvent - -client = FishAudio() - -# Use FlushEvent to force immediate generation -def text_with_flush(): - yield "This is the first sentence. " - yield "This is the second sentence. " - yield FlushEvent() # Force audio generation NOW - yield "This starts a new segment. " - yield "And continues here." - yield FlushEvent() # Force final generation - -audio_stream = client.tts.stream_websocket(text_with_flush()) - -# Process each audio chunk as it arrives -for chunk in audio_stream: - print(f"Received audio chunk: {len(chunk)} bytes") -``` - -```python Asynchronous focus={8-20} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.types import FlushEvent - -async def main(): - client = AsyncFishAudio() - - # Use FlushEvent to force immediate generation - async def text_with_flush(): - yield "This is the first sentence. " - yield "This is the second sentence. " - yield FlushEvent() # Force audio generation NOW - yield "This starts a new segment. " - yield "And continues here." - yield FlushEvent() # Force final generation - - audio_stream = await client.tts.stream_websocket(text_with_flush()) - - # Process each audio chunk as it arrives - async for chunk in audio_stream: - print(f"Received audio chunk: {len(chunk)} bytes") - -asyncio.run(main()) -``` - - - -Without `FlushEvent`, the engine automatically generates audio when the buffer reaches an optimal size. Use `FlushEvent` to control exactly when audio should be generated, which can reduce perceived latency in interactive applications. - - -### `TextEvent` vs Plain Strings - -You can yield plain strings (recommended for simplicity) or use [`TextEvent`](/api-reference/sdk/python/types#textevent-objects) for explicit control: - - -```python Synchronous focus={6-17} -from fishaudio import FishAudio -from fishaudio.types import TextEvent - -client = FishAudio() - -# Both approaches are equivalent -def text_as_strings(): - yield "Hello, " - yield "world!" - -def text_as_events(): - yield TextEvent(text="Hello, ") - yield TextEvent(text="world!") - -# Use whichever style you prefer -audio1 = client.tts.stream_websocket(text_as_strings()) -audio2 = client.tts.stream_websocket(text_as_events()) -``` - -```python Asynchronous focus={8-19} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.types import TextEvent - -async def main(): - client = AsyncFishAudio() - - # Both approaches are equivalent - async def text_as_strings(): - yield "Hello, " - yield "world!" - - async def text_as_events(): - yield TextEvent(text="Hello, ") - yield TextEvent(text="world!") - - # Use whichever style you prefer - audio1 = await client.tts.stream_websocket(text_as_strings()) - audio2 = await client.tts.stream_websocket(text_as_events()) - -asyncio.run(main()) -``` - - -### LLM Integration Pattern - -WebSocket streaming shines when integrating with LLM streaming responses. The TTS engine acts as an accumulator, buffering text until it has enough to generate natural-sounding audio: - - -```python Synchronous focus={5-19} -from fishaudio import FishAudio -from fishaudio.utils import play - -client = FishAudio() - -# Simulate streaming LLM response -def llm_stream(): - """Simulates text chunks from an LLM""" - tokens = [ - "The ", "weather ", "today ", "is ", "sunny ", - "with ", "clear ", "skies. ", "Perfect ", - "for ", "outdoor ", "activities!" - ] - for token in tokens: - yield token - -# Stream to speech in real-time -audio_stream = client.tts.stream_websocket(llm_stream()) -play(audio_stream) -``` - -```python Asynchronous focus={7-21} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # Simulate streaming LLM response - async def llm_stream(): - """Simulates text chunks from an LLM""" - tokens = [ - "The ", "weather ", "today ", "is ", "sunny ", - "with ", "clear ", "skies. ", "Perfect ", - "for ", "outdoor ", "activities!" - ] - for token in tokens: - yield token - - # Stream to speech in real-time - audio_stream = await client.tts.stream_websocket(llm_stream()) - play(audio_stream) - -asyncio.run(main()) -``` - - - -The WebSocket connection automatically buffers incoming text and generates audio when it has accumulated enough context for natural-sounding speech. You don't need to manually batch tokens unless you want to force generation at specific points using `FlushEvent`. - - -Learn more in the [WebSocket Streaming guide](/developer-guide/sdk-guide/python/websocket). - -## Advanced Configuration - -Comprehensive `TTSConfig` with all available parameters: - -```python focus={3-24} -from fishaudio.types import TTSConfig, Prosody - -# All TTSConfig parameters -config = TTSConfig( - # Audio output settings - format="mp3", - sample_rate=44100, # Custom sample rate (optional) - mp3_bitrate=192, # 64, 128, or 192 kbps - opus_bitrate=64, # For Opus format: -1000, 24, 32, 48, or 64 - normalize=True, # Normalize audio levels - - # Generation settings - chunk_length=200, # Characters per chunk (100-300) - latency="balanced", # "normal" or "balanced" - - # Voice/style settings - reference_id="bf322df2096a46f18c579d0baa36f41d", # Adrian - prosody=Prosody(speed=1.1, volume=0), - # references=[ReferenceAudio(...)] # For instant cloning - - # Model parameters - temperature=0.7, # Randomness (0.0-1.0) - top_p=0.7 # Token selection (0.0-1.0) -) - -# Use with any client -audio = client.tts.convert(text="Your text here", config=config) -``` - - -`TTSConfig` works the same for both sync and async clients. See [TTSConfig API Reference](/api-reference/sdk/python/types#ttsconfig-objects) for detailed documentation on each parameter and their defaults. - - -## Error Handling - -Handle common TTS errors gracefully: - -```python -from fishaudio import FishAudio -from fishaudio.exceptions import ( - RateLimitError, - ValidationError, - NotFoundError, - FishAudioError -) -import time - -client = FishAudio() - -try: - audio = client.tts.convert( - text="Your text here", - reference_id="voice_id" - ) -except RateLimitError: - print("Rate limit exceeded. Please wait before retrying.") - time.sleep(60) # Wait before retry -except NotFoundError: - print("Voice model not found. Check the reference_id") -except ValidationError as e: - print(f"Invalid request: {e}") -except FishAudioError as e: - print(f"API error: {e}") -``` - -Common exceptions include [`RateLimitError`](/api-reference/sdk/python/exceptions#ratelimiterror-objects), [`ValidationError`](/api-reference/sdk/python/exceptions#validationerror-objects), [`NotFoundError`](/api-reference/sdk/python/exceptions#notfounderror-objects), and [`FishAudioError`](/api-reference/sdk/python/exceptions#fishaudioerror-objects). - -## Best Practices - - - -For long texts, adjust `chunk_length` in `TTSConfig`: - -```python -from fishaudio import FishAudio -from fishaudio.types import TTSConfig - -client = FishAudio() - -audio = client.tts.convert( - text="Very long text...", - config=TTSConfig(chunk_length=250) # Larger chunks for efficiency -) -``` - - - -If you generate the same speech repeatedly, cache the results: - -```python -import os -from fishaudio import FishAudio -from fishaudio.utils import save - -client = FishAudio() - -def get_or_generate_speech(text, cache_file): - if os.path.exists(cache_file): - with open(cache_file, "rb") as f: - return f.read() - - audio = client.tts.convert(text=text) - save(audio, cache_file) - return audio -``` - - - -Implement exponential backoff for rate limits: - -```python -from fishaudio import FishAudio -from fishaudio.exceptions import RateLimitError -import time - -client = FishAudio() - -def generate_with_retry(text, max_retries=3): - for attempt in range(max_retries): - try: - return client.tts.convert(text=text) - except RateLimitError as e: - if attempt < max_retries - 1: - time.sleep(2 ** attempt) # Exponential backoff - else: - raise -``` - - - -Balance speed vs quality based on your use case: - -```python -from fishaudio import FishAudio - -client = FishAudio() - -# For real-time applications -audio = client.tts.convert(text="Fast response", latency="balanced") - -# For highest quality -audio = client.tts.convert(text="Best quality", latency="normal") -``` - - - -## Next Steps - - - - Create custom voice models - - - - Real-time audio streaming - - - - Phoneme-level control and paralanguage - - - - Production tips and optimization - - - -## Related Resources - -- [TTS API Reference](/api-reference/sdk/python/resources#tts) - Complete API documentation -- [Audio Formats Guide](/developer-guide/core-features/text-to-speech#audio-formats) - Format comparison -- [Emotion Reference](/api-reference/emotion-reference) - All available emotions -- [Utils Reference](/api-reference/sdk/python/utils) - Audio utilities \ No newline at end of file diff --git a/developer-guide/sdk-guide/python/voice-cloning.mdx b/developer-guide/sdk-guide/python/voice-cloning.mdx deleted file mode 100644 index 2ffbe28..0000000 --- a/developer-guide/sdk-guide/python/voice-cloning.mdx +++ /dev/null @@ -1,520 +0,0 @@ ---- -title: "Voice Cloning" -description: "Clone voices and create custom voice models with the Fish Audio Python SDK" -icon: "clone" ---- - -import Prerequisites from '/snippets/prerequisites.mdx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 7aeb79d08f833b97cf12bc561736fd35361a6a063f37f6f22671e24b097507cc */} - - - - - -## Prerequisites - - - -## Instant Voice Cloning - -Clone a voice on-the-fly without creating a persistent model using [`ReferenceAudio`](/api-reference/sdk/python/types#referenceaudio-objects): - - -```python Synchronous focus={6-15} -from fishaudio import FishAudio -from fishaudio.types import ReferenceAudio -from fishaudio.utils import play - -client = FishAudio() - -# Clone from reference audio -with open("reference_voice.wav", "rb") as f: - audio = client.tts.convert( - text="This will sound like the reference voice", - references=[ReferenceAudio( - audio=f.read(), - text="Text spoken in the reference audio" - )] - ) -play(audio) -``` - -```python Asynchronous focus={8-17} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.types import ReferenceAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # Clone from reference audio - with open("reference_voice.wav", "rb") as f: - audio = await client.tts.convert( - text="This will sound like the reference voice", - references=[ReferenceAudio( - audio=f.read(), - text="Text spoken in the reference audio" - )] - ) - play(audio) - -asyncio.run(main()) -``` - - - -Instant voice cloning is perfect for one-time use cases. For repeated use of the same voice, create a persistent voice model instead. - - -## Multiple Reference Samples - -Improve voice quality by providing multiple reference samples: - - -```python Synchronous focus={6-21} -from fishaudio import FishAudio -from fishaudio.types import ReferenceAudio -from fishaudio.utils import play - -client = FishAudio() - -# Load multiple reference samples -references = [] -samples = [ - ("sample1.wav", "First sample transcript"), - ("sample2.wav", "Second sample transcript"), - ("sample3.wav", "Third sample transcript") -] - -for audio_file, transcript in samples: - with open(audio_file, "rb") as f: - references.append(ReferenceAudio( - audio=f.read(), - text=transcript - )) - -# Generate with multiple references -audio = client.tts.convert( - text="This voice is trained on multiple samples", - references=references -) -play(audio) -``` - -```python Asynchronous focus={8-23} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.types import ReferenceAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # Load multiple reference samples - references = [] - samples = [ - ("sample1.wav", "First sample transcript"), - ("sample2.wav", "Second sample transcript"), - ("sample3.wav", "Third sample transcript") - ] - - for audio_file, transcript in samples: - with open(audio_file, "rb") as f: - references.append(ReferenceAudio( - audio=f.read(), - text=transcript - )) - - # Generate with multiple references - audio = await client.tts.convert( - text="This voice is trained on multiple samples", - references=references - ) - play(audio) - -asyncio.run(main()) -``` - - -## Creating Persistent Voice Models - -Create a reusable voice model for consistent voice characteristics using [`voices.create()`](/api-reference/sdk/python/resources#create): - - -```python Synchronous focus={5-20} -from fishaudio import FishAudio - -client = FishAudio() - -# Prepare voice samples -voice_samples = [] -with open("voice1.wav", "rb") as f1: - voice_samples.append(f1.read()) -with open("voice2.wav", "rb") as f2: - voice_samples.append(f2.read()) - -# Create voice model -voice = client.voices.create( - title="My Custom Voice", - voices=voice_samples, - description="A custom voice for my project", - tags=["custom", "english"], - visibility="private" -) - -print(f"Created voice: {voice.id}") -``` - -```python Asynchronous focus={7-22} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Prepare voice samples - voice_samples = [] - with open("voice1.wav", "rb") as f1: - voice_samples.append(f1.read()) - with open("voice2.wav", "rb") as f2: - voice_samples.append(f2.read()) - - # Create voice model - voice = await client.voices.create( - title="My Custom Voice", - voices=voice_samples, - description="A custom voice for my project", - tags=["custom", "english"], - visibility="private" - ) - - print(f"Created voice: {voice.id}") - -asyncio.run(main()) -``` - - -### With Transcripts - -Providing transcripts is faster and more accurate than automatic transcription. When you provide transcripts, the system skips running ASR (speech recognition), resulting in better performance and quality: - - -```python Synchronous focus={5-27} -from fishaudio import FishAudio - -client = FishAudio() - -# Voice samples with transcripts -samples = [ - ("voice1.wav", "This is the first sample"), - ("voice2.wav", "This is the second sample"), - ("voice3.wav", "This is the third sample") -] - -voices = [] -texts = [] - -for audio_file, transcript in samples: - with open(audio_file, "rb") as f: - voices.append(f.read()) - texts.append(transcript) - -# Create voice with transcripts -voice = client.voices.create( - title="High Quality Voice", - voices=voices, - texts=texts, - description="Voice with accurate transcripts", - enhance_audio_quality=True -) - -print(f"Created voice: {voice.id}") -``` - -```python Asynchronous focus={7-29} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Voice samples with transcripts - samples = [ - ("voice1.wav", "This is the first sample"), - ("voice2.wav", "This is the second sample"), - ("voice3.wav", "This is the third sample") - ] - - voices = [] - texts = [] - - for audio_file, transcript in samples: - with open(audio_file, "rb") as f: - voices.append(f.read()) - texts.append(transcript) - - # Create voice with transcripts - voice = await client.voices.create( - title="High Quality Voice", - voices=voices, - texts=texts, - description="Voice with accurate transcripts", - enhance_audio_quality=True - ) - - print(f"Created voice: {voice.id}") - -asyncio.run(main()) -``` - - -### Audio Quality Enhancement - -Enable automatic audio enhancement to clean up noisy reference audio: - -```python -voice = client.voices.create( - title="Enhanced Voice", - voices=voice_samples, - enhance_audio_quality=True # Clean up background noise and normalize levels -) -``` - - -Audio enhancement helps process noisy or lower-quality reference audio. If your audio is already clean and well-recorded, this may not provide additional benefit. - - -## Managing Voice Models - -### List Voices - -Discover available voices with filtering using [`voices.list()`](/api-reference/sdk/python/resources#list): - - -```python Synchronous focus={5-11} -from fishaudio import FishAudio - -client = FishAudio() - -# List all voices -voices = client.voices.list(page_size=20) -print(f"Total voices: {voices.total}") - -for voice in voices.items: - print(f"{voice.title}: {voice.id}") -``` - -```python Asynchronous focus={7-13} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # List all voices - voices = await client.voices.list(page_size=20) - print(f"Total voices: {voices.total}") - - for voice in voices.items: - print(f"{voice.title}: {voice.id}") - -asyncio.run(main()) -``` - - -### Filter by Tags and Language - - -```python Synchronous focus={5-21} -from fishaudio import FishAudio - -client = FishAudio() - -# Filter by tags -male_voices = client.voices.list( - tags=["male", "english"], - page_size=10 -) - -# Filter by language -chinese_voices = client.voices.list( - language="zh", - page_size=10 -) - -# Get only your own voices -my_voices = client.voices.list( - self_only=True, - page_size=20 -) -``` - -```python Asynchronous focus={7-23} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Filter by tags - male_voices = await client.voices.list( - tags=["male", "english"], - page_size=10 - ) - - # Filter by language - chinese_voices = await client.voices.list( - language="zh", - page_size=10 - ) - - # Get only your own voices - my_voices = await client.voices.list( - self_only=True, - page_size=20 - ) - -asyncio.run(main()) -``` - - -### Get Voice Details - -Use [`voices.get()`](/api-reference/sdk/python/resources#get) to retrieve voice details: - - -```python Synchronous focus={5-11} -from fishaudio import FishAudio - -client = FishAudio() - -# Get specific voice -voice = client.voices.get("bf322df2096a46f18c579d0baa36f41d") # Adrian - -print(f"Title: {voice.title}") -print(f"Description: {voice.description}") -print(f"Tags: {voice.tags}") -print(f"Languages: {voice.languages}") -``` - -```python Asynchronous focus={7-13} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Get specific voice - voice = await client.voices.get("bf322df2096a46f18c579d0baa36f41d") # Adrian - - print(f"Title: {voice.title}") - print(f"Description: {voice.description}") - print(f"Tags: {voice.tags}") - print(f"Languages: {voice.languages}") - -asyncio.run(main()) -``` - - -### Update Voice Metadata - -Update voice information using [`voices.update()`](/api-reference/sdk/python/resources#update): - - -```python Synchronous focus={5-11} -from fishaudio import FishAudio - -client = FishAudio() - -# Update voice information -client.voices.update( - "bf322df2096a46f18c579d0baa36f41d", # Adrian - title="Updated Voice Name", - description="Updated description", - visibility="public", # "public", "unlist", or "private" - tags=["updated", "english", "male"] -) -``` - -```python Asynchronous focus={7-13} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Update voice information - await client.voices.update( - "bf322df2096a46f18c579d0baa36f41d", # Adrian - title="Updated Voice Name", - description="Updated description", - visibility="public", # "public", "unlist", or "private" - tags=["updated", "english", "male"] - ) - -asyncio.run(main()) -``` - - -### Delete Voice - -Remove voice models using [`voices.delete()`](/api-reference/sdk/python/resources#delete): - - -```python Synchronous focus={5-7} -from fishaudio import FishAudio - -client = FishAudio() - -# Delete a voice model -client.voices.delete("bf322df2096a46f18c579d0baa36f41d") # Adrian -print("Voice deleted successfully") -``` - -```python Asynchronous focus={7-9} -import asyncio -from fishaudio import AsyncFishAudio - -async def main(): - client = AsyncFishAudio() - - # Delete a voice model - await client.voices.delete("bf322df2096a46f18c579d0baa36f41d") # Adrian - print("Voice deleted successfully") - -asyncio.run(main()) -``` - - - -Deleting a voice is permanent and cannot be undone. Make sure you have backups of any important voice models. - - -## Next Steps - - - - Use cloned voices for speech generation - - - - Stream audio with custom voices in real-time - - - - Complete voice management API documentation - - - - Production tips and optimization strategies - - - -## Related Resources - -- [Voice Types Reference](/api-reference/sdk/python/types#voices) - Voice model data structures -- [Audio Formats Guide](/developer-guide/core-features/text-to-speech#audio-formats) - Supported audio formats -- [Fine-grained Control](/developer-guide/core-features/fine-grained-control) - Advanced voice customization \ No newline at end of file diff --git a/developer-guide/sdk-guide/python/websocket.mdx b/developer-guide/sdk-guide/python/websocket.mdx deleted file mode 100644 index 97b3886..0000000 --- a/developer-guide/sdk-guide/python/websocket.mdx +++ /dev/null @@ -1,217 +0,0 @@ ---- -title: "WebSocket Streaming" -description: "Stream text-to-speech in real-time with WebSocket connections" -icon: "bolt" ---- - -import Prerequisites from '/snippets/prerequisites.mdx'; -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: e1fba201bbf1e887ba6b4783a3435cbdf78ebf9094f5ba29ac495725f19f09c7 */} - - - - - -## Prerequisites - - - -## Overview - -Use [`stream_websocket()`](/api-reference/sdk/python/resources#stream_websocket) for real-time text streaming with LLMs and live captions. The connection automatically buffers incoming text and generates audio as it becomes available. - -## Basic Usage - -Stream text chunks and receive audio in real-time: - - -```python Synchronous focus={5-17} -from fishaudio import FishAudio -from fishaudio.utils import play - -client = FishAudio() - -# Define text generator -def text_chunks(): - yield "Hello, " - yield "this is " - yield "real-time " - yield "streaming!" - -# Stream audio via WebSocket -audio_stream = client.tts.stream_websocket( - text_chunks(), - latency="balanced" # Use "balanced" for real-time, "normal" for quality -) - -# Play streamed audio -play(audio_stream) -``` - -```python Asynchronous focus={8-20} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # Define async text generator - async def text_chunks(): - yield "Hello, " - yield "this is " - yield "real-time " - yield "streaming!" - - # Stream audio via WebSocket - audio_stream = await client.tts.stream_websocket( - text_chunks(), - latency="balanced" # Use "balanced" for real-time, "normal" for quality - ) - - # Play streamed audio - play(audio_stream) - -asyncio.run(main()) -``` - - - -For details on audio formats, voice selection, and advanced configuration options like `TTSConfig`, see the [Text-to-Speech guide](/developer-guide/sdk-guide/python/text-to-speech). - - -## Using FlushEvent - -Force immediate audio generation to create pauses using [`FlushEvent`](/api-reference/sdk/python/types#flushevent-objects): - - -```python Synchronous focus={6-12} -from fishaudio import FishAudio -from fishaudio.types import FlushEvent - -client = FishAudio() - -def text_with_flush(): - yield "First sentence. " - yield "Second sentence. " - yield FlushEvent() # Forces generation NOW - yield "Third sentence." - -audio_stream = client.tts.stream_websocket(text_with_flush()) -``` - -```python Asynchronous focus={8-14} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.types import FlushEvent - -async def main(): - client = AsyncFishAudio() - - async def text_with_flush(): - yield "First sentence. " - yield "Second sentence. " - yield FlushEvent() # Forces generation NOW - yield "Third sentence." - - audio_stream = await client.tts.stream_websocket(text_with_flush()) - -asyncio.run(main()) -``` - - - -See [Text-to-Speech guide](/developer-guide/sdk-guide/python/text-to-speech#understanding-flushevent) for detailed FlushEvent usage and advanced examples. - - -## LLM Integration - -WebSocket streaming is designed for integrating with LLM streaming responses. The TTS engine automatically buffers incoming text chunks and generates audio when it has enough context for natural speech: - - -```python Synchronous focus={5-21} -from fishaudio import FishAudio -from fishaudio.utils import play - -client = FishAudio() - -# Simulate streaming LLM response -def llm_stream(): - """Simulates text chunks from an LLM.""" - tokens = [ - "The ", "weather ", "today ", "is ", "sunny ", - "with ", "clear ", "skies. ", "Perfect ", - "for ", "outdoor ", "activities!" - ] - for token in tokens: - yield token - -# Stream to speech in real-time -audio_stream = client.tts.stream_websocket( - llm_stream(), - latency="balanced" -) -play(audio_stream) -``` - -```python Asynchronous focus={7-23} -import asyncio -from fishaudio import AsyncFishAudio -from fishaudio.utils import play - -async def main(): - client = AsyncFishAudio() - - # Simulate streaming LLM response - async def llm_stream(): - """Simulates text chunks from an LLM.""" - tokens = [ - "The ", "weather ", "today ", "is ", "sunny ", - "with ", "clear ", "skies. ", "Perfect ", - "for ", "outdoor ", "activities!" - ] - for token in tokens: - yield token - - # Stream to speech in real-time - audio_stream = await client.tts.stream_websocket( - llm_stream(), - latency="balanced" - ) - play(audio_stream) - -asyncio.run(main()) -``` - - - -The WebSocket connection automatically buffers incoming text and generates audio when it has accumulated enough context for natural-sounding speech. You don't need to manually batch tokens unless you want to force generation at specific points using `FlushEvent`. - - -## Next Steps - - - - Learn about non-streaming TTS options, audio formats, TextEvent vs plain strings, and advanced configuration - - - - Use custom voices in streams and learn about voice selection - - - - Complete streaming API documentation - - - - Production streaming optimization - - - -## Related Resources - -- [WebSocket Types](/api-reference/sdk/python/types#tts) - TextEvent, FlushEvent, and more -- [Utils Reference](/api-reference/sdk/python/utils) - Audio playback utilities -- [Error Handling](/api-reference/sdk/python/exceptions) - WebSocket exception handling -- [Fine-grained Control](/developer-guide/core-features/fine-grained-control) - Advanced speech control diff --git a/developer-guide/sdk-guide/quickstart.mdx b/developer-guide/sdk-guide/quickstart.mdx new file mode 100644 index 0000000..87688d3 --- /dev/null +++ b/developer-guide/sdk-guide/quickstart.mdx @@ -0,0 +1,98 @@ +--- +title: "SDK Quickstart" +description: "Install a Fish Audio SDK, authenticate, and generate your first audio in under a minute" +icon: "rocket" +--- + +import Prerequisites from "/snippets/prerequisites.mdx"; + +The fastest path from zero to playable audio with the official Fish Audio SDKs. By the end you'll have a script that turns text into an MP3. + + + The Python SDK is the recommended starting point and is fully covered below. + The JavaScript SDK is in early release — see the [JavaScript SDK + guide](/api-reference/sdk/javascript/api-reference) for its current + surface. + + +## 1. Install + + +```bash Python +pip install fish-audio-sdk + +# optional: local audio playback (needs ffmpeg) +pip install "fish-audio-sdk[utils]" +``` + +```bash JavaScript +npm install fish-audio +``` + + +## 2. Authenticate + + + +Both SDKs read your key from the `FISH_API_KEY` environment variable: + +```bash +export FISH_API_KEY=your_api_key_here +``` + +## 3. Generate your first audio + + +```python Python +from fishaudio import FishAudio +from fishaudio.utils import save + +client = FishAudio() # reads FISH_API_KEY + +audio = client.tts.convert(text="Hello from Fish Audio!") +save(audio, "output.mp3") +``` + +```typescript JavaScript +import { FishAudioClient, play } from "fish-audio"; + +const client = new FishAudioClient(); // reads FISH_API_KEY +const audio = await client.textToSpeech.convert({ text: "Hello from Fish Audio!" }); +await play(audio); +``` + + +Run it, and you'll have `output.mp3` (Python) or local playback (JavaScript). That's it — you're generating speech. + + + Want async in Python? Every method mirrors onto `AsyncFishAudio`: `async with + AsyncFishAudio() as client: audio = await client.tts.convert(text="...")`. + + +## Next steps + + + + Voices, formats, prosody, and model selection + + + + Instant cloning and persistent voice models + + + + Stream LLM tokens to speech as they arrive + + + + Exception types, retries, and timeouts + + + + Task-focused recipes + + + + Full Python SDK reference + + diff --git a/developer-guide/self-hosting/running-inference.mdx b/developer-guide/self-hosting/running-inference.mdx index fa2a28e..df0a412 100644 --- a/developer-guide/self-hosting/running-inference.mdx +++ b/developer-guide/self-hosting/running-inference.mdx @@ -371,7 +371,7 @@ python fish_speech/models/text2semantic/inference.py \ Emotion control is currently supported for English, Chinese, and Japanese. More languages coming soon! -For more details, see the [Emotion Reference](/api-reference/emotion-reference). +For more details, see the [Emotion Control guide](/developer-guide/core-features/emotions). ## Troubleshooting diff --git a/developer-guide/tutorials/tutorials.mdx b/developer-guide/tutorials/tutorials.mdx deleted file mode 100644 index 74cb5d9..0000000 --- a/developer-guide/tutorials/tutorials.mdx +++ /dev/null @@ -1,32 +0,0 @@ ---- -title: "Tutorials & Examples" -description: "Step-by-step guides and code examples for Fish Audio features" -icon: "book-open" ---- -import { AudioTranscript } from '/snippets/audio-transcript.jsx'; - -{/* speak-mintlify-hash: 51595017aedb8f8987dee17fe84b729ed2501d3b4b1fe65ef658e0cd2d9a4eaa */} - - - - - - -Coming soon! We're preparing comprehensive tutorials and examples to help you get the most out of Fish Audio. - - -We're working on tutorials for: -- Building your first TTS application -- Creating custom voice models -- Implementing real-time streaming -- Building interactive voice applications -- Advanced emotion and prosody control -- Multi-speaker conversations - -In the meantime, check out: -- [Quickstart Guide](/developer-guide/getting-started/quickstart) for getting started -- [Python SDK Examples](/developer-guide/sdk-guide/python/text-to-speech) for code samples -- [JavaScript SDK Examples](/developer-guide/sdk-guide/javascript/text-to-speech) for code samples -- [Guide and Best Practices](/developer-guide/core-features/text-to-speech) for optimization tips - -Join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates and community examples. \ No newline at end of file diff --git a/docs.json b/docs.json index 7b501ab..4254322 100644 --- a/docs.json +++ b/docs.json @@ -12,16 +12,51 @@ "navigation": { "tabs": [ { - "tab": "Docs", + "tab": "Overview", "groups": [ { - "group": "Getting Started", + "group": "Get Started", "pages": [ - "developer-guide/getting-started/introduction", + "overview/capabilities", + "developer-guide/getting-started/api-key", "developer-guide/getting-started/quickstart", + "developer-guide/resources/agent-quickstart", "developer-guide/getting-started/changelog" ] }, + { + "group": "Core Features", + "pages": [ + { + "group": "Text to Speech", + "icon": "microphone", + "pages": [ + "features/text-to-speech", + "developer-guide/core-features/emotions", + { + "group": "Fine-grained Control", + "icon": "sliders", + "pages": [ + "developer-guide/core-features/fine-grained-control", + "developer-guide/core-features/fine-grained-control/english", + "developer-guide/core-features/fine-grained-control/chinese", + "developer-guide/core-features/fine-grained-control/japanese" + ] + } + ] + }, + "features/speech-to-text", + "features/voice-cloning", + "features/realtime-streaming", + "features/manage-voices" + ] + }, + { + "group": "Platform (Web App)", + "pages": [ + "overview/platform" + ] + }, { "group": "Models & Pricing", "pages": [ @@ -30,51 +65,57 @@ "developer-guide/models-pricing/deprecations", "developer-guide/models-pricing/pricing-and-rate-limits" ] + } + ] + }, + { + "tab": "Resources", + "groups": [ + { + "group": "Set Up the SDK", + "pages": [ + "developer-guide/sdk-guide/quickstart", + "developer-guide/sdk-guide/python/authentication", + "developer-guide/sdk-guide/python/errors" + ] }, { - "group": "Core Features", + "group": "Cookbook", + "icon": "book", "pages": [ - "developer-guide/core-features/text-to-speech", - "developer-guide/core-features/emotions", { - "group": "Fine-grained Control", - "icon": "sliders", + "group": "Text to Speech", "pages": [ - "developer-guide/core-features/fine-grained-control", - "developer-guide/core-features/fine-grained-control/english", - "developer-guide/core-features/fine-grained-control/chinese", - "developer-guide/core-features/fine-grained-control/japanese" + "developer-guide/sdk-guide/cookbook/streaming-to-file", + "developer-guide/sdk-guide/cookbook/telephony-8khz-audio" ] }, - "developer-guide/core-features/creating-models", - "developer-guide/core-features/speech-to-text" - ] - }, - { - "group": "Developer SDKs", - "pages": [ { - "group": "Python SDK", - "icon": "python", + "group": "Speech to Text", "pages": [ - "developer-guide/sdk-guide/python/overview", - "developer-guide/sdk-guide/python/authentication", - "developer-guide/sdk-guide/python/text-to-speech", - "developer-guide/sdk-guide/python/voice-cloning", - "developer-guide/sdk-guide/python/speech-to-text", - "developer-guide/sdk-guide/python/websocket" + "developer-guide/sdk-guide/cookbook/transcribe-to-captions", + "developer-guide/sdk-guide/cookbook/batch-transcribe-with-language-hint" ] }, { - "group": "JavaScript SDK", - "icon": "js", + "group": "Voice Cloning", + "pages": [ + "developer-guide/sdk-guide/cookbook/instant-voice-cloning", + "developer-guide/sdk-guide/cookbook/clone-and-wait-until-ready", + "developer-guide/sdk-guide/cookbook/oneshot-vs-persistent-cloning" + ] + }, + { + "group": "Realtime Streaming", "pages": [ - "developer-guide/sdk-guide/javascript/installation", - "developer-guide/sdk-guide/javascript/authentication", - "developer-guide/sdk-guide/javascript/text-to-speech", - "developer-guide/sdk-guide/javascript/voice-cloning", - "developer-guide/sdk-guide/javascript/speech-to-text", - "developer-guide/sdk-guide/javascript/websocket" + "developer-guide/sdk-guide/cookbook/realtime-llm-to-speech", + "developer-guide/sdk-guide/cookbook/voice-agent-loop" + ] + }, + { + "group": "Manage Voices", + "pages": [ + "developer-guide/sdk-guide/cookbook/discover-library-voice" ] } ] @@ -88,11 +129,11 @@ ] }, { - "group": "Product Guides", + "group": "Integrations", "pages": [ - "developer-guide/products/tts", - "developer-guide/products/voice-cloning", - "developer-guide/products/story-studio" + "developer-guide/integrations/pipecat", + "developer-guide/integrations/livekit", + "developer-guide/integrations/n8n" ] }, { @@ -104,36 +145,9 @@ ] }, { - "group": "Advanced Features", - "pages": [] - }, - { - "group": "Integrations", - "pages": [ - "developer-guide/integrations/pipecat", - "developer-guide/integrations/livekit", - "developer-guide/integrations/n8n" - ] - }, - { - "group": "Best Practices", - "pages": [] - }, - { - "group": "Safety & Ethics", - "pages": [] - }, - { - "group": "Tutorials", - "pages": ["developer-guide/tutorials/tutorials"] - }, - { - "group": "Resources", + "group": "More", "pages": [ - "developer-guide/resources/migration", - "developer-guide/resources/agent-quickstart", "developer-guide/resources/coding-agents", - "developer-guide/resources/roadmap", "contributing", "developer-guide/resources/brand" ] @@ -147,7 +161,7 @@ "group": "API Reference", "pages": [ "api-reference/introduction", - "api-reference/emotion-reference" + "api-reference/errors" ] }, { @@ -185,29 +199,36 @@ ] }, { - "group": "Python SDK", - "icon": "python", + "group": "SDK Reference", "pages": [ - "api-reference/sdk/python/overview", { - "group": "Reference", - "icon": "book-open", + "group": "Python SDK", + "icon": "python", "pages": [ - "api-reference/sdk/python/client", - "api-reference/sdk/python/resources", - "api-reference/sdk/python/types", - "api-reference/sdk/python/core", - "api-reference/sdk/python/utils", - "api-reference/sdk/python/exceptions" + "api-reference/sdk/python/overview", + { + "group": "Reference", + "icon": "book-open", + "pages": [ + "api-reference/sdk/python/client", + "api-reference/sdk/python/resources", + "api-reference/sdk/python/types", + "api-reference/sdk/python/core", + "api-reference/sdk/python/utils", + "api-reference/sdk/python/exceptions" + ] + }, + "archive/python-sdk-legacy/index" ] }, - "archive/python-sdk-legacy/index" + { + "group": "JavaScript SDK", + "icon": "js", + "pages": [ + "api-reference/sdk/javascript/api-reference" + ] + } ] - }, - { - "group": "JavaScript SDK", - "icon": "js", - "pages": ["api-reference/sdk/javascript/api-reference"] } ] } @@ -271,7 +292,11 @@ }, { "source": "/roadmap", - "destination": "/developer-guide/resources/roadmap" + "destination": "/developer-guide/getting-started/changelog" + }, + { + "source": "/developer-guide/resources/roadmap", + "destination": "/developer-guide/getting-started/changelog" }, { "source": "/developer-guide/resources/contributing", @@ -303,11 +328,15 @@ }, { "source": "/developer-guide/migration", - "destination": "/developer-guide/resources/migration" + "destination": "/overview/capabilities" }, { "source": "/resources/migration", - "destination": "/developer-guide/resources/migration" + "destination": "/overview/capabilities" + }, + { + "source": "/developer-guide/resources/migration", + "destination": "/overview/capabilities" }, { "source": "/developer-guide/sdks-tools/coding-agents", @@ -335,11 +364,19 @@ }, { "source": "/resources/tutorials", - "destination": "/developer-guide/tutorials/tutorials" + "destination": "/developer-guide/sdk-guide/quickstart" + }, + { + "source": "/developer-guide/tutorials/tutorials", + "destination": "/developer-guide/sdk-guide/quickstart" }, { "source": "/resources/emotion-reference", - "destination": "/api-reference/emotion-reference" + "destination": "/developer-guide/core-features/emotions" + }, + { + "source": "/api-reference/emotion-reference", + "destination": "/developer-guide/core-features/emotions" }, { "source": "/resources/best-practices/text-to-speech", @@ -376,6 +413,78 @@ { "source": "/n8n", "destination": "/developer-guide/integrations/n8n" + }, + { + "source": "/developer-guide/products/tts", + "destination": "/features/text-to-speech" + }, + { + "source": "/developer-guide/products/voice-cloning", + "destination": "/features/voice-cloning" + }, + { + "source": "/developer-guide/products/story-studio", + "destination": "/overview/platform" + }, + { + "source": "/developer-guide/core-features/text-to-speech", + "destination": "/features/text-to-speech" + }, + { + "source": "/developer-guide/core-features/speech-to-text", + "destination": "/features/speech-to-text" + }, + { + "source": "/developer-guide/core-features/creating-models", + "destination": "/features/voice-cloning" + }, + { + "source": "/developer-guide/sdk-guide/python/text-to-speech", + "destination": "/features/text-to-speech" + }, + { + "source": "/developer-guide/sdk-guide/python/voice-cloning", + "destination": "/features/voice-cloning" + }, + { + "source": "/developer-guide/sdk-guide/python/speech-to-text", + "destination": "/features/speech-to-text" + }, + { + "source": "/developer-guide/sdk-guide/python/websocket", + "destination": "/features/realtime-streaming" + }, + { + "source": "/developer-guide/sdk-guide/python/overview", + "destination": "/api-reference/sdk/python/overview" + }, + { + "source": "/developer-guide/sdk-guide/javascript/installation", + "destination": "/api-reference/sdk/javascript/api-reference" + }, + { + "source": "/developer-guide/sdk-guide/javascript/authentication", + "destination": "/developer-guide/getting-started/api-key" + }, + { + "source": "/developer-guide/sdk-guide/javascript/text-to-speech", + "destination": "/features/text-to-speech" + }, + { + "source": "/developer-guide/sdk-guide/javascript/speech-to-text", + "destination": "/features/speech-to-text" + }, + { + "source": "/developer-guide/sdk-guide/javascript/voice-cloning", + "destination": "/features/voice-cloning" + }, + { + "source": "/developer-guide/sdk-guide/javascript/websocket", + "destination": "/features/realtime-streaming" + }, + { + "source": "/developer-guide/getting-started/introduction", + "destination": "/overview/capabilities" } ], "footer": { diff --git a/features/manage-voices.mdx b/features/manage-voices.mdx new file mode 100644 index 0000000..8915f1f --- /dev/null +++ b/features/manage-voices.mdx @@ -0,0 +1,155 @@ +--- +title: "Manage Voices" +description: "List, inspect, update, and delete your voice models" +icon: "sliders" +--- + +Every voice you [clone](/features/voice-cloning) becomes a model you own. List your library, look up a model's details, rename or re-share it, and delete what you no longer need — all from the API directly, the Python library, or JavaScript. + + + + No code — manage voices in the browser. + + + Every endpoint for voice models. + + + Search and reuse Library voices. + + + +## When to use it + + + + List the voices you've created or saved. + + + Look up the `reference_id` to use in [Text to Speech](/features/text-to-speech). + + + Rename, re-tag, or change a model's visibility. + + + Delete models you no longer use. + + + +## List your voices + +Page through your library. The response carries the `total` count and the `items` for the current page. + + +```python Python +from fishaudio import FishAudio + +client = FishAudio() # reads FISH_API_KEY + +page = client.voices.list(self_only=True, page_size=20) + +print(f"{page.total} voices") +for v in page.items: + print(v.id, v.title, v.state, v.visibility) +``` + +```bash API (curl) +curl "https://api.fish.audio/model?self=true&page_size=20" \ + --header "Authorization: Bearer $FISH_API_KEY" + +# Response: { "total": 42, "items": [ ... ], "has_more": true } +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +const page = await client.voices.search({ page_size: 20 }); + +console.log(`${page.total} voices`); +for (const v of page.items) { + console.log(v._id ?? v.id, v.title, v.state, v.visibility); +} +``` + + +## Get, update, and delete + +Use a voice **id** to inspect a single model, change its metadata, or remove it. + + +```python Python +# Inspect one model +voice = client.voices.get("YOUR_VOICE_ID") +print(voice.title, voice.state) + +# Update metadata (only the fields you pass change) +client.voices.update( + "YOUR_VOICE_ID", + title="Updated title", + visibility="unlist", +) + +# Delete +client.voices.delete("YOUR_VOICE_ID") +``` + +```bash API (curl) +# Inspect one model +curl https://api.fish.audio/model/YOUR_VOICE_ID \ + --header "Authorization: Bearer $FISH_API_KEY" + +# Update metadata +curl --request PATCH https://api.fish.audio/model/YOUR_VOICE_ID \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --data '{ "title": "Updated title", "visibility": "unlist" }' + +# Delete +curl --request DELETE https://api.fish.audio/model/YOUR_VOICE_ID \ + --header "Authorization: Bearer $FISH_API_KEY" +``` + + +## Implementation details + +### Filtering and pagination + +Narrow the list with `title`, `tags`, or `language`, and page with `page_size` and `page_number`. Omit `self_only` (API: `self`) to search the public [Voice Library](/overview/platform) instead of just your own models. + + +```python Python +page = client.voices.list( + self_only=True, + title="narration", + page_size=50, + page_number=2, +) +``` + +```bash API (curl) +curl "https://api.fish.audio/model?self=true&title=narration&page_size=50&page_number=2" \ + --header "Authorization: Bearer $FISH_API_KEY" +``` + + +### Visibility + +Switch a model between `private`, `unlist` (shareable link), and `public` (listed in the Voice Library) with `update`. Publishing a model lets anyone use it as a `reference_id`. + +## Going further + + + + Create a new model from audio samples. + + + Use any voice id as `reference_id`. + + + Every endpoint for listing and managing models. + + + The full `voices` resource surface. + + diff --git a/features/realtime-streaming.mdx b/features/realtime-streaming.mdx new file mode 100644 index 0000000..2dd90bc --- /dev/null +++ b/features/realtime-streaming.mdx @@ -0,0 +1,243 @@ +--- +title: "Realtime Streaming" +description: "Stream audio as it generates for the lowest latency" +icon: "bolt" +--- + +Start playing audio before the whole clip is ready. Fish Audio streams speech in chunks, so your users hear the first words in a fraction of a second — essential for voice agents and live narration. Two modes: **HTTP streaming** for text you already have, and **WebSocket** for text that arrives incrementally (like LLM tokens). + + + + The live TTS WebSocket protocol. + + + LLM-to-speech and voice agents. + + + Tuning latency for production. + + + +## When to use it + + + + Conversational AI where time-to-first-audio matters. + + + Speak tokens as your model produces them — no waiting for the full reply. + + + Long-form content that should start playing immediately. + + + Anywhere a few hundred milliseconds of latency is noticeable. + + + +## Stream text you already have + +When you have the full string, stream the audio chunks as they generate and write or play them immediately. + + +```python Python +from fishaudio import FishAudio + +client = FishAudio() # reads FISH_API_KEY + +with open("out.mp3", "wb") as f: + for chunk in client.tts.stream(text="Streaming keeps latency low."): + f.write(chunk) # or send to a speaker / socket as it arrives + +# Or collect the whole stream into one bytes object: +audio = client.tts.stream(text="Streaming keeps latency low.").collect() +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/tts \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: s2-pro" \ + --no-buffer \ + --data '{ "text": "Streaming keeps latency low.", "format": "mp3" }' \ + --output out.mp3 +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { createWriteStream } from "fs"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +// convert() returns a ReadableStream — write each chunk the +// moment it arrives instead of waiting for the whole clip. +const stream = await client.textToSpeech.convert( + { text: "Streaming keeps latency low.", format: "mp3" }, + "s2-pro" +); + +const file = createWriteStream("out.mp3"); +for await (const chunk of stream) { + file.write(Buffer.from(chunk)); // or forward to a speaker / socket as it arrives +} +file.end(); +``` + + +`--no-buffer` tells curl to write each chunk as it arrives instead of waiting for the full response. + +## Stream from an LLM + +When text arrives token by token, feed a generator to `stream_websocket`. It opens a WebSocket, sends text as you produce it, and yields audio chunks back — so speech keeps pace with your model. + + +```python Python +from fishaudio import FishAudio +from fishaudio.utils import play + +client = FishAudio() + +def llm_tokens(): + # Replace with your real streaming LLM call + for token in ["The ", "first ", "move ", "sets ", "everything ", "in ", "motion."]: + yield token + +for chunk in client.tts.stream_websocket(llm_tokens(), reference_id="YOUR_VOICE_ID"): + play(chunk) # play each chunk the moment it arrives +``` + +```bash API (WebSocket) +# Token-level streaming uses the WebSocket endpoint, not curl. +# The Python SDK's stream_websocket() handles the protocol for you. +# To build it directly, see the WebSocket reference: +# /api-reference/endpoint/websocket/tts-live +``` + + +## Implementation details + +### Which mode to use + +- **HTTP streaming (`tts.stream`)** — you have the full text up front and want low time-to-first-audio. Simplest option. +- **WebSocket (`tts.stream_websocket`)** — text is still being produced (LLM output, live captions). Lets you start speaking before the sentence is finished. + +### Lower the latency further + +- Use a streaming-friendly format like `mp3` or `pcm`. +- Keep the connection warm for back-to-back generations. +- Pair with a cloned voice via `reference_id` — see [Voice Cloning](/features/voice-cloning). + +## Control where audio generates + +The WebSocket buffers incoming text and generates audio once it has enough context for natural-sounding speech, so you don't need to batch tokens yourself. When you *do* want a clean break — end of a sentence, a deliberate pause, or the end of a turn — yield a `FlushEvent` to force generation immediately. Wrap text in a `TextEvent` if you prefer explicit events over bare strings. + +```python +from fishaudio import FishAudio +from fishaudio.types import TextEvent, FlushEvent + +client = FishAudio() + +def script(): + yield TextEvent(text="First sentence. ") + yield "Second sentence. " + yield FlushEvent() # generate everything buffered so far, now + yield "Third sentence." + +for chunk in client.tts.stream_websocket(script(), reference_id="YOUR_VOICE_ID"): + ... # play or forward each chunk +``` + +## Tune latency vs. quality + +Both streaming paths take a `latency` mode: + +- `latency="balanced"` (default) — lowest time-to-first-audio. Use it for voice agents and live LLM output. +- `latency="normal"` — slightly higher latency, best audio quality. Use it for narration where you can afford a beat. + +```python +for chunk in client.tts.stream_websocket(llm_tokens(), latency="balanced"): + ... +``` + +For finer control, pass a `TTSConfig` with chunk tuning. Smaller chunks emit audio sooner (lower latency); larger chunks give the model more context (smoother prosody): + +```python +from fishaudio.types import TTSConfig + +config = TTSConfig( + latency="balanced", + chunk_length=200, # target tokens per generated chunk + min_chunk_length=100, # don't emit a chunk shorter than this +) + +for chunk in client.tts.stream(text="...", config=config): + ... +``` + +## Stream asynchronously + +For asyncio apps, `AsyncFishAudio` exposes the same streaming methods. `stream_websocket` accepts an async generator, so you can pipe an async LLM client straight into speech. + +```python +import asyncio +from fishaudio import AsyncFishAudio + +async def main(): + client = AsyncFishAudio() + + async def llm_tokens(): + async for token in your_async_llm(): + yield token + + # stream_websocket is an async generator — iterate it, don't await the call + async for chunk in client.tts.stream_websocket( + llm_tokens(), reference_id="YOUR_VOICE_ID", latency="balanced" + ): + ... # play or forward each chunk + +asyncio.run(main()) +``` + +## Direct API (no SDK) + +Token-level streaming runs over the WebSocket endpoint — the SDK's `stream_websocket()` handles framing for you. To speak the protocol directly, send MessagePack frames over the socket; the same `application/msgpack` payload format also works for one-shot HTTP streaming, which is faster to serialize than JSON for large reference audio: + +```python +import os +import httpx +import ormsgpack + +payload = {"text": "Streaming keeps latency low.", "format": "mp3", "latency": "balanced"} + +with httpx.stream( + "POST", + "https://api.fish.audio/v1/tts", + headers={ + "Authorization": f"Bearer {os.environ['FISH_API_KEY']}", + "Content-Type": "application/msgpack", + "model": "s2-pro", + }, + content=ormsgpack.packb(payload), +) as r: + for chunk in r.iter_bytes(): + ... # write each chunk as it arrives +``` + +For the full WebSocket frame sequence, see the [live TTS protocol reference](/api-reference/endpoint/websocket/tts-live). + +## Going further + + + + Voices, formats, and prosody for every generation. + + + The live TTS protocol, message by message. + + + Tuning latency for production voice apps. + + + `tts.stream` and `tts.stream_websocket`. + + diff --git a/features/speech-to-text.mdx b/features/speech-to-text.mdx new file mode 100644 index 0000000..455f316 --- /dev/null +++ b/features/speech-to-text.mdx @@ -0,0 +1,212 @@ +--- +title: "Speech to Text" +description: "Transcribe audio to text with per-segment timestamps" +icon: "waveform" +--- + +Turn spoken audio into accurate text — with timed segments — using Fish Audio's ASR model. Send an audio file, get back the transcript, its duration, and timestamped segments. Works the same from the API directly, the Python library, or JavaScript. + + + + No code — upload audio, get a transcript. + + + Every parameter for `POST /v1/asr`. + + + Captions, batch transcription, and more. + + + +## When to use it + + + + Timed segments map straight to SRT/VTT cues. + + + Transcribe recordings for summaries and search. + + + Turn short utterances into text your app can act on. + + + Make audio and video content readable. + + + +## Quick start + +Read an audio file, send the bytes, get the transcript. Choose your implementation: + + +```python Python +from fishaudio import FishAudio + +client = FishAudio() # reads FISH_API_KEY + +with open("speech.wav", "rb") as f: + result = client.asr.transcribe(audio=f.read(), language="en") + +print(result.text) +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/asr \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --form audio=@speech.wav \ + --form language=en +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { readFile } from "fs/promises"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +const result = await client.speechToText.convert({ + audio: new File([await readFile("speech.wav")], "speech.wav"), + language: "en", +}); + +console.log(result.text); +``` + + +The response gives you the full `text`, the audio `duration` in seconds, and timed `segments`. + +## Read the timestamps + +Each segment carries `start` and `end` times in seconds — ideal for captions. With the API, ask for them explicitly with `ignore_timestamps=false`. + + +```python Python +result = client.asr.transcribe(audio=audio_bytes, language="en", include_timestamps=True) + +print(f"{result.duration:.1f}s total") +for seg in result.segments: + print(f"[{seg.start:6.2f} - {seg.end:6.2f}] {seg.text}") +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/asr \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --form audio=@speech.wav \ + --form language=en \ + --form ignore_timestamps=false | jq '.segments' + +# Each segment: { "text": "One", "start": 0.0, "end": 0.24 } +``` + + + + In the Python SDK, segment timestamps are **on by default** — pass `include_timestamps=False` to skip them. That's the *inverse* of the API/JavaScript flag `ignore_timestamps`. + + +## Implementation details + +### Language + +`language` is optional — Fish Audio auto-detects it when you omit it. Pass an ISO code (`en`, `zh`, `ja`, …) to pin it and improve accuracy on short or noisy clips. + + +```python Python +# Auto-detect +result = client.asr.transcribe(audio=audio_bytes) + +# Pin the language +result = client.asr.transcribe(audio=audio_bytes, language="zh") +``` + +```bash API (curl) +# Omit the form field to auto-detect, or set it explicitly: +curl --request POST https://api.fish.audio/v1/asr \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --form audio=@speech.wav \ + --form language=zh +``` + + +### Input audio + +Common formats work directly — `wav`, `mp3`, `opus`, and more. Send the raw file bytes; no pre-processing required. The endpoint accepts `multipart/form-data` (shown above) or `application/msgpack`. + +### File limits + +One request transcribes one audio file. The endpoint accepts files up to **20 MB** and **60 minutes** long, with a minimum of **1 second** of audio. For longer recordings, split them into chunks and transcribe each, then stitch the segment timestamps back together (offset each chunk's `start`/`end` by where it began in the full recording). + +### Async transcription + +The Python SDK ships an async client with the same surface — useful when you're transcribing many files concurrently or already running inside an event loop. Use `AsyncFishAudio` and `await` the call: + +```python +import asyncio +from fishaudio import AsyncFishAudio + +async def main(): + client = AsyncFishAudio() # reads FISH_API_KEY + with open("speech.wav", "rb") as f: + result = await client.asr.transcribe(audio=f.read(), language="en") + print(result.text) + +asyncio.run(main()) +``` + +To run several files in parallel, gather the coroutines: + +```python +import asyncio +from fishaudio import AsyncFishAudio + +async def transcribe_all(paths): + client = AsyncFishAudio() + clips = [open(p, "rb").read() for p in paths] + return await asyncio.gather(*[ + client.asr.transcribe(audio=clip, language="en") for clip in clips + ]) + +for result in asyncio.run(transcribe_all(["speech.wav"])): + print(result.text) +``` + +### Direct API (MessagePack) + +`POST /v1/asr` also accepts a [MessagePack](https://msgpack.org) body instead of multipart form data — the same path the API reference links to for low-overhead, server-side calls. Pack the audio bytes and options into one payload and set `Content-Type: application/msgpack`: + +```python +import os +import httpx +import ormsgpack + +with open("speech.wav", "rb") as f: + audio = f.read() + +payload = {"audio": audio, "language": "en", "ignore_timestamps": False} + +resp = httpx.post( + "https://api.fish.audio/v1/asr", + content=ormsgpack.packb(payload), + headers={ + "Authorization": f"Bearer {os.environ['FISH_API_KEY']}", + "Content-Type": "application/msgpack", + }, +) +result = resp.json() +print(result["text"]) +``` + +The response shape is identical to the multipart path: `text`, `duration` (seconds), and `segments`. + +## Going further + + + + The reverse direction — text to lifelike audio. + + + Every field and the raw response schema. + + + `asr.transcribe` options and the `ASRResponse` type. + + diff --git a/features/text-to-speech.mdx b/features/text-to-speech.mdx new file mode 100644 index 0000000..0d9a104 --- /dev/null +++ b/features/text-to-speech.mdx @@ -0,0 +1,304 @@ +--- +title: "Text to Speech" +sidebarTitle: "Overview" +description: "Turn text into lifelike speech — use it however you build" +icon: "microphone" +--- + +Generate natural speech from text with the `s2-pro` and `s1` models. Pick a voice, choose a format, and go — from the API directly, the Python library, or JavaScript. + + + + No code — type, pick a voice, generate. + + + Every parameter for `POST /v1/tts`. + + + Ready-made recipes: streaming, telephony, and more. + + + +## When to use it + + + + Audiobooks, explainers, ads, and video narration. + + + Speak an assistant's replies — pair with [streaming](/features/realtime-streaming) for low latency. + + + Read content aloud, phone menus, notifications. + + + Speak in a [cloned voice](/features/voice-cloning) you own. + + + +## Quick start + +Send text, get back audio. Choose your implementation: + + +```python Python +from fishaudio import FishAudio +from fishaudio.utils import save + +client = FishAudio() # reads FISH_API_KEY +audio = client.tts.convert(text="Hello from Fish Audio!") +save(audio, "out.mp3") +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/tts \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: s2-pro" \ + --data '{ "text": "Hello from Fish Audio!", "format": "mp3" }' \ + --output out.mp3 +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { writeFile } from "fs/promises"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +// `s2-pro` is passed explicitly (the SDK default is `s1`). +const stream = await client.textToSpeech.convert( + { text: "Hello from Fish Audio!" }, + "s2-pro", +); + +const chunks = []; +for await (const chunk of stream) chunks.push(Buffer.from(chunk)); +await writeFile("hello.mp3", Buffer.concat(chunks)); +``` + + +## Use a specific voice + +Pass a **voice model id** (`reference_id`). Find ids in the [Voice Library](/overview/platform) or create your own via [Voice Cloning](/features/voice-cloning). + + +```python Python +audio = client.tts.convert( + text="This uses a specific voice.", + reference_id="802e3bc2b27e49c2995d23ef70e6ac89", +) +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/tts \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: s2-pro" \ + --data '{ + "text": "This uses a specific voice.", + "reference_id": "802e3bc2b27e49c2995d23ef70e6ac89", + "format": "mp3" + }' \ + --output out.mp3 +``` + + +## Implementation details + +### Models + +- **`s2-pro`** (default) — highest quality, multi-speaker, natural-language expression control. +- **`s1`** — previous generation, `(parenthesis)` emotion tags. + +In the API, select with the `model` request header. In Python, pass `model="s2-pro"`. See [Choosing a Model](/developer-guide/models-pricing/choosing-a-model). + +### Output formats + +`mp3` (default), `wav`, `pcm`, `opus`. Set `format` (and optionally `mp3_bitrate`, `sample_rate`). + + +```python Python +from fishaudio.types import TTSConfig + +audio = client.tts.convert( + text="High quality", + config=TTSConfig(format="wav", sample_rate=44100), +) +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/tts \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: s2-pro" \ + --data '{ "text": "High quality", "format": "wav", "sample_rate": 44100 }' \ + --output out.wav +``` + + +### Speed & prosody + +Adjust speech speed (0.5–2.0) and volume. + + +```python Python +audio = client.tts.convert(text="Speaking faster.", speed=1.5) +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/tts \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: s2-pro" \ + --data '{ "text": "Speaking faster.", "prosody": { "speed": 1.5 } }' \ + --output out.mp3 +``` + + +### Generation methods (Python) + +The Python SDK exposes three ways to generate, depending on whether you have the full text upfront and how you want to consume the audio: + +| Method | Returns | Use it for | +|---|---|---| +| `tts.convert()` | complete audio `bytes` | most cases — you have the text, you want the file | +| `tts.stream()` | `AudioStream` (iterate chunks, or `.collect()`) | memory-efficient transfer of large audio; write chunks to disk as they arrive | +| `tts.stream_websocket()` | iterator of audio `bytes` | text arriving in real time (LLM tokens, live captions) | + +```python +# Memory-efficient: write each chunk as it arrives instead of buffering +audio_stream = client.tts.stream(text="A very long passage...") +with open("out.mp3", "wb") as f: + for chunk in audio_stream: + f.write(chunk) +``` + +For real-time text streaming with `stream_websocket()`, see [Realtime Streaming](/features/realtime-streaming). + +### Instant voice cloning (reference audio) + +Instead of a saved `reference_id`, pass raw audio plus its transcript to clone a voice on the fly — no training step. Best with a clean 10–30s sample. + +```python +from fishaudio.types import ReferenceAudio + +with open("sample.wav", "rb") as f: + audio = client.tts.convert( + text="Spoken in the reference voice.", + references=[ReferenceAudio(audio=f.read(), text="Transcript of the sample.")], + ) +``` + +To reuse a voice across many requests, [clone it once](/features/voice-cloning) and pass the resulting `reference_id` instead. + +### Format & bitrate + +Pick a format for your delivery channel, and tune bitrate to trade size against quality: + +| Format | Notes | +|---|---| +| `mp3` (default) | good size/quality balance; set `mp3_bitrate` to `64`, `128`, or `192` | +| `wav` | uncompressed, highest quality; set `sample_rate` (e.g. `44100`) | +| `pcm` | raw samples, no container — for low-latency playback and telephony pipelines | +| `opus` | efficient for streaming; bitrate is automatic (`opus_bitrate=-1000`) | + +```python +from fishaudio.types import TTSConfig + +audio = client.tts.convert( + text="Smaller file, lower bitrate.", + config=TTSConfig(format="mp3", mp3_bitrate=64), +) +``` + +### Latency & chunk length + +`latency` trades stability for speed; `chunk_length` controls how much text the engine batches before it starts generating. + +- `latency="balanced"` (default) — lower time-to-first-audio (~300ms). Good for interactive use. +- `latency="normal"` — most stable output, at slightly higher latency. +- `chunk_length` (`100`–`300`, default `200`) — smaller chunks start audio sooner; larger chunks are more efficient for long text. + + +```python Python +from fishaudio.types import TTSConfig + +audio = client.tts.convert( + text="Quick, responsive output.", + config=TTSConfig(latency="balanced", chunk_length=150), +) +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/tts \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: s2-pro" \ + --data '{ "text": "Quick, responsive output.", "latency": "balanced", "chunk_length": 150 }' \ + --output out.mp3 +``` + + +### Direct API (MessagePack) + +`POST /v1/tts` also accepts a MessagePack body (`Content-Type: application/msgpack`) — the path the [API reference](/api-reference/endpoint/openapi-v1/text-to-speech) is built around. Use it to send binary reference audio in the request without base64 overhead, or when you don't want the SDK. + +```python +import os +import httpx +import ormsgpack + +payload = {"text": "Hello from the direct API.", "reference_id": "YOUR_VOICE_ID", "format": "mp3"} + +resp = httpx.post( + "https://api.fish.audio/v1/tts", + content=ormsgpack.packb(payload), + headers={ + "Authorization": f"Bearer {os.environ['FISH_API_KEY']}", + "Content-Type": "application/msgpack", + "model": "s2-pro", + }, +) +with open("out.mp3", "wb") as f: + f.write(resp.content) +``` + +The `model` header is required on every request. JSON and MessagePack accept the same fields. + +### Advanced generation tuning + +For finer control, `TTSConfig` exposes the model's sampling parameters. The defaults are well-tuned — reach for these only when you need to dial in determinism or curb artifacts. + +```python +from fishaudio.types import TTSConfig, Prosody + +config = TTSConfig( + prosody=Prosody(speed=1.1, volume=0), + temperature=0.7, # lower = more deterministic + top_p=0.7, + repetition_penalty=1.2, # >1.0 curbs repeated sounds + max_new_tokens=1024, # cap audio length per chunk + normalize=True, # expand numbers/dates for natural reading +) + +audio = client.tts.convert(text="Carefully tuned output.", config=config) +``` + +A `TTSConfig` is reusable — define it once and pass it to many `convert()` calls. See the [full field list](/api-reference/sdk/python/types#ttsconfig-objects) for every parameter and default. + +## Going further + + + + Lowest latency for conversational and live apps. + + + Direct delivery with tags and prosody. + + + Every field, type, and default. + + + `tts.convert` / `stream` / `stream_websocket`. + + diff --git a/features/voice-cloning.mdx b/features/voice-cloning.mdx new file mode 100644 index 0000000..e4fc44f --- /dev/null +++ b/features/voice-cloning.mdx @@ -0,0 +1,221 @@ +--- +title: "Voice Cloning" +description: "Create a custom voice from audio samples, then speak with it" +icon: "clone" +--- + +Build a reusable voice model from your own audio, then use it anywhere you generate speech. You get back a voice **id** — pass it as `reference_id` to [Text to Speech](/features/text-to-speech) and every generation speaks in that voice. Works from the API directly, the Python library, or JavaScript. + + + + No code — clone a voice in the browser. + + + Every field for `POST /model`. + + + Instant clones, training, and reuse. + + + +## When to use it + + + + One consistent voice across product, ads, and IVR. + + + Clone your own voice for narration or assistants. + + + Distinct voices for games, stories, and dialogue. + + + Keep a speaker's identity across languages. + + + +## Quick start + +Send one or more audio samples, get back a voice model. Choose your implementation: + + +```python Python +from fishaudio import FishAudio + +client = FishAudio() # reads FISH_API_KEY + +with open("sample.wav", "rb") as f: + voice = client.voices.create( + title="My Voice", + voices=[f.read()], + description="Cloned from a studio sample", + visibility="private", + ) + +print(voice.id, voice.state) +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/model \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --form type=tts \ + --form title="My Voice" \ + --form "description=Cloned from a studio sample" \ + --form visibility=private \ + --form train_mode=fast \ + --form voices=@sample.wav + +# Returns the new model, including its "_id" and "state". +``` + +```javascript JavaScript +import { FishAudioClient } from "fish-audio"; +import { readFile } from "fs/promises"; + +const client = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); + +const sample = await readFile("reference.wav"); + +const voice = await client.voices.ivc.create({ + title: "My Voice", + voices: [new File([sample], "reference.wav")], + description: "Cloned from a studio sample", + visibility: "private", +}); + +console.log(voice._id, voice.state); +``` + + +## Use your cloned voice + +Pass the voice **id** as `reference_id` to Text to Speech — exactly like any other voice. + + +```python Python +audio = client.tts.convert( + text="Now I speak in my cloned voice.", + reference_id=voice.id, +) +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/tts \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: s2-pro" \ + --data '{ "text": "Now I speak in my cloned voice.", "reference_id": "YOUR_VOICE_ID" }' \ + --output out.mp3 +``` + + +## Implementation details + +### Sample quality + +Clean, mono, single-speaker audio gives the best result. A short clip works for a quick clone; a minute or two of clear speech improves fidelity. Avoid background music, reverb, and overlapping voices. + +### Multiple samples + +Pass several clips to capture more range. You can also supply the matching transcripts as `texts` to sharpen pronunciation. + + +```python Python +voice = client.voices.create( + title="My Voice", + voices=[open("a.wav", "rb").read(), open("b.wav", "rb").read()], + texts=["Transcript of clip A.", "Transcript of clip B."], +) +``` + +```bash API (curl) +curl --request POST https://api.fish.audio/model \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --form type=tts \ + --form title="My Voice" \ + --form voices=@a.wav \ + --form voices=@b.wav +``` + + +### Visibility + +Models are `private` by default. Set `unlist` for a shareable link, or `public` to publish to the [Voice Library](/overview/platform). You can change this later — see [Manage Voices](/features/manage-voices). + +## Instant vs. persistent clones + +There are two ways to clone: + +- **Persistent model** (above) — train once with `voices.create()`, get back a reusable `id`. Best when you'll use the same voice repeatedly. +- **Instant clone** — pass reference audio inline on each generation with no model to manage. Best for one-off or per-request voices. + +For an instant clone, send the reference audio (and its transcript) directly to Text to Speech via `references` instead of `reference_id`: + +```python Python +from fishaudio import FishAudio +from fishaudio.types import ReferenceAudio + +client = FishAudio() + +with open("reference.wav", "rb") as f: + audio = client.tts.convert( + text="This will sound like the reference voice.", + references=[ReferenceAudio( + audio=f.read(), + text="The exact words spoken in the reference clip.", + )], + ) +``` + +Pass several `ReferenceAudio` entries to capture more range, just as you would with multiple samples in a persistent model. The matching `text` for each clip sharpens pronunciation. + +## Sample audio requirements + +Samples can be `.wav`, `.mp3`, `.m4a`, or `.opus`. Aim for at least 10 seconds per clip; a minute or two of clear, single-speaker speech improves fidelity. + +`enhance_audio_quality` (on by default) removes background noise and normalizes levels before training: + +```python Python +voice = client.voices.create( + title="My Voice", + voices=[open("sample.wav", "rb").read()], + enhance_audio_quality=True, +) +``` + +Leave it on for noisy or lower-quality recordings. If your audio is already clean and studio-grade, turning it off (`enhance_audio_quality=False`) avoids any extra processing. + +## Model state + +A new model reports a `state` field that moves from `created` to `trained` (or `failed`). With `train_mode="fast"` (the default) the voice is usable almost immediately, so most clones return already `trained`. + +```python Python +voice = client.voices.create(title="My Voice", voices=[sample]) +print(voice.state) # "trained" +``` + +If a generation rejects the `reference_id`, re-fetch the model and confirm its state before using it in Text to Speech: + +```python Python +voice = client.voices.get(voice.id) +if voice.state == "trained": + audio = client.tts.convert(text="Hello.", reference_id=voice.id) +``` + +## Going further + + + + Use `reference_id` in any generation. + + + List, update, and delete your voice models. + + + Get the most natural results from your samples. + + + Every field for `POST /model`. + + diff --git a/overview/capabilities.mdx b/overview/capabilities.mdx new file mode 100644 index 0000000..57184e8 --- /dev/null +++ b/overview/capabilities.mdx @@ -0,0 +1,83 @@ +--- +title: "Overview" +sidebarTitle: "Overview" +description: "Everything Fish Audio can do — and how to build with it" +icon: "house" +--- + +Fish Audio is a voice AI platform. Every core feature is available three ways: in the [web app](/overview/platform) (no code), through the [REST API](/api-reference/introduction), and via the official [SDK](/developer-guide/sdk-guide/quickstart). + +## Core features + + + + Convert text into lifelike speech with the `s2-pro` and `s1` models. + + + + Transcribe audio to text with per-segment timestamps. + + + + Clone a voice instantly from a clip, or train a persistent model. + + + + Stream audio as it generates — for voice agents and live apps. + + + + List, inspect, update, and delete your voice models. + + + +## Also in the web app + +These run in the browser, no code required — see the [Platform guide](/overview/platform). + + + + Transform existing audio into a different voice. + + + + Produce multi-speaker, long-form audio — audiobooks and narration. + + + + Generate music and cinematic sound effects from a prompt. + + + + Split audio into stems, and related processing utilities. + + + +## Models + +Two text-to-speech models power most capabilities: + +- **`s2-pro`** — the default, highest-quality model, with multi-speaker and natural-language expression control. +- **`s1`** — the previous generation, with `(parenthesis)` emotion tags. + +See [Models Overview](/developer-guide/models-pricing/models-overview) and [Choosing a Model](/developer-guide/models-pricing/choosing-a-model) for the full lineup, languages, and limits. + +## Pick your path + + + + No code — generate audio, clone voices, and produce projects in your browser. + + + + The Python library for your application. + + + + Raw REST and WebSocket endpoints for any language. + + + + Install the Fish Audio skill so your agent writes correct code. + + diff --git a/overview/platform.mdx b/overview/platform.mdx new file mode 100644 index 0000000..464844e --- /dev/null +++ b/overview/platform.mdx @@ -0,0 +1,98 @@ +--- +title: "Platform (Web App)" +description: "Use Fish Audio in your browser — no code required" +icon: "browser" +--- + +The [Fish Audio web app](https://fish.audio/app) gives you every capability without writing code: generate speech, clone voices, produce long-form projects, and manage your account. Sign in at [fish.audio/app](https://fish.audio/app). + + + This page is an orientation map of the web app. Detailed, step-by-step + walkthroughs for each tool are coming. To build with code instead, see the + [SDK Quickstart](/developer-guide/sdk-guide/quickstart) or the [API + Reference](/api-reference/introduction). + + +## Create audio + + + + Type or paste text, pick a voice and model, and generate speech. + + + + Upload audio and transform it into another voice. + + + + Upload audio to transcribe it, with timestamps. + + + + Generate cinematic sound effects from a text prompt. + + + + Generate music from a description. + + + + Split audio into stems (e.g. vocals and background). + + + +## Voices + + + + Create a custom voice from your own audio samples. + + + + Browse and use thousands of community and official voices. + + + + Manage the voice models you've created or saved. + + + +## Produce projects + + + Assemble multi-speaker, long-form audio — audiobooks, dialogue, and narration + — in a project editor. + + +## Library & history + + + + Find, replay, and download everything you've generated. + + + + Your saved audio, models, and collections. + + + +## Account & billing + + + + Create and manage keys for the API and SDKs. + + + + Subscription, credits, and invoices. See [Pricing & + Rate Limits](/developer-guide/models-pricing/pricing-and-rate-limits). + + + + Track your consumption. + + + + Shared workspaces, members, and billing for organizations. + + diff --git a/snippets/support.mdx b/snippets/support.mdx index 31442e0..81d0047 100644 --- a/snippets/support.mdx +++ b/snippets/support.mdx @@ -5,6 +5,6 @@ Need help? Check out these resources: - [API Reference](/api-reference/introduction) - Complete API documentation - [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model - [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech -- [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming +- [Real-time Streaming](/features/realtime-streaming) - WebSocket for real-time streaming - [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community - [Support Email](mailto:support@fish.audio) - Contact our support team \ No newline at end of file diff --git a/tests/.gitignore b/tests/.gitignore new file mode 100644 index 0000000..d3c2c56 --- /dev/null +++ b/tests/.gitignore @@ -0,0 +1,8 @@ +# Python cookbook tests +__pycache__/ +.pytest_cache/ + +# JavaScript tests +js/node_modules/ +js/package-lock.json +js/_runs/ diff --git a/tests/cookbooks/README.md b/tests/cookbooks/README.md new file mode 100644 index 0000000..595c10a --- /dev/null +++ b/tests/cookbooks/README.md @@ -0,0 +1,50 @@ +# Cookbook end-to-end tests + +These tests run the **exact code published in the cookbook `.mdx` files** against the live +Fish Audio API, so a recipe can't pass review while showing broken or drifted code. + +How it works: for each recipe, the harness extracts the ` ```python ` block straight from +the `.mdx`, substitutes a few placeholders (``, `play()`, input filenames) with +test fixtures, runs it in an isolated working directory, and asserts it produced valid +audio (magic-byte check). Any voice models a recipe creates are deleted afterward. + +## Run + +```bash +pip install -r tests/cookbooks/requirements.txt +export FISH_API_KEY=... # or rely on a workspace .env / keyfile +pytest tests/cookbooks -v +``` + +If no key is found (env var, `.env`, or local keyfile), the whole suite **skips** rather +than fails — so it's safe in CI without secrets. + +## Add a recipe + +Append a spec to `specs.py` — no test code changes: + +```python +{ + "slug": "my-recipe", + "path": "developer-guide/sdk-guide/cookbook/my-recipe.mdx", + "cases": [ + {"name": "happy path", "block": 0, "file": ("out.mp3", "mp3")}, + ], +} +``` + +Per-case keys: +- `block` — index of the `python` code block in the page (document order). +- `subs` — `{placeholder: replacement}` string substitutions. +- `file` — `(filename, format)` the recipe should write (validated by magic bytes). +- `var` — `(variable, format)` an audio-bytes variable the recipe should define. +- `consumed` — assert the injected `consume()` drained a non-empty stream. +- `postamble` — extra code run after the block (e.g. to drive a generator the block defines). + +## Tiers + +- **T1 (here):** pure Fish Audio recipes — run fully live. +- **T2:** integration seams (e.g. the `fish-tts` CLI, framework plugins) — test the + Fish-facing component live; the external framework is contract-checked. +- **T3:** full external round-trips (Telegram/Discord/Twilio) — staging only; see each + tutorial's manual checklist and any credential-guarded integration test. diff --git a/tests/cookbooks/conftest.py b/tests/cookbooks/conftest.py new file mode 100644 index 0000000..fed02a2 --- /dev/null +++ b/tests/cookbooks/conftest.py @@ -0,0 +1,77 @@ +"""Pytest fixtures: API key, client shims, a generated reference clip, and a clean +working directory seeded with the input files recipes expect. +""" +import os +import sys +from pathlib import Path + +import pytest + +sys.path.insert(0, str(Path(__file__).parent)) +import harness # noqa: E402 + + +@pytest.fixture(scope="session") +def api_key(): + key = harness.resolve_key() + if not key: + pytest.skip("FISH_API_KEY not available (env var, /tmp keyfile, or workspace .env)") + # Recipes import their own `FishAudio`, which reads FISH_API_KEY from the environment — + # export it so the verbatim, unmodified recipe code authenticates. + os.environ["FISH_API_KEY"] = key + return key + + +@pytest.fixture(scope="session") +def shims(api_key): + return harness.make_shims(api_key) + + +@pytest.fixture(scope="session") +def base_client(shims): + fish_audio, _ = shims + return fish_audio() + + +@pytest.fixture(scope="session") +def sample_wav(base_client): + # One short, clean reference clip, generated once and reused as recipe input. + return base_client.tts.convert( + text="Exact transcript of what is said in reference dot wav.", format="wav" + ) + + +@pytest.fixture(scope="session", autouse=True) +def _track_and_cleanup(api_key): + # Patch voices.create at the class level (sync AND async) so EVERY client — including a + # recipe's own verbatim FishAudio()/AsyncFishAudio() — records created models for cleanup. + from fishaudio.resources.voices import AsyncVoicesClient, VoicesClient + + sync_orig = VoicesClient.create + async_orig = AsyncVoicesClient.create + + def sync_create(self, *a, **k): + voice = sync_orig(self, *a, **k) + harness._created_voice_ids.append(voice.id) + return voice + + async def async_create(self, *a, **k): + voice = await async_orig(self, *a, **k) + harness._created_voice_ids.append(voice.id) + return voice + + VoicesClient.create = sync_create + AsyncVoicesClient.create = async_create + yield + VoicesClient.create = sync_orig + AsyncVoicesClient.create = async_orig + harness.cleanup_created(api_key) + + +@pytest.fixture() +def work_cwd(tmp_path, sample_wav, monkeypatch): + # Recipes read/write relative paths; give them an isolated cwd with inputs present. + for name in ("reference.wav", "sample.wav", "speech.wav"): + (tmp_path / name).write_bytes(sample_wav) + monkeypatch.chdir(tmp_path) + return tmp_path diff --git a/tests/cookbooks/extract.py b/tests/cookbooks/extract.py new file mode 100644 index 0000000..df8fb83 --- /dev/null +++ b/tests/cookbooks/extract.py @@ -0,0 +1,36 @@ +"""Extract fenced code blocks from a Mintlify .mdx file. + +The cookbook tests run the *published* code, so we pull the exact ```python blocks +out of the .mdx rather than maintaining a separate copy that can drift. +""" +import re +from pathlib import Path + +# Column-0 fences: ```lang [optional label]\n \n``` +_FENCE = re.compile(r"^```([A-Za-z0-9_+-]*)([^\n]*)\n(.*?)^```", re.M | re.S) + + +def code_blocks(mdx_path, lang=None): + text = Path(mdx_path).read_text(encoding="utf-8") + out = [] + for m in _FENCE.finditer(text): + block_lang = m.group(1) + if lang and block_lang != lang: + continue + out.append({ + "lang": block_lang, + "label": m.group(2).strip(), + "code": m.group(3).rstrip("\n") + "\n", + }) + return out + + +def python_blocks(mdx_path): + return code_blocks(mdx_path, "python") + + +if __name__ == "__main__": + import sys + for i, b in enumerate(python_blocks(sys.argv[1])): + print(f"--- block {i} ({b['label'] or 'python'}) ---") + print(b["code"]) diff --git a/tests/cookbooks/harness.py b/tests/cookbooks/harness.py new file mode 100644 index 0000000..0b6c4d3 --- /dev/null +++ b/tests/cookbooks/harness.py @@ -0,0 +1,101 @@ +"""Shared machinery for running cookbook recipes verbatim against the live API. + +Key ideas: +- `make_shims()` returns FishAudio / AsyncFishAudio that inject the API key and (in the + sandbox) the required HTTP proxy, so recipe code that calls `FishAudio()` with no args + runs unchanged in CI, locally, and in the sandbox. +- voice models created by recipes are tracked and deleted in `cleanup_created()`. +""" +import os +from pathlib import Path + +import httpx +from fishaudio import AsyncFishAudio as _AsyncFishAudio +from fishaudio import FishAudio as _FishAudio + +BASE_URL = "https://api.fish.audio" +# Validated public voice ("Energetic Male"), used to fill placeholders. +PUBLIC_VOICE = "802e3bc2b27e49c2995d23ef70e6ac89" +_WORKSPACE_ENV = "/Users/shawnlai/project/fish-audio/.env" +_LOCAL_KEYFILE = "/tmp/claude/fishdoctest/fishkey" + +_created_voice_ids = [] + + +def resolve_key(): + k = os.environ.get("FISH_API_KEY") + if k: + return k.strip() + if os.path.isfile(_LOCAL_KEYFILE): + v = Path(_LOCAL_KEYFILE).read_text().strip() + if v: + return v + try: + from dotenv import dotenv_values + for p in (_WORKSPACE_ENV, str(Path.cwd() / ".env")): + v = dotenv_values(p).get("FISH_API_KEY") + if v: + return v.strip() + except Exception: + pass + return None + + +def _proxy(): + return os.environ.get("HTTPS_PROXY") or os.environ.get("HTTP_PROXY") or None + + +def _sync_httpx(): + return httpx.Client(trust_env=False, proxy=_proxy(), base_url=BASE_URL, timeout=240.0) + + +def _async_httpx(): + return httpx.AsyncClient(trust_env=False, proxy=_proxy(), base_url=BASE_URL, timeout=240.0) + + +def make_shims(key): + # Voice-model cleanup is handled by a class-level patch in conftest (so recipes that + # build their own FishAudio() are covered too), not per instance here. + def FishAudio(*a, **k): + k.setdefault("api_key", key) + k.setdefault("httpx_client", _sync_httpx()) + return _FishAudio(*a, **k) + + def AsyncFishAudio(*a, **k): + k.setdefault("api_key", key) + k.setdefault("httpx_client", _async_httpx()) + return _AsyncFishAudio(*a, **k) + + return FishAudio, AsyncFishAudio + + +def cleanup_created(key): + if not _created_voice_ids: + return + c = _FishAudio(api_key=key, httpx_client=_sync_httpx()) + for vid in list(_created_voice_ids): + try: + c.voices.delete(vid) + except Exception: + pass + _created_voice_ids.clear() + + +def sniff(b): + if b[:3] == b"ID3" or (len(b) > 1 and b[0] == 0xFF and (b[1] & 0xE0) == 0xE0): + return "mp3" + if b[:4] == b"RIFF": + return "wav" + if b[:4] == b"OggS": + return "opus" + return "unknown" + + +class Consume: + """Stand-in for hardware playback (`play`) — drains a stream and remembers its size.""" + last_bytes = 0 + + def __call__(self, stream): + chunks = list(stream) + self.last_bytes = sum(len(c) for c in chunks) + return chunks diff --git a/tests/cookbooks/requirements.txt b/tests/cookbooks/requirements.txt new file mode 100644 index 0000000..3b17e31 --- /dev/null +++ b/tests/cookbooks/requirements.txt @@ -0,0 +1,4 @@ +fish-audio-sdk>=1.3.0 +httpx>=0.28 +python-dotenv>=1.0 +pytest>=8.0 diff --git a/tests/cookbooks/specs.py b/tests/cookbooks/specs.py new file mode 100644 index 0000000..3998a61 --- /dev/null +++ b/tests/cookbooks/specs.py @@ -0,0 +1,109 @@ +"""Per-recipe test specs. + +Each case names a code block (by index, in document order) from a cookbook .mdx, the +placeholder substitutions that make it runnable, and what to assert. Adding coverage for +a new cookbook is ~4 lines here — no test code to touch. +""" +from harness import PUBLIC_VOICE + +COOKBOOK = "developer-guide/sdk-guide/cookbook" + +SPECS = [ + { + "slug": "streaming-to-file", + "path": f"{COOKBOOK}/streaming-to-file.mdx", + "cases": [ + {"name": "sync stream to file", "block": 0, "file": ("output.mp3", "mp3")}, + {"name": "async stream to file", "block": 1, "file": ("output.mp3", "mp3")}, + {"name": "collect to bytes", "block": 2, "var": ("audio", "mp3")}, + ], + }, + { + "slug": "instant-voice-cloning", + "path": f"{COOKBOOK}/instant-voice-cloning.mdx", + "cases": [ + {"name": "sync ReferenceAudio clone", "block": 0, "file": ("cloned.mp3", "mp3")}, + {"name": "async ReferenceAudio clone", "block": 1, "file": ("cloned.mp3", "mp3")}, + {"name": "reuse via create + reference_id", "block": 2, "var": ("audio", "mp3")}, + ], + }, + { + "slug": "realtime-llm-to-speech", + "path": f"{COOKBOOK}/realtime-llm-to-speech.mdx", + "cases": [ + { + "name": "sync websocket + play", + "block": 0, + "subs": {"": PUBLIC_VOICE, "play(audio_stream)": "consume(audio_stream)"}, + "consumed": True, + }, + {"name": "async websocket to file", "block": 1, "file": ("out.mp3", "mp3")}, + { + "name": "FlushEvent boundary", + "block": 2, + "postamble": "consume(client.tts.stream_websocket(turns()))", + "consumed": True, + }, + ], + }, + # ---- recipes authored by the cookbook workflow (one live-tested primary block each) ---- + { + "slug": "transcribe-to-captions", + "path": f"{COOKBOOK}/transcribe-to-captions.mdx", + "cases": [{"name": "SRT/VTT captions", "block": 0, "file": ("captions.srt", "srt")}], + }, + { + "slug": "batch-transcribe-with-language-hint", + "path": f"{COOKBOOK}/batch-transcribe-with-language-hint.mdx", + "cases": [ + {"name": "batch transcribe (sync)", "block": 0, "truthy": "results"}, + {"name": "batch transcribe (async)", "block": 1}, # runs to completion = pass + ], + }, + { + "slug": "telephony-8khz-audio", + "path": f"{COOKBOOK}/telephony-8khz-audio.mdx", + "cases": [ + {"name": "8 kHz wav (sync)", "block": 0, "file": ("out.wav", "wav")}, + {"name": "8 kHz wav (async)", "block": 1, "file": ("out.wav", "wav")}, + {"name": "8 kHz raw pcm", "block": 2, "var_nonempty": "audio"}, + ], + }, + { + "slug": "clone-and-wait-until-ready", + "path": f"{COOKBOOK}/clone-and-wait-until-ready.mdx", + "cases": [ + {"name": "create + poll + synth (sync)", "block": 0, "file": ("out.mp3", "mp3"), + "subs": {"deadline = time.time() + 300 # 5-minute timeout": + "deadline = time.time() + 600 # extended timeout for test"}}, + {"name": "create + poll + synth (async)", "block": 1, "file": ("out.mp3", "mp3")}, + ], + }, + { + "slug": "oneshot-vs-persistent-cloning", + "path": f"{COOKBOOK}/oneshot-vs-persistent-cloning.mdx", + "cases": [ + {"name": "one-shot ReferenceAudio (sync)", "block": 0, "file": ("oneshot.mp3", "mp3")}, + {"name": "one-shot ReferenceAudio (async)", "block": 1, "file": ("oneshot.mp3", "mp3")}, + {"name": "persistent create + reuse", "block": 2, "file": ("persistent.mp3", "mp3")}, + {"name": "reuse known id", "block": 3, "var": ("audio", "mp3"), + "subs": {"": PUBLIC_VOICE}}, + ], + }, + { + "slug": "discover-library-voice", + "path": f"{COOKBOOK}/discover-library-voice.mdx", + "cases": [{"name": "library search + synth", "block": 0, "file": ("out.mp3", "mp3"), + "subs": {"": PUBLIC_VOICE}}], + }, + { + "slug": "voice-agent-loop", + "path": f"{COOKBOOK}/voice-agent-loop.mdx", + "cases": [ + {"name": "asr -> reply -> tts (sync)", "block": 0, "file": ("reply.mp3", "mp3"), + "subs": {'""': f'"{PUBLIC_VOICE}"'}}, + {"name": "asr -> reply -> tts (async)", "block": 1, "file": ("reply.mp3", "mp3"), + "subs": {'""': f'"{PUBLIC_VOICE}"'}}, + ], + }, +] diff --git a/tests/cookbooks/test_cookbooks.py b/tests/cookbooks/test_cookbooks.py new file mode 100644 index 0000000..c46b706 --- /dev/null +++ b/tests/cookbooks/test_cookbooks.py @@ -0,0 +1,76 @@ +"""Run every cookbook recipe verbatim against the live Fish Audio API. + +For each case: extract the exact code block from the .mdx, apply the spec's placeholder +substitutions, execute it in an isolated cwd, and assert it produced valid audio. +""" +import sys +from pathlib import Path + +import pytest + +sys.path.insert(0, str(Path(__file__).parent)) +import harness # noqa: E402 +from extract import python_blocks # noqa: E402 +from specs import SPECS # noqa: E402 + +ROOT = Path(__file__).resolve().parents[2] # fish-docs/ + + +def _cases(): + for spec in SPECS: + for case in spec["cases"]: + yield pytest.param(spec, case, id=f"{spec['slug']}::{case['name']}") + + +@pytest.mark.parametrize("spec,case", list(_cases())) +def test_cookbook_recipe(spec, case, shims, base_client, work_cwd): + fish_audio, async_fish_audio = shims + blocks = python_blocks(ROOT / spec["path"]) + assert case["block"] < len(blocks), f"{spec['slug']}: block {case['block']} out of range" + + code = blocks[case["block"]]["code"] + for old, new in case.get("subs", {}).items(): + code = code.replace(old, new) + + # Continuation snippets assume types imported by an earlier block on the page; provide them. + from fishaudio.types import ReferenceAudio, TTSConfig + from fishaudio.utils import save + + consume = harness.Consume() + ns = { + "__name__": "__cookbook__", + "FishAudio": fish_audio, + "AsyncFishAudio": async_fish_audio, + "client": base_client, + "consume": consume, + "TTSConfig": TTSConfig, + "ReferenceAudio": ReferenceAudio, + "save": save, + } + exec(compile(code, spec["path"], "exec"), ns) + if case.get("postamble"): + exec(compile(case["postamble"], "", "exec"), ns) + + if "file" in case: + name, fmt = case["file"] + path = work_cwd / name + assert path.exists(), f"{name} was not created" + if fmt == "srt": + text = path.read_text(encoding="utf-8") + assert "-->" in text, f"{name} has no SRT/VTT cues" + else: + data = path.read_bytes() + assert data, f"{name} is empty" + assert harness.sniff(data) == fmt, f"{name}: expected {fmt}, got {harness.sniff(data)}" + if "var" in case: + name, fmt = case["var"] + val = ns.get(name) + assert isinstance(val, (bytes, bytearray)) and val, f"`{name}` is not audio bytes" + assert harness.sniff(val) == fmt, f"`{name}`: expected {fmt}, got {harness.sniff(val)}" + if "var_nonempty" in case: + val = ns.get(case["var_nonempty"]) + assert isinstance(val, (bytes, bytearray)) and val, f"`{case['var_nonempty']}` is not non-empty bytes" + if "truthy" in case: + assert ns.get(case["truthy"]), f"`{case['truthy']}` is empty/falsy after the recipe" + if case.get("consumed"): + assert consume.last_bytes > 0, "no audio was produced by the stream" diff --git a/tests/js/package.json b/tests/js/package.json new file mode 100644 index 0000000..55b6ea2 --- /dev/null +++ b/tests/js/package.json @@ -0,0 +1,13 @@ +{ + "name": "fish-docs-js-tests", + "version": "1.0.0", + "private": true, + "type": "module", + "description": "End-to-end tests that run the published JavaScript examples from the docs against the live Fish Audio API.", + "scripts": { + "test": "node run.mjs" + }, + "dependencies": { + "fish-audio": "^0.1.0" + } +} diff --git a/tests/js/run.mjs b/tests/js/run.mjs new file mode 100644 index 0000000..732c46b --- /dev/null +++ b/tests/js/run.mjs @@ -0,0 +1,93 @@ +// End-to-end runner for the JavaScript examples in the docs. +// Extracts ```javascript blocks straight from the .mdx, substitutes placeholders, runs each +// verbatim in Node against the live Fish Audio API, and asserts it produced valid output. +// +// Key: $FISH_API_KEY, or the workspace .env. Skips (exit 0) if no key — safe for CI. +import { execFileSync } from "node:child_process"; +import { existsSync, mkdirSync, readFileSync, rmSync, writeFileSync } from "node:fs"; +import { dirname, join, resolve } from "node:path"; +import { fileURLToPath } from "node:url"; + +import { FishAudioClient } from "fish-audio"; +import { SPECS } from "./specs.mjs"; + +const HERE = dirname(fileURLToPath(import.meta.url)); +const FISH = resolve(HERE, "../.."); +const WS = resolve(FISH, ".."); +const PUBLIC_VOICE = "802e3bc2b27e49c2995d23ef70e6ac89"; + +function resolveKey() { + if (process.env.FISH_API_KEY) return process.env.FISH_API_KEY.trim(); + try { + const m = readFileSync(join(WS, ".env"), "utf8").match(/^\s*(?:export\s+)?FISH_API_KEY\s*=\s*(.+)$/m); + if (m) return m[1].trim().replace(/^["']|["']$/g, ""); + } catch {} + return null; +} +const KEY = resolveKey(); +if (!KEY) { console.log("SKIP: FISH_API_KEY not found (env or workspace .env)"); process.exit(0); } +process.env.FISH_API_KEY = KEY; + +const client = new FishAudioClient({ apiKey: KEY }); +const toBytes = async (s) => { const c = []; for await (const x of s) c.push(Buffer.from(x)); return Buffer.concat(c); }; +const sniff = (b) => + (b.slice(0, 3).toString() === "ID3" || (b[0] === 0xff && (b[1] & 0xe0) === 0xe0)) ? "mp3" + : b.slice(0, 4).toString() === "RIFF" ? "wav" + : b.slice(0, 4).toString() === "OggS" ? "opus" : "?"; + +function jsBlocks(mdxRel) { + const src = readFileSync(join(FISH, mdxRel), "utf8"); + const re = /```javascript[^\n]*\n([\s\S]*?)```/g; + const out = []; let m; + while ((m = re.exec(src))) out.push(m[1]); + return out; +} + +// One reference clip, reused as recipe input (speech.wav / reference.wav / sample.wav). +const sampleWav = await toBytes( + await client.textToSpeech.convert({ text: "A sample clip for testing.", format: "wav" }, "s2-pro") +); + +const RUNS = join(HERE, "_runs"); +rmSync(RUNS, { recursive: true, force: true }); +mkdirSync(RUNS, { recursive: true }); + +let pass = 0, fail = 0; +for (const spec of SPECS) { + const blocks = jsBlocks(spec.mdx); + for (const c of spec.cases) { + let code = blocks[c.block]; + if (code === undefined) { console.log(`FAIL ${spec.slug}::${c.name}\n block ${c.block} not found`); fail++; continue; } + for (const [k, v] of Object.entries(c.subs || {})) code = code.split(k).join(v.replace("", PUBLIC_VOICE)); + const dir = join(RUNS, `${spec.slug}-${c.block}`); + mkdirSync(dir, { recursive: true }); + for (const f of ["speech.wav", "reference.wav", "sample.wav"]) writeFileSync(join(dir, f), sampleWav); + writeFileSync(join(dir, "run.mjs"), code); + let ok = true, err = ""; + try { + execFileSync(process.execPath, ["run.mjs"], { cwd: dir, env: { ...process.env }, stdio: "pipe", timeout: 180000 }); + } catch (e) { ok = false; err = (e.stderr?.toString() || e.message).trim().split("\n").slice(-3).join(" | "); } + if (ok && c.file) { + const p = join(dir, c.file[0]); + if (!existsSync(p)) { ok = false; err = `${c.file[0]} not written`; } + else { + const b = readFileSync(p); + if (c.file[1] === "srt") { if (!b.toString().includes("-->")) { ok = false; err = `${c.file[0]} has no SRT cues`; } } + else if (c.file[1] !== "any" && sniff(b) !== c.file[1]) { ok = false; err = `${c.file[0]} is ${sniff(b)} not ${c.file[1]}`; } + } + } + console.log(`${ok ? "PASS" : "FAIL"} ${spec.slug}::${c.name}` + (ok ? "" : `\n ${err}`)); + ok ? pass++ : fail++; + } +} + +// Best-effort cleanup of any throwaway voices the examples created. +try { + const page = await client.voices.search({ self: true, page_size: 50 }); + for (const v of (page.items || [])) { + if (/zzz|my voice|my narrator/i.test(v.title || "")) { try { await client.voices.delete(v._id || v.id); } catch {} } + } +} catch {} + +console.log(`\n=== ${pass}/${pass + fail} JS blocks passed ===`); +process.exit(fail ? 1 : 0); diff --git a/tests/js/specs.mjs b/tests/js/specs.mjs new file mode 100644 index 0000000..10ab859 --- /dev/null +++ b/tests/js/specs.mjs @@ -0,0 +1,17 @@ +// Auto-generated by _integrate_js.py — per-page JS test specs. +export const SPECS = [ + { slug: "features/text-to-speech", mdx: "features/text-to-speech.mdx", cases: [{ name: "quick start", block: 0, file: ["hello.mp3", "mp3"] }] }, + { slug: "features/speech-to-text", mdx: "features/speech-to-text.mdx", cases: [{ name: "primary", block: 0 }] }, + { slug: "features/voice-cloning", mdx: "features/voice-cloning.mdx", cases: [{ name: "primary", block: 0 }] }, + { slug: "manage-voices", mdx: "features/manage-voices.mdx", cases: [{ name: "primary", block: 0 }] }, + { slug: "realtime-streaming", mdx: "features/realtime-streaming.mdx", cases: [{ name: "primary", block: 0, file: ["out.mp3", "mp3"] }] }, + { slug: "streaming-to-file", mdx: "developer-guide/sdk-guide/cookbook/streaming-to-file.mdx", cases: [{ name: "primary", block: 0, file: ["output.mp3", "mp3"] }] }, + { slug: "instant-voice-cloning", mdx: "developer-guide/sdk-guide/cookbook/instant-voice-cloning.mdx", cases: [{ name: "primary", block: 0, file: ["cloned.mp3", "mp3"] }] }, + { slug: "transcribe-to-captions", mdx: "developer-guide/sdk-guide/cookbook/transcribe-to-captions.mdx", cases: [{ name: "primary", block: 0, file: ["captions.srt", "srt"] }] }, + { slug: "batch-transcribe-with-language-hint", mdx: "developer-guide/sdk-guide/cookbook/batch-transcribe-with-language-hint.mdx", cases: [{ name: "primary", block: 0 }] }, + { slug: "telephony-8khz-audio", mdx: "developer-guide/sdk-guide/cookbook/telephony-8khz-audio.mdx", cases: [{ name: "primary", block: 0, file: ["out.wav", "wav"] }] }, + { slug: "developer-guide/sdk-guide/cookbook/clone-and-wait-until-ready", mdx: "developer-guide/sdk-guide/cookbook/clone-and-wait-until-ready.mdx", cases: [{ name: "primary", block: 0, file: ["out.mp3", "mp3"] }] }, + { slug: "oneshot-vs-persistent-cloning", mdx: "developer-guide/sdk-guide/cookbook/oneshot-vs-persistent-cloning.mdx", cases: [{ name: "primary", block: 0, file: ["oneshot.mp3", "mp3"] }] }, + { slug: "discover-library-voice", mdx: "developer-guide/sdk-guide/cookbook/discover-library-voice.mdx", cases: [{ name: "primary", block: 0, file: ["out.mp3", "mp3"], subs: { "": "" } }] }, + { slug: "voice-agent-loop", mdx: "developer-guide/sdk-guide/cookbook/voice-agent-loop.mdx", cases: [{ name: "primary", block: 0, file: ["reply.mp3", "mp3"], subs: { "": "" } }] }, +];