feat(web): unify AI providers behind OpenAI-compatible config (Ollama, Whisper, etc.)#1877
Open
kovashikawa wants to merge 7 commits into
Open
feat(web): unify AI providers behind OpenAI-compatible config (Ollama, Whisper, etc.)#1877kovashikawa wants to merge 7 commits into
kovashikawa wants to merge 7 commits into
Conversation
Adds six optional server env vars: - AI_BASE_URL / AI_API_KEY / AI_MODEL — chat completions provider - STT_BASE_URL / STT_API_KEY / STT_MODEL — speech-to-text provider All are .optional() so existing self-hosters running Groq/OpenAI/Deepgram are unaffected. Subsequent commits wire them into the AI client and the transcription workflow.
Replaces the groq-sdk wrapper and the hand-rolled fetch-to-OpenAI fallback with a single `openai` SDK client in apps/web/lib/ai-provider.ts. The new getAiClient()/getAiModel() resolve, in order: 1. AI_BASE_URL + AI_API_KEY + AI_MODEL (any OpenAI-compatible provider: Ollama, vLLM, OpenRouter, LiteLLM, etc.) 2. GROQ_API_KEY (existing default, baseURL pinned to Groq, model preserved as openai/gpt-oss-120b) 3. OPENAI_API_KEY (default OpenAI endpoint, model preserved as gpt-4o-mini) Call sites migrated: - apps/web/workflows/generate-ai.ts: drops the duplicate callOpenAi raw fetch in favor of the unified client; signatures threaded - apps/web/actions/videos/translate-transcript.ts - apps/web/lib/messenger/agent.ts (Groq branch only; Anthropic/OpenAI fallbacks untouched — separate domain, kept out of scope) Dependency change: -groq-sdk, +openai. Behavior for existing installs is unchanged because the Groq path now constructs an OpenAI client with baseURL = https://api.groq.com/openai/v1 — same wire protocol.
Transcription workflow: - When STT_BASE_URL is set, transcribeAudio dispatches to a new transcribeViaSttProvider() that calls openai.audio.transcriptions.create with response_format: "vtt". The OpenAI Whisper API returns WebVTT directly, which is exactly what Cap's pipeline writes to S3, so the Deepgram-specific formatToWebVTT(DeepgramResult) adapter drops out on this path. - Default behavior (STT_BASE_URL unset) still uses Deepgram. Existing installs are unaffected. Trigger-gate widenings (these were the blockers preventing self-hosters on local providers from ever firing the workflow): - apps/web/lib/transcribe.ts: accept STT_BASE_URL as a valid provider - apps/web/lib/generate-ai.ts: accept AI_BASE_URL as a valid provider - apps/web/actions/videos/get-status.ts: same widenings in the share-page auto-trigger paths for both transcription and AI generation
The default docker-compose.yml did not pass DEEPGRAM/GROQ/OPENAI env vars through to cap-web, which is part of why self-host AI was silently broken — even when users set the keys in .env, they never reached the container. This commit threads them through along with the new AI_*/STT_* triples in all four compose flavors: - docker-compose.yml (default) - docker-compose.template.yml - docker-compose.coolify.yml - docker-compose.coolify.env.example
…scheme
- AI_BASE_URL and STT_BASE_URL now refuse non-http(s) schemes (defense
against typos like `file://` or `gopher://`). Empty string still passes
for compose default `${VAR:-}`.
- Doc strings updated to make the requirement contract explicit
(AI_API_KEY and AI_MODEL are required when AI_BASE_URL is set; same for
STT_*). The previous wording said AI_API_KEY "falls back to GROQ_API_KEY
or OPENAI_API_KEY" — that fallback is removed in the next commit
because it could silently send a paid cloud key to an arbitrary URL.
…e gates
ai-provider.ts:
- Drop the module-level singleton cache. The OpenAI SDK is cheap to
construct and the cache made env changes / hot-reloads / tests carry
stale clients with no path to recreate.
- Drop the cross-provider apiKey fallback. Previously, setting AI_BASE_URL
without AI_API_KEY would silently send the configured GROQ_API_KEY or
OPENAI_API_KEY over the wire to the new endpoint. Now AI_API_KEY is
required explicitly when AI_BASE_URL is set; same for STT_API_KEY.
- Throw clear errors when AI_BASE_URL is set without the required
AI_API_KEY or AI_MODEL (and STT analogue). The previous code would
silently default AI_MODEL to "gpt-4o-mini" and let Ollama/vLLM return
an opaque 404 inside the workflow step.
- Set explicit timeouts (120s chat, 300s STT) and maxRetries: 2 on the
OpenAI client. The SDK default of 600s would hang workflow steps for
10 minutes on a stuck local inference call; the retry restores the
resilience that the previous Groq->OpenAI fallback used to provide.
- Add isAiConfigured() / isSttConfigured() helpers as the single source
of truth for "is any chat / STT provider available?" so the OR-chains
in trigger gates don't drift the next time a provider type lands.
workflows/transcribe.ts:
- Drop the `as unknown as string` cast on the OpenAI SDK transcription
response. With `response_format: "vtt" as const` the SDK's overload
narrows to string at compile time; the unsafe cast was hiding that.
- Strengthen the WebVTT smoke check from a substring search for
"WEBVTT" to a structural check (`/^WEBVTT/m` header line plus a cue
arrow `-->`). The substring form would both reject valid VTT without
the header and accept SRT or other formats that happened to contain
the word.
workflows/generate-ai.ts:
- Request `response_format: { type: "json_object" }` on chat completions.
Every prompt already instructs "Return ONLY valid JSON"; modern
OpenAI-compatible providers (OpenAI, Groq, Ollama, vLLM, OpenRouter)
enforce that with this flag, which materially reduces parse failures
on smaller local models. A try/catch falls back to plain mode when
the underlying provider rejects the field, keeping niche gateways
compatible.
Trigger gates consolidated via isAiConfigured() / isSttConfigured():
- apps/web/actions/videos/get-status.ts (share-page auto-trigger, both
transcription and AI-generation paths)
- apps/web/lib/transcribe.ts (lib entry point)
- apps/web/lib/generate-ai.ts (lib entry point)
- apps/web/workflows/generate-ai.ts (validateAndSetProcessing step;
also aligns the second-check error message with the first)
- apps/web/workflows/transcribe.ts (validateVideo step)
The previous PR commit dropped the cross-provider failover and described maxRetries: 2 as a replacement — that was wrong. maxRetries only retries the same endpoint on transient errors; it does not preserve the prior Groq → OpenAI behavior for users who had both keys set as a true failover. This restores the prior semantics through the unified abstraction: - New getAiFallbackClient() in ai-provider.ts returns an OpenAI client (with the same timeout / maxRetries settings) when both GROQ_API_KEY and OPENAI_API_KEY are set AND no AI_BASE_URL override is in effect. An explicit AI_BASE_URL means the user has chosen a specific provider; no implicit fallback is added in that case. - callAiApi in workflows/generate-ai.ts wraps the primary call in a try/catch; on any primary failure, if a fallback client is available it retries once with OpenAI before propagating. JSON-mode handling is applied to both legs via a shared invokeChat helper. - A console.warn surfaces the fallback so the failure is observable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Depends on #1874. That proxy fix is the prerequisite for any workflow to execute on a self-host. Without #1874 merged, this PR's transcription path queues but never runs (the workflow runtime's
/.well-known/workflow/v1/*HTTP callbacks get 307→/login). Once #1874 lands, this PR rebases cleanly ontomainwith no further changes.Root cause
Self-hosted Cap currently requires three paid third-party providers (Deepgram + Groq/OpenAI) to make the share-page AI features (summary, chapters, transcript) work. Even with paid keys, the three calls use three inconsistent code paths:
groq-sdkchat completionapps/web/lib/groq-client.tsfetchtoapi.openai.comapps/web/workflows/generate-ai.ts:391@deepgram/sdktranscribeFileapps/web/workflows/transcribe.ts:342@xenova in #1356: "any interest in using a local model for speech transcription? 👀"
PR #1705 already ships local STT in the desktop app (Parakeet). Local models are part of Cap's stance — just not on the server side, yet.
Fix
Collapse the three call patterns into one OpenAI-compatible client abstraction, configured by env. The OpenAI API is the lingua franca that Groq, OpenAI, Ollama, vLLM, OpenRouter, LiteLLM,
faster-whisper-server, and whisper.cpp's HTTP server all already speak.Two concerns → two env triples (all optional, all default to existing behavior):
Key simplification on the STT path: OpenAI's
/v1/audio/transcriptionsnatively returns WebVTT (response_format: "vtt"). That's exactly what Cap already writes to S3, so whenSTT_BASE_URLis set the Deepgram-specificformatToWebVTT(DeepgramResult)adapter drops out — the rest of the pipeline is unchanged.Commits are split into 4 logical groups, each typechecking independently:
feat(env): add AI_*/STT_* env vars for OpenAI-compatible providersrefactor(web): unify chat AI behind OpenAI-compatible client— dropsgroq-sdk, addsopenai, deleteslib/groq-client.ts, migrates 3 call sitesfeat(web): OpenAI-compatible STT + widen self-host AI gates— new STT branch in transcribe workflow; widens 4 env-key trigger gates that were blocking self-hosters on local providerschore(docker): expose AI/STT env vars through compose files— 4 compose flavorsEnd-state for a fully-local self-host (FYI — not part of this PR's required setup):
Backwards compatibility
If the new env vars are unset, behavior is identical to today:
GROQ_API_KEY→ Groq path;OPENAI_API_KEY→ existing OpenAI fallback;DEEPGRAM_API_KEY→ Deepgram. The Groq path now constructs anopenaiSDK client withbaseURL = https://api.groq.com/openai/v1— same wire protocol, no observable difference.Verification
End-to-end test on a local Docker Compose self-host with Ollama (Gemma 3 12B) +
hwdsl2/whisper-server(Whisper base), after applying both #1874 and this PR:Before this PR (with #1874 applied alone — workflow runtime works)
AI_BASE_URL/STT_BASE_URLset (no Groq/Deepgram keys): the share page renders,transcribeVideo()is called, but trigger gates checking forDEEPGRAM_API_KEY/GROQ_API_KEYshort-circuit before the workflow starts.transcriptionStatusstaysNULL.After this PR
transcribeVideo()→ workflow runs end-to-end:videos.transcriptionStatus = COMPLETE;transcription.vttwritten directly to S3 from Whisper'sresponse_format=vttoutput (no format conversion).aiGenerationStatus = COMPLETE;metadata.summaryandmetadata.chapterspopulated.GROQ_API_KEY).Gates clean for changed files:
pnpm exec biome check --write,pnpm exec tsc -b,pnpm vitest run __tests__/unit/generate-ai-title.test.ts(6/6).Out of scope (intentional)
@deepgram/sdkremains a dependency. Deepgram is the default STT whenSTT_BASE_URLis unset, so cap.so cloud is untouched. A follow-up could drop the SDK onceSTT_BASE_URLbecomes the unified STT path.apps/web/lib/messenger/agent.tswere NOT migrated — that file is the support chatbot with its own Anthropic → OpenAI → Groq fallback chain, separate domain.Related
Design questions
AI_*+STT_*(what I shipped) vs.LLM_*+STT_*vs. some other shape you'd prefer?@deepgram/sdk?Greptile Summary
This PR replaces three divergent AI call patterns (Groq SDK, raw OpenAI fetch, Deepgram SDK) with a single OpenAI-compatible client abstraction in
lib/ai-provider.ts, gated by six new optional env vars (AI_BASE_URL/KEY/MODEL,STT_BASE_URL/KEY/MODEL). Existing deployments usingGROQ_API_KEY/OPENAI_API_KEY/DEEPGRAM_API_KEYare unaffected by default.lib/ai-provider.ts— new singleton factory for chat and STT clients; priority order isAI_BASE_URL→ Groq → OpenAI.workflows/generate-ai.ts— removes the Groq→OpenAI automatic failover that previously recovered from Groq errors whenOPENAI_API_KEYwas also present.workflows/transcribe.ts— adds an OpenAI-compatible STT branch that posts audio and receives WebVTT directly, bypassing the Deepgram formatter.lib/messenger/agent.ts— the PR states this file was out of scope, butcallGroqwas replaced withcallAiProviderusinggetAiClient(), meaning the support chatbot's last-resort fallback now routes through whatever providerAI_BASE_URLpoints to (e.g. a local Ollama instance).Confidence Score: 3/5
Safe to merge for new self-hosted deployments; two unintended behavioral changes affect existing dual-key setups and the support chatbot routing.
Two issues need resolution before merge. First, workflows/generate-ai.ts silently drops the Groq→OpenAI failover — users with both keys set lose automatic recovery from Groq downtime. Second, lib/messenger/agent.ts was migrated despite the PR explicitly calling it out of scope, so the support chatbot's final fallback now routes through any AI_BASE_URL-configured provider (e.g. a local Ollama instance), which may produce unsuitable responses for a customer support context.
apps/web/workflows/generate-ai.ts (failover removal) and apps/web/lib/messenger/agent.ts (unintended migration of the support chatbot)
Important Files Changed
Comments Outside Diff (2)
apps/web/lib/messenger/agent.ts, line 168-191 (link)The PR's "Out of scope" section explicitly states: "The Anthropic + OpenAI raw-fetch fallbacks in
apps/web/lib/messenger/agent.tswere NOT migrated." ButcallGroqhas been renamed tocallAiProviderand now callsgetAiClient(). For any self-hosted deployment that setsAI_BASE_URL(the stated target of this PR) but has neitherANTHROPIC_API_KEYnorOPENAI_API_KEY, the support chatbot's last resort will now be a local Ollama/vLLM instance. A local Gemma model answering customer support queries is likely not the intended behavior, and it contradicts the stated out-of-scope decision.Prompt To Fix With AI
apps/web/workflows/transcribe.ts, line 536-541 (link)unknowntostringThe OpenAI Node SDK v4 overloads
audio.transcriptions.create— whenresponse_formatis"vtt","srt", or"text"the runtime value is a plain string, but the TypeScript generic signature falls through to theTranscriptiontype. Theas unknown as stringworkaround is valid here, but leaving a comment explaining why the cast is needed would help the next reader avoid accidentally "fixing" it by removing the cast (which would cause a type error on.includes("WEBVTT")).Prompt To Fix With AI
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "chore(docker): expose AI/STT env vars th..." | Re-trigger Greptile