feat(web): unify AI providers behind OpenAI-compatible config (Ollama, Whisper, etc.) by kovashikawa · Pull Request #1877 · CapSoftware/Cap

kovashikawa · 2026-05-31T15:33:43Z

Summary

Replace three inconsistent AI call patterns with a single OpenAI-compatible client abstraction.
Self-hosters can run AI features fully locally (Ollama for chat, faster-whisper-server for STT) with zero paid services.
Existing Groq / OpenAI / Deepgram installs are unchanged — all current behavior preserved by default.

Depends on #1874. That proxy fix is the prerequisite for any workflow to execute on a self-host. Without #1874 merged, this PR's transcription path queues but never runs (the workflow runtime's /.well-known/workflow/v1/* HTTP callbacks get 307→/login). Once #1874 lands, this PR rebases cleanly onto main with no further changes.

Root cause

Self-hosted Cap currently requires three paid third-party providers (Deepgram + Groq/OpenAI) to make the share-page AI features (summary, chapters, transcript) work. Even with paid keys, the three calls use three inconsistent code paths:

Concern	Current SDK / call	File
Summary (primary)	`groq-sdk` chat completion	`apps/web/lib/groq-client.ts`
Summary (fallback)	hand-rolled `fetch` to `api.openai.com`	`apps/web/workflows/generate-ai.ts:391`
Transcription	`@deepgram/sdk` `transcribeFile`	`apps/web/workflows/transcribe.ts:342`

@xenova in #1356: "any interest in using a local model for speech transcription? 👀"

PR #1705 already ships local STT in the desktop app (Parakeet). Local models are part of Cap's stance — just not on the server side, yet.

Fix

Collapse the three call patterns into one OpenAI-compatible client abstraction, configured by env. The OpenAI API is the lingua franca that Groq, OpenAI, Ollama, vLLM, OpenRouter, LiteLLM, faster-whisper-server, and whisper.cpp's HTTP server all already speak.

Two concerns → two env triples (all optional, all default to existing behavior):

# Chat (summaries, chapters, titles)
AI_BASE_URL      # default: https://api.groq.com/openai/v1
AI_API_KEY       # default: $GROQ_API_KEY → $OPENAI_API_KEY
AI_MODEL         # default: openai/gpt-oss-120b

# Speech-to-text
STT_BASE_URL     # default unset → existing Deepgram path
STT_API_KEY
STT_MODEL        # used when STT_BASE_URL is set

Key simplification on the STT path: OpenAI's /v1/audio/transcriptions natively returns WebVTT (response_format: "vtt"). That's exactly what Cap already writes to S3, so when STT_BASE_URL is set the Deepgram-specific formatToWebVTT(DeepgramResult) adapter drops out — the rest of the pipeline is unchanged.

Commits are split into 4 logical groups, each typechecking independently:

feat(env): add AI_*/STT_* env vars for OpenAI-compatible providers
refactor(web): unify chat AI behind OpenAI-compatible client — drops groq-sdk, adds openai, deletes lib/groq-client.ts, migrates 3 call sites
feat(web): OpenAI-compatible STT + widen self-host AI gates — new STT branch in transcribe workflow; widens 4 env-key trigger gates that were blocking self-hosters on local providers
chore(docker): expose AI/STT env vars through compose files — 4 compose flavors

End-state for a fully-local self-host (FYI — not part of this PR's required setup):

# docker-compose.override.yml
services:
  cap-web:
    environment:
      AI_BASE_URL: http://host.docker.internal:11434/v1
      AI_API_KEY: ollama
      AI_MODEL: gemma3:12b
      STT_BASE_URL: http://faster-whisper:9000/v1
      STT_API_KEY: none
      STT_MODEL: large-v3-turbo

Backwards compatibility

If the new env vars are unset, behavior is identical to today: GROQ_API_KEY → Groq path; OPENAI_API_KEY → existing OpenAI fallback; DEEPGRAM_API_KEY → Deepgram. The Groq path now constructs an openai SDK client with baseURL = https://api.groq.com/openai/v1 — same wire protocol, no observable difference.

Verification

End-to-end test on a local Docker Compose self-host with Ollama (Gemma 3 12B) + hwdsl2/whisper-server (Whisper base), after applying both #1874 and this PR:

Before this PR (with #1874 applied alone — workflow runtime works)

Self-host with only AI_BASE_URL/STT_BASE_URL set (no Groq/Deepgram keys): the share page renders, transcribeVideo() is called, but trigger gates checking for DEEPGRAM_API_KEY/GROQ_API_KEY short-circuit before the workflow starts. transcriptionStatus stays NULL.

After this PR

Share-page render fires transcribeVideo() → workflow runs end-to-end:

[ShareVideoPage] Starting transcription for video: t7tkmqev8a8ecbk
[transcribeVideo] Triggering transcription workflow
[transcribe] Probe result: audioCodec=aac, videoCodec=h264
[transcribe] Extracted audio: 2126359 bytes
[whisper] POST /v1/audio/transcriptions HTTP/1.1 200 OK

videos.transcriptionStatus = COMPLETE; transcription.vtt written directly to S3 from Whisper's response_format=vtt output (no format conversion).
AI generation auto-queues; Gemma 3 12B produces title + paragraph summary + 4 timestamped chapters; aiGenerationStatus = COMPLETE; metadata.summary and metadata.chapters populated.
Existing Groq-only config still works (smoke-tested by reverting env to only GROQ_API_KEY).

Gates clean for changed files: pnpm exec biome check --write, pnpm exec tsc -b, pnpm vitest run __tests__/unit/generate-ai-title.test.ts (6/6).

Out of scope (intentional)

@deepgram/sdk remains a dependency. Deepgram is the default STT when STT_BASE_URL is unset, so cap.so cloud is untouched. A follow-up could drop the SDK once STT_BASE_URL becomes the unified STT path.
The Anthropic + OpenAI raw-fetch fallbacks in apps/web/lib/messenger/agent.ts were NOT migrated — that file is the support chatbot with its own Anthropic → OpenAI → Groq fallback chain, separate domain.

Depends on fix(proxy): allow /.well-known/* and /embed/* on self-hosted #1874 (proxy fix that enables workflows to execute on self-host at all)
Closes the AI-provider portion of Transcription error on self-hosted Cap #1356 (self-host transcription error)
Addresses @xenova's local-model interest in Transcription error on self-hosted Cap #1356
May address Transcription workflow fails on Docker image with Node 24 #1550 transitively once the workflow runtime side stabilizes

Design questions

Naming: AI_* + STT_* (what I shipped) vs. LLM_* + STT_* vs. some other shape you'd prefer?
Deepgram fate: keep as default branch indefinitely (current), or migrate cap.so cloud to OpenAI-compatible Whisper at some point and drop @deepgram/sdk?
OpenAI raw-fetch fallback in messenger: also unify that path in a follow-up, or leave as-is given the separate fallback semantics?

Greptile Summary

This PR replaces three divergent AI call patterns (Groq SDK, raw OpenAI fetch, Deepgram SDK) with a single OpenAI-compatible client abstraction in lib/ai-provider.ts, gated by six new optional env vars (AI_BASE_URL/KEY/MODEL, STT_BASE_URL/KEY/MODEL). Existing deployments using GROQ_API_KEY / OPENAI_API_KEY / DEEPGRAM_API_KEY are unaffected by default.

lib/ai-provider.ts — new singleton factory for chat and STT clients; priority order is AI_BASE_URL → Groq → OpenAI.
workflows/generate-ai.ts — removes the Groq→OpenAI automatic failover that previously recovered from Groq errors when OPENAI_API_KEY was also present.
workflows/transcribe.ts — adds an OpenAI-compatible STT branch that posts audio and receives WebVTT directly, bypassing the Deepgram formatter.
lib/messenger/agent.ts — the PR states this file was out of scope, but callGroq was replaced with callAiProvider using getAiClient(), meaning the support chatbot's last-resort fallback now routes through whatever provider AI_BASE_URL points to (e.g. a local Ollama instance).

Confidence Score: 3/5

Safe to merge for new self-hosted deployments; two unintended behavioral changes affect existing dual-key setups and the support chatbot routing.

Two issues need resolution before merge. First, workflows/generate-ai.ts silently drops the Groq→OpenAI failover — users with both keys set lose automatic recovery from Groq downtime. Second, lib/messenger/agent.ts was migrated despite the PR explicitly calling it out of scope, so the support chatbot's final fallback now routes through any AI_BASE_URL-configured provider (e.g. a local Ollama instance), which may produce unsuitable responses for a customer support context.

apps/web/workflows/generate-ai.ts (failover removal) and apps/web/lib/messenger/agent.ts (unintended migration of the support chatbot)

Important Files Changed

Filename	Overview
apps/web/lib/ai-provider.ts	New unified OpenAI-compatible client abstraction; module-level singletons work for production but could silently retain a stale client across test/dev reloads.
apps/web/lib/messenger/agent.ts	callGroq renamed to callAiProvider and now routes through getAiClient() — contradicts the PR's "out of scope" statement and silently routes the support chatbot through any configured AI_BASE_URL provider.
apps/web/workflows/generate-ai.ts	Migrated to unified client; removes the Groq→OpenAI automatic failover that existed for dual-key setups, making AI generation workflows non-resilient to provider errors.
apps/web/workflows/transcribe.ts	Adds OpenAI-compatible STT path via transcribeViaSttProvider; the as unknown as string cast and VTT validation are functional but the cast rationale is undocumented.
packages/env/server.ts	Adds 6 new optional env vars (AI_BASE_URL, AI_API_KEY, AI_MODEL, STT_BASE_URL, STT_API_KEY, STT_MODEL) with clear descriptions; all optional with no breaking schema changes.
apps/web/actions/videos/get-status.ts	Widens trigger gates to accept STT_BASE_URL and AI_BASE_URL alongside legacy keys; straightforward additive guard changes.

Comments Outside Diff (2)

apps/web/lib/messenger/agent.ts, line 168-191 (link)

Messenger agent was migrated — contradicts PR description

The PR's "Out of scope" section explicitly states: "The Anthropic + OpenAI raw-fetch fallbacks in apps/web/lib/messenger/agent.ts were NOT migrated." But callGroq has been renamed to callAiProvider and now calls getAiClient(). For any self-hosted deployment that sets AI_BASE_URL (the stated target of this PR) but has neither ANTHROPIC_API_KEY nor OPENAI_API_KEY, the support chatbot's last resort will now be a local Ollama/vLLM instance. A local Gemma model answering customer support queries is likely not the intended behavior, and it contradicts the stated out-of-scope decision.

Prompt To Fix With AI

This is a comment left during a code review.
Path: apps/web/lib/messenger/agent.ts
Line: 168-191

Comment:
**Messenger agent was migrated — contradicts PR description**

The PR's "Out of scope" section explicitly states: *"The Anthropic + OpenAI raw-fetch fallbacks in `apps/web/lib/messenger/agent.ts` were NOT migrated."* But `callGroq` has been renamed to `callAiProvider` and now calls `getAiClient()`. For any self-hosted deployment that sets `AI_BASE_URL` (the stated target of this PR) but has neither `ANTHROPIC_API_KEY` nor `OPENAI_API_KEY`, the support chatbot's last resort will now be a local Ollama/vLLM instance. A local Gemma model answering customer support queries is likely not the intended behavior, and it contradicts the stated out-of-scope decision.

How can I resolve this? If you propose a fix, please make it concise.

apps/web/workflows/transcribe.ts, line 536-541 (link)

Double-cast through unknown to string

The OpenAI Node SDK v4 overloads audio.transcriptions.create — when response_format is "vtt", "srt", or "text" the runtime value is a plain string, but the TypeScript generic signature falls through to the Transcription type. The as unknown as string workaround is valid here, but leaving a comment explaining why the cast is needed would help the next reader avoid accidentally "fixing" it by removing the cast (which would cause a type error on .includes("WEBVTT")).

Prompt To Fix With AI

This is a comment left during a code review.
Path: apps/web/workflows/transcribe.ts
Line: 536-541

Comment:
**Double-cast through `unknown` to `string`**

The OpenAI Node SDK v4 overloads `audio.transcriptions.create` — when `response_format` is `"vtt"`, `"srt"`, or `"text"` the runtime value is a plain string, but the TypeScript generic signature falls through to the `Transcription` type. The `as unknown as string` workaround is valid here, but leaving a comment explaining why the cast is needed would help the next reader avoid accidentally "fixing" it by removing the cast (which would cause a type error on `.includes("WEBVTT")`).

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix All With AI

Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
apps/web/lib/messenger/agent.ts:168-191
**Messenger agent was migrated — contradicts PR description**

The PR's "Out of scope" section explicitly states: *"The Anthropic + OpenAI raw-fetch fallbacks in `apps/web/lib/messenger/agent.ts` were NOT migrated."* But `callGroq` has been renamed to `callAiProvider` and now calls `getAiClient()`. For any self-hosted deployment that sets `AI_BASE_URL` (the stated target of this PR) but has neither `ANTHROPIC_API_KEY` nor `OPENAI_API_KEY`, the support chatbot's last resort will now be a local Ollama/vLLM instance. A local Gemma model answering customer support queries is likely not the intended behavior, and it contradicts the stated out-of-scope decision.

### Issue 2 of 4
apps/web/workflows/generate-ai.ts:396-400
**Groq → OpenAI automatic failover silently dropped**

The old `callAiApi` tried Groq first and, on any exception, retried with OpenAI if `OPENAI_API_KEY` was also set. That resilience path is gone: if the configured provider (e.g. Groq) returns an error, the whole `generateAiWorkflow` now throws with no recovery. For users who had both `GROQ_API_KEY` and `OPENAI_API_KEY` set (the documented fallback setup that motivated the original dual-path code), a transient Groq outage will now fail the AI generation step entirely instead of falling back gracefully.

### Issue 3 of 4
apps/web/workflows/transcribe.ts:536-541
**Double-cast through `unknown` to `string`**

The OpenAI Node SDK v4 overloads `audio.transcriptions.create` — when `response_format` is `"vtt"`, `"srt"`, or `"text"` the runtime value is a plain string, but the TypeScript generic signature falls through to the `Transcription` type. The `as unknown as string` workaround is valid here, but leaving a comment explaining why the cast is needed would help the next reader avoid accidentally "fixing" it by removing the cast (which would cause a type error on `.includes("WEBVTT")`).

### Issue 4 of 4
apps/web/lib/ai-provider.ts:11-15
**Module-level singleton not reset between test runs**

`aiClient` and `sttClient` are module-level `let` variables that cache the first-constructed client for the lifetime of the process. In tests (and in hot-reload dev servers) where `serverEnv()` may differ between invocations or env vars are set after module load, `getAiClient()` will keep returning the first client even if the environment has changed. The old `groq-client.ts` had the same pattern — but with a unified provider this is now the single gate for both chat and STT. The existing unit test works because it mocks the module; this is fine for production, but worth a comment so the caching intent is explicit.

_{Reviews (1): Last reviewed commit: "chore(docker): expose AI/STT env vars th..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

Adds six optional server env vars: - AI_BASE_URL / AI_API_KEY / AI_MODEL — chat completions provider - STT_BASE_URL / STT_API_KEY / STT_MODEL — speech-to-text provider All are .optional() so existing self-hosters running Groq/OpenAI/Deepgram are unaffected. Subsequent commits wire them into the AI client and the transcription workflow.

Replaces the groq-sdk wrapper and the hand-rolled fetch-to-OpenAI fallback with a single `openai` SDK client in apps/web/lib/ai-provider.ts. The new getAiClient()/getAiModel() resolve, in order: 1. AI_BASE_URL + AI_API_KEY + AI_MODEL (any OpenAI-compatible provider: Ollama, vLLM, OpenRouter, LiteLLM, etc.) 2. GROQ_API_KEY (existing default, baseURL pinned to Groq, model preserved as openai/gpt-oss-120b) 3. OPENAI_API_KEY (default OpenAI endpoint, model preserved as gpt-4o-mini) Call sites migrated: - apps/web/workflows/generate-ai.ts: drops the duplicate callOpenAi raw fetch in favor of the unified client; signatures threaded - apps/web/actions/videos/translate-transcript.ts - apps/web/lib/messenger/agent.ts (Groq branch only; Anthropic/OpenAI fallbacks untouched — separate domain, kept out of scope) Dependency change: -groq-sdk, +openai. Behavior for existing installs is unchanged because the Groq path now constructs an OpenAI client with baseURL = https://api.groq.com/openai/v1 — same wire protocol.

Transcription workflow: - When STT_BASE_URL is set, transcribeAudio dispatches to a new transcribeViaSttProvider() that calls openai.audio.transcriptions.create with response_format: "vtt". The OpenAI Whisper API returns WebVTT directly, which is exactly what Cap's pipeline writes to S3, so the Deepgram-specific formatToWebVTT(DeepgramResult) adapter drops out on this path. - Default behavior (STT_BASE_URL unset) still uses Deepgram. Existing installs are unaffected. Trigger-gate widenings (these were the blockers preventing self-hosters on local providers from ever firing the workflow): - apps/web/lib/transcribe.ts: accept STT_BASE_URL as a valid provider - apps/web/lib/generate-ai.ts: accept AI_BASE_URL as a valid provider - apps/web/actions/videos/get-status.ts: same widenings in the share-page auto-trigger paths for both transcription and AI generation

The default docker-compose.yml did not pass DEEPGRAM/GROQ/OPENAI env vars through to cap-web, which is part of why self-host AI was silently broken — even when users set the keys in .env, they never reached the container. This commit threads them through along with the new AI_*/STT_* triples in all four compose flavors: - docker-compose.yml (default) - docker-compose.template.yml - docker-compose.coolify.yml - docker-compose.coolify.env.example

superagent-security

Superagent found 1 security concern(s).

…scheme - AI_BASE_URL and STT_BASE_URL now refuse non-http(s) schemes (defense against typos like `file://` or `gopher://`). Empty string still passes for compose default `${VAR:-}`. - Doc strings updated to make the requirement contract explicit (AI_API_KEY and AI_MODEL are required when AI_BASE_URL is set; same for STT_*). The previous wording said AI_API_KEY "falls back to GROQ_API_KEY or OPENAI_API_KEY" — that fallback is removed in the next commit because it could silently send a paid cloud key to an arbitrary URL.

…e gates ai-provider.ts: - Drop the module-level singleton cache. The OpenAI SDK is cheap to construct and the cache made env changes / hot-reloads / tests carry stale clients with no path to recreate. - Drop the cross-provider apiKey fallback. Previously, setting AI_BASE_URL without AI_API_KEY would silently send the configured GROQ_API_KEY or OPENAI_API_KEY over the wire to the new endpoint. Now AI_API_KEY is required explicitly when AI_BASE_URL is set; same for STT_API_KEY. - Throw clear errors when AI_BASE_URL is set without the required AI_API_KEY or AI_MODEL (and STT analogue). The previous code would silently default AI_MODEL to "gpt-4o-mini" and let Ollama/vLLM return an opaque 404 inside the workflow step. - Set explicit timeouts (120s chat, 300s STT) and maxRetries: 2 on the OpenAI client. The SDK default of 600s would hang workflow steps for 10 minutes on a stuck local inference call; the retry restores the resilience that the previous Groq->OpenAI fallback used to provide. - Add isAiConfigured() / isSttConfigured() helpers as the single source of truth for "is any chat / STT provider available?" so the OR-chains in trigger gates don't drift the next time a provider type lands. workflows/transcribe.ts: - Drop the `as unknown as string` cast on the OpenAI SDK transcription response. With `response_format: "vtt" as const` the SDK's overload narrows to string at compile time; the unsafe cast was hiding that. - Strengthen the WebVTT smoke check from a substring search for "WEBVTT" to a structural check (`/^WEBVTT/m` header line plus a cue arrow `-->`). The substring form would both reject valid VTT without the header and accept SRT or other formats that happened to contain the word. workflows/generate-ai.ts: - Request `response_format: { type: "json_object" }` on chat completions. Every prompt already instructs "Return ONLY valid JSON"; modern OpenAI-compatible providers (OpenAI, Groq, Ollama, vLLM, OpenRouter) enforce that with this flag, which materially reduces parse failures on smaller local models. A try/catch falls back to plain mode when the underlying provider rejects the field, keeping niche gateways compatible. Trigger gates consolidated via isAiConfigured() / isSttConfigured(): - apps/web/actions/videos/get-status.ts (share-page auto-trigger, both transcription and AI-generation paths) - apps/web/lib/transcribe.ts (lib entry point) - apps/web/lib/generate-ai.ts (lib entry point) - apps/web/workflows/generate-ai.ts (validateAndSetProcessing step; also aligns the second-check error message with the first) - apps/web/workflows/transcribe.ts (validateVideo step)

The previous PR commit dropped the cross-provider failover and described maxRetries: 2 as a replacement — that was wrong. maxRetries only retries the same endpoint on transient errors; it does not preserve the prior Groq → OpenAI behavior for users who had both keys set as a true failover. This restores the prior semantics through the unified abstraction: - New getAiFallbackClient() in ai-provider.ts returns an OpenAI client (with the same timeout / maxRetries settings) when both GROQ_API_KEY and OPENAI_API_KEY are set AND no AI_BASE_URL override is in effect. An explicit AI_BASE_URL means the user has chosen a specific provider; no implicit fallback is added in that case. - callAiApi in workflows/generate-ai.ts wraps the primary call in a try/catch; on any primary failure, if a fallback client is available it retries once with OpenAI before propagating. JSON-mode handling is applied to both legs via a shared invokeChat helper. - A console.warn surfaces the fallback so the failure is observable.

kovashikawa added 4 commits May 31, 2026 11:32

superagent-security Bot added the pr:flagged PR flagged for review by security analysis. label May 31, 2026

superagent-security Bot reviewed May 31, 2026

View reviewed changes

Comment thread apps/web/lib/ai-provider.ts

greptile-apps Bot reviewed May 31, 2026

View reviewed changes

Comment thread apps/web/workflows/generate-ai.ts

Comment thread apps/web/lib/ai-provider.ts Outdated

kovashikawa added 2 commits May 31, 2026 20:35

superagent-security Bot removed the pr:flagged PR flagged for review by security analysis. label Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(web): unify AI providers behind OpenAI-compatible config (Ollama, Whisper, etc.)#1877

feat(web): unify AI providers behind OpenAI-compatible config (Ollama, Whisper, etc.)#1877
kovashikawa wants to merge 7 commits into
CapSoftware:mainfrom
kovashikawa:feat/openai-compatible-ai-providers

kovashikawa commented May 31, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

superagent-security Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kovashikawa commented May 31, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Backwards compatibility

Verification

Out of scope (intentional)

Related

Design questions

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Comments Outside Diff (2)

Uh oh!

superagent-security Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kovashikawa commented May 31, 2026 •

edited by greptile-apps Bot

Loading