Skip to content

feat(web): unify AI providers behind OpenAI-compatible config (Ollama, Whisper, etc.)#1877

Open
kovashikawa wants to merge 7 commits into
CapSoftware:mainfrom
kovashikawa:feat/openai-compatible-ai-providers
Open

feat(web): unify AI providers behind OpenAI-compatible config (Ollama, Whisper, etc.)#1877
kovashikawa wants to merge 7 commits into
CapSoftware:mainfrom
kovashikawa:feat/openai-compatible-ai-providers

Conversation

@kovashikawa
Copy link
Copy Markdown

@kovashikawa kovashikawa commented May 31, 2026

Summary

  • Replace three inconsistent AI call patterns with a single OpenAI-compatible client abstraction.
  • Self-hosters can run AI features fully locally (Ollama for chat, faster-whisper-server for STT) with zero paid services.
  • Existing Groq / OpenAI / Deepgram installs are unchanged — all current behavior preserved by default.

Depends on #1874. That proxy fix is the prerequisite for any workflow to execute on a self-host. Without #1874 merged, this PR's transcription path queues but never runs (the workflow runtime's /.well-known/workflow/v1/* HTTP callbacks get 307→/login). Once #1874 lands, this PR rebases cleanly onto main with no further changes.

Root cause

Self-hosted Cap currently requires three paid third-party providers (Deepgram + Groq/OpenAI) to make the share-page AI features (summary, chapters, transcript) work. Even with paid keys, the three calls use three inconsistent code paths:

Concern Current SDK / call File
Summary (primary) groq-sdk chat completion apps/web/lib/groq-client.ts
Summary (fallback) hand-rolled fetch to api.openai.com apps/web/workflows/generate-ai.ts:391
Transcription @deepgram/sdk transcribeFile apps/web/workflows/transcribe.ts:342

@xenova in #1356: "any interest in using a local model for speech transcription? 👀"

PR #1705 already ships local STT in the desktop app (Parakeet). Local models are part of Cap's stance — just not on the server side, yet.

Fix

Collapse the three call patterns into one OpenAI-compatible client abstraction, configured by env. The OpenAI API is the lingua franca that Groq, OpenAI, Ollama, vLLM, OpenRouter, LiteLLM, faster-whisper-server, and whisper.cpp's HTTP server all already speak.

Two concerns → two env triples (all optional, all default to existing behavior):

# Chat (summaries, chapters, titles)
AI_BASE_URL      # default: https://api.groq.com/openai/v1
AI_API_KEY       # default: $GROQ_API_KEY → $OPENAI_API_KEY
AI_MODEL         # default: openai/gpt-oss-120b

# Speech-to-text
STT_BASE_URL     # default unset → existing Deepgram path
STT_API_KEY
STT_MODEL        # used when STT_BASE_URL is set

Key simplification on the STT path: OpenAI's /v1/audio/transcriptions natively returns WebVTT (response_format: "vtt"). That's exactly what Cap already writes to S3, so when STT_BASE_URL is set the Deepgram-specific formatToWebVTT(DeepgramResult) adapter drops out — the rest of the pipeline is unchanged.

Commits are split into 4 logical groups, each typechecking independently:

  1. feat(env): add AI_*/STT_* env vars for OpenAI-compatible providers
  2. refactor(web): unify chat AI behind OpenAI-compatible client — drops groq-sdk, adds openai, deletes lib/groq-client.ts, migrates 3 call sites
  3. feat(web): OpenAI-compatible STT + widen self-host AI gates — new STT branch in transcribe workflow; widens 4 env-key trigger gates that were blocking self-hosters on local providers
  4. chore(docker): expose AI/STT env vars through compose files — 4 compose flavors

End-state for a fully-local self-host (FYI — not part of this PR's required setup):

# docker-compose.override.yml
services:
  cap-web:
    environment:
      AI_BASE_URL: http://host.docker.internal:11434/v1
      AI_API_KEY: ollama
      AI_MODEL: gemma3:12b
      STT_BASE_URL: http://faster-whisper:9000/v1
      STT_API_KEY: none
      STT_MODEL: large-v3-turbo

Backwards compatibility

If the new env vars are unset, behavior is identical to today: GROQ_API_KEY → Groq path; OPENAI_API_KEY → existing OpenAI fallback; DEEPGRAM_API_KEY → Deepgram. The Groq path now constructs an openai SDK client with baseURL = https://api.groq.com/openai/v1 — same wire protocol, no observable difference.

Verification

End-to-end test on a local Docker Compose self-host with Ollama (Gemma 3 12B) + hwdsl2/whisper-server (Whisper base), after applying both #1874 and this PR:

Before this PR (with #1874 applied alone — workflow runtime works)

  • Self-host with only AI_BASE_URL/STT_BASE_URL set (no Groq/Deepgram keys): the share page renders, transcribeVideo() is called, but trigger gates checking for DEEPGRAM_API_KEY/GROQ_API_KEY short-circuit before the workflow starts. transcriptionStatus stays NULL.

After this PR

  • Share-page render fires transcribeVideo() → workflow runs end-to-end:
    [ShareVideoPage] Starting transcription for video: t7tkmqev8a8ecbk
    [transcribeVideo] Triggering transcription workflow
    [transcribe] Probe result: audioCodec=aac, videoCodec=h264
    [transcribe] Extracted audio: 2126359 bytes
    [whisper] POST /v1/audio/transcriptions HTTP/1.1 200 OK
    
  • videos.transcriptionStatus = COMPLETE; transcription.vtt written directly to S3 from Whisper's response_format=vtt output (no format conversion).
  • AI generation auto-queues; Gemma 3 12B produces title + paragraph summary + 4 timestamped chapters; aiGenerationStatus = COMPLETE; metadata.summary and metadata.chapters populated.
  • Existing Groq-only config still works (smoke-tested by reverting env to only GROQ_API_KEY).

Gates clean for changed files: pnpm exec biome check --write, pnpm exec tsc -b, pnpm vitest run __tests__/unit/generate-ai-title.test.ts (6/6).

Out of scope (intentional)

  • @deepgram/sdk remains a dependency. Deepgram is the default STT when STT_BASE_URL is unset, so cap.so cloud is untouched. A follow-up could drop the SDK once STT_BASE_URL becomes the unified STT path.
  • The Anthropic + OpenAI raw-fetch fallbacks in apps/web/lib/messenger/agent.ts were NOT migrated — that file is the support chatbot with its own Anthropic → OpenAI → Groq fallback chain, separate domain.

Related

Design questions

  1. Naming: AI_* + STT_* (what I shipped) vs. LLM_* + STT_* vs. some other shape you'd prefer?
  2. Deepgram fate: keep as default branch indefinitely (current), or migrate cap.so cloud to OpenAI-compatible Whisper at some point and drop @deepgram/sdk?
  3. OpenAI raw-fetch fallback in messenger: also unify that path in a follow-up, or leave as-is given the separate fallback semantics?

Greptile Summary

This PR replaces three divergent AI call patterns (Groq SDK, raw OpenAI fetch, Deepgram SDK) with a single OpenAI-compatible client abstraction in lib/ai-provider.ts, gated by six new optional env vars (AI_BASE_URL/KEY/MODEL, STT_BASE_URL/KEY/MODEL). Existing deployments using GROQ_API_KEY / OPENAI_API_KEY / DEEPGRAM_API_KEY are unaffected by default.

  • lib/ai-provider.ts — new singleton factory for chat and STT clients; priority order is AI_BASE_URL → Groq → OpenAI.
  • workflows/generate-ai.ts — removes the Groq→OpenAI automatic failover that previously recovered from Groq errors when OPENAI_API_KEY was also present.
  • workflows/transcribe.ts — adds an OpenAI-compatible STT branch that posts audio and receives WebVTT directly, bypassing the Deepgram formatter.
  • lib/messenger/agent.ts — the PR states this file was out of scope, but callGroq was replaced with callAiProvider using getAiClient(), meaning the support chatbot's last-resort fallback now routes through whatever provider AI_BASE_URL points to (e.g. a local Ollama instance).

Confidence Score: 3/5

Safe to merge for new self-hosted deployments; two unintended behavioral changes affect existing dual-key setups and the support chatbot routing.

Two issues need resolution before merge. First, workflows/generate-ai.ts silently drops the Groq→OpenAI failover — users with both keys set lose automatic recovery from Groq downtime. Second, lib/messenger/agent.ts was migrated despite the PR explicitly calling it out of scope, so the support chatbot's final fallback now routes through any AI_BASE_URL-configured provider (e.g. a local Ollama instance), which may produce unsuitable responses for a customer support context.

apps/web/workflows/generate-ai.ts (failover removal) and apps/web/lib/messenger/agent.ts (unintended migration of the support chatbot)

Important Files Changed

Filename Overview
apps/web/lib/ai-provider.ts New unified OpenAI-compatible client abstraction; module-level singletons work for production but could silently retain a stale client across test/dev reloads.
apps/web/lib/messenger/agent.ts callGroq renamed to callAiProvider and now routes through getAiClient() — contradicts the PR's "out of scope" statement and silently routes the support chatbot through any configured AI_BASE_URL provider.
apps/web/workflows/generate-ai.ts Migrated to unified client; removes the Groq→OpenAI automatic failover that existed for dual-key setups, making AI generation workflows non-resilient to provider errors.
apps/web/workflows/transcribe.ts Adds OpenAI-compatible STT path via transcribeViaSttProvider; the as unknown as string cast and VTT validation are functional but the cast rationale is undocumented.
packages/env/server.ts Adds 6 new optional env vars (AI_BASE_URL, AI_API_KEY, AI_MODEL, STT_BASE_URL, STT_API_KEY, STT_MODEL) with clear descriptions; all optional with no breaking schema changes.
apps/web/actions/videos/get-status.ts Widens trigger gates to accept STT_BASE_URL and AI_BASE_URL alongside legacy keys; straightforward additive guard changes.

Comments Outside Diff (2)

  1. apps/web/lib/messenger/agent.ts, line 168-191 (link)

    P1 Messenger agent was migrated — contradicts PR description

    The PR's "Out of scope" section explicitly states: "The Anthropic + OpenAI raw-fetch fallbacks in apps/web/lib/messenger/agent.ts were NOT migrated." But callGroq has been renamed to callAiProvider and now calls getAiClient(). For any self-hosted deployment that sets AI_BASE_URL (the stated target of this PR) but has neither ANTHROPIC_API_KEY nor OPENAI_API_KEY, the support chatbot's last resort will now be a local Ollama/vLLM instance. A local Gemma model answering customer support queries is likely not the intended behavior, and it contradicts the stated out-of-scope decision.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: apps/web/lib/messenger/agent.ts
    Line: 168-191
    
    Comment:
    **Messenger agent was migrated — contradicts PR description**
    
    The PR's "Out of scope" section explicitly states: *"The Anthropic + OpenAI raw-fetch fallbacks in `apps/web/lib/messenger/agent.ts` were NOT migrated."* But `callGroq` has been renamed to `callAiProvider` and now calls `getAiClient()`. For any self-hosted deployment that sets `AI_BASE_URL` (the stated target of this PR) but has neither `ANTHROPIC_API_KEY` nor `OPENAI_API_KEY`, the support chatbot's last resort will now be a local Ollama/vLLM instance. A local Gemma model answering customer support queries is likely not the intended behavior, and it contradicts the stated out-of-scope decision.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. apps/web/workflows/transcribe.ts, line 536-541 (link)

    P2 Double-cast through unknown to string

    The OpenAI Node SDK v4 overloads audio.transcriptions.create — when response_format is "vtt", "srt", or "text" the runtime value is a plain string, but the TypeScript generic signature falls through to the Transcription type. The as unknown as string workaround is valid here, but leaving a comment explaining why the cast is needed would help the next reader avoid accidentally "fixing" it by removing the cast (which would cause a type error on .includes("WEBVTT")).

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: apps/web/workflows/transcribe.ts
    Line: 536-541
    
    Comment:
    **Double-cast through `unknown` to `string`**
    
    The OpenAI Node SDK v4 overloads `audio.transcriptions.create` — when `response_format` is `"vtt"`, `"srt"`, or `"text"` the runtime value is a plain string, but the TypeScript generic signature falls through to the `Transcription` type. The `as unknown as string` workaround is valid here, but leaving a comment explaining why the cast is needed would help the next reader avoid accidentally "fixing" it by removing the cast (which would cause a type error on `.includes("WEBVTT")`).
    
    How can I resolve this? If you propose a fix, please make it concise.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix All With AI
Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
apps/web/lib/messenger/agent.ts:168-191
**Messenger agent was migrated — contradicts PR description**

The PR's "Out of scope" section explicitly states: *"The Anthropic + OpenAI raw-fetch fallbacks in `apps/web/lib/messenger/agent.ts` were NOT migrated."* But `callGroq` has been renamed to `callAiProvider` and now calls `getAiClient()`. For any self-hosted deployment that sets `AI_BASE_URL` (the stated target of this PR) but has neither `ANTHROPIC_API_KEY` nor `OPENAI_API_KEY`, the support chatbot's last resort will now be a local Ollama/vLLM instance. A local Gemma model answering customer support queries is likely not the intended behavior, and it contradicts the stated out-of-scope decision.

### Issue 2 of 4
apps/web/workflows/generate-ai.ts:396-400
**Groq → OpenAI automatic failover silently dropped**

The old `callAiApi` tried Groq first and, on any exception, retried with OpenAI if `OPENAI_API_KEY` was also set. That resilience path is gone: if the configured provider (e.g. Groq) returns an error, the whole `generateAiWorkflow` now throws with no recovery. For users who had both `GROQ_API_KEY` and `OPENAI_API_KEY` set (the documented fallback setup that motivated the original dual-path code), a transient Groq outage will now fail the AI generation step entirely instead of falling back gracefully.

### Issue 3 of 4
apps/web/workflows/transcribe.ts:536-541
**Double-cast through `unknown` to `string`**

The OpenAI Node SDK v4 overloads `audio.transcriptions.create` — when `response_format` is `"vtt"`, `"srt"`, or `"text"` the runtime value is a plain string, but the TypeScript generic signature falls through to the `Transcription` type. The `as unknown as string` workaround is valid here, but leaving a comment explaining why the cast is needed would help the next reader avoid accidentally "fixing" it by removing the cast (which would cause a type error on `.includes("WEBVTT")`).

### Issue 4 of 4
apps/web/lib/ai-provider.ts:11-15
**Module-level singleton not reset between test runs**

`aiClient` and `sttClient` are module-level `let` variables that cache the first-constructed client for the lifetime of the process. In tests (and in hot-reload dev servers) where `serverEnv()` may differ between invocations or env vars are set after module load, `getAiClient()` will keep returning the first client even if the environment has changed. The old `groq-client.ts` had the same pattern — but with a unified provider this is now the single gate for both chat and STT. The existing unit test works because it mocks the module; this is fine for production, but worth a comment so the caching intent is explicit.

Reviews (1): Last reviewed commit: "chore(docker): expose AI/STT env vars th..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

Adds six optional server env vars:
- AI_BASE_URL / AI_API_KEY / AI_MODEL — chat completions provider
- STT_BASE_URL / STT_API_KEY / STT_MODEL — speech-to-text provider

All are .optional() so existing self-hosters running Groq/OpenAI/Deepgram
are unaffected. Subsequent commits wire them into the AI client and the
transcription workflow.
Replaces the groq-sdk wrapper and the hand-rolled fetch-to-OpenAI fallback
with a single `openai` SDK client in apps/web/lib/ai-provider.ts. The new
getAiClient()/getAiModel() resolve, in order:

1. AI_BASE_URL + AI_API_KEY + AI_MODEL (any OpenAI-compatible provider:
   Ollama, vLLM, OpenRouter, LiteLLM, etc.)
2. GROQ_API_KEY (existing default, baseURL pinned to Groq, model
   preserved as openai/gpt-oss-120b)
3. OPENAI_API_KEY (default OpenAI endpoint, model preserved as
   gpt-4o-mini)

Call sites migrated:
- apps/web/workflows/generate-ai.ts: drops the duplicate callOpenAi raw
  fetch in favor of the unified client; signatures threaded
- apps/web/actions/videos/translate-transcript.ts
- apps/web/lib/messenger/agent.ts (Groq branch only; Anthropic/OpenAI
  fallbacks untouched — separate domain, kept out of scope)

Dependency change: -groq-sdk, +openai. Behavior for existing installs
is unchanged because the Groq path now constructs an OpenAI client with
baseURL = https://api.groq.com/openai/v1 — same wire protocol.
Transcription workflow:
- When STT_BASE_URL is set, transcribeAudio dispatches to a new
  transcribeViaSttProvider() that calls openai.audio.transcriptions.create
  with response_format: "vtt". The OpenAI Whisper API returns WebVTT
  directly, which is exactly what Cap's pipeline writes to S3, so the
  Deepgram-specific formatToWebVTT(DeepgramResult) adapter drops out
  on this path.
- Default behavior (STT_BASE_URL unset) still uses Deepgram. Existing
  installs are unaffected.

Trigger-gate widenings (these were the blockers preventing self-hosters
on local providers from ever firing the workflow):
- apps/web/lib/transcribe.ts: accept STT_BASE_URL as a valid provider
- apps/web/lib/generate-ai.ts: accept AI_BASE_URL as a valid provider
- apps/web/actions/videos/get-status.ts: same widenings in the
  share-page auto-trigger paths for both transcription and AI generation
The default docker-compose.yml did not pass DEEPGRAM/GROQ/OPENAI env
vars through to cap-web, which is part of why self-host AI was silently
broken — even when users set the keys in .env, they never reached the
container. This commit threads them through along with the new
AI_*/STT_* triples in all four compose flavors:

- docker-compose.yml (default)
- docker-compose.template.yml
- docker-compose.coolify.yml
- docker-compose.coolify.env.example
@superagent-security superagent-security Bot added the pr:flagged PR flagged for review by security analysis. label May 31, 2026
Copy link
Copy Markdown

@superagent-security superagent-security Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superagent found 1 security concern(s).

Comment thread apps/web/lib/ai-provider.ts
Comment thread apps/web/workflows/generate-ai.ts
Comment thread apps/web/lib/ai-provider.ts Outdated
…scheme

- AI_BASE_URL and STT_BASE_URL now refuse non-http(s) schemes (defense
  against typos like `file://` or `gopher://`). Empty string still passes
  for compose default `${VAR:-}`.
- Doc strings updated to make the requirement contract explicit
  (AI_API_KEY and AI_MODEL are required when AI_BASE_URL is set; same for
  STT_*). The previous wording said AI_API_KEY "falls back to GROQ_API_KEY
  or OPENAI_API_KEY" — that fallback is removed in the next commit
  because it could silently send a paid cloud key to an arbitrary URL.
…e gates

ai-provider.ts:
- Drop the module-level singleton cache. The OpenAI SDK is cheap to
  construct and the cache made env changes / hot-reloads / tests carry
  stale clients with no path to recreate.
- Drop the cross-provider apiKey fallback. Previously, setting AI_BASE_URL
  without AI_API_KEY would silently send the configured GROQ_API_KEY or
  OPENAI_API_KEY over the wire to the new endpoint. Now AI_API_KEY is
  required explicitly when AI_BASE_URL is set; same for STT_API_KEY.
- Throw clear errors when AI_BASE_URL is set without the required
  AI_API_KEY or AI_MODEL (and STT analogue). The previous code would
  silently default AI_MODEL to "gpt-4o-mini" and let Ollama/vLLM return
  an opaque 404 inside the workflow step.
- Set explicit timeouts (120s chat, 300s STT) and maxRetries: 2 on the
  OpenAI client. The SDK default of 600s would hang workflow steps for
  10 minutes on a stuck local inference call; the retry restores the
  resilience that the previous Groq->OpenAI fallback used to provide.
- Add isAiConfigured() / isSttConfigured() helpers as the single source
  of truth for "is any chat / STT provider available?" so the OR-chains
  in trigger gates don't drift the next time a provider type lands.

workflows/transcribe.ts:
- Drop the `as unknown as string` cast on the OpenAI SDK transcription
  response. With `response_format: "vtt" as const` the SDK's overload
  narrows to string at compile time; the unsafe cast was hiding that.
- Strengthen the WebVTT smoke check from a substring search for
  "WEBVTT" to a structural check (`/^WEBVTT/m` header line plus a cue
  arrow `-->`). The substring form would both reject valid VTT without
  the header and accept SRT or other formats that happened to contain
  the word.

workflows/generate-ai.ts:
- Request `response_format: { type: "json_object" }` on chat completions.
  Every prompt already instructs "Return ONLY valid JSON"; modern
  OpenAI-compatible providers (OpenAI, Groq, Ollama, vLLM, OpenRouter)
  enforce that with this flag, which materially reduces parse failures
  on smaller local models. A try/catch falls back to plain mode when
  the underlying provider rejects the field, keeping niche gateways
  compatible.

Trigger gates consolidated via isAiConfigured() / isSttConfigured():
- apps/web/actions/videos/get-status.ts (share-page auto-trigger, both
  transcription and AI-generation paths)
- apps/web/lib/transcribe.ts (lib entry point)
- apps/web/lib/generate-ai.ts (lib entry point)
- apps/web/workflows/generate-ai.ts (validateAndSetProcessing step;
  also aligns the second-check error message with the first)
- apps/web/workflows/transcribe.ts (validateVideo step)
@superagent-security superagent-security Bot removed the pr:flagged PR flagged for review by security analysis. label Jun 1, 2026
The previous PR commit dropped the cross-provider failover and described
maxRetries: 2 as a replacement — that was wrong. maxRetries only retries
the same endpoint on transient errors; it does not preserve the prior
Groq → OpenAI behavior for users who had both keys set as a true failover.

This restores the prior semantics through the unified abstraction:
- New getAiFallbackClient() in ai-provider.ts returns an OpenAI client
  (with the same timeout / maxRetries settings) when both GROQ_API_KEY
  and OPENAI_API_KEY are set AND no AI_BASE_URL override is in effect.
  An explicit AI_BASE_URL means the user has chosen a specific provider;
  no implicit fallback is added in that case.
- callAiApi in workflows/generate-ai.ts wraps the primary call in a
  try/catch; on any primary failure, if a fallback client is available
  it retries once with OpenAI before propagating. JSON-mode handling is
  applied to both legs via a shared invokeChat helper.
- A console.warn surfaces the fallback so the failure is observable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant