There's a gap between ephemeral prompt caching (5min/1h TTL) and fine-tuning. For apps with a large, stable system context (~50-100K tokens) and moderate but irregular traffic, neither option fits well:
- Prompt caching works during peak hours but the cache expires during off-hours, forcing repeated cold writes when traffic resumes
- Fine-tuning is overkill when the context is just static reference data, not behavioral changes
- Dedicated instances ($1K+/month) are way too expensive for small-to-medium apps
Example use case: A customer support agent for a SaaS product. The system context holds ~80K tokens of product docs, pricing rules, refund policies, and edge cases. Traffic is high during business hours (Europe), dead at night. Every morning the cache is cold again — full write cost, higher latency on first requests.
Proposal
A "persistent context slot" — a pre-loaded context that doesn't expire, for a flat monthly fee:
- Upload context once via API (or a dashboard)
- Choose which models to pre-load it on (Haiku, Sonnet, Opus — each separately)
- Send requests with just the user message + a
context_id — no need to re-send the context every time
- No TTL, no cache misses, no cold starts
Suggested pricing (as a user, this is what I'd be willing to pay)
| Context size |
Haiku |
Sonnet |
Opus |
| 50K tokens |
$5/mo |
$13/mo |
$20/mo |
| 100K tokens |
$10/mo |
$25/mo |
$40/mo |
| 250K tokens |
$25/mo |
$60/mo |
$95/mo |
Output tokens billed at normal per-request rates. The slot only covers the stored context.
Why this matters
Today, keeping 100K tokens warm on Opus with 1h cache costs ~$36/month in refreshes alone — plus $0.05 per request in cache hits. At 1000 req/day, that's ~$1,500/month just in cache hits. Most small-to-medium apps can't justify that.
A persistent slot at $10-40/month would unlock a whole category of apps that don't exist today because the economics don't work with ephemeral caching. It also creates predictable MRR (vs volatile pay-per-use) and natural lock-in (context becomes a hosted asset).
Happy to discuss further.
There's a gap between ephemeral prompt caching (5min/1h TTL) and fine-tuning. For apps with a large, stable system context (~50-100K tokens) and moderate but irregular traffic, neither option fits well:
Example use case: A customer support agent for a SaaS product. The system context holds ~80K tokens of product docs, pricing rules, refund policies, and edge cases. Traffic is high during business hours (Europe), dead at night. Every morning the cache is cold again — full write cost, higher latency on first requests.
Proposal
A "persistent context slot" — a pre-loaded context that doesn't expire, for a flat monthly fee:
context_id— no need to re-send the context every timeSuggested pricing (as a user, this is what I'd be willing to pay)
Output tokens billed at normal per-request rates. The slot only covers the stored context.
Why this matters
Today, keeping 100K tokens warm on Opus with 1h cache costs ~$36/month in refreshes alone — plus $0.05 per request in cache hits. At 1000 req/day, that's ~$1,500/month just in cache hits. Most small-to-medium apps can't justify that.
A persistent slot at $10-40/month would unlock a whole category of apps that don't exist today because the economics don't work with ephemeral caching. It also creates predictable MRR (vs volatile pay-per-use) and natural lock-in (context becomes a hosted asset).
Happy to discuss further.