Skip to content

Commit 96cf4dd

Browse files
waleedlatif1claude
andcommitted
fix(knowledge): match chunk tokenizer to KB embedding provider
Cursor bugbot: createChunk and updateChunk hardcoded the 'openai' tokenizer when computing the stored tokenCount. For KBs using gemini-embedding-001 the count was estimated with the wrong heuristic, leading to inaccurate stored counts (and any billing derived from them). Now derive the tokenizer from the KB's embedding model provider, matching the search route. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 57589cb commit 96cf4dd

1 file changed

Lines changed: 20 additions & 4 deletions

File tree

apps/sim/lib/knowledge/chunks/service.ts

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,19 @@ import type {
1111
ChunkQueryResult,
1212
CreateChunkData,
1313
} from '@/lib/knowledge/chunks/types'
14+
import { getEmbeddingModelInfo } from '@/lib/knowledge/embedding-models'
1415
import { generateEmbeddings } from '@/lib/knowledge/embeddings'
1516
import { estimateTokenCount } from '@/lib/tokenization/estimators'
1617

18+
/**
19+
* Map embedding model provider → tokenization provider id used by
20+
* `estimateTokenCount`. Keeps stored token counts (and any cost computed
21+
* from them) consistent with how the embedding provider tokenizes.
22+
*/
23+
function tokenizerProviderForEmbeddingModel(model: string): 'openai' | 'google' {
24+
return getEmbeddingModelInfo(model).provider === 'gemini' ? 'google' : 'openai'
25+
}
26+
1727
const logger = createLogger('ChunksService')
1828

1929
/**
@@ -126,8 +136,11 @@ export async function createChunk(
126136
workspaceId
127137
)
128138

129-
// Calculate accurate token count
130-
const tokenCount = estimateTokenCount(chunkData.content, 'openai')
139+
// Calculate accurate token count using the tokenizer matching the KB's embedding provider.
140+
const tokenCount = estimateTokenCount(
141+
chunkData.content,
142+
tokenizerProviderForEmbeddingModel(kbEmbeddingModel)
143+
)
131144

132145
const chunkId = generateId()
133146
const now = new Date()
@@ -385,8 +398,11 @@ export async function updateChunk(
385398
}
386399
const { embeddings } = await generateEmbeddings([content], chunkEmbeddingModel, workspaceId)
387400

388-
// Calculate accurate token count
389-
const tokenCount = estimateTokenCount(content, 'openai')
401+
// Calculate accurate token count using the tokenizer matching the KB's embedding provider.
402+
const tokenCount = estimateTokenCount(
403+
content,
404+
tokenizerProviderForEmbeddingModel(chunkEmbeddingModel)
405+
)
390406

391407
dbUpdateData.content = content
392408
dbUpdateData.contentLength = newContentLength

0 commit comments

Comments
 (0)