Skip to content

Commit 220f8c8

Browse files
authored
feat(knowledge): add chunking strategies and regex strict boundaries (#4368)
* feat(knowledge): add chunking strategies and regex strict boundaries - Add Token, Sentence, Recursive, and Regex chunkers with strategy selection in create-base modal - Add opt-in strict boundaries mode for regex chunker so each match becomes its own chunk - Add chunking strategies docs page with industry references * fix(chunkers): strip capturing groups and validate strictBoundaries scope - Convert capturing groups to non-capturing in regex chunker so split() doesn't surface delimiter text as spurious chunks - Reject strictBoundaries in chunkingConfigSchema when strategy is not regex * fix(chunkers): also strip named capture groups in regex patterns Named groups (?<name>...) are still capturing groups so split() interleaves their matched text. Convert them to non-capturing alongside plain ( groups. * fix(chunkers): exclude lookbehind from named-group rewrite Tighten NAMED_GROUP_PREFIX with negative lookahead so patterns like (?<=<tag>) are not misidentified as named capture groups.
1 parent 124fe17 commit 220f8c8

10 files changed

Lines changed: 468 additions & 4 deletions

File tree

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
---
2+
title: Chunking Strategies
3+
description: How Sim splits documents into searchable chunks, and which strategy to pick for your content
4+
---
5+
6+
import { FAQ } from '@/components/ui/faq'
7+
8+
Sim splits every uploaded document into chunks before generating embeddings. The strategy controls *where* those splits happen.
9+
10+
## How chunking works
11+
12+
Every chunker follows a two-phase pattern:
13+
14+
1. **Split** — break the document at boundaries (paragraphs, sentences, tokens, or a custom regex)
15+
2. **Pack** — merge adjacent splits up to the maximum chunk size
16+
17+
This is documented in [LangChain's text splitter guide](https://python.langchain.com/docs/concepts/text_splitters/), which states the principle: *"no resulting merged split should exceed the designated chunk size."* LlamaIndex, Chonkie, and Unstructured follow the same convention.
18+
19+
The packing step is what keeps chunks roughly uniform. It also means a chunk usually spans multiple splits — a precise split boundary is not the same as a chunk boundary. Most "why is my regex not producing one chunk per match" surprises trace back to this.
20+
21+
## Configuration shared by all strategies
22+
23+
| Setting | Unit | Default | Range | Description |
24+
|---------|------|---------|-------|-------------|
25+
| Max Chunk Size | tokens | 1,024 | 100–4,000 | Upper bound on chunk size. 1 token ≈ 4 characters. |
26+
| Min Chunk Size | characters | 100 | 100–2,000 | Tiny fragments below this are dropped. |
27+
| Overlap | tokens | 200 | 0–500 | Tokens repeated between adjacent chunks to preserve context. |
28+
29+
[Pinecone's chunking guide](https://www.pinecone.io/learn/chunking-strategies/) covers the tradeoffs in size and overlap.
30+
31+
## Strategies
32+
33+
### Auto
34+
35+
Sim inspects the file and routes to the right chunker:
36+
37+
- `.json`, `.jsonl`, `.yaml`, `.yml` → structural chunking (records are never split mid-way; small records may still be batched together up to the chunk size)
38+
- `.csv`, `.xlsx`, `.xls`, `.tsv` → grouped by row, with headers preserved
39+
- Everything else (`.pdf`, `.docx`, `.txt`, `.md`, `.html`, `.pptx`, …) → Text strategy
40+
41+
Routing is based on detected MIME type and content shape, not just the extension — a `.txt` file containing valid JSON is still routed structurally.
42+
43+
Pick Auto unless you've confirmed it isn't producing the chunks you want.
44+
45+
### Text
46+
47+
Hierarchical splitter that walks down a separator list: horizontal rules → markdown headings → paragraphs (`\n\n`) → lines (`\n`) → sentence punctuation (`. ! ?`) → clause punctuation (`; ,`) → spaces. It tries the largest separator first and falls back when a piece is still too large.
48+
49+
Same algorithm as LangChain's [`RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/concepts/text_splitters/#text-structured-based), the de facto standard for prose.
50+
51+
Use it for general prose.
52+
53+
### Recursive
54+
55+
Same algorithm as Text, but you supply your own separator hierarchy or pick a built-in recipe (`plain`, `markdown`, `code`).
56+
57+
The recipe pattern comes from [Chonkie](https://github.com/chonkie-inc/chonkie), which ships pre-built separator sets for common content types.
58+
59+
Use Recursive when your content has structural markers the default Text separators miss — splitting code on `\nclass `, `\nfunction `, then `\n\n`, for example.
60+
61+
### Sentence
62+
63+
Splits on sentence boundaries (`. `, `! `, `? `, with abbreviation handling) and packs whole sentences up to the chunk size. A sentence is never split mid-way unless it individually exceeds the limit.
64+
65+
This is the technique behind [LlamaIndex's `SentenceSplitter`](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/), which is the recommended default for prose in their stack.
66+
67+
Use it when sentence integrity matters — Q&A, legal text, or anything where mid-sentence cuts hurt comprehension.
68+
69+
### Token
70+
71+
Fixed-size sliding window aligned to word boundaries. No awareness of paragraphs or sentences.
72+
73+
LlamaIndex provides the same as `TokenTextSplitter`. Useful when downstream processing requires uniform chunk sizes; otherwise prefer Text or Sentence.
74+
75+
### Regex
76+
77+
Splits on every match of a regex pattern you supply, then packs splits up to the chunk size by default — the same merge behavior as every other chunker. A precise boundary regex like `(?=\n\s*\{\s*"id"\s*:)` will still produce chunks containing multiple matches if those matches are small enough to fit together. This is standard across LangChain, LlamaIndex, Chonkie, and [Unstructured](https://docs.unstructured.io/api-reference/partition/chunking-documents).
78+
79+
Use Regex when your content has explicit delimiters that don't fit any other strategy.
80+
81+
#### Strict boundaries
82+
83+
The regex strategy has an opt-in **"Each match is its own chunk (don't merge)"** checkbox. When enabled:
84+
85+
- Every regex match becomes its own chunk
86+
- Adjacent splits are not packed together
87+
- Overlap is disabled
88+
- Splits that exceed the chunk size are still sub-split at word boundaries
89+
90+
This matches the `join=False` knob in [txtai](https://neuml.github.io/txtai/) and the `split_length=1` pattern in Haystack's `DocumentSplitter`. Most libraries don't expose this directly because they expect users to switch to a structural parser instead — see "One record per chunk" below.
91+
92+
Turn it on when each match is a discrete record (one QA pair, one log entry) and you need each isolated for retrieval.
93+
94+
## How to choose
95+
96+
Pick **Auto** unless you have a reason not to.
97+
98+
If Auto isn't right:
99+
100+
- Sentence integrity matters → **Sentence**
101+
- Your content has structural markers Text doesn't know about → **Recursive**
102+
- You need uniform chunk sizes → **Token**
103+
- You have explicit delimiters → **Regex**
104+
- Each record must be its own chunk → see below
105+
106+
## One record per chunk
107+
108+
Each record (each QA pair, each log line, each row) as its own chunk is structural chunking, not regex chunking. Two paths:
109+
110+
1. **Convert to JSONL** (one record per line) and upload. Sim's Auto strategy treats it as structured data and never splits a record mid-way. Small records may still be batched together up to the chunk size — to force one record per chunk, lower the max chunk size to roughly the size of one record. See [LlamaIndex's `JSONNodeParser`](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/) and [Unstructured's element-based chunking](https://docs.unstructured.io/api-reference/partition/chunking-documents).
111+
112+
2. **Use Regex with strict boundaries enabled** when you can't restructure the source.
113+
114+
Prefer option 1. Structural parsers handle nested records, escaped delimiters, and malformed entries that regex won't.
115+
116+
## Further reading
117+
118+
- [LangChain — Text Splitters](https://python.langchain.com/docs/concepts/text_splitters/)
119+
- [LlamaIndex — Node Parsers](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/)
120+
- [Chonkie](https://github.com/chonkie-inc/chonkie)
121+
- [Unstructured — Chunking](https://docs.unstructured.io/api-reference/partition/chunking-documents)
122+
- [Pinecone — Chunking Strategies](https://www.pinecone.io/learn/chunking-strategies/)
123+
124+
## FAQ
125+
126+
<FAQ items={[
127+
{ question: "Which strategy should I pick if I'm not sure?", answer: "Auto. JSON/JSONL/YAML go through structural chunking, CSVs are grouped by row, everything else uses Text. Only override Auto if you've confirmed it isn't producing the chunks you want." },
128+
{ question: "Why does my chunk contain multiple records even though my regex is precise?", answer: "Every chunker follows split-then-pack: small adjacent splits are merged up to the chunk size to keep chunks roughly uniform. To preserve every match as its own chunk, enable 'Each match is its own chunk (don't merge)' under the Regex strategy, or convert your file to JSONL." },
129+
{ question: "What's the difference between Text and Recursive?", answer: "Same algorithm. Text uses a built-in separator hierarchy for general prose. Recursive lets you supply your own separators or pick a recipe (plain, markdown, code) when the default doesn't capture your structure." },
130+
{ question: "When should I use Sentence over Text?", answer: "When sentence integrity matters — Q&A, legal text, or anything where mid-sentence cuts hurt comprehension. Text may split mid-sentence at lower levels of its hierarchy; Sentence never does unless a single sentence exceeds the chunk size." },
131+
{ question: "Does Token chunking respect any structure?", answer: "No. It's a fixed-size sliding window aligned to word boundaries. Use it only when downstream processing requires uniform chunk sizes." },
132+
{ question: "What does overlap actually do?", answer: "Overlap repeats tokens from the end of one chunk at the start of the next, so a query spanning a chunk boundary can still match. Higher values increase storage and may surface duplicate hits in search." },
133+
{ question: "How do I get one chunk per record?", answer: "Convert to JSONL and lower the max chunk size to roughly the size of one record — Auto handles the rest. If you can't restructure the source, use Regex with 'Each match is its own chunk' enabled." },
134+
{ question: "Do larger chunks always retrieve better?", answer: "No. Larger chunks dilute relevance — the embedding represents the average of more content, so specific queries match worse. 256–1,024 tokens is a typical range; experiment for your data." },
135+
{ question: "Can I change the chunking strategy on an existing knowledge base?", answer: "No. Chunking config is set at creation. To change it, create a new knowledge base and re-upload your documents." },
136+
{ question: "Why isn't my regex producing any splits?", answer: "Sim normalizes content before splitting: \\r\\n becomes \\n, runs of three or more newlines collapse to \\n\\n, and tabs become spaces. Patterns that depend on those characters won't match. Also: in non-strict mode, content that fits within the chunk size returns as a single chunk regardless of matches — enable strict boundaries to force splits." },
137+
]} />

apps/docs/content/docs/en/knowledgebase/index.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,8 @@ When creating a knowledge base, you can configure how documents are split into c
4444
| **Min Chunk Size** | characters | 100 | 100-2,000 | Minimum chunk size to avoid tiny fragments |
4545
| **Overlap** | tokens | 200 | 0-500 | Context overlap between consecutive chunks |
4646

47+
You can also pick a chunking strategy (Auto, Text, Recursive, Sentence, Token, or Regex) to control where splits happen. See [Chunking Strategies](/docs/knowledgebase/chunking-strategies) for a breakdown of when to use each.
48+
4749
- **Hierarchical splitting**: Respects document structure (sections, paragraphs, sentences)
4850

4951
### Editing Capabilities
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
{
22
"title": "Knowledge Base",
3-
"pages": ["index", "connectors", "tags"]
3+
"pages": ["index", "chunking-strategies", "connectors", "tags"]
44
}

apps/sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ import { useForm } from 'react-hook-form'
99
import { z } from 'zod'
1010
import {
1111
Button,
12+
Checkbox,
1213
Combobox,
1314
type ComboboxOption,
1415
Input,
@@ -75,6 +76,7 @@ const FormSchema = z
7576
.max(500, 'Overlap must be less than 500 tokens'),
7677
strategy: z.enum(['auto', 'text', 'regex', 'recursive', 'sentence', 'token']).default('auto'),
7778
regexPattern: z.string().optional(),
79+
regexStrictBoundaries: z.boolean().default(false),
7880
customSeparators: z.string().optional(),
7981
})
8082
.refine(
@@ -175,13 +177,15 @@ export const CreateBaseModal = memo(function CreateBaseModal({
175177
overlapSize: 200,
176178
strategy: 'auto',
177179
regexPattern: '',
180+
regexStrictBoundaries: false,
178181
customSeparators: '',
179182
},
180183
mode: 'onSubmit',
181184
})
182185

183186
const nameValue = watch('name')
184187
const strategyValue = watch('strategy')
188+
const regexStrictBoundariesValue = watch('regexStrictBoundaries')
185189

186190
useEffect(() => {
187191
if (open) {
@@ -199,6 +203,7 @@ export const CreateBaseModal = memo(function CreateBaseModal({
199203
overlapSize: 200,
200204
strategy: 'auto',
201205
regexPattern: '',
206+
regexStrictBoundaries: false,
202207
customSeparators: '',
203208
})
204209
}
@@ -304,7 +309,10 @@ export const CreateBaseModal = memo(function CreateBaseModal({
304309
try {
305310
const strategyOptions: StrategyOptions | undefined =
306311
data.strategy === 'regex' && data.regexPattern
307-
? { pattern: data.regexPattern }
312+
? {
313+
pattern: data.regexPattern,
314+
...(data.regexStrictBoundaries && { strictBoundaries: true }),
315+
}
308316
: data.strategy === 'recursive' && data.customSeparators?.trim()
309317
? {
310318
separators: data.customSeparators
@@ -495,6 +503,28 @@ export const CreateBaseModal = memo(function CreateBaseModal({
495503
<p className='text-[var(--text-muted)] text-xs'>
496504
Text will be split at each match of this regex pattern.
497505
</p>
506+
<label
507+
htmlFor='regexStrictBoundaries'
508+
className='mt-1 flex cursor-pointer items-start gap-2'
509+
>
510+
<Checkbox
511+
id='regexStrictBoundaries'
512+
checked={regexStrictBoundariesValue}
513+
onCheckedChange={(checked) =>
514+
setValue('regexStrictBoundaries', checked === true)
515+
}
516+
className='mt-0.5'
517+
/>
518+
<div className='flex flex-col gap-0.5'>
519+
<span className='text-[var(--text-primary)] text-sm'>
520+
Each match is its own chunk (don&apos;t merge)
521+
</span>
522+
<span className='text-[var(--text-muted)] text-xs'>
523+
Preserve boundaries exactly. Recommended when each match is a discrete
524+
record (e.g. one QA pair per chunk).
525+
</span>
526+
</div>
527+
</label>
498528
</div>
499529
)}
500530

apps/sim/lib/api/contracts/knowledge/base.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ export const chunkingStrategyOptionsSchema = z
2121
pattern: z.string().max(500).optional(),
2222
separators: z.array(z.string()).optional(),
2323
recipe: z.enum(['plain', 'markdown', 'code']).optional(),
24+
strictBoundaries: z.boolean().optional(),
2425
})
2526
.strict() satisfies z.ZodType<StrategyOptions>
2627

@@ -44,6 +45,9 @@ export const chunkingConfigSchema = z
4445
message: 'Regex pattern is required when using the regex chunking strategy',
4546
}
4647
)
48+
.refine((data) => data.strategy === 'regex' || data.strategyOptions?.strictBoundaries !== true, {
49+
message: 'strictBoundaries is only valid for the regex chunking strategy',
50+
})
4751

4852
export const createKnowledgeBaseBodySchema = z.object({
4953
name: z.string().min(1, 'Name is required'),

0 commit comments

Comments
 (0)