Skip to content

Commit 0238f3c

Browse files
authored
skill: add the silk-profiler skill to investigate backend bottlenecks (baserow#5093)
* skill: add the silk-profiler skill to investigate backend bottlenecks * address feedback
1 parent cf15a4a commit 0238f3c

1 file changed

Lines changed: 285 additions & 0 deletions

File tree

Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
---
2+
name: silk-profiler
3+
description: Investigate backend performance using Django Silk profiling data. Use when investigating a slow endpoint, a potential bottleneck, N+1 queries, or understanding query patterns for a specific request.
4+
---
5+
6+
# Investigate Backend Performance with Silk
7+
8+
Use this skill to investigate a slow endpoint or a potential bottleneck. Django Silk (enabled by default in dev via `BASEROW_ENABLE_SILK` in `dev.py`) records every HTTP request, its SQL queries, and full Python stack traces into PostgreSQL. The user may provide a Silk request URL, a request ID, or just describe which endpoint is slow.
9+
10+
## Prerequisites
11+
12+
- Silk must be enabled (default in dev: `BASEROW_ENABLE_SILK=on` in `dev.py`)
13+
- The slow operation must have been performed recently so Silk has captured it
14+
- The dev database must be accessible (usually via `docker exec` into the `baserow-db-1` container)
15+
16+
## Connecting to the Database
17+
18+
Read the `DATABASES` setting in `backend/src/baserow/config/settings/base.py` (and `dev.py` which imports it) to find the connection credentials. Env vars may override the defaults.
19+
20+
The database usually runs in a Docker container. Try `docker exec` first:
21+
22+
```bash
23+
docker exec <db-container> psql -U <user> -d <database> -c "<SQL>"
24+
```
25+
26+
To find the container name:
27+
28+
```bash
29+
docker ps --format '{{.Names}}' | grep -i "db\|postgres"
30+
```
31+
32+
If `psql` is available locally, connect directly using the host/port from the settings. **Try connecting first, only ask the user if it fails.**
33+
34+
## Handling User Input
35+
36+
The user may provide:
37+
38+
- **A Silk URL** like `http://localhost:8000/silk/request/<uuid>/sql/` — extract the UUID between `/request/` and the next `/` as the request ID.
39+
- **A request ID** (UUID) directly.
40+
- **An endpoint path** like `/api/database/tables/123/` — search for it in `silk_request`.
41+
- **A description** like "deleting a table is slow" — search by path pattern and sort by time.
42+
43+
## Workflow
44+
45+
### Step 1: Find the Request
46+
47+
If you already have a request ID, skip to Step 2.
48+
49+
```sql
50+
-- Search by path pattern
51+
SELECT id, path, method, round(time_taken::numeric, 0) AS ms,
52+
num_sql_queries AS queries, start_time
53+
FROM silk_request
54+
WHERE path LIKE '%/api/PATTERN/%'
55+
ORDER BY start_time DESC
56+
LIMIT 10;
57+
```
58+
59+
```sql
60+
-- Slowest requests overall
61+
SELECT id, path, method, round(time_taken::numeric, 0) AS ms,
62+
num_sql_queries AS queries, start_time
63+
FROM silk_request
64+
ORDER BY time_taken DESC
65+
LIMIT 10;
66+
```
67+
68+
```sql
69+
-- Requests with the most SQL queries (likely N+1)
70+
SELECT id, path, method, num_sql_queries AS queries,
71+
round(time_taken::numeric, 0) AS ms, start_time
72+
FROM silk_request
73+
ORDER BY num_sql_queries DESC
74+
LIMIT 10;
75+
```
76+
77+
### Step 2: Analyze Where Time Is Spent
78+
79+
`meta_time_spent_queries` is usually NULL, so compute query time from `silk_sqlquery`:
80+
81+
```sql
82+
SELECT r.path, r.method,
83+
round(r.time_taken::numeric, 0) AS total_ms,
84+
r.num_sql_queries,
85+
round(q.query_ms::numeric, 0) AS query_ms,
86+
round((r.time_taken - q.query_ms)::numeric, 0) AS python_ms
87+
FROM silk_request r
88+
JOIN (
89+
SELECT request_id, SUM(time_taken) AS query_ms
90+
FROM silk_sqlquery GROUP BY request_id
91+
) q ON q.request_id = r.id
92+
WHERE r.id = '<request_id>';
93+
```
94+
95+
If `query_ms` dominates, the problem is database queries. If `python_ms` is high, the bottleneck is in Python code.
96+
97+
### Step 3: Detect N+1 Query Patterns
98+
99+
Group by **normalized** query text (string literals and integer params replaced with `?`) to catch N+1 patterns where the same query runs with different IDs:
100+
101+
```sql
102+
SELECT COUNT(*) AS count,
103+
round(SUM(time_taken)::numeric, 1) AS total_ms,
104+
LEFT(regexp_replace(
105+
regexp_replace(query, '''[^'']*''', '''?''', 'g'),
106+
'= \d+', '= ?', 'g'
107+
), 200) AS normalized
108+
FROM silk_sqlquery
109+
WHERE request_id = '<request_id>'
110+
GROUP BY regexp_replace(
111+
regexp_replace(query, '''[^'']*''', '''?''', 'g'),
112+
'= \d+', '= ?', 'g'
113+
)
114+
ORDER BY count DESC
115+
LIMIT 15;
116+
```
117+
118+
Any query appearing more than a handful of times is likely an N+1 problem.
119+
120+
### Step 4: Examine Slow Individual Queries
121+
122+
```sql
123+
SELECT id, round(time_taken::numeric, 1) AS ms, LEFT(query, 300) AS query_preview
124+
FROM silk_sqlquery
125+
WHERE request_id = '<request_id>'
126+
ORDER BY time_taken DESC
127+
LIMIT 10;
128+
```
129+
130+
```sql
131+
-- Full query text for a specific slow query
132+
SELECT query FROM silk_sqlquery WHERE id = <query_id>;
133+
```
134+
135+
### Step 5: Analyze Query Plans
136+
137+
Silk runs `EXPLAIN` on every captured query and stores the plan in the `analysis` column of `silk_sqlquery`. This is always available — no extra configuration needed.
138+
139+
```sql
140+
SELECT id, round(time_taken::numeric, 1) AS ms, analysis
141+
FROM silk_sqlquery
142+
WHERE request_id = '<request_id>'
143+
ORDER BY time_taken DESC
144+
LIMIT 5;
145+
```
146+
147+
These are estimated plans (no actual execution stats). If you need actual row counts and timings, run `EXPLAIN ANALYZE` manually on a specific query:
148+
149+
```sql
150+
EXPLAIN ANALYZE <paste query from silk_sqlquery here>;
151+
```
152+
153+
**What to look for:**
154+
155+
- **Seq Scan on large tables** — a sequential scan on a table that could grow large (e.g. rows, fields, workspaces) means a missing or unused index. Small config tables are fine.
156+
- **Missing indexes** — if a WHERE or JOIN condition filters on a column without an index, the planner falls back to sequential scans. Fix with `db_index=True` on the model field or a migration with `AddIndex`.
157+
- **Existing indexes not being used** — an index may exist but the planner ignores it (wrong column order in composite index, type mismatch, function wrapping the column, etc.). Check whether the index is used by *any* query in the codebase — if not, it's dead weight and should be removed.
158+
- **Nested Loop with high row estimates** — often a sign of missing `select_related` or a missing index on the join column.
159+
160+
> **Do not enable `SILKY_ANALYZE_QUERIES`** (`BASEROW_DANGEROUS_SILKY_ANALYZE_QUERIES`) to upgrade to EXPLAIN ANALYZE globally. It re-executes every UPDATE and breaks data integrity. The default EXPLAIN plans in the `analysis` column are sufficient for most investigations — run `EXPLAIN ANALYZE` manually only on specific queries when needed.
161+
162+
### Step 6: Read Stack Traces
163+
164+
Stack traces are stored reversed: the **top** is ORM/Silk internals, the **middle** contains Baserow application frames (paths containing `baserow/src/baserow/`), and the **bottom** is Django server/threading boilerplate. Scan for the Baserow frames in the middle — those are the ones that matter.
165+
166+
```sql
167+
SELECT traceback FROM silk_sqlquery WHERE id = <query_id>;
168+
```
169+
170+
To get traces for the most repeated query pattern:
171+
172+
```sql
173+
SELECT id, traceback
174+
FROM silk_sqlquery
175+
WHERE request_id = '<request_id>'
176+
AND regexp_replace(
177+
regexp_replace(query, '''[^'']*''', '''?''', 'g'),
178+
'= \d+', '= ?', 'g'
179+
) = (
180+
SELECT regexp_replace(
181+
regexp_replace(query, '''[^'']*''', '''?''', 'g'),
182+
'= \d+', '= ?', 'g'
183+
)
184+
FROM silk_sqlquery
185+
WHERE request_id = '<request_id>'
186+
GROUP BY regexp_replace(
187+
regexp_replace(query, '''[^'']*''', '''?''', 'g'),
188+
'= \d+', '= ?', 'g'
189+
)
190+
ORDER BY COUNT(*) DESC LIMIT 1
191+
)
192+
LIMIT 3;
193+
```
194+
195+
### Step 7: Trace to Code and Propose Fixes
196+
197+
Once you've identified the problematic query and its origin in the stack trace:
198+
199+
1. Read the source file and function that triggered the query
200+
2. Understand the data access pattern
201+
3. Propose a fix based on the patterns below
202+
203+
## Common Patterns and Fixes
204+
205+
### N+1 Queries — Missing `select_related()` / `prefetch_related()`
206+
207+
**Symptom:** The same SELECT appears dozens/hundreds of times, each with a different WHERE id = value.
208+
209+
**Fix:** Add `select_related('relation')` for ForeignKey/OneToOne, or `prefetch_related('relation')` for reverse FK/M2M, on the queryset that feeds the loop or serializer.
210+
211+
### Queries in Loops That Could Be Batched
212+
213+
**Symptom:** Queries inside a Python for-loop that could be replaced with a single bulk operation.
214+
215+
**Fix:** Collect IDs first, do a single bulk query, then map results back. Common in Baserow: `break_dependencies_for_field` called per-field instead of batching, `FieldCache.reset_cache()` called inside loops invalidating cached data.
216+
217+
### `specific_iterator` or `specific_queryset` / `.specific` Overhead
218+
219+
**Symptom:** Many SELECT queries fetching from different field type tables (`database_textfield`, `database_numberfield`, etc.) one at a time.
220+
221+
**Fix:** Use `specific_iterator()` with a batch of fields rather than calling `.specific` per field in a loop. Or ensure `FieldCache` is populated before the loop and not reset mid-iteration.
222+
223+
### `get_model()` Called Repeatedly
224+
225+
**Symptom:** Repeated queries to `database_field` and `django_content_type` tables to rebuild the same table model.
226+
227+
**Fix:** Cache the model result or pass it as a parameter instead of calling `table.get_model()` multiple times.
228+
229+
### Duplicate Queries Across Serializer Fields
230+
231+
**Symptom:** Multiple serializer fields each trigger the same query independently.
232+
233+
**Fix:** Use `prefetch_related` with a `Prefetch` object on the view's queryset, or add a `@cached_property` on the model.
234+
235+
### Missing Database Indexes
236+
237+
**Symptom:** A single query takes a very long time (>100ms). The query filters on a column that isn't indexed.
238+
239+
**Fix:** Add `db_index=True` to the model field or create a migration with `AddIndex`.
240+
241+
## Silk Table Schema Reference
242+
243+
### silk_request
244+
| Column | Type | Notes |
245+
|---|---|---|
246+
| id | CharField(36) | UUID primary key |
247+
| path | CharField(190) | Request URL path |
248+
| method | CharField(10) | HTTP method |
249+
| start_time | DateTimeField | Request start |
250+
| time_taken | FloatField | Total time (ms) |
251+
| num_sql_queries | IntegerField | SQL query count |
252+
| meta_time_spent_queries | FloatField | **Usually NULL** — compute from silk_sqlquery instead |
253+
| view_name | CharField(190) | Django view name (usually populated) |
254+
255+
### silk_sqlquery
256+
| Column | Type | Notes |
257+
|---|---|---|
258+
| id | IntegerField | Auto PK |
259+
| query | TextField | Full SQL text |
260+
| start_time | DateTimeField | Query start |
261+
| end_time | DateTimeField | Query end |
262+
| time_taken | FloatField | Execution time (ms) |
263+
| request_id | FK | Links to silk_request |
264+
| traceback | TextField | Python stack trace (reversed — ORM at top, Baserow in middle, server at bottom) |
265+
| analysis | TextField | EXPLAIN output (always populated). With `SILKY_ANALYZE_QUERIES` enabled, contains EXPLAIN ANALYZE instead. |
266+
267+
### silk_profile
268+
| Column | Type | Notes |
269+
|---|---|---|
270+
| id | IntegerField | Auto PK |
271+
| name | CharField(300) | Profile name |
272+
| time_taken | FloatField | Duration (ms) |
273+
| request_id | FK | Links to silk_request |
274+
| file_path | CharField(300) | Source file |
275+
| line_num | IntegerField | Start line |
276+
| func_name | CharField(300) | Function name |
277+
278+
## Guardrails
279+
280+
- **Use read-only queries only.** Never run INSERT, UPDATE, DELETE, or DDL on any table.
281+
- **Never enable `SILKY_ANALYZE_QUERIES`** (`BASEROW_DANGEROUS_SILKY_ANALYZE_QUERIES`). It runs every UPDATE twice and breaks data integrity.
282+
- **Do not truncate Silk tables** without asking the user first.
283+
- **Ground every claim in data.** Always quote the specific evidence (EXPLAIN output, stack trace lines, source code with file path and line number) when making a diagnosis. Never state that a query is slow, an index is missing, or a `select_related` is needed without showing the data that supports it.
284+
- When proposing fixes, always read the actual source code first — don't guess from the query text alone.
285+
- **Confirm findings with the user.** When a user describes a problem without a Silk URL or request ID, confirm the matching request was found and provide its Silk URL (e.g. `http://localhost:8000/silk/request/<uuid>/`) so the user can verify before proceeding.

0 commit comments

Comments
 (0)