Skip to content

Commit 905ee8d

Browse files
committed
Add plan for position-aware
1 parent e7c00d9 commit 905ee8d

2 files changed

Lines changed: 234 additions & 16 deletions

File tree

docs/rfcs/arbitrary-boolean-expressions-with-subqueries.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ contributors:
77
- ilia
88
- kev
99
created: 2026-01-27
10-
last_updated: 2026-01-27
10+
last_updated: 2026-02-10
1111
prd: N/A
1212
prd_version: N/A
1313
---
@@ -376,6 +376,40 @@ This handles all race conditions:
376376
| Value enters in txn T, exits in txn T+1 while query running | Same — `moved_out_tags` records hash from T+1's move-out; T's query results filtered |
377377
| Same hash at different positions (`x IN sq1 OR x IN sq2`) | Position-aware filtering only checks the triggering position |
378378
| Partial exit from multi-value query ([A, B] queried, A exits) | Per-row filtering: row with A skipped, row with B kept |
379+
| `find_position_for_sublink` re-derivation picks wrong position (same subquery in both IN and NOT IN) | Pre-existing TODO, low practical risk — requires identical subquery text in both positive and negated form. Fix: thread already-known position through instead of re-deriving. |
380+
381+
##### Dual tag format: snapshot files vs wire/API
382+
383+
Snapshot file tags (used for `moved_out_tags` filtering) use a **different format** from the
384+
wire/API tags sent to clients:
385+
386+
- **Wire format** (in JSON headers, sent to clients): slash-delimited strings per disjunct —
387+
e.g. `"hash1/hash2/"`, `"//hash3"`. One string per disjunct, positions within a disjunct
388+
separated by `/`. Unchanged from the design above. Clients use these for their own
389+
tag-based inclusion tracking.
390+
- **Snapshot file format** (internal, used for filtering): flat list of per-position hashes —
391+
one hash per DNF position, no delimiters. `Enum.at(tags, position)` directly yields the
392+
hash for that position. These tags never leave the server.
393+
394+
**Why the flat format is sufficient**: since `moved_out_tags` is now position-aware (tracking
395+
`{position, tags_to_skip}` instead of just `tags_to_skip`), filtering only needs to check
396+
the hash at a single position index. No string splitting or multi-position comparison is
397+
needed. The flat list is effectively a position-indexed array.
398+
399+
**Worked example** — shape `WHERE x IN sq1 OR x IN sq2`, tag_structure `[[x], [x]]`:
400+
401+
| Format | Positions | Row with x=a |
402+
|--------|-----------|--------------|
403+
| Wire (per-disjunct, slash-delimited) | disjunct 0: `[x]`, disjunct 1: `[x]` | `["hash(a)/", "/hash(a)"]` |
404+
| Snapshot (per-position, flat) | position 0: sq1, position 1: sq2 | `["hash(a)", "hash(a)"]` |
405+
406+
When value `a` exits sq1, `moved_out_tags` records `{position: 0, tags: {"hash(a)"}}`.
407+
At filter time, `Enum.at(tags, 0)` = `"hash(a)"` → match → row skipped for sq1.
408+
But a different move-in query for sq2 would check `Enum.at(tags, 1)` — position 0's
409+
move-out doesn't affect position 1's filtering.
410+
411+
Storage version is bumped (1 → 2 in `pure_file_storage.ex`) to invalidate old-format
412+
snapshots on upgrade. This is a server-internal change only — clients are unaffected.
379413

380414
### Consistency Model for Shapes with Subqueries
381415

@@ -549,6 +583,7 @@ Questions resolved during RFC development:
549583
| Version | Date | Author | Changes |
550584
|---------|------|--------|---------|
551585
| 1.0 | 2026-01-27 | rob | Initial version |
586+
| 1.1 | 2026-02-10 | rob | Add dual tag format (snapshot vs wire), `find_position_for_sublink` limitation |
552587

553588
---
554589

packages/sync-service/IMPLEMENTATION_PLAN.md

Lines changed: 198 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1208,21 +1208,204 @@ Before enabling `dnf_subqueries` feature flag in production:
12081208

12091209
### Remaining Work
12101210

1211-
1. **Phase 11: Position-aware `moved_out_tags` filtering**
1212-
- Main's `moved_out_tags` tracks bare hash values. For multi-disjunct shapes where the
1213-
same column appears at different positions (e.g. `x IN sq1 OR x IN sq2`), the same
1214-
hash appears at both positions, causing incorrect filtering.
1215-
- Fix: track `{position, hash}` tuples instead of bare hashes. Filter at the
1216-
**triggering position** — the position that caused the move-in query.
1217-
- Code changes:
1218-
- `MoveIns.move_out_happened`: accept `{position, hash}` tuples
1219-
- `MoveIns.add_waiting`: track triggering position alongside query name
1220-
- `do_legacy_move_out` (move_handling.ex): pass position info to `move_out_happened`
1221-
- Consumer or storage: filter rows at write time using position-aware lookup
1222-
- Edge cases handled:
1223-
- Same hash at different positions: only triggering position checked
1224-
- Partial exit from multi-value query: per-row filtering (A skipped, B kept)
1225-
- Within-txn and cross-txn races: both handled by `moved_out_tags`
1211+
1. **Phase 11: Position-aware `moved_out_tags` filtering** ← next
1212+
1213+
#### Problem
1214+
1215+
`moved_out_tags` compares bare hashes against full slash-delimited tag strings — never
1216+
matches for multi-disjunct shapes.
1217+
1218+
Consider `WHERE x IN sq1 OR x IN sq2` with tag_structure `[[x], [x]]` (two disjuncts,
1219+
both on column `x`). When value `a` exits sq1, the move-out control message records
1220+
`hash(a)` as a bare string. But the tags stored in the snapshot file are slash-delimited
1221+
per-disjunct: `["hash(a)/", "/hash(a)"]`. The filtering check in
1222+
`all_parents_moved_out?/2` does:
1223+
1224+
```elixir
1225+
# pure_file_storage.ex — current filtering
1226+
defp all_parents_moved_out?(tags, tags_to_skip) do
1227+
tags != [] and Enum.all?(tags, &MapSet.member?(tags_to_skip, &1))
1228+
end
1229+
```
1230+
1231+
`tags_to_skip` contains `"hash(a)"` (bare), but `tags` contains `"hash(a)/"` and
1232+
`"/hash(a)"` (slash-delimited) — no match is ever found.
1233+
1234+
Even if we fixed the string format, position-unaware filtering would be wrong: value `a`
1235+
exits sq1 (position 0) but is still valid for sq2 (position 1). A bare-hash match would
1236+
incorrectly skip the row for both positions.
1237+
1238+
#### Solution — two changes
1239+
1240+
**Change 1: Snapshot file tags become per-position flat hashes.**
1241+
1242+
Currently `make_tags/3` in `querying.ex` generates SQL that produces one slash-delimited
1243+
string per disjunct:
1244+
1245+
```elixir
1246+
# Current: make_tags returns SQL for slash-delimited strings per disjunct
1247+
# tag_structure: [[x, y], [nil, z]] → SQL producing ["hash(x)/hash(y)", "/hash(z)"]
1248+
Enum.map(pattern, fn ... end) |> Enum.join(" || '/' || ")
1249+
```
1250+
1251+
Add a second function `make_snapshot_tags/3` that produces one hash per DNF position
1252+
(flat, no slashes):
1253+
1254+
```elixir
1255+
# New: make_snapshot_tags returns SQL for one hash per position
1256+
# tag_structure: [[x, y], [nil, z]]
1257+
# positions: 0 1 1 2 (flattened across disjuncts)
1258+
#
1259+
# → SQL producing ["hash(x)", "hash(y)", "hash(z)"]
1260+
# (nils are included as empty strings so indices stay aligned)
1261+
defp make_snapshot_tags(%Shape{tag_structure: tag_structure}, stack_id, shape_handle) do
1262+
escaped_prefix = escape_sql_string(to_string(stack_id) <> to_string(shape_handle))
1263+
1264+
tag_structure
1265+
|> List.flatten()
1266+
|> Enum.map(fn
1267+
nil -> "''"
1268+
column_name when is_binary(column_name) ->
1269+
col = pg_cast_column_to_text(column_name)
1270+
~s[md5('#{escaped_prefix}' || #{pg_namespace_value_sql(col)})]
1271+
{:hash_together, columns} ->
1272+
# ... same as make_tags ...
1273+
end)
1274+
end
1275+
```
1276+
1277+
`query_move_in/5` uses `make_snapshot_tags` instead of `make_tags`:
1278+
1279+
```elixir
1280+
# querying.ex — query_move_in uses flat tags for snapshot storage
1281+
tag_select = make_snapshot_tags(shape, stack_id, shape_handle) |> Enum.join(", ")
1282+
~s|SELECT #{key_select}, ARRAY[#{tag_select}]::text[], #{json_like_select} FROM ...|
1283+
```
1284+
1285+
API tags (in JSON headers from `stream_initial_data`) remain slash-delimited — unchanged.
1286+
1287+
**Change 2: `moved_out_tags` becomes position-aware.**
1288+
1289+
Currently `move_out_happened/2` unions bare hashes into a flat `MapSet`:
1290+
1291+
```elixir
1292+
# Current: move_ins.ex
1293+
@type t() :: %__MODULE__{
1294+
moved_out_tags: %{move_in_name() => MapSet.t(String.t())}
1295+
}
1296+
1297+
def move_out_happened(state, new_tags) do
1298+
moved_out_tags =
1299+
Map.new(state.moved_out_tags, fn {name, tags} ->
1300+
{name, MapSet.union(tags, new_tags)}
1301+
end)
1302+
%{state | moved_out_tags: moved_out_tags}
1303+
end
1304+
```
1305+
1306+
Change to accept `{position, tags}` and store per-position:
1307+
1308+
```elixir
1309+
# New: move_ins.ex — position-aware moved_out_tags
1310+
@type t() :: %__MODULE__{
1311+
moved_out_tags: %{move_in_name() => {non_neg_integer(), MapSet.t(String.t())}}
1312+
}
1313+
1314+
def move_out_happened(state, position, new_tags) do
1315+
moved_out_tags =
1316+
Map.new(state.moved_out_tags, fn {name, {pos, tags}} ->
1317+
if pos == position do
1318+
{name, {pos, MapSet.union(tags, new_tags)}}
1319+
else
1320+
{name, {pos, tags}}
1321+
end
1322+
end)
1323+
%{state | moved_out_tags: moved_out_tags}
1324+
end
1325+
```
1326+
1327+
The caller in `do_legacy_move_out` derives position from `dep_handle`:
1328+
1329+
```elixir
1330+
# move_handling.ex — pass position to move_out_happened
1331+
positions = DnfContext.get_positions_for_dependency(state.dnf_context, dep_handle)
1332+
position = List.first(positions) # see Known Limitation below
1333+
1334+
move_handling_state =
1335+
MoveIns.move_out_happened(
1336+
state.move_handling_state,
1337+
position,
1338+
MapSet.new(message.headers.patterns |> Enum.map(& &1[:value]))
1339+
)
1340+
```
1341+
1342+
Filtering in storage becomes position-aware — check only the hash at the triggering
1343+
position index:
1344+
1345+
```elixir
1346+
# pure_file_storage.ex — new filtering
1347+
defp should_skip_for_moved_out?(tags, {position, tags_to_skip}) do
1348+
case Enum.at(tags, position) do
1349+
nil -> false
1350+
"" -> false
1351+
hash -> MapSet.member?(tags_to_skip, hash)
1352+
end
1353+
end
1354+
```
1355+
1356+
This is why the flat format is sufficient: since `moved_out_tags` now knows which
1357+
position triggered the move-out, we only need to check the hash at that position index.
1358+
We never need to parse or split strings — `Enum.at(tags, position)` directly yields the
1359+
hash. The flat list is effectively a position-indexed array.
1360+
1361+
#### Worked example
1362+
1363+
Shape: `WHERE x IN sq1 OR x IN sq2`
1364+
Tag structure: `[[x], [x]]` → positions: 0 (sq1), 1 (sq2)
1365+
1366+
1. Value `a` enters sq1 → move-in query fires, snapshot rows stored with flat tags
1367+
`["hash(a)", ""]` (position 0 has hash, position 1 empty for this disjunct)
1368+
2. While query is in flight, value `a` exits sq1 → `move_out_happened(state, 0, MapSet.new(["hash(a)"]))`
1369+
records `{0, MapSet["hash(a)"]}` for the in-flight query
1370+
3. Query completes → filtering checks `Enum.at(["hash(a)", ""], 0)` = `"hash(a)"`
1371+
in `tags_to_skip` → row skipped ✓
1372+
4. Meanwhile, value `a` is still valid for sq2 (position 1). If sq2 also has an in-flight
1373+
query with tags `["", "hash(a)"]`, filtering checks `Enum.at(["", "hash(a)"], 0)` = `""`
1374+
not in `tags_to_skip` → row kept ✓
1375+
1376+
#### Files changed
1377+
1378+
- `querying.ex` — add `make_snapshot_tags/3`, use it in `query_move_in/5`
1379+
- `move_ins.ex` — change `moved_out_tags` type, update `move_out_happened/3` to accept position
1380+
- `move_handling.ex` — derive position from `dnf_context`, pass to `move_out_happened`
1381+
- `storage.ex` — update `moved_out_tags` type in behaviour callback specs
1382+
- `pure_file_storage.ex` — replace `all_parents_moved_out?` with `should_skip_for_moved_out?`,
1383+
bump `@version` 1 → 2
1384+
- `in_memory_storage.ex` — same filtering change for in-memory store
1385+
- `crashing_file_storage.ex` — delegate to file storage (no logic change)
1386+
- `test_storage.ex` — update test helpers for new type
1387+
- `storage_implementations_test.exs` — test position-aware filtering
1388+
1389+
#### Storage version bump
1390+
1391+
`@version` in `pure_file_storage.ex` from 1 → 2 to force clean slate. Old-format
1392+
snapshots (slash-delimited tags) are invalidated on startup — shapes re-snapshot with
1393+
the new flat format.
1394+
1395+
#### Edge cases handled
1396+
1397+
- Same hash at different positions: only triggering position checked (see worked example)
1398+
- Partial exit from multi-value query: per-row filtering (A skipped, B kept)
1399+
- Within-txn and cross-txn races: both handled by `moved_out_tags`
1400+
1401+
#### Known limitation
1402+
1403+
`do_move_out_for_positions` / `do_move_out_for_positions_with_check` discard
1404+
`_positions` and re-derive via `find_position_for_sublink`, which could pick the wrong
1405+
position if the same `dep_handle` maps to both a positive and negated position (same
1406+
subquery with both IN and NOT IN). Pre-existing TODO, low practical risk (requires
1407+
identical subquery text in both positive and negated form). Fix would be straightforward:
1408+
thread the already-known position through instead of re-deriving.
12261409

12271410
2. **Protocol Version Validation** (Optional)
12281411
- Add protocol version check to reject complex shapes for v1 clients

0 commit comments

Comments
 (0)