Skip to content

release-25.2: sql: fix partial index data loss / phantom rows during update#166322

Merged
rail merged 1 commit intocockroachdb:release-25.2from
fqazi:blathers/backport-release-25.2-166123
Mar 23, 2026
Merged

release-25.2: sql: fix partial index data loss / phantom rows during update#166322
rail merged 1 commit intocockroachdb:release-25.2from
fqazi:blathers/backport-release-25.2-166123

Conversation

@fqazi
Copy link
Collaborator

@fqazi fqazi commented Mar 20, 2026

Backport 1/1 commits from #166123 on behalf of @fqazi.


sql: fix partial index data loss / phantom rows during update

This commit fixes a bug on tables with multiple column families where a
concurrent update that does not overlap with a partial index's column
family could cause the partial index to write a NULL instead of the
actual data, or incorrectly add phantom rows to a temporary index
during a schema change backfill.

This bug was previously masked on default (single-column-family) tables
because an update to any column causes the optimizer to conservatively
fetch all columns in that family. However, with multiple column families,
two normalization rules in the optimizer caused issues:

  1. PruneMutationFetchCols: If an update does not change any column
    associated with an index, the optimizer avoids fetching those
    columns. This causes the execution layer to see NULLs for the
    unfetched columns.
  2. SimplifyPartialIndexProjections: If an update does not change
    any column associated with a partial index, the optimizer
    simplifies the partial index predicate evaluation to FALSE.
    This causes the execution layer to skip writes to the index.

During a schema change backfill, the execution layer's updater must
unconditionally write complete index entries to temporary (mutating)
indexes for any concurrent update, even if the index's columns are
unchanged. This ensures the backfill merger has a complete snapshot
to correctly reconcile the final index. If columns are pruned (Rule 1)
or writes are simplified away (Rule 2), the temporary index receives
incomplete entries (NULLs) or misses the update entirely.

Furthermore, missing columns can lead to phantom rows if the partial
index predicate evaluates to TRUE when given a NULL value (e.g.,
WHERE val IS NULL).

This change ensures the optimizer always fetches the required
columns and avoids simplifying predicate evaluation if the index
is a mutation index, correctly propagating the full row state to the
execution layer.

Fixes: #166122

Release note (bug fix): Fixed a bug where concurrent updates to a table
using multiple column families during a partial index creation could
result in data loss, incorrect NULL values, or validation failures in
the resulting index.


Release justification: addresses a bug that can lead to partial indexes that are corrupt or fail to construct in the face of concurrent updates.

@fqazi fqazi requested a review from a team as a code owner March 20, 2026 17:44
@fqazi fqazi requested review from yuzefovich and removed request for a team March 20, 2026 17:44
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Mar 20, 2026
@blathers-crl
Copy link

blathers-crl bot commented Mar 20, 2026

Thanks for opening a backport.

Before merging, please confirm that it falls into one of the following categories (select one):

  • Non-production code changes OR fixes for serious issues. Non-production includes test-only changes, build system changes, etc. Serious issues are defined in the policy as correctness, stability, or security issues, data corruption/loss, significant performance regressions, breaking working and widely used functionality, or an inability to detect and debug production issues.
  • Other approved changes. These changes must be gated behind a disabled-by-default feature flag unless there is a strong justification not to. Reference the approved ENGREQ ticket in the PR body (e.g., "Fixes ENGREQ-123").

Add a brief release justification to the PR description explaining your selection.

Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy.

All backports must be reviewed by the TL and EM for the owning area.

@blathers-crl blathers-crl bot requested review from a team and michae2 March 20, 2026 17:44
@blathers-crl blathers-crl bot added backport Label PR's that are backports to older release branches T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Mar 20, 2026
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@blathers-crl
Copy link

blathers-crl bot commented Mar 20, 2026

✅ PR #166322 is compliant with backport policy

Confidence: high
Critical bug criteria met: [Data corruption/loss Bugs that can cause the DB to return incorrect results or result in suboptimal performance]
Backward compatible: true
Explanation: The PR fixes a critical bug involving data corruption or loss arising during concurrent updates when multiple column families are used, particularly during partial index creation which could lead to data loss, incorrect NULL values, or validation failures. The changes are made to critical optimization and handling functions within the SQL layer, accurately addressing issues like incorrect propagation of changes to indices due to optimization rules which overlook necessary data fetches and writes in certain scenarios. In terms of policy compliance, the PR's body explicitly mentions 'Release justification:' which includes a clear description of the bug addressed and qualifies it as a critical bug fix. The changes are straightforward, mainly affecting the handling and optimization functions directly linked to the bug. There's no removal of version gates, and the changes only introduce handling and checks around index states, thus preserving backward compatibility.

ENGREQ Check Passed: No ENGREQ required (non-production code or serious issues).

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@michae2 reviewed 4 files and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on yuzefovich).

This commit fixes a bug on tables with multiple column families where a
concurrent update that does not overlap with a partial index's column
family could cause the partial index to write a NULL instead of the
actual data, or incorrectly add phantom rows to a temporary index
during a schema change backfill.

This bug was previously masked on default (single-column-family) tables
because an update to any column causes the optimizer to conservatively
fetch all columns in that family. However, with multiple column families,
two normalization rules in the optimizer caused issues:

1. PruneMutationFetchCols: If an update does not change any column
   associated with an index, the optimizer avoids fetching those
   columns. This causes the execution layer to see NULLs for the
   unfetched columns.
2. SimplifyPartialIndexProjections: If an update does not change
   any column associated with a partial index, the optimizer
   simplifies the partial index predicate evaluation to FALSE.
   This causes the execution layer to skip writes to the index.

During a schema change backfill, the execution layer's updater must
unconditionally write complete index entries to temporary (mutating)
indexes for any concurrent update, even if the index's columns are
unchanged. This ensures the backfill merger has a complete snapshot
to correctly reconcile the final index. If columns are pruned (Rule 1)
or writes are simplified away (Rule 2), the temporary index receives
incomplete entries (NULLs) or misses the update entirely.

Furthermore, missing columns can lead to phantom rows if the partial
index predicate evaluates to TRUE when given a NULL value (e.g.,
WHERE val IS NULL).

This change ensures the optimizer always fetches the required
columns and avoids simplifying predicate evaluation if the index
is a mutation index, correctly propagating the full row state to the
execution layer.

Fixes: cockroachdb#166122

Release note (bug fix): Fixed a bug where concurrent updates to a table
using multiple column families during a partial index creation could
result in data loss, incorrect NULL values, or validation failures in
the resulting index.
@fqazi fqazi force-pushed the blathers/backport-release-25.2-166123 branch from 9285d0f to 4b359db Compare March 23, 2026 12:19
@fqazi
Copy link
Collaborator Author

fqazi commented Mar 23, 2026

@michae2 TFTR!

/trunk merge

@trunk-io
Copy link
Contributor

trunk-io bot commented Mar 23, 2026

😎 Merged manually by @rail - details.

@rail rail merged commit cca811e into cockroachdb:release-25.2 Mar 23, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) target-release-25.2.17

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants