Skip to content

roachtest: retry transient errors in cdc/multiple-schema-changes#166338

Merged
trunk-io[bot] merged 1 commit intocockroachdb:masterfrom
rharding6373:20260320-multiple-schema-changes-failure
Mar 27, 2026
Merged

roachtest: retry transient errors in cdc/multiple-schema-changes#166338
trunk-io[bot] merged 1 commit intocockroachdb:masterfrom
rharding6373:20260320-multiple-schema-changes-failure

Conversation

@rharding6373
Copy link
Copy Markdown
Collaborator

The cdc/multiple-schema-changes roachtest used ExecMultiple to run ALTER TABLE DROP col statements, which fatals on any error with no retry. A transient "connection refused" error on any ALTER causes the test to never reach the highwater check, leading to a confusing timeout.

Replace ExecMultiple with individual SucceedsSoon retry loops for each ALTER statement. Also handle the UndefinedColumn error case where a previous attempt succeeded at the KV level but the connection dropped before the response was returned.

Fixes: #155830
Fixes: #163154

Co-Authored-By: roachdev-claude roachdev-claude-bot@cockroachlabs.com

Release note: None

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Mar 20, 2026

😎 Merged successfully - details.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@rharding6373 rharding6373 marked this pull request as ready for review March 20, 2026 23:17
@rharding6373 rharding6373 requested a review from a team as a code owner March 20, 2026 23:17
@rharding6373 rharding6373 requested review from aerfrei and removed request for a team March 20, 2026 23:17
Comment thread pkg/cmd/roachtest/tests/cdc.go Outdated
alterStmts := []string{"SET sql_safe_updates = false"}
for _, tableName := range tableNames {
alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName))
alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we are going to retry the alter, we can use an idempotent sql statement which avoids the need to examine the error.

ALTER TABLE %s DROP COLUMN IF EXISTS col;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread pkg/cmd/roachtest/tests/cdc.go Outdated
alterStmts := []string{"SET sql_safe_updates = false"}
for _, tableName := range tableNames {
alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName))
alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The artifacts have gc'd so I can't dig into this, but I'm skeptical of fixing this with a retry. Most of the sql statements we make to the test cluster are not wrapped in retry loops. So even if this is the right fix, its not sufficient to fix the issue.

Furthermore, I don't think 'CONNECTION REFUSED' is a transient error in the context of a roachprod test. I would expect it in the context of rolling restarts, but not in a stable cluster. If we see this in a roachtest it suggests either:

  1. We failed to wait for the cluster to be ready.
  2. One of the nodes crashed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. From the little info we have on the issue at this time, it looks like 7 ALTERs succeeded, and the test hung while the 8th ALTER was trying to connect. This suggests something happened to the cluster.

One benefit of this implementation is that if it gets stuck while the ALTERs are executing, it won't wait 1h for the test to time out, but it's possible it could get stuck after and it would still take 1h to time out, so that benefit seems marginal.

Happy to close this PR if you don't think it's useful and just close the issues as not enough info, and wait to see if it happens again to reinvestigate.

Comment thread pkg/cmd/roachtest/tests/cdc.go Outdated
for _, tableName := range tableNames {
alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName))
alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)
testutils.SucceedsSoon(t, func() error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the weirdest thing about the failures we were seeing is that we don't see the errors we (presumably) got while executing these statements in ExecMultiple, instead we just see the test timeout eventually. SucceedsSoon fatals so I would think this shouldn't be an issue, but so does ExecMultiple. Any thoughts on why we saw this as a timeout?

In the event that this retry loop still fails, I'd like to be sure we'd see a more descriptive error.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good observation and relates to Jeff's point, too. I think the error you found in #153695 (comment) might have been a red herring. The exec hung because it was trying to connect, then failed when the test timed out, returning "connection refused". So the real issue is what was happening that caused the connection to hang.

@rharding6373 rharding6373 force-pushed the 20260320-multiple-schema-changes-failure branch from e6b54e2 to 233291d Compare March 26, 2026 03:46
Copy link
Copy Markdown
Collaborator Author

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LMK if you'd prefer to close this PR (and the related issues) as not enough info.

Comment thread pkg/cmd/roachtest/tests/cdc.go Outdated
alterStmts := []string{"SET sql_safe_updates = false"}
for _, tableName := range tableNames {
alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName))
alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread pkg/cmd/roachtest/tests/cdc.go Outdated
for _, tableName := range tableNames {
alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName))
alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)
testutils.SucceedsSoon(t, func() error {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good observation and relates to Jeff's point, too. I think the error you found in #153695 (comment) might have been a red herring. The exec hung because it was trying to connect, then failed when the test timed out, returning "connection refused". So the real issue is what was happening that caused the connection to hang.

Comment thread pkg/cmd/roachtest/tests/cdc.go Outdated
alterStmts := []string{"SET sql_safe_updates = false"}
for _, tableName := range tableNames {
alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName))
alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. From the little info we have on the issue at this time, it looks like 7 ALTERs succeeded, and the test hung while the 8th ALTER was trying to connect. This suggests something happened to the cluster.

One benefit of this implementation is that if it gets stuck while the ALTERs are executing, it won't wait 1h for the test to time out, but it's possible it could get stuck after and it would still take 1h to time out, so that benefit seems marginal.

Happy to close this PR if you don't think it's useful and just close the issues as not enough info, and wait to see if it happens again to reinvestigate.

@jeffswenson
Copy link
Copy Markdown
Collaborator

I think splitting up the statements is an improvement, but I think we should remove the retry loop until we understand why the statements are failing.

@rharding6373 rharding6373 force-pushed the 20260320-multiple-schema-changes-failure branch from 233291d to 01d3ce6 Compare March 26, 2026 17:27
…nges

The cdc/multiple-schema-changes roachtest used ExecMultiple to run ALTER
TABLE DROP col statements. ExecMultiple uses context.Background(), so
if a query hangs (e.g. due to a temporarily unavailable node), it blocks
indefinitely and cannot be cancelled. The test's 1-hour timeout fires on
the runner goroutine while the test goroutine is still stuck, so the
failure is reported as a timeout with no useful error. The actual error
(e.g. "connection refused") only surfaces later as a suppressed
failure_2 log entry.

Replace ExecMultiple with individual sqlDB.Exec calls so each ALTER
TABLE statement is executed separately. Use DROP COLUMN IF EXISTS for
idempotency.

Fixes: cockroachdb#155830
Fixes: cockroachdb#163154

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@rharding6373 rharding6373 force-pushed the 20260320-multiple-schema-changes-failure branch from 01d3ce6 to 00ebc41 Compare March 26, 2026 21:01
Copy link
Copy Markdown
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rharding6373
Copy link
Copy Markdown
Collaborator Author

TFTR!

@trunk-io trunk-io bot merged commit 5155d48 into cockroachdb:master Mar 27, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: cdc/multiple-schema-changes failed roachtest: cdc/multiple-schema-changes failed

4 participants