roachtest: retry transient errors in cdc/multiple-schema-changes#166338
Conversation
|
😎 Merged successfully - details. |
| alterStmts := []string{"SET sql_safe_updates = false"} | ||
| for _, tableName := range tableNames { | ||
| alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)) | ||
| alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName) |
There was a problem hiding this comment.
nit: if we are going to retry the alter, we can use an idempotent sql statement which avoids the need to examine the error.
ALTER TABLE %s DROP COLUMN IF EXISTS col;
| alterStmts := []string{"SET sql_safe_updates = false"} | ||
| for _, tableName := range tableNames { | ||
| alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)) | ||
| alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName) |
There was a problem hiding this comment.
The artifacts have gc'd so I can't dig into this, but I'm skeptical of fixing this with a retry. Most of the sql statements we make to the test cluster are not wrapped in retry loops. So even if this is the right fix, its not sufficient to fix the issue.
Furthermore, I don't think 'CONNECTION REFUSED' is a transient error in the context of a roachprod test. I would expect it in the context of rolling restarts, but not in a stable cluster. If we see this in a roachtest it suggests either:
- We failed to wait for the cluster to be ready.
- One of the nodes crashed.
There was a problem hiding this comment.
I think you're right. From the little info we have on the issue at this time, it looks like 7 ALTERs succeeded, and the test hung while the 8th ALTER was trying to connect. This suggests something happened to the cluster.
One benefit of this implementation is that if it gets stuck while the ALTERs are executing, it won't wait 1h for the test to time out, but it's possible it could get stuck after and it would still take 1h to time out, so that benefit seems marginal.
Happy to close this PR if you don't think it's useful and just close the issues as not enough info, and wait to see if it happens again to reinvestigate.
| for _, tableName := range tableNames { | ||
| alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)) | ||
| alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName) | ||
| testutils.SucceedsSoon(t, func() error { |
There was a problem hiding this comment.
To me the weirdest thing about the failures we were seeing is that we don't see the errors we (presumably) got while executing these statements in ExecMultiple, instead we just see the test timeout eventually. SucceedsSoon fatals so I would think this shouldn't be an issue, but so does ExecMultiple. Any thoughts on why we saw this as a timeout?
In the event that this retry loop still fails, I'd like to be sure we'd see a more descriptive error.
There was a problem hiding this comment.
This is a good observation and relates to Jeff's point, too. I think the error you found in #153695 (comment) might have been a red herring. The exec hung because it was trying to connect, then failed when the test timed out, returning "connection refused". So the real issue is what was happening that caused the connection to hang.
e6b54e2 to
233291d
Compare
rharding6373
left a comment
There was a problem hiding this comment.
LMK if you'd prefer to close this PR (and the related issues) as not enough info.
| alterStmts := []string{"SET sql_safe_updates = false"} | ||
| for _, tableName := range tableNames { | ||
| alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)) | ||
| alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName) |
| for _, tableName := range tableNames { | ||
| alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)) | ||
| alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName) | ||
| testutils.SucceedsSoon(t, func() error { |
There was a problem hiding this comment.
This is a good observation and relates to Jeff's point, too. I think the error you found in #153695 (comment) might have been a red herring. The exec hung because it was trying to connect, then failed when the test timed out, returning "connection refused". So the real issue is what was happening that caused the connection to hang.
| alterStmts := []string{"SET sql_safe_updates = false"} | ||
| for _, tableName := range tableNames { | ||
| alterStmts = append(alterStmts, fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName)) | ||
| alterStmt := fmt.Sprintf(`ALTER TABLE %s DROP col`, tableName) |
There was a problem hiding this comment.
I think you're right. From the little info we have on the issue at this time, it looks like 7 ALTERs succeeded, and the test hung while the 8th ALTER was trying to connect. This suggests something happened to the cluster.
One benefit of this implementation is that if it gets stuck while the ALTERs are executing, it won't wait 1h for the test to time out, but it's possible it could get stuck after and it would still take 1h to time out, so that benefit seems marginal.
Happy to close this PR if you don't think it's useful and just close the issues as not enough info, and wait to see if it happens again to reinvestigate.
|
I think splitting up the statements is an improvement, but I think we should remove the retry loop until we understand why the statements are failing. |
233291d to
01d3ce6
Compare
…nges The cdc/multiple-schema-changes roachtest used ExecMultiple to run ALTER TABLE DROP col statements. ExecMultiple uses context.Background(), so if a query hangs (e.g. due to a temporarily unavailable node), it blocks indefinitely and cannot be cancelled. The test's 1-hour timeout fires on the runner goroutine while the test goroutine is still stuck, so the failure is reported as a timeout with no useful error. The actual error (e.g. "connection refused") only surfaces later as a suppressed failure_2 log entry. Replace ExecMultiple with individual sqlDB.Exec calls so each ALTER TABLE statement is executed separately. Use DROP COLUMN IF EXISTS for idempotency. Fixes: cockroachdb#155830 Fixes: cockroachdb#163154 Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
01d3ce6 to
00ebc41
Compare
|
TFTR! |
The cdc/multiple-schema-changes roachtest used ExecMultiple to run ALTER TABLE DROP col statements, which fatals on any error with no retry. A transient "connection refused" error on any ALTER causes the test to never reach the highwater check, leading to a confusing timeout.
Replace ExecMultiple with individual SucceedsSoon retry loops for each ALTER statement. Also handle the UndefinedColumn error case where a previous attempt succeeded at the KV level but the connection dropped before the response was returned.
Fixes: #155830
Fixes: #163154
Co-Authored-By: roachdev-claude roachdev-claude-bot@cockroachlabs.com
Release note: None