You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Linux `test` job on PR [#184](https://github.com/kernel/hypeman/pull/184) failed while the other checks passed.
226
+
- Observed failures from the GitHub Actions log:
227
+
-`TestQEMUStandbyRestoreCompressionScenarios`
228
+
-`TestQEMUStandbyAndRestore`
229
+
-`TestBasicEndToEnd`
230
+
-`TestForkCloudHypervisorFromRunningNetwork`
231
+
- Failure shapes were integration stalls, not deterministic assertion failures:
232
+
-`instance ... did not reach Running within 20s (last state: Initializing)`
233
+
-`rpc error: code = DeadlineExceeded desc = stream terminated by RST_STREAM with error code: CANCEL`
234
+
235
+
### Investigation
236
+
- Initial stopgap of removing `t.Parallel()` from the new restart-recovery tests was rejected; that was the wrong direction and was not kept.
237
+
- Reproduced the branch on `deft-kernel-dev` using the CI-like Linux/root flow with correct prewarm env:
238
+
-`go mod download`
239
+
-`make oapi-generate`
240
+
-`make build`
241
+
-`go run ./cmd/test-prewarm`
242
+
-`sudo env ... go test -count=1 -tags containers_image_openpgp -timeout=20m ./...`
243
+
- Tight loop on the exact CI-failing tests did not reproduce a flake once the command shape matched CI and prewarm settings were correct.
244
+
245
+
### Root cause and fix
246
+
- The new standby compression recovery tests were unit-style tests but used `setupTestManager`, which pulls in much heavier integration-style manager setup than needed.
247
+
- That extra setup was unnecessary for these tests and added avoidable load to an already heavy `lib/instances` package.
248
+
- Fix:
249
+
- Added a lightweight `newSnapshotCompressionTestManager` helper in `lib/instances/snapshot_compression_test.go`
250
+
- Moved the new delayed-job and restart-recovery tests to that lightweight fixture
251
+
- Restored `t.Parallel()` on the new recovery tests and subtests
252
+
- This keeps coverage and parallelism intact while removing needless setup cost.
253
+
254
+
### Validation
255
+
- Targeted stress loop after the fixture change:
256
+
-`go test -count=20 -run '^(TestRecoverPendingStandbyCompressionJobs|TestStartCompressionJobDelayedCancellationRecordsSkipped)$' ./lib/instances`
257
+
- Result: pass
258
+
- Deft full fresh-cache CI-like runs after the fix:
259
+
- Run 1: pass (`lib/instances` 193.279s)
260
+
- Run 2: pass (`lib/instances` 261.633s)
261
+
- Run 3: pass (`lib/instances` 173.573s)
221
262
-`exec-agent not ready for instance ... within 15s (last state: Initializing)`
222
263
223
264
### Additional flakes reproduced during Deft full-suite verification
0 commit comments