Failure: swarms/node failure/check restart clickhouse on swarm node — consistent failure on Antalya 26.1

Move https://github.com/Altinity/clickhouse-regression/issues/124

**Affected test:**
- `/swarms/feature/node failure/check restart clickhouse on swarm node`

**Affected files:**
- `swarms/tests/node_failure.py`
- `swarms/tests/steps/swarm_node_actions.py`

## Description

The `check restart clickhouse on swarm node` test fails consistently on ClickHouse Antalya 26.1 (100% fail rate) while passing on Antalya 25.8. The test verifies that a swarm cluster query fails with `exitcode=32` (`Attempt to read after eof`) when a ClickHouse process is killed on a swarm node during query execution.

On 26.1, the query either **succeeds when it shouldn't** (with `SEGV` signal) or **hangs indefinitely** (with `KILL` signal), instead of failing quickly with an EOF error as it does on 25.8.

## Analysis

### Test behavior by version and signal

| Version | Signal | Behavior | Test result |
|---|---|---|---|
| 25.8 + `KILL` | Query fails in ~5s with `Code: 32. DB::Exception: Attempt to read after eof` | **OK** (consistent) |
| 26.1 + `SEGV` (current main branch) | Query **completes successfully** — tasks are redistributed to surviving node, returns `100 clickhouse2` with exitcode 0 | **Fail** (AssertionError — exitcode 0 ≠ 32) |
| 26.1 + `KILL` (PR 1520 / newer regression commits) | Query **hangs for 600s** until bash timeout | **Error** (ExpectTimeoutError) |

### Two separate issues

**1. SEGV on 26.1: query survives node failure (AssertionError) — historical, test already updated**

With `SEGV`, the ClickHouse process takes time to die (core dump generation). During this window, the TaskDistributor detects the node going down and redistributes its pending tasks to the surviving node. The query completes successfully with only one node's results. The test code was updated in commit `f1827080b` to use `KILL` instead of `SEGV`, so this failure mode will no longer occur once the regression commit hash is updated on the main branch.

**2. KILL on 26.1: query hangs instead of failing (ExpectTimeoutError) — potential ClickHouse bug**

With `KILL`, the process dies instantly (no cleanup). On 25.8, this causes an immediate `Attempt to read after eof` error (~5s). On 26.1, the initiator **does not detect the broken connection** and the query hangs for the full 600s bash timeout. This is a behavioral regression in 26.1.

### What changed in 26.1

All Altinity swarm-specific PRs (#780, #866, #1014, #1042, #1201, etc.) are labeled `antalya-25.8` or earlier — the swarm code is already present in 25.8 where the test passes. The PRs #1395 and #1414 that frontported this code to 26.1 contain only code that was already in 25.8.

The difference is in the **upstream ClickHouse base code** (26.1 vs. ~25.5). Key files that differ between the two branches:

- `src/Storages/ObjectStorage/StorageObjectStorageCluster.cpp` — new `parallel_replicas_for_cluster_engines`, `cluster_table_function_split_granularity`, changed `distributed_processing` derivation
- `src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp` — removed `has_concurrent_next()` check, reordered `handleReplicaLoss()` operations, changed file identifier resolution
- `src/Storages/IStorageCluster.cpp` — metadata handling, join detection changes
- `src/Storages/ObjectStorage/StorageObjectStorageSource.cpp/.h` — structural changes

Core connection handling code is **identical** between versions:
- `RemoteQueryExecutor.cpp`, `MultiplexedConnections.cpp`, `ReadBufferFromPocoSocket.cpp`, `PacketReceiver.cpp`, `ConnectionTimeouts.cpp/h`

### Database evidence

```sql
-- 26.1: 100% failure rate
SELECT result, count() FROM `gh-data`.clickhouse_regression_results
WHERE test_name = '/swarms/feature/node failure/check restart clickhouse on swarm node'
  AND clickhouse_version LIKE '26.1%'
GROUP BY result
-- Fail: ~25, Error: ~8

-- 25.8: passes consistently (flaky rare failure)
SELECT result, count() FROM `gh-data`.clickhouse_regression_results
WHERE test_name = '/swarms/feature/node failure/check restart clickhouse on swarm node'
  AND clickhouse_version LIKE '25.8%'
GROUP BY result
-- OK: ~15, Fail: 1
```

### Commit context

The test code was updated in commit `f1827080b` (2026-03-10, "Update node failure tests to use kill instead segv") to use `signal="KILL"`, `delay=30`, `delay_before_execution=5`. The main branch `antalya-26.1` still uses the old regression commit `a54216bbc` which has `signal="SEGV"`, `delay=0`. [PR 1520](https://github.com/Altinity/ClickHouse/pull/1520) is updating the regression commit hash for `antalya-26.1`, and the failures with `KILL` (ExpectTimeoutError) can be seen in that PR's CI runs.

## Next Steps

- [x] Validate manually that `KILL` on 26.1 causes a hang (reproduce the ExpectTimeoutError scenario)
- [ ] Investigate why the upstream 26.1 base code doesn't propagate EOF from a killed node's connection
- [ ] Potentially file a bug in `Altinity/ClickHouse` if the KILL hang is confirmed as a regression

## References

- Failing run: https://github.com/Altinity/ClickHouse/actions/runs/23546722566/job/68549375510
- Commit that changed test signal from SEGV to KILL: https://github.com/Altinity/clickhouse-regression/commit/f1827080b



## Confirmed defects

- High: Replica-loss reschedule uses inconsistent file identifiers, causing stalled object-storage cluster queries.
  - Impact: Long-running object-storage cluster queries can stall until client timeout (300s in swarms harness) after a swarm node is killed/restarted, instead of failing fast with EOF (Code 32).
  - Anchor: src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp, function StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica.
  - Trigger: Run /swarms/feature/node failure/check restart clickhouse on swarm node with signal=KILL, delay_before_execution=5, delay=30 on a 26.1 build 
  - Why defect: The reschedule path computes and enqueues with file->getPath(), while normal task assignment and dequeue paths use a different identity, getAbsolutePathFromObjectInfo(...).value_or(getIdentifier()). This breaks queue identity invariants after replica loss.
  - Transition: ConnectionLost -> RemoteQueryExecutor::processPacket -> task_iterator->rescheduleTasksFromReplica() -> requeue under non-canonical key -> downstream task distribution cannot reliably reconcile the same object identity -> initiator-side query waits and times out.
  - Proof sketch: In 26.1 code, rescheduleTasksFromReplica uses file->getPath() for both getReplicaForFile(...) and unprocessed_files.emplace(...), while getPreQueuedFile and getMatchingFileFromIterator use absolute and identifier-based keys. In the 25.8 working baseline, reschedule used absolute-path-aware identity, getAbsolutePath().value_or(getPath()), preserving key consistency.
  - Root-cause PR: Introduced by PR #1414 (26.1 Antalya port - improvements for cluster requests, commit cc2dea7920d...) in the new 26.1 reschedule implementation. Not fixed by PR #1568, which only addresses erase-order and UAF safety.
  - Bug code location:
    - File: src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp
    - Function: StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica
    - Problematic lines:

      for (const auto & file : processed_file_list_ptr->second)
      {
          auto file_replica_idx = getReplicaForFile(file->getPath());
          unprocessed_files.emplace(file->getPath(), std::make_pair(file, file_replica_idx));
          connection_to_files[file_replica_idx].push_back(file);
      }

    - Conflicting canonical key logic in the same class:

      auto file_identifier = send_over_whole_archive
          ? next_file->getPathOrPathToArchiveIfArchive()
          : getAbsolutePathFromObjectInfo(next_file).value_or(next_file->getIdentifier());

  - Smallest logical repro:
    1. Start long query with object_storage_cluster='static_swarm_cluster' on initiator.
    2. Kill one swarm replica process with signal KILL during active remote reads.
    3. Observe ConnectionLost path invoke reschedule.
    4. Query does not fail fast with EOF and hangs until harness timeout (ExpectTimeoutError: Timeout 300s).
  - Fix direction (short): Backport PR #1493 identifier-consistency changes (getFileIdentifier style unified key derivation) in addition to PR #1568.




Version	Signal	Behavior
25.8 + `KILL`	Query fails in ~5s with `Code: 32. DB::Exception: Attempt to read after eof`	OK (consistent)
26.1 + `SEGV` (current main branch)	Query completes successfully — tasks are redistributed to surviving node, returns `100 clickhouse2` with exitcode 0	Fail (AssertionError — exitcode 0 ≠ 32)
26.1 + `KILL` (PR 1520 / newer regression commits)	Query hangs for 600s until bash timeout	Error (ExpectTimeoutError)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure: swarms/node failure/check restart clickhouse on swarm node — consistent failure on Antalya 26.1 #1601

Description

Analysis

Test behavior by version and signal

Two separate issues

What changed in 26.1

Database evidence

Commit context

Next Steps

References

Confirmed defects

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failure: swarms/node failure/check restart clickhouse on swarm node — consistent failure on Antalya 26.1 #1601

Description

Description

Analysis

Test behavior by version and signal

Two separate issues

What changed in 26.1

Database evidence

Commit context

Next Steps

References

Confirmed defects

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions