Export Partition - release the part lock when the query is cancelled#1593
Export Partition - release the part lock when the query is cancelled#1593arthurpassos wants to merge 4 commits intoantalya-26.1from
Conversation
|
@codex review |
There was a problem hiding this comment.
NIT: May be move this log record below exit without increment to avoid 'will now increment' - 'oh, sorry, will not'? I think second line would be enough to understand what is going on.
There was a problem hiding this comment.
I removed the "will now increment counters" part
|
AI audit note: This review comment was generated by AI (gpt-5.3-codex). Audit update for PR #1593 (Export Partition - release part lock when query is cancelled): Confirmed defectsMedium: Cancellation path ignores lock-release failure and can leave export stuckImpact: A cancelled part export can remain locked by the same replica if ZooKeeper remove fails transiently; subsequent scheduler cycles skip the part as "locked", so export may remain pending until ZooKeeper session expiry/manual intervention. Anchor: Trigger: Affected transition: Why defect: The new cancellation branch performs a best-effort Affected subsystem and blast radius: Replicated MergeTree export-partition scheduler state in ZooKeeper ( Smallest logical reproduction steps: Start Logical fault-injection mapping: Inject fault category "coordination write failure on cancellation transition" at Fix direction (short): Check Regression test direction (short): Add integration/fault-injection test that forces Code evidence: if (exception->code() == ErrorCodes::QUERY_WAS_CANCELLED)
{
zk->tryRemove(export_path / "locks" / part_name, locked_by_stat.version);
LOG_INFO(storage.log, "ExportPartition scheduler task: Part {} export was cancelled, skipping error handling", part_name);
return;
}Coverage summaryScope reviewed: Full PR delta for Categories failed: Cancellation-transition durability under coordination failures (lock release failure handling). Categories passed: Scheduler stop/start gating ( Assumptions/limits: Static deep audit only (no runtime execution/CI); fault-injection outcomes were reasoned from code paths and ZooKeeper API contracts. |
|
Added a test that covers this for export partition: https://github.com/Altinity/clickhouse-regression/blob/6ca30d8082b91c4629f29d57cc917083d5b70f23/s3/tests/export_partition/network.py#L483 It is passing with this build locally: Running as: |
PR #1593 CI Triage — Export Partition: release the part lock when the query is cancelled
PR DescriptionBug fix for export partition: releases the part lock when an export task is cancelled (e.g. via Changed files (2):
Summary
Verdict: No failures are caused by this PR. All CI failures are pre-existing or infrastructure-related. PR-targeted regression suites all passed: s3_export_partition (x86_64 + aarch64), s3_export_part (x86_64 + aarch64), iceberg (x86_64 + aarch64), parquet (x86_64 + aarch64) — all green. CI Jobs Status
Detailed Failure Analysis1.
|
| Suite | x86_64 | aarch64 |
|---|---|---|
| s3_export_partition | ✅ 4 features, 7 scenarios — all OK (5m 18s) | ✅ 4 features, 7 scenarios — all OK (4m 57s) |
| s3_export_part | ✅ 17 features, 87 scenarios — all OK (1h 45m) | ✅ 17 features, 87 scenarios — all OK (1h 44m) |
| iceberg_1 | ✅ 42 features, 487 scenarios (1h 6m) | ✅ 42 features, 487 scenarios (1h 24m) |
| iceberg_2 | ✅ 29 features, 1576 scenarios (1h 21m) | ✅ 29 features, 1576 scenarios (1h 41m) |
| parquet | ✅ 39 features (57m) | ✅ 39 features (1h 15m) |
Additionally, the PR author confirmed local pass of the export partition network test (/s3/minio/export tests/export partition/network/export resumes after stop start moves).
Recommendations
- Merge-ready from CI perspective — No PR-caused regressions detected. All targeted regression suites pass.
- Pre-existing issues to track separately:
- Swarms node failure: #1601
- FULL OUTER JOIN in swarms: ClickHouse#89996
- Alpine Docker CVE: CVE-2026-2673 — requires base image update (affects all PRs)
- Audit note from @Selfeer: An AI audit identified a medium-severity concern about cancellation path ignoring lock-release failure if ZooKeeper
tryRemovefails transiently. This is a code-quality consideration, not a CI regression.
During export partition, parts are locked by replicas for exports. This PR introduces a change that releases these locks when an export task is cancelled. Previously, it would not release the lock. We did not catch this error before because the only cases an export task was cancelled we tested were
KILL EXPORT PARTITIONandDROP TABLE. In those cases, the entire task is cancelled, so it does not matter if a replica does not release its lock.But a query can also be cancelled with 'SYSTEM STOP MOVES', and in that case, it is a local operation. The lock must be released so other replicas can continue.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: