Skip to content

Export Partition: cancellation path ignores lock-release failure, can leave export stuck #1602

@Selfeer

Description

@Selfeer

Summary

During export partition, when a task is cancelled (e.g. via SYSTEM STOP MOVES), the cancellation handler performs a best-effort zk->tryRemove to release the part lock but does not check the result or handle failure. If ZooKeeper tryRemove fails transiently (network blip, version mismatch race), the lock remains in ZooKeeper and subsequent scheduler cycles skip the part as "locked", leaving the export stuck until ZooKeeper session expiry or manual intervention.

Severity

Medium

Affected code

src/Storages/MergeTree/ExportPartitionTaskScheduler.cppExportPartitionTaskScheduler::handlePartExportFailure

if (exception->code() == ErrorCodes::QUERY_WAS_CANCELLED)
{
    zk->tryRemove(export_path / "locks" / part_name, locked_by_stat.version);
    LOG_INFO(storage.log, "ExportPartition scheduler task: Part {} export was cancelled, skipping error handling", part_name);
    return;
}

Affected subsystem

Replicated MergeTree export-partition scheduler state in ZooKeeper (exports/<key>/locks/<part>). Impacts recovery/progress of pending partition exports on the affected replica.

Steps to reproduce

  1. Start EXPORT PARTITION on a replicated MergeTree table
  2. Issue SYSTEM STOP MOVES to trigger cancellation
  3. Inject a transient ZooKeeper failure (or version mismatch race) during the tryRemove call for exports/<key>/locks/<part>
  4. Issue SYSTEM START MOVES
  5. Observe: the part remains skipped by the scheduler due to the stale lock; export does not resume for that part

Expected behavior

The lock should be reliably released on cancellation. If tryRemove fails, the failure should be detected and remediated (retry, backoff, or fallback cleanup) so that the part becomes schedulable again after SYSTEM START MOVES.

Actual behavior

tryRemove result is ignored. If the remove fails, the stale lock persists in ZooKeeper and blocks rescheduling of the affected part until ZooKeeper session expiry or manual cleanup.

References

Origin

Identified via static audit of PR #1593 ("Export Partition - release the part lock when the query is cancelled"). The defect is in the new cancellation branch introduced by the PR.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions