Skip to content

fix(compute): cancel backup longjobs before migration when backupTaskLongJobUuids present#3530

Open
zstack-robot-1 wants to merge 2 commits into5.5.12from
sync/ye.zou/fix/ZSTAC-82195-migrate-cancel-backup
Open

fix(compute): cancel backup longjobs before migration when backupTaskLongJobUuids present#3530
zstack-robot-1 wants to merge 2 commits into5.5.12from
sync/ye.zou/fix/ZSTAC-82195-migrate-cancel-backup

Conversation

@zstack-robot-1
Copy link
Collaborator

ZSTAC-82195: 备份任务取消后热迁移仍失败

Root Cause

UI 提交迁移 LongJob 时在 jobData JSON 里带了 backupTaskLongJobUuids,期望 MN 先取消备份再迁移。但 MigrateVmLongJob.start() 把 jobData 反序列化为 MigrateVmInnerMsg,该类没有 backupTaskLongJobUuids 字段,GSON 静默丢弃 → MN 直接执行迁移 → volume chain task 还在 → 失败。

Solution

MigrateVmLongJob.start() 中:

  1. 从 raw jobData (LinkedHashMap) 额外解析 backupTaskLongJobUuids
  2. 如果存在,逐个发送 CancelLongJobMsg 取消备份 longjob
  3. 取消完成后,轮询 GetVolumeTaskMsg 等待 volume chain task 退出(最多 30s)
  4. chain task 清除后再发 MigrateVmInnerMsg 执行迁移
  5. backupTaskLongJobUuids 时保持原有逻辑不变

Test

  • 新增 BackupCancelMigrateCase.groovy(premium test-premium)
  • 覆盖:带 cancel 迁移、不带 cancel 迁移、多个 backup UUID 取消

Files Changed

File Change
compute/.../MigrateVmLongJob.java 在 start() 中加 cancel backup + wait chain task 逻辑
premium/test-premium/.../BackupCancelMigrateCase.groovy 新增测试 case

sync from gitlab !9389

…ackupTaskLongJobUuids

ZSTAC-82195: UI submits migrate LongJob with backupTaskLongJobUuids in
jobData, expecting MN to cancel backups first. But MigrateVmInnerMsg
lacks this field, so GSON drops it — backup never cancelled, volume
chain task blocks migration.

Parse backupTaskLongJobUuids from raw jobData, cancel each via
CancelLongJobMsg, poll volume chain tasks until clear, then migrate.

Co-Authored-By: ye.zou <ye.zou@zstack.io>
@coderabbitai
Copy link

coderabbitai bot commented Mar 18, 2026

Walkthrough

在 VM 迁移长任务中加入可选的备份长任务预取消流程:提取 backupTaskLongJobUuids,递归发送取消消息、等待卷上备份链任务退出(带重试/超时),然后再执行迁移;新增调度与重试常量与日志记录。

Changes

Cohort / File(s) Summary
VM 迁移任务增强
compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java
新增预取消相关备份长任务的控制流(递归 CancelLongJobMsg)、实现等待卷备份链任务退出的轮询与重试逻辑、注入 ThreadFacade、添加私有 logger 与重试常数,扩展相关导入。

Sequence Diagram(s)

sequenceDiagram
    participant User as 用户
    participant MigrateJob as VM迁移长任务
    participant BackupSvc as 备份服务
    participant VolumeSvc as 卷服务
    participant ThreadFacade as 线程调度器

    User->>MigrateJob: 启动迁移请求
    MigrateJob->>MigrateJob: 提取 backupTaskLongJobUuids

    alt 存在备份长任务
        MigrateJob->>BackupSvc: 发送 CancelLongJobMsg(递归多个 UUID)
        BackupSvc-->>MigrateJob: 返回取消结果

        loop 轮询直到卷备份链任务退出或超时
            MigrateJob->>VolumeSvc: 发送 GetVolumeTaskMsg 查询 runningTask
            VolumeSvc-->>MigrateJob: 返回任务状态
            alt 还有运行任务
                MigrateJob->>ThreadFacade: 调度延迟重试(间隔常数)
                ThreadFacade-->>MigrateJob: 触发下一次查询
            else 无运行任务
                MigrateJob->>MigrateJob: 继续执行 doMigrate(...)
            end
        end
    else 无备份长任务
        MigrateJob->>MigrateJob: 直接执行 doMigrate(...)
    end

    MigrateJob-->>User: 迁移完成或失败
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 迁移前兔子忙,先把备份点亮熄。
等到链条全退出,轻轻一跃入新地。
递归消息稳又巧,重试间隔不慌张。


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Title check ❌ Error PR title does not conform to the required format constraint of 72 characters or less; it is 89 characters long. Shorten the title to 72 characters or less while maintaining the [scope]: format, e.g., 'fix(compute): cancel backup longjobs before migration'.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed PR description clearly explains the root cause, solution, and test coverage related to the backup job cancellation before VM migration changes in the codebase.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/ye.zou/fix/ZSTAC-82195-migrate-cancel-backup
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java (2)

151-153: GetVolumeTaskReply 结果建议增加空值保护。

这里直接访问 gr.getResults().values()info.getRunningTask(),在异常返回或边界数据下可能触发 NPE,导致长任务中断。建议加上 null 防御判断。

🔧 建议修改
- boolean hasRunningTasks = gr.getResults().values().stream()
-         .anyMatch(info -> !info.getRunningTask().isEmpty());
+ boolean hasRunningTasks = gr.getResults() != null && gr.getResults().values().stream()
+         .anyMatch(info -> info != null && info.getRunningTask() != null && !info.getRunningTask().isEmpty());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java` around
lines 151 - 153, GetVolumeTaskReply handling can NPE if gr or its maps/lists are
null; update the code around reply.castReply() and the hasRunningTasks
computation to null-check gr, gr.getResults(), each Map.Entry value, and
info.getRunningTask() before streaming/inspecting; specifically guard the result
of reply.castReply() (GetVolumeTaskReply gr), ensure gr.getResults() != null,
skip null values, and treat null/empty runningTask as empty when computing the
boolean hasRunningTasks so the long job won't NPE on abnormal/empty replies.

98-103: cancelBackupLongJobsThenMigrate()fail 分支当前不可达。

cancelBackupLongJobs() 内部无论单个取消成功或失败都会继续递归,最终只会走 completion.success(),不会触发 completion.fail()。建议删除该 fail 分支,或在达到失败阈值时显式回调 fail,避免误导后续维护。

Also applies to: 107-127

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java` around
lines 98 - 103, The fail branch inside cancelBackupLongJobsThenMigrate() is
unreachable because cancelBackupLongJobs() always ultimately calls
completion.success(); either remove the unreachable fail handler in
MigrateVmLongJob (the anonymous Callback.fail implementation that logs and calls
doMigrate(msg, completion)) to avoid confusion, or change cancelBackupLongJobs()
to track failure count and invoke the provided Callback.fail when a configured
failure threshold is exceeded so the fail path becomes reachable; update
references in MigrateVmLongJob.doMigrate / cancelBackupLongJobsThenMigrate to
match the chosen approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java`:
- Around line 78-89: The async chain in start() (specifically in
cancelBackupLongJobsThenMigrate -> cancelBackupLongJobs callback ->
waitForVolumeChainTasksExit callback -> doMigrate) lacks a check for the LongJob
being canceled, so a user cancel can still allow doMigrate() to run; update each
asynchronous continuation (the anonymous Completion.success() handlers in
cancelBackupLongJobsThenMigrate, the similar callbacks at the other locations
(around lines 166-171 and 197-212)) to first query the long job status (via the
job UUID accessible in this class) and if the status is CANCELED/TERMINATED,
short-circuit by calling completion.success()/fail() as appropriate and do not
call doMigrate(); also ensure cancel() (which sends CancelMigrateVmMsg) sets the
job status so that these checks observe cancellation before proceeding.
- Around line 69-75: getBackupTaskLongJobUuids currently unsafely casts
raw.get("backupTaskLongJobUuids") to (List<String>) and returns it without
per-element validation; change the logic to read it as List<?> (or Object),
iterate each element in getBackupTaskLongJobUuids, convert non-null elements to
String, trim them, filter out null/blank entries and non-string types, collect
into a List<String> and return that (or an empty list/null if none valid) so
callers no longer receive invalid or blank UUIDs.

---

Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java`:
- Around line 151-153: GetVolumeTaskReply handling can NPE if gr or its
maps/lists are null; update the code around reply.castReply() and the
hasRunningTasks computation to null-check gr, gr.getResults(), each Map.Entry
value, and info.getRunningTask() before streaming/inspecting; specifically guard
the result of reply.castReply() (GetVolumeTaskReply gr), ensure gr.getResults()
!= null, skip null values, and treat null/empty runningTask as empty when
computing the boolean hasRunningTasks so the long job won't NPE on
abnormal/empty replies.
- Around line 98-103: The fail branch inside cancelBackupLongJobsThenMigrate()
is unreachable because cancelBackupLongJobs() always ultimately calls
completion.success(); either remove the unreachable fail handler in
MigrateVmLongJob (the anonymous Callback.fail implementation that logs and calls
doMigrate(msg, completion)) to avoid confusion, or change cancelBackupLongJobs()
to track failure count and invoke the provided Callback.fail when a configured
failure threshold is exceeded so the fail path becomes reachable; update
references in MigrateVmLongJob.doMigrate / cancelBackupLongJobsThenMigrate to
match the chosen approach.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 2b68d15f-941f-4d88-9f59-154086783749

📥 Commits

Reviewing files that changed from the base of the PR and between e0a8246 and 531859e.

📒 Files selected for processing (1)
  • compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java

Comment on lines +78 to +89
private void cancelBackupLongJobsThenMigrate(List<String> backupTaskLongJobUuids,
MigrateVmInnerMsg msg,
ReturnValueCompletion<APIEvent> completion) {
cancelBackupLongJobs(backupTaskLongJobUuids.iterator(), new Completion(completion) {
@Override
public void success() {
waitForVolumeChainTasksExit(msg.getVmInstanceUuid(), WAIT_CHAIN_TASK_EXIT_MAX_RETRIES,
new Completion(completion) {
@Override
public void success() {
doMigrate(msg, completion);
}
Copy link

@coderabbitai coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

新增预处理阶段缺少“已取消”门禁,存在取消后仍迁移的风险。

现在 start() 会先执行“取消备份 + 轮询等待”再 doMigrate(),但这条异步链路里没有任何 longjob 取消态检查;cancel() 仅发送 CancelMigrateVmMsg。如果用户在预处理阶段点击取消,后续仍可能进入 doMigrate()。建议在每次异步回调继续前检查 longjob 状态,已取消则立即终止。

Also applies to: 166-171, 197-212

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java` around
lines 78 - 89, The async chain in start() (specifically in
cancelBackupLongJobsThenMigrate -> cancelBackupLongJobs callback ->
waitForVolumeChainTasksExit callback -> doMigrate) lacks a check for the LongJob
being canceled, so a user cancel can still allow doMigrate() to run; update each
asynchronous continuation (the anonymous Completion.success() handlers in
cancelBackupLongJobsThenMigrate, the similar callbacks at the other locations
(around lines 166-171 and 197-212)) to first query the long job status (via the
job UUID accessible in this class) and if the status is CANCELED/TERMINATED,
short-circuit by calling completion.success()/fail() as appropriate and do not
call doMigrate(); also ensure cancel() (which sends CancelMigrateVmMsg) sets the
job status so that these checks observe cancellation before proceeding.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from ye.zou:

这是 LongJob 框架层面的通用缺陷——start() 和 cancel() 跑在不同线程/节点,框架没有提供取消信号传递机制。在单个 LongJob bugfix MR 里加 volatile 标志不是正确的修法,且多节点场景下内存标志也不够。

已开 ZSTAC-83470 跟踪框架改进:由框架提供 isCancelled(jobUuid) 查询(走 DB 保证多节点一致),所有 LongJob 实现在异步回调 checkpoint 统一调用。

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zstack-robot-1,感谢 ye.zou 的说明。理解了——start()cancel() 运行于不同线程/节点,框架层面缺乏统一的取消信号传递机制,在单个 bugfix PR 里添加内存标志(volatile)并不是正确的解法,多节点场景下也不具备一致性保证。在 ZSTAC-83470 中由框架统一提供基于 DB 的 isCancelled(jobUuid) 查询、并在所有 LongJob 实现的异步回调 checkpoint 处统一调用,是更合理的系统性修复方向。此 PR 范围内无需处理,感谢跟踪。


✏️ Learnings added
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 3530
File: compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java:78-89
Timestamp: 2026-03-19T07:19:58.791Z
Learning: In the ZStack codebase (MigrateVmLongJob.java and LongJob framework in general), start() and cancel() run on different threads/nodes. The framework provides no built-in cancellation signal propagation mechanism between them. Adding an in-memory volatile flag in a single LongJob implementation is insufficient for multi-node scenarios. The correct fix is a framework-level isCancelled(jobUuid) query backed by the DB (for multi-node consistency), to be called at async callback checkpoints in all LongJob implementations. This is tracked under ZSTAC-83470.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

… deserialization

Use List<?> iteration with null/blank filtering instead of unchecked
(List<String>) cast in getBackupTaskLongJobUuids.

Co-Authored-By: ye.zou <ye.zou@zstack.io>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java`:
- Around line 161-178: The current check using GetVolumeTaskReply gr ->
getResults().values().stream().anyMatch(info ->
!info.getRunningTask().isEmpty()) treats any running volume task as blocking;
narrow the predicate to only consider the backup-chain tasks relevant to this
cancel operation (inspect the running task entries returned by
GetVolumeTaskReply for a task type/name or context field and match that to the
backup-chain identifier or context used by the cancel operation), e.g. replace
the anyMatch condition to detect only tasks whose type/name/context indicates
the backup chain for this VM (use symbols like GetVolumeTaskReply,
info.getRunningTask(), vmUuid and the surrounding retry logic using
WAIT_CHAIN_TASK_EXIT_INTERVAL_SECS/retriesLeft/completion.fail/success to ensure
only backup-chain tasks trigger the retry/fail path).
- Around line 117-135: The cancelBackupLongJobs recursion currently swallows
individual CancelLongJobMsg failures (in the CloudBusCallBack for
cancelBackupLongJobs) by only logging and continuing, which prevents the caller
from seeing cancellation errors; modify cancelBackupLongJobs (and its
CloudBusCallBack) so that on a non-successful reply you propagate the error to
the provided Completion (e.g., call completion.fail(reply.getError()) or
otherwise record the failure and stop further recursion), and only call
completion.success() when all cancels succeed; reference cancelBackupLongJobs,
CancelLongJobMsg, CloudBusCallBack, and the completion parameter when making
this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 99605c44-2962-42ab-bbe6-999c828a7461

📥 Commits

Reviewing files that changed from the base of the PR and between 531859e and 8d1a8d9.

📒 Files selected for processing (1)
  • compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java

Comment on lines +117 to +135
private void cancelBackupLongJobs(Iterator<String> it, Completion completion) {
if (!it.hasNext()) {
completion.success();
return;
}

String longJobUuid = it.next();
CancelLongJobMsg cmsg = new CancelLongJobMsg();
cmsg.setUuid(longJobUuid);
bus.makeLocalServiceId(cmsg, LongJobConstants.SERVICE_ID);
bus.send(cmsg, new CloudBusCallBack(completion) {
@Override
public void run(MessageReply reply) {
if (!reply.isSuccess()) {
logger.warn(String.format("failed to cancel backup longjob[uuid:%s]: %s",
longJobUuid, reply.getError()));
}
cancelBackupLongJobs(it, completion);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

单个备份 LongJob 的取消失败现在被吞掉了。

Line 130-135 只记日志后继续递归,所以只要迭代器走完,外层就一定会进入后续等待流程。这样一来,单个取消失败不会走到 Line 109-113 的“继续迁移”兜底,反而更可能在等待阶段超时并把迁移直接打失败。建议把任一取消失败汇总后显式回传给调用方,让外层按既有兜底路径处理。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java` around
lines 117 - 135, The cancelBackupLongJobs recursion currently swallows
individual CancelLongJobMsg failures (in the CloudBusCallBack for
cancelBackupLongJobs) by only logging and continuing, which prevents the caller
from seeing cancellation errors; modify cancelBackupLongJobs (and its
CloudBusCallBack) so that on a non-successful reply you propagate the error to
the provided Completion (e.g., call completion.fail(reply.getError()) or
otherwise record the failure and stop further recursion), and only call
completion.success() when all cancels succeed; reference cancelBackupLongJobs,
CancelLongJobMsg, CloudBusCallBack, and the completion parameter when making
this change.

Comment on lines +161 to +178
GetVolumeTaskReply gr = reply.castReply();
boolean hasRunningTasks = gr.getResults().values().stream()
.anyMatch(info -> !info.getRunningTask().isEmpty());

if (!hasRunningTasks) {
completion.success();
return;
}

if (retriesLeft <= 0) {
completion.fail(operr(
"timeout waiting for volume backup chain tasks to exit for vm[uuid:%s]", vmUuid));
return;
}

logger.debug(String.format(
"volumes of vm[uuid:%s] still have running tasks, retry in %ds (retries left: %d)",
vmUuid, WAIT_CHAIN_TASK_EXIT_INTERVAL_SECS, retriesLeft));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

这里等待的是“任意卷任务”,不是“backup chain task”。

Line 162-163 只判断 runningTask 是否非空,没有按任务类型或上下文筛选。只要 VM 上还有其他合法卷任务,这条新链路就会额外阻塞 30 秒,甚至把原本可迁移的请求直接判成超时失败。建议把判断范围收窄到这次取消动作关联的 backup chain task。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java` around
lines 161 - 178, The current check using GetVolumeTaskReply gr ->
getResults().values().stream().anyMatch(info ->
!info.getRunningTask().isEmpty()) treats any running volume task as blocking;
narrow the predicate to only consider the backup-chain tasks relevant to this
cancel operation (inspect the running task entries returned by
GetVolumeTaskReply for a task type/name or context field and match that to the
backup-chain identifier or context used by the cancel operation), e.g. replace
the anyMatch condition to detect only tasks whose type/name/context indicates
the backup chain for this VM (use symbols like GetVolumeTaskReply,
info.getRunningTask(), vmUuid and the surrounding retry logic using
WAIT_CHAIN_TASK_EXIT_INTERVAL_SECS/retriesLeft/completion.fail/success to ensure
only backup-chain tasks trigger the retry/fail path).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants