fix(compute): cancel backup longjobs before migration when backupTaskLongJobUuids present#3530
fix(compute): cancel backup longjobs before migration when backupTaskLongJobUuids present#3530zstack-robot-1 wants to merge 2 commits into5.5.12from
Conversation
…ackupTaskLongJobUuids ZSTAC-82195: UI submits migrate LongJob with backupTaskLongJobUuids in jobData, expecting MN to cancel backups first. But MigrateVmInnerMsg lacks this field, so GSON drops it — backup never cancelled, volume chain task blocks migration. Parse backupTaskLongJobUuids from raw jobData, cancel each via CancelLongJobMsg, poll volume chain tasks until clear, then migrate. Co-Authored-By: ye.zou <ye.zou@zstack.io>
Walkthrough在 VM 迁移长任务中加入可选的备份长任务预取消流程:提取 Changes
Sequence Diagram(s)sequenceDiagram
participant User as 用户
participant MigrateJob as VM迁移长任务
participant BackupSvc as 备份服务
participant VolumeSvc as 卷服务
participant ThreadFacade as 线程调度器
User->>MigrateJob: 启动迁移请求
MigrateJob->>MigrateJob: 提取 backupTaskLongJobUuids
alt 存在备份长任务
MigrateJob->>BackupSvc: 发送 CancelLongJobMsg(递归多个 UUID)
BackupSvc-->>MigrateJob: 返回取消结果
loop 轮询直到卷备份链任务退出或超时
MigrateJob->>VolumeSvc: 发送 GetVolumeTaskMsg 查询 runningTask
VolumeSvc-->>MigrateJob: 返回任务状态
alt 还有运行任务
MigrateJob->>ThreadFacade: 调度延迟重试(间隔常数)
ThreadFacade-->>MigrateJob: 触发下一次查询
else 无运行任务
MigrateJob->>MigrateJob: 继续执行 doMigrate(...)
end
end
else 无备份长任务
MigrateJob->>MigrateJob: 直接执行 doMigrate(...)
end
MigrateJob-->>User: 迁移完成或失败
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (1 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java (2)
151-153:GetVolumeTaskReply结果建议增加空值保护。这里直接访问
gr.getResults().values()和info.getRunningTask(),在异常返回或边界数据下可能触发 NPE,导致长任务中断。建议加上null防御判断。🔧 建议修改
- boolean hasRunningTasks = gr.getResults().values().stream() - .anyMatch(info -> !info.getRunningTask().isEmpty()); + boolean hasRunningTasks = gr.getResults() != null && gr.getResults().values().stream() + .anyMatch(info -> info != null && info.getRunningTask() != null && !info.getRunningTask().isEmpty());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java` around lines 151 - 153, GetVolumeTaskReply handling can NPE if gr or its maps/lists are null; update the code around reply.castReply() and the hasRunningTasks computation to null-check gr, gr.getResults(), each Map.Entry value, and info.getRunningTask() before streaming/inspecting; specifically guard the result of reply.castReply() (GetVolumeTaskReply gr), ensure gr.getResults() != null, skip null values, and treat null/empty runningTask as empty when computing the boolean hasRunningTasks so the long job won't NPE on abnormal/empty replies.
98-103:cancelBackupLongJobsThenMigrate()的fail分支当前不可达。
cancelBackupLongJobs()内部无论单个取消成功或失败都会继续递归,最终只会走completion.success(),不会触发completion.fail()。建议删除该fail分支,或在达到失败阈值时显式回调fail,避免误导后续维护。Also applies to: 107-127
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java` around lines 98 - 103, The fail branch inside cancelBackupLongJobsThenMigrate() is unreachable because cancelBackupLongJobs() always ultimately calls completion.success(); either remove the unreachable fail handler in MigrateVmLongJob (the anonymous Callback.fail implementation that logs and calls doMigrate(msg, completion)) to avoid confusion, or change cancelBackupLongJobs() to track failure count and invoke the provided Callback.fail when a configured failure threshold is exceeded so the fail path becomes reachable; update references in MigrateVmLongJob.doMigrate / cancelBackupLongJobsThenMigrate to match the chosen approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java`:
- Around line 78-89: The async chain in start() (specifically in
cancelBackupLongJobsThenMigrate -> cancelBackupLongJobs callback ->
waitForVolumeChainTasksExit callback -> doMigrate) lacks a check for the LongJob
being canceled, so a user cancel can still allow doMigrate() to run; update each
asynchronous continuation (the anonymous Completion.success() handlers in
cancelBackupLongJobsThenMigrate, the similar callbacks at the other locations
(around lines 166-171 and 197-212)) to first query the long job status (via the
job UUID accessible in this class) and if the status is CANCELED/TERMINATED,
short-circuit by calling completion.success()/fail() as appropriate and do not
call doMigrate(); also ensure cancel() (which sends CancelMigrateVmMsg) sets the
job status so that these checks observe cancellation before proceeding.
- Around line 69-75: getBackupTaskLongJobUuids currently unsafely casts
raw.get("backupTaskLongJobUuids") to (List<String>) and returns it without
per-element validation; change the logic to read it as List<?> (or Object),
iterate each element in getBackupTaskLongJobUuids, convert non-null elements to
String, trim them, filter out null/blank entries and non-string types, collect
into a List<String> and return that (or an empty list/null if none valid) so
callers no longer receive invalid or blank UUIDs.
---
Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java`:
- Around line 151-153: GetVolumeTaskReply handling can NPE if gr or its
maps/lists are null; update the code around reply.castReply() and the
hasRunningTasks computation to null-check gr, gr.getResults(), each Map.Entry
value, and info.getRunningTask() before streaming/inspecting; specifically guard
the result of reply.castReply() (GetVolumeTaskReply gr), ensure gr.getResults()
!= null, skip null values, and treat null/empty runningTask as empty when
computing the boolean hasRunningTasks so the long job won't NPE on
abnormal/empty replies.
- Around line 98-103: The fail branch inside cancelBackupLongJobsThenMigrate()
is unreachable because cancelBackupLongJobs() always ultimately calls
completion.success(); either remove the unreachable fail handler in
MigrateVmLongJob (the anonymous Callback.fail implementation that logs and calls
doMigrate(msg, completion)) to avoid confusion, or change cancelBackupLongJobs()
to track failure count and invoke the provided Callback.fail when a configured
failure threshold is exceeded so the fail path becomes reachable; update
references in MigrateVmLongJob.doMigrate / cancelBackupLongJobsThenMigrate to
match the chosen approach.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: 2b68d15f-941f-4d88-9f59-154086783749
📒 Files selected for processing (1)
compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java
compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java
Outdated
Show resolved
Hide resolved
| private void cancelBackupLongJobsThenMigrate(List<String> backupTaskLongJobUuids, | ||
| MigrateVmInnerMsg msg, | ||
| ReturnValueCompletion<APIEvent> completion) { | ||
| cancelBackupLongJobs(backupTaskLongJobUuids.iterator(), new Completion(completion) { | ||
| @Override | ||
| public void success() { | ||
| waitForVolumeChainTasksExit(msg.getVmInstanceUuid(), WAIT_CHAIN_TASK_EXIT_MAX_RETRIES, | ||
| new Completion(completion) { | ||
| @Override | ||
| public void success() { | ||
| doMigrate(msg, completion); | ||
| } |
There was a problem hiding this comment.
新增预处理阶段缺少“已取消”门禁,存在取消后仍迁移的风险。
现在 start() 会先执行“取消备份 + 轮询等待”再 doMigrate(),但这条异步链路里没有任何 longjob 取消态检查;cancel() 仅发送 CancelMigrateVmMsg。如果用户在预处理阶段点击取消,后续仍可能进入 doMigrate()。建议在每次异步回调继续前检查 longjob 状态,已取消则立即终止。
Also applies to: 166-171, 197-212
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java` around
lines 78 - 89, The async chain in start() (specifically in
cancelBackupLongJobsThenMigrate -> cancelBackupLongJobs callback ->
waitForVolumeChainTasksExit callback -> doMigrate) lacks a check for the LongJob
being canceled, so a user cancel can still allow doMigrate() to run; update each
asynchronous continuation (the anonymous Completion.success() handlers in
cancelBackupLongJobsThenMigrate, the similar callbacks at the other locations
(around lines 166-171 and 197-212)) to first query the long job status (via the
job UUID accessible in this class) and if the status is CANCELED/TERMINATED,
short-circuit by calling completion.success()/fail() as appropriate and do not
call doMigrate(); also ensure cancel() (which sends CancelMigrateVmMsg) sets the
job status so that these checks observe cancellation before proceeding.
There was a problem hiding this comment.
Comment from ye.zou:
这是 LongJob 框架层面的通用缺陷——start() 和 cancel() 跑在不同线程/节点,框架没有提供取消信号传递机制。在单个 LongJob bugfix MR 里加 volatile 标志不是正确的修法,且多节点场景下内存标志也不够。
已开 ZSTAC-83470 跟踪框架改进:由框架提供 isCancelled(jobUuid) 查询(走 DB 保证多节点一致),所有 LongJob 实现在异步回调 checkpoint 统一调用。
There was a problem hiding this comment.
@zstack-robot-1,感谢 ye.zou 的说明。理解了——start() 与 cancel() 运行于不同线程/节点,框架层面缺乏统一的取消信号传递机制,在单个 bugfix PR 里添加内存标志(volatile)并不是正确的解法,多节点场景下也不具备一致性保证。在 ZSTAC-83470 中由框架统一提供基于 DB 的 isCancelled(jobUuid) 查询、并在所有 LongJob 实现的异步回调 checkpoint 处统一调用,是更合理的系统性修复方向。此 PR 范围内无需处理,感谢跟踪。
✏️ Learnings added
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 3530
File: compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java:78-89
Timestamp: 2026-03-19T07:19:58.791Z
Learning: In the ZStack codebase (MigrateVmLongJob.java and LongJob framework in general), start() and cancel() run on different threads/nodes. The framework provides no built-in cancellation signal propagation mechanism between them. Adding an in-memory volatile flag in a single LongJob implementation is insufficient for multi-node scenarios. The correct fix is a framework-level isCancelled(jobUuid) query backed by the DB (for multi-node consistency), to be called at async callback checkpoints in all LongJob implementations. This is tracked under ZSTAC-83470.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
… deserialization Use List<?> iteration with null/blank filtering instead of unchecked (List<String>) cast in getBackupTaskLongJobUuids. Co-Authored-By: ye.zou <ye.zou@zstack.io>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java`:
- Around line 161-178: The current check using GetVolumeTaskReply gr ->
getResults().values().stream().anyMatch(info ->
!info.getRunningTask().isEmpty()) treats any running volume task as blocking;
narrow the predicate to only consider the backup-chain tasks relevant to this
cancel operation (inspect the running task entries returned by
GetVolumeTaskReply for a task type/name or context field and match that to the
backup-chain identifier or context used by the cancel operation), e.g. replace
the anyMatch condition to detect only tasks whose type/name/context indicates
the backup chain for this VM (use symbols like GetVolumeTaskReply,
info.getRunningTask(), vmUuid and the surrounding retry logic using
WAIT_CHAIN_TASK_EXIT_INTERVAL_SECS/retriesLeft/completion.fail/success to ensure
only backup-chain tasks trigger the retry/fail path).
- Around line 117-135: The cancelBackupLongJobs recursion currently swallows
individual CancelLongJobMsg failures (in the CloudBusCallBack for
cancelBackupLongJobs) by only logging and continuing, which prevents the caller
from seeing cancellation errors; modify cancelBackupLongJobs (and its
CloudBusCallBack) so that on a non-successful reply you propagate the error to
the provided Completion (e.g., call completion.fail(reply.getError()) or
otherwise record the failure and stop further recursion), and only call
completion.success() when all cancels succeed; reference cancelBackupLongJobs,
CancelLongJobMsg, CloudBusCallBack, and the completion parameter when making
this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: 99605c44-2962-42ab-bbe6-999c828a7461
📒 Files selected for processing (1)
compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java
| private void cancelBackupLongJobs(Iterator<String> it, Completion completion) { | ||
| if (!it.hasNext()) { | ||
| completion.success(); | ||
| return; | ||
| } | ||
|
|
||
| String longJobUuid = it.next(); | ||
| CancelLongJobMsg cmsg = new CancelLongJobMsg(); | ||
| cmsg.setUuid(longJobUuid); | ||
| bus.makeLocalServiceId(cmsg, LongJobConstants.SERVICE_ID); | ||
| bus.send(cmsg, new CloudBusCallBack(completion) { | ||
| @Override | ||
| public void run(MessageReply reply) { | ||
| if (!reply.isSuccess()) { | ||
| logger.warn(String.format("failed to cancel backup longjob[uuid:%s]: %s", | ||
| longJobUuid, reply.getError())); | ||
| } | ||
| cancelBackupLongJobs(it, completion); | ||
| } |
There was a problem hiding this comment.
单个备份 LongJob 的取消失败现在被吞掉了。
Line 130-135 只记日志后继续递归,所以只要迭代器走完,外层就一定会进入后续等待流程。这样一来,单个取消失败不会走到 Line 109-113 的“继续迁移”兜底,反而更可能在等待阶段超时并把迁移直接打失败。建议把任一取消失败汇总后显式回传给调用方,让外层按既有兜底路径处理。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java` around
lines 117 - 135, The cancelBackupLongJobs recursion currently swallows
individual CancelLongJobMsg failures (in the CloudBusCallBack for
cancelBackupLongJobs) by only logging and continuing, which prevents the caller
from seeing cancellation errors; modify cancelBackupLongJobs (and its
CloudBusCallBack) so that on a non-successful reply you propagate the error to
the provided Completion (e.g., call completion.fail(reply.getError()) or
otherwise record the failure and stop further recursion), and only call
completion.success() when all cancels succeed; reference cancelBackupLongJobs,
CancelLongJobMsg, CloudBusCallBack, and the completion parameter when making
this change.
| GetVolumeTaskReply gr = reply.castReply(); | ||
| boolean hasRunningTasks = gr.getResults().values().stream() | ||
| .anyMatch(info -> !info.getRunningTask().isEmpty()); | ||
|
|
||
| if (!hasRunningTasks) { | ||
| completion.success(); | ||
| return; | ||
| } | ||
|
|
||
| if (retriesLeft <= 0) { | ||
| completion.fail(operr( | ||
| "timeout waiting for volume backup chain tasks to exit for vm[uuid:%s]", vmUuid)); | ||
| return; | ||
| } | ||
|
|
||
| logger.debug(String.format( | ||
| "volumes of vm[uuid:%s] still have running tasks, retry in %ds (retries left: %d)", | ||
| vmUuid, WAIT_CHAIN_TASK_EXIT_INTERVAL_SECS, retriesLeft)); |
There was a problem hiding this comment.
这里等待的是“任意卷任务”,不是“backup chain task”。
Line 162-163 只判断 runningTask 是否非空,没有按任务类型或上下文筛选。只要 VM 上还有其他合法卷任务,这条新链路就会额外阻塞 30 秒,甚至把原本可迁移的请求直接判成超时失败。建议把判断范围收窄到这次取消动作关联的 backup chain task。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@compute/src/main/java/org/zstack/compute/vm/MigrateVmLongJob.java` around
lines 161 - 178, The current check using GetVolumeTaskReply gr ->
getResults().values().stream().anyMatch(info ->
!info.getRunningTask().isEmpty()) treats any running volume task as blocking;
narrow the predicate to only consider the backup-chain tasks relevant to this
cancel operation (inspect the running task entries returned by
GetVolumeTaskReply for a task type/name or context field and match that to the
backup-chain identifier or context used by the cancel operation), e.g. replace
the anyMatch condition to detect only tasks whose type/name/context indicates
the backup chain for this VM (use symbols like GetVolumeTaskReply,
info.getRunningTask(), vmUuid and the surrounding retry logic using
WAIT_CHAIN_TASK_EXIT_INTERVAL_SECS/retriesLeft/completion.fail/success to ensure
only backup-chain tasks trigger the retry/fail path).
ZSTAC-82195: 备份任务取消后热迁移仍失败
Root Cause
UI 提交迁移 LongJob 时在
jobDataJSON 里带了backupTaskLongJobUuids,期望 MN 先取消备份再迁移。但MigrateVmLongJob.start()把 jobData 反序列化为MigrateVmInnerMsg,该类没有backupTaskLongJobUuids字段,GSON 静默丢弃 → MN 直接执行迁移 → volume chain task 还在 → 失败。Solution
在
MigrateVmLongJob.start()中:backupTaskLongJobUuidsCancelLongJobMsg取消备份 longjobGetVolumeTaskMsg等待 volume chain task 退出(最多 30s)MigrateVmInnerMsg执行迁移backupTaskLongJobUuids时保持原有逻辑不变Test
BackupCancelMigrateCase.groovy(premium test-premium)Files Changed
compute/.../MigrateVmLongJob.javapremium/test-premium/.../BackupCancelMigrateCase.groovysync from gitlab !9389