feat(failover): Implement Graceful Failover Feature by yuzegao · Pull Request #3295 · apache/kvrocks

yuzegao · 2025-12-14T04:20:34Z

Summary

This PR implements graceful failover for Kvrocks cluster, allowing a master node to safely transfer control to a slave node while ensuring data consistency and minimizing service disruption.

Background

Based on GitHub Discussion #3218, this feature enables controlled master-to-slave failover with:

Data consistency guarantee (waits for replication sync)
Write blocking during critical phases
Configurable timeout
State machine-based async execution

Implementation

Architecture

Independent module: ClusterFailover class, parallel to SlotMigrator
Async execution: Dedicated background thread for failover process
State machine: 8 states (none → started → check_slave → pause_write → wait_sync → switching → success/failed)

Key Features

Slave Validation: Checks connection status, replication speed, and lag before proceeding
Write Blocking: Blocks write requests during pause_write, wait_sync, and switching states (returns TRYAGAIN)
Replication Sync: Waits for slave to catch up to target sequence number
Takeover: Sends CLUSTERX TAKEOVER command to slave with authentication support
Slot Redirection: Marks all slots as migrated, redirects clients via MOVED errors

Commands

CLUSTERX FAILOVER <slave-node-id> [timeout] - Initiate failover (default timeout: 1000ms)
CLUSTER INFO - Now includes cluster_failover_state:<state>

Files Changed

New Files:

src/cluster/cluster_failover.h / cluster_failover.cc - Core implementation (325 lines)
tests/gocase/integration/failover/failover_test.go - Test suite (926 lines)
GRACEFUL_FAILOVER_DESIGN.md - Design document

Modified Files:

src/server/server.{h,cc} - Added ClusterFailover member and GetSlaveReplicationOffset()
src/cluster/cluster.{h,cc} - Write blocking check, SetMySlotsMigrated(), OnTakeOver(), state reset
src/commands/cmd_cluster.cc - FAILOVER and TAKEOVER command handlers

Testing

Comprehensive test suite with 20 sub-test cases (100% pass rate):

Normal flow (basic, custom timeout, authentication)
Failure scenarios (non-existent node, non-slave, invalid timeout, lag timeout, auth failure)
Concurrency (cannot start when in progress, restart after failure)
Write blocking (write blocked, read not blocked)
State query and transitions
Integration (data consistency, state reset after SETNODES)

Compatibility

✅ Backward compatible: New feature, no breaking changes. Only active when cluster-enabled=yes. Existing clusters unaffected.

tests/gocase/integration/failover/TEST_CASES.md

src/cluster/cluster_failover.h

src/cluster/cluster_failover.cc

Copilot

Pull request overview

This PR implements a graceful failover mechanism for Kvrocks cluster, enabling controlled master-to-slave role transitions while maintaining data consistency. The implementation follows a state machine pattern with dedicated background thread execution, similar to the existing SlotMigrator module architecture.

Key Changes:

Introduces a new ClusterFailover module with an 8-state state machine for controlled failover execution
Implements write blocking during critical failover phases (pause_write, wait_sync, switching) to ensure data consistency
Adds CLUSTERX FAILOVER and CLUSTERX TAKEOVER commands for failover initiation and slave promotion

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`tests/gocase/go.mod`	Updated Go version directive (contains critical issue - see comments)
`tests/gocase/integration/failover/failover_test.go`	Comprehensive test suite with 20 test cases covering normal flow, failure scenarios, concurrency, write blocking, authentication, and data consistency
`tests/gocase/integration/failover/TEST_CASES.md`	Chinese-language test documentation describing test coverage goals
`src/cluster/cluster_failover.h`	Header defining `ClusterFailover` class with state machine, failover control methods, and thread management
`src/cluster/cluster_failover.cc`	Implementation of failover logic including slave validation, lag checking, sync waiting, and takeover command sending
`src/cluster/cluster.h`	Added methods for failover support: `OnTakeOver()`, `GetNodeIPPort()`, `SetMySlotsMigrated()`, `IsSlotImported()`
`src/cluster/cluster.cc`	Implements cluster-level failover operations, write blocking checks, failover state reset, and slot redirection
`src/commands/cmd_cluster.cc`	Command handlers for `FAILOVER` and `TAKEOVER` subcommands with timeout parsing
`src/server/server.h`	Declares `ClusterFailover` member and `GetSlaveReplicationOffset()` method
`src/server/server.cc`	Initializes `ClusterFailover` and implements slave offset retrieval for sync verification
`src/server/redis_connection.cc`	Allows writes to imported slots during failover (before topology update)
`src/storage/scripting.cc`	Allows Lua script writes to imported slots during failover

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/gocase/integration/failover/failover_test.go

src/cluster/cluster_failover.cc

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cluster/cluster_failover.cc

Copilot · 2025-12-16T12:19:56Z

src/cluster/cluster_failover.cc

+  if (s_resp.GetValue().substr(0, 3) != "+OK") {
+    return {Status::NotOK, "TAKEOVER failed: " + s_resp.GetValue()};
+  }


Potential resource leak: if SockReadLine fails or returns an unexpected response, the socket fd is closed. However, if GetValue().substr() throws an exception (e.g., if the response is less than 3 characters), the socket will not be closed. Consider using RAII or ensuring close(fd) is called in all paths, including exception paths.

auto s_resp = util::SockReadLine(fd);
close(fd);

if (!s_resp.IsOK()) {
return {Status::NotOK, "Failed to read TAKEOVER response: " + s_resp.Msg()};
}

if (s_resp.GetValue().substr(0, 3) != "+OK") {
return {Status::NotOK, "TAKEOVER failed: " + s_resp.GetValue()};
}

The fd allways closed before any if condition.

Copilot · 2025-12-16T12:19:57Z

src/commands/cmd_cluster.cc

+    }
+
+    return {Status::RedisParseErr,
+            "CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER"};


The help message in the error does not include "TAKEOVER" which is a valid subcommand added in this PR. The message should be updated to include TAKEOVER in the list of valid subcommands.

Suggested change

"CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER"};

"CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER|TAKEOVER"};

The TAKEOVER is inner command, don't need to declare here.

src/server/redis_connection.cc

src/cluster/cluster_failover.cc

Copilot · 2025-12-16T12:19:58Z

src/cluster/cluster_failover.cc

+    auto s_line = util::SockReadLine(fd);
+    if (!s_line.IsOK() || s_line.GetValue().substr(0, 3) != "+OK") {
+      close(fd);
+      return {Status::NotOK, "AUTH failed"};
+    }


Potential resource leak: if SockReadLine fails or returns an unexpected response, the socket fd is closed. However, if GetValue().substr() throws an exception (e.g., if the response is less than 3 characters), the socket will not be closed. Consider using RAII or ensuring close(fd) is called in all paths, including exception paths.

Without resource leak risk, beacuse all path have close(fd).
if (!pass.empty()) {
std::string auth_cmd = redis::ArrayOfBulkStrings({"AUTH", pass});
auto s_auth = util::SockSend(fd, auth_cmd);
if (!s_auth.IsOK()) {
close(fd);
return {Status::NotOK, "Failed to send AUTH: " + s_auth.Msg()};
}
auto s_line = util::SockReadLine(fd);
if (!s_line.IsOK() || s_line.GetValue().substr(0, 3) != "+OK") {
close(fd);
return {Status::NotOK, "AUTH failed"};
}
}

std::string cmd = redis::ArrayOfBulkStrings({"CLUSTERX", "TAKEOVER"});
auto s_send = util::SockSend(fd, cmd);
if (!s_send.IsOK()) {
close(fd);
return {Status::NotOK, "Failed to send TAKEOVER: " + s_send.Msg()};
}

auto s_resp = util::SockReadLine(fd);
close(fd);

src/cluster/cluster_failover.cc

tests/gocase/integration/failover/failover_test.go

src/cluster/cluster.cc

git-hulk

@yuzegao, I'm not sure if you expected to use the migrated/imported slot to do the redirection while doing the failover. If yes, it's too tricky to do that. You can use the failover state to allow/disallow writing in the new/master node instead of mixing the migration behavior with the failover.

Another question is: What if the migration is ongoing?

src/cluster/cluster_failover.cc

yuzegao · 2025-12-22T07:08:21Z

@yuzegao, I'm not sure if you expected to use the migrated/imported slot to do the redirection while doing the failover. If yes, it's too tricky to do that. You can use the failover state to allow/disallow writing in the new/master node instead of mixing the migration behavior with the failover.

Another question is: What if the migration is ongoing?

Hi, @git-hulk , Thank you very much for your feedback. There is indeed a serious risk here.
Failover and migrate should not be allowed to execute simultaneously. Semantically speaking:
Migration: Smoothly transfers slots from one master to another (horizontal scaling up/down).
Failover: Urgently transfers the master's responsibilities to the slave.
These two operations have completely different purposes, semantics, and time windows. Executing them simultaneously will lead to unpredictable cluster states and data corruption.

Please evaluate whether the following modifications are reasonable？

Add mutual exclusion checks in the migrate and failover startup processes to prevent the two operations from executing simultaneously.
Instead of using migrated/imported variables to control data redirection, define an independent state for failover control, making the semantics clearer.

yuzegao added 2 commits December 14, 2025 11:45

cluster: implement graceful failover feature

d16f56e

Merge branch 'unstable' into feature/gracefully_failover

38d3089

yuzegao changed the title ~~Implement Graceful Failover Feature~~ feat(failover): Implement Graceful Failover Feature Dec 14, 2025

PragmaTwice reviewed Dec 15, 2025

View reviewed changes

tests/gocase/integration/failover/TEST_CASES.md Outdated Show resolved Hide resolved

PragmaTwice requested review from caipengbo, Copilot and git-hulk December 15, 2025 02:49

Copilot started reviewing on behalf of PragmaTwice December 15, 2025 02:50 View session

PragmaTwice reviewed Dec 15, 2025

View reviewed changes

src/cluster/cluster_failover.h Show resolved Hide resolved

PragmaTwice reviewed Dec 15, 2025

View reviewed changes

src/cluster/cluster_failover.cc Show resolved Hide resolved

Copilot AI reviewed Dec 15, 2025

View reviewed changes

add License header and fix Copilot report issues.

0dbbc19

yuzegao requested review from PragmaTwice and Copilot December 16, 2025 11:50

Copilot started reviewing on behalf of yuzegao December 16, 2025 12:12 View session

Copilot AI reviewed Dec 16, 2025

View reviewed changes

git-hulk reviewed Dec 16, 2025

View reviewed changes

src/cluster/cluster_failover.cc Show resolved Hide resolved

src/cluster/cluster_failover.cc Show resolved Hide resolved

Paragrf mentioned this pull request Feb 28, 2026

Add support for the CLIENT PAUSE/UNPAUSE commands to provide a lightweight temporary solution for failover(takeover) #3377

Open

2 tasks

	"CLUSTERX command, CLUSTERX VERSION\|MYID\|SETNODEID\|SETNODES\|SETSLOT\|MIGRATE\|FAILOVER"};
	"CLUSTERX command, CLUSTERX VERSION\|MYID\|SETNODEID\|SETNODES\|SETSLOT\|MIGRATE\|FAILOVER\|TAKEOVER"};

Conversation

yuzegao commented Dec 14, 2025

Summary

Background

Implementation

Architecture

Key Features

Commands

Files Changed

Testing

Compatibility

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

yuzegao Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

yuzegao Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

yuzegao Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

git-hulk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yuzegao commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuzegao commented Dec 22, 2025 •

edited

Loading