Skip to content

Trigger leader balance on long-term WAL write blocking#17731

Open
zerolbsony wants to merge 1 commit into
apache:masterfrom
zerolbsony:fix-triger-region-leader-balance-after-wal-write-delay
Open

Trigger leader balance on long-term WAL write blocking#17731
zerolbsony wants to merge 1 commit into
apache:masterfrom
zerolbsony:fix-triger-region-leader-balance-after-wal-write-delay

Conversation

@zerolbsony
Copy link
Copy Markdown
Contributor

Mark DataNode as ReadOnly(WALBlocked) when WAL write blocking persists, and let ConfigNode move Region leaders away from the blocked DataNode. Add UT and IT coverage for WAL block status and leader balance behavior.

Mark DataNode as ReadOnly(WALBlocked) when WAL write blocking
persists, and let ConfigNode move Region leaders away from the
blocked DataNode. Add UT and IT coverage for WAL block status and
leader balance behavior.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a mechanism for DataNodes to automatically enter ReadOnly(WALBlocked) when WAL write blocking persists (due to WAL disk throttling or WAL buffer queue memory pressure), enabling ConfigNode to balance/migrate Region leaders away from the affected DataNode. It also adds unit/integration test coverage for the new WAL-blocked status behavior and its impact on leader balancing.

Changes:

  • Add WALWriteBlockStatus to update CommonConfig NodeStatus/Reason based on long-term WAL write blocking.
  • Extend WALManager to detect long-term WAL throttling and WAL buffer queue blocking, and track blocked durations.
  • Add UT/IT coverage, including a new cluster IT validating leader movement away from ReadOnly(WALBlocked) nodes.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/storageengine/dataregion/wal/WALWriteBlockStatus.java New helper to toggle DataNode ReadOnly status with WALBlocked reason and recover when unblocked.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/storageengine/dataregion/wal/WALManager.java Track WAL throttle/buffer-queue block start times and expose isLongTermWriteBlocked() for heartbeat-driven status updates.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/storageengine/dataregion/wal/utils/MemoryControlledWALEntryQueue.java Mark/clear WAL buffer-queue blocked state when writers are forced to wait on WAL queue memory allocation.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/protocol/thrift/impl/DataNodeInternalRPCServiceImpl.java Update WAL-blocked status during DataNode heartbeat response generation.
iotdb-core/datanode/src/test/java/org/apache/iotdb/db/storageengine/dataregion/wal/WALWriteBlockStatusTest.java New unit tests validating status transitions for WALBlocked and non-WAL read-only reasons.
iotdb-core/datanode/src/test/java/org/apache/iotdb/db/storageengine/dataregion/wal/WALManagerTest.java New unit tests covering long-term WAL throttle and WAL buffer-queue blocking detection.
integration-test/src/test/java/org/apache/iotdb/confignode/it/load/IoTDBRegionGroupLeaderBalanceWithWALBlockIT.java New cluster IT validating leader balancing away from a WALBlocked (ReadOnly) DataNode.
integration-test/src/main/java/org/apache/iotdb/itbase/env/CommonConfig.java Add IT env setters for check_period_when_insert_blocked and max_waiting_time_when_insert_blocked.
integration-test/src/main/java/org/apache/iotdb/it/env/remote/config/RemoteCommonConfig.java Stub implementations for the new CommonConfig setters in remote IT env config.
integration-test/src/main/java/org/apache/iotdb/it/env/cluster/config/MppSharedCommonConfig.java Wire the new “insert blocked” timing configs to both CN and DN in shared cluster IT env.
integration-test/src/main/java/org/apache/iotdb/it/env/cluster/config/MppCommonConfig.java Persist the new “insert blocked” timing configs to IT property files for cluster env.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 285 to 287
public long getThrottleThreshold() {
return (long) (config.getThrottleThreshold() * 0.8);
}
Comment on lines +106 to +108
TRegionInfo targetLeader = findAnyDataRegionLeader(client);
triggerLongTermWalBlockingOnDataNode(client, targetLeader.getDataNodeId());

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants