Trigger leader balance on long-term WAL write blocking#17731
Open
zerolbsony wants to merge 1 commit into
Open
Conversation
Mark DataNode as ReadOnly(WALBlocked) when WAL write blocking persists, and let ConfigNode move Region leaders away from the blocked DataNode. Add UT and IT coverage for WAL block status and leader balance behavior.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a mechanism for DataNodes to automatically enter ReadOnly(WALBlocked) when WAL write blocking persists (due to WAL disk throttling or WAL buffer queue memory pressure), enabling ConfigNode to balance/migrate Region leaders away from the affected DataNode. It also adds unit/integration test coverage for the new WAL-blocked status behavior and its impact on leader balancing.
Changes:
- Add
WALWriteBlockStatusto updateCommonConfigNodeStatus/Reason based on long-term WAL write blocking. - Extend
WALManagerto detect long-term WAL throttling and WAL buffer queue blocking, and track blocked durations. - Add UT/IT coverage, including a new cluster IT validating leader movement away from
ReadOnly(WALBlocked)nodes.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| iotdb-core/datanode/src/main/java/org/apache/iotdb/db/storageengine/dataregion/wal/WALWriteBlockStatus.java | New helper to toggle DataNode ReadOnly status with WALBlocked reason and recover when unblocked. |
| iotdb-core/datanode/src/main/java/org/apache/iotdb/db/storageengine/dataregion/wal/WALManager.java | Track WAL throttle/buffer-queue block start times and expose isLongTermWriteBlocked() for heartbeat-driven status updates. |
| iotdb-core/datanode/src/main/java/org/apache/iotdb/db/storageengine/dataregion/wal/utils/MemoryControlledWALEntryQueue.java | Mark/clear WAL buffer-queue blocked state when writers are forced to wait on WAL queue memory allocation. |
| iotdb-core/datanode/src/main/java/org/apache/iotdb/db/protocol/thrift/impl/DataNodeInternalRPCServiceImpl.java | Update WAL-blocked status during DataNode heartbeat response generation. |
| iotdb-core/datanode/src/test/java/org/apache/iotdb/db/storageengine/dataregion/wal/WALWriteBlockStatusTest.java | New unit tests validating status transitions for WALBlocked and non-WAL read-only reasons. |
| iotdb-core/datanode/src/test/java/org/apache/iotdb/db/storageengine/dataregion/wal/WALManagerTest.java | New unit tests covering long-term WAL throttle and WAL buffer-queue blocking detection. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/load/IoTDBRegionGroupLeaderBalanceWithWALBlockIT.java | New cluster IT validating leader balancing away from a WALBlocked (ReadOnly) DataNode. |
| integration-test/src/main/java/org/apache/iotdb/itbase/env/CommonConfig.java | Add IT env setters for check_period_when_insert_blocked and max_waiting_time_when_insert_blocked. |
| integration-test/src/main/java/org/apache/iotdb/it/env/remote/config/RemoteCommonConfig.java | Stub implementations for the new CommonConfig setters in remote IT env config. |
| integration-test/src/main/java/org/apache/iotdb/it/env/cluster/config/MppSharedCommonConfig.java | Wire the new “insert blocked” timing configs to both CN and DN in shared cluster IT env. |
| integration-test/src/main/java/org/apache/iotdb/it/env/cluster/config/MppCommonConfig.java | Persist the new “insert blocked” timing configs to IT property files for cluster env. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
285
to
287
| public long getThrottleThreshold() { | ||
| return (long) (config.getThrottleThreshold() * 0.8); | ||
| } |
Comment on lines
+106
to
+108
| TRegionInfo targetLeader = findAnyDataRegionLeader(client); | ||
| triggerLongTermWalBlockingOnDataNode(client, targetLeader.getDataNodeId()); | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mark DataNode as ReadOnly(WALBlocked) when WAL write blocking persists, and let ConfigNode move Region leaders away from the blocked DataNode. Add UT and IT coverage for WAL block status and leader balance behavior.