feat(net): reduce sync memory via lazy parsing and throttling#6717
feat(net): reduce sync memory via lazy parsing and throttling#6717xxo1shine wants to merge 1 commit intotronprotocol:developfrom
Conversation
|
|
||
| private final int maxPendingBlockNum = Args.getInstance().getMaxPendingBlockNum(); | ||
|
|
||
| private static volatile long maxRequestedBlockNum = 0; |
There was a problem hiding this comment.
[MUST] This should be an instance field, not static.
Every other piece of sync state on this class (blockWaitToProcess, blockJustReceived, maxPendingBlockNum) is instance-scoped — only this one is static, with no apparent reason. Any scenario that recreates the Spring context (integration tests, hot reload, future multi-instance use) will leak state across instances.
Direct evidence: SyncServiceTest has to reset it via reflection (SyncServiceTest.java:185-187) to avoid pollution between test cases.
Suggested change:
private volatile long maxRequestedBlockNum = 0;
The reflection reset in the test can then be removed as well. If only the single-threaded fetchExecutor writes to it after this change, volatile can also be dropped.
There was a problem hiding this comment.
I accept the suggested changes; removing the static field is also fine.
| processSyncBlock(msg.getBlockCapsule(), peerConnection); | ||
| peerConnection.getSyncBlockInProcess().remove(msg.getBlockId()); | ||
| BlockCapsule block; | ||
| try { |
There was a problem hiding this comment.
[MUST] On failure this branch only logs and returns, which leaves three problems:
unparsedBlockstays inblockWaitToProcess- next tick re-parses, re-fails, and the entry permanently consumes onemaxPendingBlockNumslot.peer.getSyncBlockInProcess()is not cleared for thisblockId.invalid(blockId, peerConnection)is not called, so the peer is not penalised and the block is not rescheduled to another peer.
The trigger probability is very low (bytes were already parsed once in the BlockMessage ctor), but the cleanup should mirror the isDisconnect() branch directly above:
} catch (Exception e) {
logger.warn("Deserialize block {} failed", blockId.getString(), e);
blockWaitToProcess.remove(unparsedBlock);
peerConnection.getSyncBlockInProcess().remove(blockId);
invalid(blockId, peerConnection);
return;
}There was a problem hiding this comment.
The deserialization was performed once before, so it's impossible for it to fail here. This is just to handle exceptions and doesn't require any additional logic.
| }); | ||
| } | ||
|
|
||
| private synchronized void handleSyncBlock() { |
There was a problem hiding this comment.
[SHOULD] Missing wake-up safety net when the throttle saturates.
When the budget is full, startFetchSyncBlock returns having issued zero requests, and fetchFlag has already been consumed (set to false). Re-arming it depends on a later event from processBlock / invalid. If no new block arrives and no peer disconnects in that window, the next fetch has to wait for the cron tick and an external event.
Since handleSyncBlock is exactly the place that frees budget, the cheapest fix is to set fetchFlag = true; at the end of the if (isFound[0]) { ... } block (after processSyncBlock). Negligible cost, removes the corner case.
There was a problem hiding this comment.
processBlock already sets fetchFlag=true whenever a new block arrives, so the wake-up frequency during an active sync is already sufficient. Adding another fetchFlag=true here would, once the sync is near catch-up and peers have nothing left in syncBlockToFetch, cause startFetchSyncBlock to repeatedly walk activePeers for nothing — i.e. wasted fetch cycles. Leaving it as-is for now.
| } | ||
| method.invoke(service); | ||
| // highBlockId must NOT be requested: remainNum <= 0 and num > maxRequestedBlockNum | ||
| Assert.assertNull(peer.getSyncBlockRequested().get(highBlockId)); |
There was a problem hiding this comment.
[SHOULD] Missing test for the maxRequestedBlockNum exemption path.
Current coverage only asserts the hard-cap case (budget exhausted + num > maxRequestedBlockNum -> block not requested). The PR's deadlock-avoidance design specifically allows budget exhausted + num <= maxRequestedBlockNum to go through as a retry, and that path is untested - a future refactor could break it silently.
Suggestion: add a symmetric case where maxRequestedBlockNum is set to e.g. 100, blockWaitToProcess is filled to maxPendingBlockNum, the target blockId.num = 50, and assert peer.getSyncBlockRequested().get(blockId) != null.
There was a problem hiding this comment.
Okay, the single test has been completed.
| private final int maxPendingBlockNum = Args.getInstance().getMaxPendingBlockNum(); | ||
|
|
||
| private static volatile long maxRequestedBlockNum = 0; |
There was a problem hiding this comment.
[SHOULD] The names of these two variables are very poorly chosen and have low readability. Based on the code, maxPendingBlockNum refers to the length of the block list, while maxRequestedBlockNum refers to an absolute block number—these meanings are quite different, yet the names are almost identical. It is recommended to rename maxPendingBlockNum to maxPendingBlockSize.
There was a problem hiding this comment.
The variable names have been modified as suggested.
|
|
||
| HashMap<PeerConnection, List<BlockId>> send = new HashMap<>(); | ||
| tronNetDelegate.getActivePeer().stream() | ||
| int[] cnt = {0}; |
There was a problem hiding this comment.
[SHOULD] This name cnt is no meaning. Give the variable a name that clearly conveys its meaning. Such as fetchingBlockSize.
There was a problem hiding this comment.
The variable names have been modified as suggested.
| peerConnection.getSyncBlockInProcess().remove(msg.getBlockId()); | ||
| BlockCapsule block; | ||
| try { | ||
| block = new BlockCapsule(unparsedBlock.getData()); |
There was a problem hiding this comment.
[MUST] The purpose of replacing Block with UnparsedBlock is to save memory. But have you evaluated how much memory is actually saved? Encoding the data in a binary (proto) form doesn’t reduce memory usage significantly, so the benefit is minimal, while introducing several downsides:
- It increases the cognitive burden for users.
- If the data cannot be constructed into a
Block, the peer should be disconnected as early as possible. However, nothing is handled here, which is inconsistent with the previous logic.
There was a problem hiding this comment.
This will result in a significant improvement. Since deserialization was performed once before, it's impossible for it to fail here. This is simply to handle exceptions and doesn't require any additional logic.
76412e7 to
e59b6af
Compare
… and throttling in-flight requests
e59b6af to
3c49cb2
Compare
What does this PR do?
Cap memory usage during block sync by deferring block deserialization and bounding in-flight block requests.
New
UnparsedBlockcarrier — lightweight(BlockId, byte[])pair (framework/.../net/service/sync/UnparsedBlock.java). Holds the parsed block id (cheap) plus the raw protobuf bytes; the block body is not deserialized until the worker thread is ready to process it. Implementsequals/hashCodeonblockIdso it works as a map key.Defer deserialization in sync queues — in
SyncService:blockWaitToProcessandblockJustReceivedchange fromMap<BlockMessage, PeerConnection>toMap<UnparsedBlock, PeerConnection>processBlock(peer, blockMessage)builds theUnparsedBlockfromblockMessage.getData()instead of inserting the fully-parsed messagehandleSyncBlockparses the bytes back into aBlockMessageonly when the block is about to be appliedThrottle in-flight block requests via
node.maxPendingBlockNum— new config controlling the total budget ofrequested + justReceived + waitToProcessblocks across all peers:500, clamped to[50, 2000]inNodeConfig.postProcess()NodeConfig.maxPendingBlockNum→Args.applyNodeConfig→CommonParameter.maxPendingBlockNumreference.conf; example documented inconfig.confstartFetchSyncBlockbecomes budget-aware — computesremainNum = maxPendingBlockNum - requested - justReceived - waitToProcess; once the budget isexhausted it stops requesting new heights, with one exception: blocks at or below
maxRequestedBlockNum(the previously-requested ceiling) are still allowedthrough, so a slow peer can be retried without stalling the whole sync.
Tests —
SyncServiceTestupdated to constructUnparsedBlockcarriers in place of rawBlockMessageputs, plus boundary-condition coverage for the throttlepath.
Why are these changes required?
Sync-time heap was unbounded. Every received block was parsed eagerly into a
BlockMessageand held in two maps until processing finished. EachBlockMessagekeeps the parsedBlockproto with every transaction expanded into Java objects, which inflates the on-heap footprint several times over the wire size. During catch-up sync from many peers, this routinely caused multi-GB heap growth and GC pressure, with documented OOM incidents on smaller-RAM nodes.UnparsedBlockcollapses the worst case. Raw bytes are 1–2 MB per block; the parsed in-memory representation is significantly larger. Keeping the bytes raw until processing defers (and amortizes) the parse cost to the worker that actually needs it, and frees the in-flight buffer to grow proportional to bytes, not parsed-object-graph.Concurrent peer fetching had no global cap. A peer aggressively shipping inventory could push the
blockJustReceivedmap far past safe levels, because the receiver only tracked per-peerMAX_BLOCK_FETCH_PER_PEER. Boundingrequested + justReceived + waitToProcessmakes the worst-case in-flight memory predictable:maxPendingBlockNum × avg block size.The
maxRequestedBlockNumexemption avoids deadlock. A pure hard cap could stall sync if every peer holding a needed block goes idle and the budget is full —the only way to make progress is to retry the height, which requires re-requesting a block already in the budget. Allowing retries within the existing ceiling keeps sync forward-progressing under transient peer flakiness.
Operators need the knob. Default
500is conservative for nodes with ≤ 8 GB heap; large/dedicated nodes benefit from raising it for higher sync throughput, and constrained environments can lower it. Surfacing it asnode.maxPendingBlockNum(rather than a constant) lets operators tune without a release.This PR has been tested by:
Follow up
Extra details
Fixes #6685