Performance improvements and fixes#1502
Draft
MegaRedHand wants to merge 92 commits into
Draft
Conversation
Add a new Rust NIF (hash_beacon_state_cached_rs) that hashes BeaconState fields individually and accepts pre-computed field hashes for unchanged fields. On non-epoch blocks, 5 expensive fields (validators, inactivity scores, sync committees, historical roots) are cached, skipping ~70% of hashing work. Results (32-block benchmark): - Merkleization: ~7.2s -> ~2.4s per non-epoch block (-67%) - Non-epoch avg: ~12.0s -> ~7.0s (-42%) - Total: ~398s -> ~242s (-39%)
Move Ssz.to_ssz(state) out of the block processing critical path. The SSZ-encoded binary is only needed for DB storage, which happens asynchronously. By deferring encoding to the async task, we save ~2s per block. Results (32-block benchmark): - Non-epoch avg: 7.0s -> 5.8s (-17.6%) - Total: 242s -> 202s (-16.4%) - Epoch block: 24.2s -> 22.6s (-6.6%)
Block attestations are already BLS-verified during state transition (in process_attestation_batch). The fork-choice on_attestation handler was verifying them again, doubling the BLS cost per block. Skip check_valid_indexed_attestation when is_from_block=true. Attestation processing: ~916ms -> ~213ms per block (-77%). Total: 196s -> 181s (-7.4%).
Replace zip_with + filter + reduce (3 passes, creates intermediate filtered vector) with zip_with + foldl (2 passes, no intermediate allocation). Produces 0 for non-participating validators instead of filtering them out. This is called by compute_pulled_up_tip on every block (~656ms per block untimed), plus once during epoch processing.
The incremental merkleization cache was incorrectly caching the validators field hash (field 11) on ALL non-epoch blocks, but block operations like slashings, exits, BLS-to-execution changes, consolidation requests, and withdrawal requests can modify the validators vector. This would produce incorrect state roots. Now checks the block body for validator-modifying operations and only caches field 11 when none are present. Zero performance impact since these operations are extremely rare on mainnet.
Cache additional stable fields on non-epoch blocks: - 14 (slashings), 17-20 (justification/checkpoints), 27 (historical_summaries), 37 (proposer_lookahead) Also corrects the safety analysis: - Field 15 (previous_epoch_participation) is NOT cacheable — attestation processing updates it on every block for previous-epoch attestations - Field 21 (inactivity_scores) must be excluded when validator-modifying operations are present, since add_validator_to_registry appends to it
Extract helper functions to reduce cyclomatic complexity and nesting depth in state_transition.ex, epoch_processing.ex, bench/blocks.ex, and bench/download.ex. Use Enum.map_join instead of Enum.map |> Enum.join.
…lock The sync committee has 512 members and only rotates every 256 epochs. Previously, get_sync_committee_indices scanned all 2.2M validators on every block to resolve pubkeys. Now cached in ETS keyed by epoch+root. Also replaced Stream with Aja.Vector.foldl to avoid materialization. Results: Total 419.1s -> 398.6s (-4.9%), non-epoch avg -5.3%.
Replace 32 O(V) linear validator scans with 1 O(V) scan + 16-entry map. For each epoch's pending deposits (up to 16), the old code did two full validator set scans per deposit (find + find_index). The new approach scans validators once to resolve only the needed pubkeys into a small map, then uses O(1) lookups. Benchmark: epoch.pending_deposits 4.2s -> 1.9s (-55%) Epoch boundary block: 27.0s -> 23.7s (-12.2%)
Extract helper functions to reduce cyclomatic complexity and nesting depth in state_transition.ex, epoch_processing.ex, bench/blocks.ex, and bench/download.ex. Use Enum.map_join instead of Enum.map |> Enum.join.
When the exact start_key doesn't exist in LevelDB (e.g., a skipped slot at the finalization boundary), iterator_move returns the next available key instead. Previously this was treated as an error, causing pruning to fail entirely and leaking the iterator handle. Now fold_keys accepts the inexact position and iterates from there — accumulate/4 already validates the prefix for each key, so we stay within the correct table's key space. The iterator is always closed.
… invalid During catch-up sync, data columns may not be downloaded yet when ForkChoice.on_block runs the data availability check. Previously this permanently invalidated the block and cascade-invalidated all descendants (~216 blocks in this incident). Now: 1. "data not available" moves block back to :download_columns for retry 2. add_block allows re-processing of previously :invalid blocks so Optimistic Sync can recover them after restart
…s transient During catch-up sync, data columns may not be downloaded yet when ForkChoice.on_block runs the data availability check. Previously this permanently invalidated the block and cascade-invalidated all descendants (~216 blocks in this incident). Now: 1. "data not available" moves block back to :download_columns for retry 2. add_block allows re-processing of previously :invalid blocks 3. On startup, recover_invalid_blocks resets blocks with signed_block data from :invalid to :download_columns so they can be re-evaluated
Without scheduling :retry_download_columns after recovery, recovered blocks sit in :download_columns status with no timer to check if their columns are already present in the DB.
retry_download_columns only re-requested missing columns but never moved blocks to :pending when all columns were already present. This left recovered blocks stuck in :download_columns status permanently.
Process at most 5 blocks per invocation and schedule a quick follow-up for the remainder. This yields the GenServer between batches, allowing GC to reclaim BeaconState objects (~300MB each) and preventing message queue buildup that caused OOM kills at 57GB+ RSS.
Status requests (status/1, status/2) called StoreDb.fetch_store() which deserializes the entire Store from LevelDB for each peer request. During sync with many peers, this caused the GenServer to stall deserializing the store repeatedly while the message queue grew to 57K+. Pass the store from the GenServer state directly to the request handler, eliminating redundant LevelDB reads and SSZ deserialization.
Tree.update_root! crashes the GenServer when the finalized root is not in the in-memory tree (common after restart/recovery). Replace with Tree.update_root and rebuild the tree from the finalized root on :not_found, matching the non-raising pattern used elsewhere.
… restart After restart/recovery, rebuild_tree adds blocks to the tree but doesn't populate unrealized_justifications. When get_voting_source looks up a block root from a prior epoch, it gets nil and crashes with BadMapError on voting_source.epoch. Similarly, filter_leaf_block crashes accessing unrealized_justifications[block_root].epoch. Fix by falling back to the block's state justified checkpoint when the unrealized justification is missing, and guarding against nil in the pull-up check.
When KZG verification fails for a block whose custody columns are all present, the columns were likely corrupted during download. Previously this marked the block as permanently invalid, cascading to ALL children and effectively killing the chain. Now we delete the stored columns and move the block back to :download_columns so fresh copies are fetched. This prevents the cascade invalidation that made the node unable to process any new blocks on Hoodi testnet. Also adds DataColumnDb.delete_columns_for_block/2 for targeted column purging.
When BlockRootBySlot.get returns :not_found during state pruning, the error branch logged the error but didn't return the accumulator. This caused Logger.error's :ok return value to become the new acc, leading to ArithmeticError: :erlang.+(:ok, 1) on the next iteration.
When update_tree detects the finalized root isn't in the tree, it creates Tree.new(finalized_root) with only the finalized root. Then Tree.add_block fails for every subsequent block because the parent isn't in the minimal tree. This leaves the tree permanently stuck at 1 node, preventing LMD-GHOST head selection from advancing. Add repair_tree_chain/3 which walks the parent chain from the new block's parent back to the finalized root using Blocks.get_block_info, collecting intermediate block roots. These are then added to the tree in order, filling the gap. This triggers whenever add_block fails (parent not in tree), self-repairing the tree on every finalization advance. Observed: head stuck at slot 2596640 (later 2596672) while processing blocks up to 2596711+, with "Block not found in tree during get_children" warning on every block. After fix, tree repaired with 61 blocks and head immediately advanced to each new block.
The BlockStates LRU cache was configured with max_entries=128, allowing up to 128 BeaconStates (~460MB each) to accumulate in ETS. During catch-up sync, this grew to 120 entries (55GB), exceeding the machine's 62GB RAM and causing severe swap thrashing. recompute_head went from <500ms to 47+ seconds. Reduced max_entries to 16 (~7.4GB) and batch_prune_size to 4 (to avoid over-pruning with the smaller cache). States are backed by LevelDB via StateDb, so evicted entries can be re-fetched when needed.
get_ancestor/3 used Blocks.get_block! which raises RuntimeError when a block has been pruned from LevelDB. This crashed the Libp2pPort GenServer when get_weight (LMD-GHOST) tried to compute ancestors for validator votes referencing old pruned blocks. Changed to Blocks.get_block (returns nil) and return the root as-is when the block is not found. This means stale votes are safely discounted (ancestor won't match any candidate), and finalized_check correctly filters out blocks whose chain can't be verified.
After GenServer restart, the store's time may not have been advanced by on_tick yet, causing valid blocks to be rejected as "from the future". Previously, these blocks were permanently marked as :invalid, triggering cascade invalidation of all descendant blocks. Now treated as a transient timing error: block stays as :pending and a retry is scheduled after 12 seconds (one slot), allowing on_tick to advance the store time before the block is re-evaluated.
When the Libp2pPort GenServer falls behind processing blocks, incoming gossip messages (attestations, column sidecars, sync committees) pile up in the mailbox faster than they can be processed. This creates a feedback loop: the queue grows → process memory balloons → node falls further behind → more messages → OOM. Observed: queue reached 34,654 messages (19.2 GB process memory) growing at ~885 msgs/sec before manual intervention was required. The fix checks message_queue_len on each non-essential message. When the queue exceeds 2000 messages, gossip, incoming requests, peer notifications, and tracer messages are dropped. Responses and results (replies to our own block/column download requests) are always processed to maintain catch-up capability. The shed count is logged periodically and reset when the queue drains below the threshold.
The initial load shedding implementation dropped all non-response/result port messages when the queue exceeded the threshold. This included new_peer notifications, which carry the discv5 node_id needed by the Peerbook for PeerDAS custody column routing. Without node_ids, DataColumnDownloader reported :no_peers for column downloads, leaving all blocks stuck in :download_columns status indefinitely. Blocks could not advance past the checkpoint sync anchor. Fix: add :new_peer to the always-process list alongside :response and :result. This ensures PeerDAS routing data is always up-to-date while still shedding high-volume gossip (attestations, column sidecars, etc.) during overload.
During catch-up, LevelDB writes were skipped as an optimization. States only existed in the 16-entry ETS LRU cache. When evicted from ETS, they were permanently lost — LevelDB fetch returned :not_found, causing "parent state not found" cascade failures. The LRU cache already falls back to LevelDB on cache miss (BlockStates.get_state_info → LRUCache.get → fetch_state → StateDb.get_state_by_block_root), but this only works if the state was written to LevelDB in the first place. The async Task.Supervisor write doesn't block block processing, so removing the catch-up skip has minimal performance impact.
The cache cleanup match spec used `{{x, _}}` which matches 1-tuples,
but ETS stores records as 2-tuples `{key, value}`. This meant the
cleanup spec never matched any records, causing beacon_committee and
active_validator_indices tables to grow without bound (~0.5 GB/hour).
After 26 hours of operation, these tables consumed 11.8 GB:
- beacon_committee: 834K entries / 7.5 GB (should be ~12K / 120 MB)
- active_validator_indices: 408 entries / 4.2 GB (should be ~8 / 85 MB)
Fix:
1. Change match spec from `{{x, _}}` to `{{x, _}, _}` to correctly
match ETS record format {key, value}
2. Add cleanup trigger to Cache.set/3 (used by maybe_prefetch_committees)
which previously bypassed cleanup entirely
Result: Cache tables dropped from 11.8 GB to 207 MB (98% reduction).
Total BEAM memory dropped from 26.6 GB to 14.6 GB (45% reduction).
On mainnet, StoreDb.persist_store/1 was called synchronously in on_block, on_attestation, on_attester_slashing, and on_tick. This performs :erlang.term_to_binary with compression followed by an eleveldb.put call, blocking the Libp2pPort GenServer. Problems observed: 1. LevelDB compaction during catch-up caused multi-minute write stalls, growing the message queue to 54K+ messages 2. Deep-copying the Store struct (1.2M latest_messages) to a new process via spawn takes 1-9 seconds per block 3. The combination of spawn overhead + epoch processing caused OOM on a 62 GB system Fix: Skip persist_store entirely during catch-up sync (when head_slot is >2 slots behind wall clock). Once caught up, persist only on epoch boundaries (every 32 slots) via spawn. The store can be recovered from checkpoint + replay if the node crashes during sync. init_store remains synchronous since it only runs once at startup.
On mainnet, spawning a process with the Store struct deep-copies 1.2M latest_messages entries, taking 15s and 3-5 GB extra memory, causing OOM on 62 GB systems even at mid-epoch. New approach: serialize (term_to_binary) in the calling process, then spawn only for the LevelDB write. The serialized binary is a BEAM refc binary shared between processes without copying. This eliminates the deep-copy overhead entirely. Persist fires at mid-epoch (slot mod 32 == 16) when caught up, and is skipped during catch-up sync. The serialization (~7-9s with compression) blocks the Libp2pPort for one slot per epoch (~6.4 min), which is acceptable — it only misses one slot out of 32.
Three issues caused the node to stall for minutes when data columns were unavailable: 1. Partial responses silently dropped: When a peer returned 2 of 4 requested columns, process_data_columns stored them but did NOT immediately re-request the remaining columns. The node waited 30-60s for the retry timer, falling behind the chain. Fix: Immediately re-request missing columns on partial response. 2. Missing peer fallback in request_columns_by_root: Unlike request_columns_by_range, the by-root path did not fall back to get_some_peer() when no PeerDAS-capable peer was found, causing immediate :no_peers failures. Fix: Add get_some_peer() fallback. 3. Retry delays too slow: 30s error retry and 60s heartbeat meant the node fell 2-5 slots behind per retry cycle. Fix: Reduce to 5s error retry and 12s heartbeat (one slot).
…verload The peerbook previously had no hard limit — peers accumulated unboundedly, with only a soft 128 target and gentle 5% challenge-based pruning. This caused the Libp2pPort GenServer to process messages from hundreds of peers, leading to message queue buildup and load shedding. Changes: - Hard cap at 100 peers: new peers above this limit trigger immediate eviction of the lowest-scoring non-PeerDAS peer - Soft target lowered to 80: pruning starts earlier - Aggressive eviction when well above target: lowest-scoring non-PeerDAS peers are immediately removed (not just challenged) - PeerDAS peers (with custody_group_count set) are protected from eviction since they're needed for data availability - Prune percentage increased from 5% to 10%, max prune from 8 to 10 - New peer handling updates node_id for existing peers (useful when discovery provides node_id for a peer first seen via gossip)
The Go libp2p host accepted unlimited incoming connections, causing hundreds of peers to accumulate. Each peer generates gossip messages that flood the Elixir Libp2pPort GenServer, leading to 500K+ message queue buildup within 1-2 hours and eventual stall. Adds a ConnManager with LowWater=60 and HighWater=80 peers. When peer count exceeds HighWater, libp2p automatically prunes connections down to LowWater, keeping the message volume manageable. New peers get a 1-minute grace period before being eligible for pruning. This complements the Elixir-side peerbook limit (max 100) which only controlled peer selection, not actual Go-level connections.
Each BeaconState on mainnet is ~775 MB (1.2M validators). The original 16 entries consumed ~12.4 GB, causing OOM during epoch transitions. 6 entries (4.6 GB) caused frequent cache misses that triggered 30s+ LevelDB reads blocking the Libp2pPort GenServer. 10 entries (7.7 GB) balances memory usage with cache hit rate, reducing expensive LevelDB state fetches while leaving ~20 GB headroom on a 62 GB system.
Before this fix, prefetch_states in the ForkChoice GenServer would fetch checkpoint states from LevelDB when they weren't in the ETS LRU cache. On mainnet, each BeaconState is ~775MB, and LevelDB deserialization takes 28-35 seconds per state. With multiple checkpoint targets during epoch transitions, this caused blocks to take 7-92 seconds to process, making the node oscillate between head-tracking and falling 16+ slots behind. The fix adds cache-only variants of state lookup functions that check the in-memory store maps and ETS LRU cache but never fall through to LevelDB. The fetch_checkpoint_state function (used by prefetch_states) now uses these cached-only lookups. If a checkpoint state isn't in cache, that attestation target is gracefully skipped in fork choice weight calculation rather than blocking for 28-85 seconds. This does NOT affect block validity (state transitions still use full lookups). It only affects LMD-GHOST fork choice weight for attestations referencing uncached checkpoint states - equivalent to what other clients do when dropping late attestations. Result: blocks now process in 2-5s consistently (was 7-92s with spikes). Spec tests pass (15113 tests, 0 failures).
The parent state ETS touch was only done when NOT in catch-up mode. During catch-up, rapid sequential block processing fills the 10-entry LRU cache, and parent states get evicted since they aren't refreshed. The next block then falls through to LevelDB (775MB state read taking 30s-10min+ on mainnet due to compaction contention in a 23GB database). Now always touch the parent state regardless of catch-up status. The touch is a lightweight GenServer.cast (no blocking) and prevents the common pattern of: near-head → fall behind → catch-up mode → parent evicted → 10+ minute LevelDB stall.
Previously every block's state (~775MB on mainnet) was written to LevelDB asynchronously. This created continuous compaction storms (448MB SST tables) that blocked concurrent LevelDB reads for 6-12+ minutes when the ETS cache missed and needed a state from disk. Now only persist every 4th block and at epoch boundaries, reducing write volume by ~75%. The ETS LRU cache (10 entries) remains the primary fast-path; LevelDB is the fallback for rare cache misses. Epoch boundary states always persist since they're needed for checkpoint computation. Combined with the prefetch_states cache-only fix and parent state touch fix, this should significantly reduce the frequency of LevelDB compaction storms that block the ForkChoice/Libp2pPort GenServer.
LevelDB writes of 775MB mainnet BeaconStates cause compaction storms (448MB SST tables) that block concurrent reads for 5-10+ minutes. Even writing every 4th block generated enough compaction to stall the node after ~12 minutes at head. Now only persist to LevelDB at epoch boundaries (~every 6.4 min) and only when at head (not during catch-up). This reduces writes from 32 per epoch (every block) to 1 per epoch (97% reduction). The ETS LRU cache (10 entries) is the primary fast-path storage; LevelDB is only the crash recovery fallback to the nearest epoch boundary. During catch-up, zero LevelDB writes ensures the catch-up phase completes without any compaction-induced stalls.
…B stalls The Libp2pPort GenServer was stalling for 10+ minutes on eleveldb.get/3 reads of 775MB mainnet BeaconStates. Pattern: every ~3 hours of operation, the node would go silent with 47K-64K queued messages while Libp2pPort was blocked inside the LevelDB NIF. Root cause: Handlers.on_block called Store.get_state(store, block.parent_root) which falls through to LevelDB on ETS cache miss. Once triggered, the NIF blocks the BEAM scheduler and no other messages can be processed. Fix: use Store.get_state_cached/2 which returns nil on ETS miss. The existing nil handling drops the block with "parent state not found". Optimistic sync will re-pull blocks in sequence (12-slot drift threshold) and each parent will be freshly cached from the previous block's processing. Verified: 37/37 fork_choice + 95/95 sanity spec tests pass.
…lDB stalls After the handlers.ex fix (227b973), Libp2pPort was still stalling at ~9h uptime in eleveldb.get/3 via a different hot path: Head.get_head → get_filtered_block_tree → filter_leaf_block → justified_check → get_voting_source → Store.get_state! Three more LevelDB fallthrough paths fixed: 1. Head.get_head: Store.get_checkpoint_state → get_checkpoint_state_cached. On cache miss, return the previous head_root instead of recomputing. 2. Head.get_voting_source: Store.get_state! → get_state_cached. On miss, fall back to voting_source_fallback (which also handles nil). 3. Head.voting_source_fallback: Store.get_state → get_state_cached. Existing nil handling returns store.justified_checkpoint. All fallbacks are conservative — they either reuse previous head info or defer to the justified checkpoint (canonical chain). LMD-GHOST weight computation is skipped for this block; next block will retry with a warm cache. Optimistic sync handles any drift. Verified: 37/37 fork_choice + 95/95 sanity spec tests pass. Run 29 observation: stalled at 9h13m with Libp2pPort in eleveldb.get (93K queue). This fix addresses the remaining hot-path reads discovered there.
…lashing After fixing on_block (227b973) and Head.get_head (86bbe5f), Libp2pPort was still stalling in eleveldb.get after ~2h uptime (run 30 observed). Two more LevelDB fallthrough paths remained in the attestation pipeline: 1. on_attestation line 193: Store.get_checkpoint_state → get_checkpoint_state_cached. Called for every block attestation and every gossip attestation — extremely hot path. Existing nil handling skips the attestation (fork choice best-effort). 2. on_attester_slashing line 249: Store.get_state! → get_state_cached. Returns error on cache miss, skipping the slashing (rare event, can be re-processed later when state is cached). The nil/error handling matches existing patterns (e.g., the Lighthouse best-effort comment already in on_attestation). Attestations and slashings that reference un-cached states are simply dropped from fork choice weight calculation — a correct behavior since we cannot validate them without the state. Verified: 37/37 fork_choice + 95/95 sanity = 132/132 spec tests pass. Run 30 observation: stalled at ~2h13m in eleveldb.get (45K queue) after previous fixes addressed on_block and get_head paths. This completes the hot-path fixes for synchronous block processing.
Run 31 (after 3 prior LevelDB hot-path fixes) still stalled at ~3h15m in
eleveldb.get. Remaining LevelDB fallthrough in block lookups during
check_attestation_valid: Blocks.get_block(beacon_block_root) and
Blocks.get_block(target.root).
With 512-entry LRU, most blocks are cached, but attestations can reference
old blocks that have been evicted. Reading a 200KB block shouldn't normally
block long, but under LevelDB compaction pressure (from 775MB state writes
every epoch), reads can queue for minutes.
Added:
- Blocks.get_block_info_cached/1 — ETS-only lookup
- Blocks.get_block_cached/1 — ETS-only convenience
Used them in check_attestation_valid. Attestations referencing uncached
blocks are returned as {:unknown_block, root}, which existing error
handlers treat as "defer for later" — no fork choice impact since we
re-receive the block via sync/gossip and retry.
Verified: 132/132 fork_choice + sanity spec tests pass.
After 4 prior LevelDB state fixes, runs still stalled every ~2-3h because block reads (Blocks.get_block!) in the fork choice hot path also trigger eleveldb.get during LevelDB compaction. Under compaction pressure from 775MB epoch state writes, even 200KB block reads queue for minutes. Converted ALL remaining LevelDB-hitting paths to cache-only: Store: - get_ancestor: Blocks.get_block → get_block_cached (nil = return root as-is) - get_children: Blocks.get_block! → get_block_cached (filter out uncached) - update_head_info: Blocks.get_block! → get_block_cached (fallback prev slot) Head: - get_weight: Blocks.get_block! → get_block_cached (nil = return 0 weight) - get_filtered_block_tree: try cached first, fallback to DB only for justified root - get_voting_source: Blocks.get_block! → get_block_cached (nil = justified_checkpoint) Blocks: - Added get_block_cached/1 and get_block_info_cached/1 (ETS-only, no LevelDB) All fallbacks are conservative: uncached blocks get 0 weight in LMD-GHOST, uncached children are filtered out of the fork tree, and uncached ancestors return the root as-is (same as pruned blocks). The node self-corrects via optimistic sync if head selection is briefly inaccurate. Verified: 132/132 fork_choice + sanity spec tests pass.
Run 34 still stalled at ~2h40m despite 5 prior fixes. Comprehensive audit found additional LevelDB reads still on the Libp2pPort hot path: - fork_choice.ex: recompute_head → Blocks.get_block!(head_root) - handlers.ex: notify_forkchoice_update → Blocks.get_block!(finalized_root) - handlers.ex: get_safe_execution_payload_hash → Blocks.get_block!(safe_root) - head.ex: get_filtered_block_tree → Blocks.get_block!(justified_root) - store.ex: collect_parent_chain → Blocks.get_block_info(current_root) All converted to cache-only with graceful degradation: - recompute_head: skip EL notification if head block uncached - notify_forkchoice_update: return error if finalized block uncached - get_filtered_block_tree: return empty tree if justified block uncached - collect_parent_chain: stop walking at uncached blocks This is the 6th commit in the LevelDB stall prevention series. Goal is to ensure NO synchronous LevelDB read ever runs on the Libp2pPort process. Verified: 132/132 fork_choice + sanity spec tests pass.
Run 35 stalled at ~1h47m in eleveldb.get despite 6 prior cache-only fixes. Root cause: IncomingRequestsHandler serves peer sync requests (BlocksByRange and BlocksByRoot) synchronously on the Libp2pPort process. BlocksByRange (line 139) called BlockDb.get_block_info_by_slot/1 DIRECTLY to LevelDB — not even through the ETS cache! Reading 32-64 blocks per request, any one read can block during LevelDB compaction. Fixes: - BlocksByRange: spawn Task.async for LevelDB reads with 5s timeout. If reads take too long, return empty response and kill the task. This keeps Libp2pPort responsive while still serving peers when fast. - BlocksByRoot: use Blocks.get_block_info_cached/1 (ETS-only). Uncached blocks return :skip (peers try other nodes). This is the 7th commit in the LevelDB stall prevention series. Verified: 132/132 fork_choice + sanity spec tests pass.
…ndler Run 36 survived 4h28m (longest yet with all fixes) but still stalled in eleveldb.get. PendingBlocks.process_blocks runs on Libp2pPort and had 5 calls to Blocks.get_block_info/1 that fall through to LevelDB. Converted all to Blocks.get_block_info_cached/1: - pending_blocks.ex lines 68, 224, 256, 272, 388 - On cache miss, blocks stay in download queue for retry (correct behavior) This is the 8th commit in the LevelDB stall prevention series. Verified: 132/132 fork_choice + sanity spec tests pass.
Symptom: On mainnet, a node whose head was 11-65 slots behind wall clock kept processing fresh gossip blocks via the full prefetch_states path, because `catching_up?` only checked the arriving block's slot distance from wall clock — not the store head's distance. Each fresh gossip block cost 30-45 s in `prefetch_states_and_committees/2`, which tore down the NIF's incremental merkle cache (via process_slots evicting the parent state from the 10-entry LRU), after which every subsequent block did full merkleization (4,300 ms) forever — a cascade that grew gap by ~2.9 slots/min and eventually froze the node for 19 h (observed run at head=14,114,704 frozen 2026-04-14T21:05 → 2026-04-15T16:24). Root cause: `wall_slot - block_slot > 4` has per-block semantics. A fresh gossip block at tip passes this check even when our store's head is far behind, so the node keeps paying prefetch_states costs it can't benefit from (LMD-GHOST is already short-circuited when `wall_slot - block_slot > 1` in `recompute_head/3`). Fix: widen `catching_up?` to also fire when `store.head_slot` is >4 slots behind wall clock. Gives the safety valve state-wide semantics instead of per-block. Confirmed live on mainnet: after the fix, no `prefetch_states=` entries appear in `[on_block]` log lines during catch-up, and per-block processing stays under the 12 s slot cadence. Note: spec-test / lint pre-existing failures in other files are not related to this change.
Symptom: Twice during a ~3 hour mainnet run (2026-04-15 20:51:56 at slot 14,121,856, and 22:15:35 at slot 14,122,274), Libp2pPort stopped processing blocks. Mailbox grew to 30-70k messages. Block head stopped advancing but beam stayed alive. Stack via `Process.info(Libp2pPort, [:current_stacktrace])`: :eleveldb.get/3 ← blocking sync LevelDB read Peerbook.get/1 Peerbook.db_span/2 :telemetry.span/3 Peerbook.fetch_peerbook!/0 Peerbook.handle_new_peer/2 Libp2pPort.handle_notification/2 Libp2pPort.batch_drain_port_messages/3 ← inside shed-drain loop Root cause: `new_peer` was in the shed keep-list (both `shed_load?/1` at line 764 and the inner list in `batch_drain_port_messages/3` at line 746). The rationale for keeping it was PeerDAS routing needs node_ids. But `Peerbook.handle_new_peer/2` does a read-modify-write against the `peerbook` KvSchema, which hits `eleveldb:get/3` on the hot Libp2pPort GenServer path. Trigger sequence: (a) `prune_old_states` advances finalized checkpoint, (b) LevelDB compaction starts in the background, (c) an epoch-boundary block with 10-16 s processing piles gossip past the 2000 shed threshold, (d) shed-drain loop processes each queued new_peer sync-blocked on eleveldb during compaction. The drain itself stalls for minutes. Fix: remove `:new_peer` from both the `shed_load?/1` exemption and the kept-list inside `batch_drain_port_messages/3`. During overload, new_peer events are now dropped along with gossip and req/resp-inbound. The Go-side libp2p port still tracks connected peers; only the Elixir-side Peerbook bookkeeping misses the notification. Subsequent discovery events and AddPeer calls re-populate Peerbook when load clears. Responses and results (replies to our own outbound requests) remain in the keep-list for correctness. Risk: during sustained overload, Peerbook score/metadata/custody-group tracking will miss new peers. That's a graceful degradation vs. a full node stall. Observed prior to this fix: two stalls within 4 hours of the performance-improvements-2-fixes branch. Companion fix to f09e5fa.
…leak
Symptom: After 17-23h of mainnet operation, gossip blocks stop arriving.
The node continues serving peers (DataColumnsByRoot, slot transitions)
but Libp2pPort is idle (mbox=0) with no incoming gossip. Observed on
runs 5 and 6 at slots 14,129,232 and 14,134,340 respectively.
Root cause: When gossip messages are dropped during load shedding (both
in the top-level `shed_load?` path at handle_info/2 and inside
`batch_drain_port_messages/3`), no validation response is sent back to
the Go port. On the Go side (subscriptions.go), each gossip message
spawns a validator goroutine that blocks on `return <-ch`. Without a
validation response (:accept/:reject/:ignore), the goroutine blocks
forever. These leaked goroutines exhaust go-libp2p-pubsub's validation
queue (`WithValidateQueueSize(600)`). Once all 600 slots are consumed
by leaked goroutines, no new gossip messages can be validated, and the
subscription is functionally dead.
Fix: Add `maybe_ignore_gossip/3` that sends `validate_message(:ignore)`
directly to the port (via `send_data/2`) for every gossip message
dropped during shedding. This unblocks the Go-side goroutine so the
validation slot is returned to the pool. Non-gossip dropped messages
(requests, tracer, new_peer) are not affected — they don't have
validator goroutines.
Also fixes a secondary bug in `handle_cast({:error_downloading_chunk})`
where a failed sync range request never decremented `blocks_remaining`,
leaving the node stuck in "syncing" state with no retry mechanism.
Risk: Sending :ignore for shed gossip means those messages won't be
re-propagated by our node. This is the correct behavior — we're under
load and can't validate them anyway. The alternative (goroutine leak
leading to gossip death) is strictly worse.
Companion to f09e5fa (catching_up? widening) and 2a13d1c (new_peer
shedding). Together these three fixes address all observed mainnet
stall patterns on this branch.
The /metrics endpoint on mainnet returns ~96k lines and takes ~2s to serve. The previous 1s scrape_interval caused every scrape to time out (scrape_timeout defaults to scrape_interval), so Prometheus never ingested any samples and Grafana dashboards showed no data. Raised to scrape_interval: 15s / scrape_timeout: 10s. Target now reports up=1 with scrape duration ~1.83s.
After a multi-minute prefetch_states stall on 2026-04-20 22:30, all peers timed out and disconnected. The subsequent :check_pending_blocks tick hit BlockDownloader.get_some_peer/0, which raised RuntimeError "No peers available to request blocks from." That raise escaped all the way up to Libp2pPort's handle_continue/2 callback, killing the entire GenServer. Supervisor restarted it every ~4s, and the new GenServer hit the same :check_pending_blocks → same raise, crash-looping indefinitely (no [on_block] for 20+ minutes). Fix: - BlockDownloader.get_some_peer/0 returns :no_peers instead of raising. Resolves TODO #1317. - BlockDownloader.request_blocks_by_root/3 and request_blocks_by_range/4 handle :no_peers by logging and returning :ok; callers leave the pending block in the download queue for the next tick to retry. Also added defensive filters in PendingBlocks.process_blocks/1 and PendingBlocks.retry_download_columns/1 to skip any :pending or :download_columns block whose signed_block is nil. The crash loop left the store in that corrupted state and the resulting BadMapError was the second/third crash behind the first one.
Remove the mailbox-queue-length shedder introduced in 629b9f4 (and its follow-up fixes in 77c809f, 3abb428, ba6f1f4). The fixes in the rest of the performance-improvements-2-fixes branch (cache-only block/state lookups, prefetch_states offloading, minimized LevelDB persistence) have materially reduced the steady-state pressure on Libp2pPort, and we want to measure whether the shedder is still load-bearing. If mailbox growth / OOM pressure returns under the current workload, revert this commit. Otherwise the simpler one-message-per-callback path stays. Removes: - @max_queue_before_shedding / @shed_log_interval constants - shed_load?/1, batch_drain_port_messages/3, maybe_ignore_gossip/3 - shed-count tracking and recovery log in the :on_tick handler - the shed branch in handle_info({port, {:data, _}}, state)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This branch includes some optimizations and fixes over the added Fulu support.
Description
Closes #issue_number