Skip to content

Performance improvements and fixes#1502

Draft
MegaRedHand wants to merge 92 commits into
fulu-supportfrom
performance-improvements-2-fixes
Draft

Performance improvements and fixes#1502
MegaRedHand wants to merge 92 commits into
fulu-supportfrom
performance-improvements-2-fixes

Conversation

@MegaRedHand

Copy link
Copy Markdown
Collaborator

Motivation

This branch includes some optimizations and fixes over the added Fulu support.

Description

Closes #issue_number

Add a new Rust NIF (hash_beacon_state_cached_rs) that hashes BeaconState
fields individually and accepts pre-computed field hashes for unchanged
fields. On non-epoch blocks, 5 expensive fields (validators, inactivity
scores, sync committees, historical roots) are cached, skipping ~70% of
hashing work.

Results (32-block benchmark):
- Merkleization: ~7.2s -> ~2.4s per non-epoch block (-67%)
- Non-epoch avg: ~12.0s -> ~7.0s (-42%)
- Total: ~398s -> ~242s (-39%)
Move Ssz.to_ssz(state) out of the block processing critical path.
The SSZ-encoded binary is only needed for DB storage, which happens
asynchronously. By deferring encoding to the async task, we save ~2s
per block.

Results (32-block benchmark):
- Non-epoch avg: 7.0s -> 5.8s (-17.6%)
- Total: 242s -> 202s (-16.4%)
- Epoch block: 24.2s -> 22.6s (-6.6%)
Block attestations are already BLS-verified during state transition
(in process_attestation_batch). The fork-choice on_attestation handler
was verifying them again, doubling the BLS cost per block.

Skip check_valid_indexed_attestation when is_from_block=true.
Attestation processing: ~916ms -> ~213ms per block (-77%).
Total: 196s -> 181s (-7.4%).
Replace zip_with + filter + reduce (3 passes, creates intermediate
filtered vector) with zip_with + foldl (2 passes, no intermediate
allocation). Produces 0 for non-participating validators instead of
filtering them out.

This is called by compute_pulled_up_tip on every block (~656ms per
block untimed), plus once during epoch processing.
The incremental merkleization cache was incorrectly caching the
validators field hash (field 11) on ALL non-epoch blocks, but block
operations like slashings, exits, BLS-to-execution changes,
consolidation requests, and withdrawal requests can modify the
validators vector. This would produce incorrect state roots.

Now checks the block body for validator-modifying operations and only
caches field 11 when none are present. Zero performance impact since
these operations are extremely rare on mainnet.
Cache additional stable fields on non-epoch blocks:
- 14 (slashings), 17-20 (justification/checkpoints), 27 (historical_summaries),
  37 (proposer_lookahead)

Also corrects the safety analysis:
- Field 15 (previous_epoch_participation) is NOT cacheable — attestation
  processing updates it on every block for previous-epoch attestations
- Field 21 (inactivity_scores) must be excluded when validator-modifying
  operations are present, since add_validator_to_registry appends to it
Extract helper functions to reduce cyclomatic complexity and nesting
depth in state_transition.ex, epoch_processing.ex, bench/blocks.ex,
and bench/download.ex. Use Enum.map_join instead of Enum.map |> Enum.join.
…lock

The sync committee has 512 members and only rotates every 256 epochs.
Previously, get_sync_committee_indices scanned all 2.2M validators on
every block to resolve pubkeys. Now cached in ETS keyed by epoch+root.
Also replaced Stream with Aja.Vector.foldl to avoid materialization.

Results: Total 419.1s -> 398.6s (-4.9%), non-epoch avg -5.3%.
Replace 32 O(V) linear validator scans with 1 O(V) scan + 16-entry map.
For each epoch's pending deposits (up to 16), the old code did two full
validator set scans per deposit (find + find_index). The new approach
scans validators once to resolve only the needed pubkeys into a small
map, then uses O(1) lookups.

Benchmark: epoch.pending_deposits 4.2s -> 1.9s (-55%)
Epoch boundary block: 27.0s -> 23.7s (-12.2%)
Extract helper functions to reduce cyclomatic complexity and nesting
depth in state_transition.ex, epoch_processing.ex, bench/blocks.ex,
and bench/download.ex. Use Enum.map_join instead of Enum.map |> Enum.join.
When the exact start_key doesn't exist in LevelDB (e.g., a skipped slot
at the finalization boundary), iterator_move returns the next available
key instead. Previously this was treated as an error, causing pruning to
fail entirely and leaking the iterator handle.

Now fold_keys accepts the inexact position and iterates from there —
accumulate/4 already validates the prefix for each key, so we stay
within the correct table's key space. The iterator is always closed.
… invalid

During catch-up sync, data columns may not be downloaded yet when
ForkChoice.on_block runs the data availability check. Previously this
permanently invalidated the block and cascade-invalidated all descendants
(~216 blocks in this incident). Now:

1. "data not available" moves block back to :download_columns for retry
2. add_block allows re-processing of previously :invalid blocks so
   Optimistic Sync can recover them after restart
…s transient

During catch-up sync, data columns may not be downloaded yet when
ForkChoice.on_block runs the data availability check. Previously this
permanently invalidated the block and cascade-invalidated all descendants
(~216 blocks in this incident). Now:

1. "data not available" moves block back to :download_columns for retry
2. add_block allows re-processing of previously :invalid blocks
3. On startup, recover_invalid_blocks resets blocks with signed_block data
   from :invalid to :download_columns so they can be re-evaluated
Without scheduling :retry_download_columns after recovery, recovered
blocks sit in :download_columns status with no timer to check if their
columns are already present in the DB.
retry_download_columns only re-requested missing columns but never moved
blocks to :pending when all columns were already present. This left
recovered blocks stuck in :download_columns status permanently.
Process at most 5 blocks per invocation and schedule a quick follow-up
for the remainder. This yields the GenServer between batches, allowing
GC to reclaim BeaconState objects (~300MB each) and preventing message
queue buildup that caused OOM kills at 57GB+ RSS.
Status requests (status/1, status/2) called StoreDb.fetch_store() which
deserializes the entire Store from LevelDB for each peer request. During
sync with many peers, this caused the GenServer to stall deserializing
the store repeatedly while the message queue grew to 57K+.

Pass the store from the GenServer state directly to the request handler,
eliminating redundant LevelDB reads and SSZ deserialization.
Tree.update_root! crashes the GenServer when the finalized root is not
in the in-memory tree (common after restart/recovery). Replace with
Tree.update_root and rebuild the tree from the finalized root on
:not_found, matching the non-raising pattern used elsewhere.
… restart

After restart/recovery, rebuild_tree adds blocks to the tree but doesn't
populate unrealized_justifications. When get_voting_source looks up a
block root from a prior epoch, it gets nil and crashes with BadMapError
on voting_source.epoch. Similarly, filter_leaf_block crashes accessing
unrealized_justifications[block_root].epoch.

Fix by falling back to the block's state justified checkpoint when the
unrealized justification is missing, and guarding against nil in the
pull-up check.
When KZG verification fails for a block whose custody columns are all
present, the columns were likely corrupted during download. Previously
this marked the block as permanently invalid, cascading to ALL children
and effectively killing the chain.

Now we delete the stored columns and move the block back to
:download_columns so fresh copies are fetched. This prevents the
cascade invalidation that made the node unable to process any new
blocks on Hoodi testnet.

Also adds DataColumnDb.delete_columns_for_block/2 for targeted
column purging.
When BlockRootBySlot.get returns :not_found during state pruning,
the error branch logged the error but didn't return the accumulator.
This caused Logger.error's :ok return value to become the new acc,
leading to ArithmeticError: :erlang.+(:ok, 1) on the next iteration.
When update_tree detects the finalized root isn't in the tree, it
creates Tree.new(finalized_root) with only the finalized root. Then
Tree.add_block fails for every subsequent block because the parent
isn't in the minimal tree. This leaves the tree permanently stuck at
1 node, preventing LMD-GHOST head selection from advancing.

Add repair_tree_chain/3 which walks the parent chain from the new
block's parent back to the finalized root using Blocks.get_block_info,
collecting intermediate block roots. These are then added to the tree
in order, filling the gap. This triggers whenever add_block fails
(parent not in tree), self-repairing the tree on every finalization
advance.

Observed: head stuck at slot 2596640 (later 2596672) while processing
blocks up to 2596711+, with "Block not found in tree during
get_children" warning on every block. After fix, tree repaired with
61 blocks and head immediately advanced to each new block.
The BlockStates LRU cache was configured with max_entries=128, allowing up
to 128 BeaconStates (~460MB each) to accumulate in ETS. During catch-up sync,
this grew to 120 entries (55GB), exceeding the machine's 62GB RAM and causing
severe swap thrashing. recompute_head went from <500ms to 47+ seconds.

Reduced max_entries to 16 (~7.4GB) and batch_prune_size to 4 (to avoid
over-pruning with the smaller cache). States are backed by LevelDB via
StateDb, so evicted entries can be re-fetched when needed.
get_ancestor/3 used Blocks.get_block! which raises RuntimeError when a
block has been pruned from LevelDB. This crashed the Libp2pPort GenServer
when get_weight (LMD-GHOST) tried to compute ancestors for validator votes
referencing old pruned blocks.

Changed to Blocks.get_block (returns nil) and return the root as-is when
the block is not found. This means stale votes are safely discounted
(ancestor won't match any candidate), and finalized_check correctly
filters out blocks whose chain can't be verified.
After GenServer restart, the store's time may not have been advanced by
on_tick yet, causing valid blocks to be rejected as "from the future".
Previously, these blocks were permanently marked as :invalid, triggering
cascade invalidation of all descendant blocks.

Now treated as a transient timing error: block stays as :pending and a
retry is scheduled after 12 seconds (one slot), allowing on_tick to
advance the store time before the block is re-evaluated.
When the Libp2pPort GenServer falls behind processing blocks, incoming
gossip messages (attestations, column sidecars, sync committees) pile up
in the mailbox faster than they can be processed. This creates a feedback
loop: the queue grows → process memory balloons → node falls further
behind → more messages → OOM.

Observed: queue reached 34,654 messages (19.2 GB process memory) growing
at ~885 msgs/sec before manual intervention was required.

The fix checks message_queue_len on each non-essential message. When the
queue exceeds 2000 messages, gossip, incoming requests, peer notifications,
and tracer messages are dropped. Responses and results (replies to our own
block/column download requests) are always processed to maintain catch-up
capability. The shed count is logged periodically and reset when the queue
drains below the threshold.
The initial load shedding implementation dropped all non-response/result
port messages when the queue exceeded the threshold. This included
new_peer notifications, which carry the discv5 node_id needed by the
Peerbook for PeerDAS custody column routing.

Without node_ids, DataColumnDownloader reported :no_peers for column
downloads, leaving all blocks stuck in :download_columns status
indefinitely. Blocks could not advance past the checkpoint sync anchor.

Fix: add :new_peer to the always-process list alongside :response and
:result. This ensures PeerDAS routing data is always up-to-date while
still shedding high-volume gossip (attestations, column sidecars, etc.)
during overload.
During catch-up, LevelDB writes were skipped as an optimization.
States only existed in the 16-entry ETS LRU cache. When evicted
from ETS, they were permanently lost — LevelDB fetch returned
:not_found, causing "parent state not found" cascade failures.

The LRU cache already falls back to LevelDB on cache miss
(BlockStates.get_state_info → LRUCache.get → fetch_state →
StateDb.get_state_by_block_root), but this only works if the
state was written to LevelDB in the first place.

The async Task.Supervisor write doesn't block block processing,
so removing the catch-up skip has minimal performance impact.
The cache cleanup match spec used `{{x, _}}` which matches 1-tuples,
but ETS stores records as 2-tuples `{key, value}`. This meant the
cleanup spec never matched any records, causing beacon_committee and
active_validator_indices tables to grow without bound (~0.5 GB/hour).

After 26 hours of operation, these tables consumed 11.8 GB:
- beacon_committee: 834K entries / 7.5 GB (should be ~12K / 120 MB)
- active_validator_indices: 408 entries / 4.2 GB (should be ~8 / 85 MB)

Fix:
1. Change match spec from `{{x, _}}` to `{{x, _}, _}` to correctly
   match ETS record format {key, value}
2. Add cleanup trigger to Cache.set/3 (used by maybe_prefetch_committees)
   which previously bypassed cleanup entirely

Result: Cache tables dropped from 11.8 GB to 207 MB (98% reduction).
Total BEAM memory dropped from 26.6 GB to 14.6 GB (45% reduction).
On mainnet, StoreDb.persist_store/1 was called synchronously in
on_block, on_attestation, on_attester_slashing, and on_tick. This
performs :erlang.term_to_binary with compression followed by an
eleveldb.put call, blocking the Libp2pPort GenServer.

Problems observed:
1. LevelDB compaction during catch-up caused multi-minute write stalls,
   growing the message queue to 54K+ messages
2. Deep-copying the Store struct (1.2M latest_messages) to a new
   process via spawn takes 1-9 seconds per block
3. The combination of spawn overhead + epoch processing caused OOM
   on a 62 GB system

Fix: Skip persist_store entirely during catch-up sync (when head_slot
is >2 slots behind wall clock). Once caught up, persist only on epoch
boundaries (every 32 slots) via spawn. The store can be recovered
from checkpoint + replay if the node crashes during sync.
init_store remains synchronous since it only runs once at startup.
On mainnet, spawning a process with the Store struct deep-copies 1.2M
latest_messages entries, taking 15s and 3-5 GB extra memory, causing
OOM on 62 GB systems even at mid-epoch.

New approach: serialize (term_to_binary) in the calling process, then
spawn only for the LevelDB write. The serialized binary is a BEAM refc
binary shared between processes without copying. This eliminates the
deep-copy overhead entirely.

Persist fires at mid-epoch (slot mod 32 == 16) when caught up, and is
skipped during catch-up sync. The serialization (~7-9s with compression)
blocks the Libp2pPort for one slot per epoch (~6.4 min), which is
acceptable — it only misses one slot out of 32.
Three issues caused the node to stall for minutes when data columns
were unavailable:

1. Partial responses silently dropped: When a peer returned 2 of 4
   requested columns, process_data_columns stored them but did NOT
   immediately re-request the remaining columns. The node waited
   30-60s for the retry timer, falling behind the chain.
   Fix: Immediately re-request missing columns on partial response.

2. Missing peer fallback in request_columns_by_root: Unlike
   request_columns_by_range, the by-root path did not fall back to
   get_some_peer() when no PeerDAS-capable peer was found, causing
   immediate :no_peers failures.
   Fix: Add get_some_peer() fallback.

3. Retry delays too slow: 30s error retry and 60s heartbeat meant
   the node fell 2-5 slots behind per retry cycle.
   Fix: Reduce to 5s error retry and 12s heartbeat (one slot).
…verload

The peerbook previously had no hard limit — peers accumulated
unboundedly, with only a soft 128 target and gentle 5% challenge-based
pruning. This caused the Libp2pPort GenServer to process messages from
hundreds of peers, leading to message queue buildup and load shedding.

Changes:
- Hard cap at 100 peers: new peers above this limit trigger immediate
  eviction of the lowest-scoring non-PeerDAS peer
- Soft target lowered to 80: pruning starts earlier
- Aggressive eviction when well above target: lowest-scoring non-PeerDAS
  peers are immediately removed (not just challenged)
- PeerDAS peers (with custody_group_count set) are protected from
  eviction since they're needed for data availability
- Prune percentage increased from 5% to 10%, max prune from 8 to 10
- New peer handling updates node_id for existing peers (useful when
  discovery provides node_id for a peer first seen via gossip)
The Go libp2p host accepted unlimited incoming connections, causing
hundreds of peers to accumulate. Each peer generates gossip messages
that flood the Elixir Libp2pPort GenServer, leading to 500K+ message
queue buildup within 1-2 hours and eventual stall.

Adds a ConnManager with LowWater=60 and HighWater=80 peers. When peer
count exceeds HighWater, libp2p automatically prunes connections down
to LowWater, keeping the message volume manageable. New peers get a
1-minute grace period before being eligible for pruning.

This complements the Elixir-side peerbook limit (max 100) which only
controlled peer selection, not actual Go-level connections.
Each BeaconState on mainnet is ~775 MB (1.2M validators). The original
16 entries consumed ~12.4 GB, causing OOM during epoch transitions.

6 entries (4.6 GB) caused frequent cache misses that triggered 30s+
LevelDB reads blocking the Libp2pPort GenServer. 10 entries (7.7 GB)
balances memory usage with cache hit rate, reducing expensive LevelDB
state fetches while leaving ~20 GB headroom on a 62 GB system.
Before this fix, prefetch_states in the ForkChoice GenServer would fetch
checkpoint states from LevelDB when they weren't in the ETS LRU cache.
On mainnet, each BeaconState is ~775MB, and LevelDB deserialization takes
28-35 seconds per state. With multiple checkpoint targets during epoch
transitions, this caused blocks to take 7-92 seconds to process, making
the node oscillate between head-tracking and falling 16+ slots behind.

The fix adds cache-only variants of state lookup functions that check
the in-memory store maps and ETS LRU cache but never fall through to
LevelDB. The fetch_checkpoint_state function (used by prefetch_states)
now uses these cached-only lookups. If a checkpoint state isn't in cache,
that attestation target is gracefully skipped in fork choice weight
calculation rather than blocking for 28-85 seconds.

This does NOT affect block validity (state transitions still use full
lookups). It only affects LMD-GHOST fork choice weight for attestations
referencing uncached checkpoint states - equivalent to what other clients
do when dropping late attestations.

Result: blocks now process in 2-5s consistently (was 7-92s with spikes).
Spec tests pass (15113 tests, 0 failures).
The parent state ETS touch was only done when NOT in catch-up mode.
During catch-up, rapid sequential block processing fills the 10-entry
LRU cache, and parent states get evicted since they aren't refreshed.
The next block then falls through to LevelDB (775MB state read taking
30s-10min+ on mainnet due to compaction contention in a 23GB database).

Now always touch the parent state regardless of catch-up status.
The touch is a lightweight GenServer.cast (no blocking) and prevents
the common pattern of: near-head → fall behind → catch-up mode →
parent evicted → 10+ minute LevelDB stall.
Previously every block's state (~775MB on mainnet) was written to LevelDB
asynchronously. This created continuous compaction storms (448MB SST tables)
that blocked concurrent LevelDB reads for 6-12+ minutes when the ETS
cache missed and needed a state from disk.

Now only persist every 4th block and at epoch boundaries, reducing write
volume by ~75%. The ETS LRU cache (10 entries) remains the primary
fast-path; LevelDB is the fallback for rare cache misses. Epoch boundary
states always persist since they're needed for checkpoint computation.

Combined with the prefetch_states cache-only fix and parent state touch
fix, this should significantly reduce the frequency of LevelDB compaction
storms that block the ForkChoice/Libp2pPort GenServer.
LevelDB writes of 775MB mainnet BeaconStates cause compaction storms
(448MB SST tables) that block concurrent reads for 5-10+ minutes. Even
writing every 4th block generated enough compaction to stall the node
after ~12 minutes at head.

Now only persist to LevelDB at epoch boundaries (~every 6.4 min) and
only when at head (not during catch-up). This reduces writes from 32
per epoch (every block) to 1 per epoch (97% reduction). The ETS LRU
cache (10 entries) is the primary fast-path storage; LevelDB is only
the crash recovery fallback to the nearest epoch boundary.

During catch-up, zero LevelDB writes ensures the catch-up phase
completes without any compaction-induced stalls.
…B stalls

The Libp2pPort GenServer was stalling for 10+ minutes on eleveldb.get/3
reads of 775MB mainnet BeaconStates. Pattern: every ~3 hours of operation,
the node would go silent with 47K-64K queued messages while Libp2pPort
was blocked inside the LevelDB NIF.

Root cause: Handlers.on_block called Store.get_state(store, block.parent_root)
which falls through to LevelDB on ETS cache miss. Once triggered, the NIF
blocks the BEAM scheduler and no other messages can be processed.

Fix: use Store.get_state_cached/2 which returns nil on ETS miss. The existing
nil handling drops the block with "parent state not found". Optimistic sync
will re-pull blocks in sequence (12-slot drift threshold) and each parent
will be freshly cached from the previous block's processing.

Verified: 37/37 fork_choice + 95/95 sanity spec tests pass.
…lDB stalls

After the handlers.ex fix (227b973), Libp2pPort was still stalling at ~9h
uptime in eleveldb.get/3 via a different hot path:
  Head.get_head → get_filtered_block_tree → filter_leaf_block →
  justified_check → get_voting_source → Store.get_state!

Three more LevelDB fallthrough paths fixed:

1. Head.get_head: Store.get_checkpoint_state → get_checkpoint_state_cached.
   On cache miss, return the previous head_root instead of recomputing.

2. Head.get_voting_source: Store.get_state! → get_state_cached. On miss,
   fall back to voting_source_fallback (which also handles nil).

3. Head.voting_source_fallback: Store.get_state → get_state_cached.
   Existing nil handling returns store.justified_checkpoint.

All fallbacks are conservative — they either reuse previous head info or
defer to the justified checkpoint (canonical chain). LMD-GHOST weight
computation is skipped for this block; next block will retry with a warm
cache. Optimistic sync handles any drift.

Verified: 37/37 fork_choice + 95/95 sanity spec tests pass.

Run 29 observation: stalled at 9h13m with Libp2pPort in eleveldb.get (93K
queue). This fix addresses the remaining hot-path reads discovered there.
…lashing

After fixing on_block (227b973) and Head.get_head (86bbe5f), Libp2pPort
was still stalling in eleveldb.get after ~2h uptime (run 30 observed).

Two more LevelDB fallthrough paths remained in the attestation pipeline:

1. on_attestation line 193: Store.get_checkpoint_state → get_checkpoint_state_cached.
   Called for every block attestation and every gossip attestation — extremely
   hot path. Existing nil handling skips the attestation (fork choice best-effort).

2. on_attester_slashing line 249: Store.get_state! → get_state_cached.
   Returns error on cache miss, skipping the slashing (rare event, can be
   re-processed later when state is cached).

The nil/error handling matches existing patterns (e.g., the Lighthouse best-effort
comment already in on_attestation). Attestations and slashings that reference
un-cached states are simply dropped from fork choice weight calculation — a
correct behavior since we cannot validate them without the state.

Verified: 37/37 fork_choice + 95/95 sanity = 132/132 spec tests pass.

Run 30 observation: stalled at ~2h13m in eleveldb.get (45K queue) after previous
fixes addressed on_block and get_head paths. This completes the hot-path fixes
for synchronous block processing.
Run 31 (after 3 prior LevelDB hot-path fixes) still stalled at ~3h15m in
eleveldb.get. Remaining LevelDB fallthrough in block lookups during
check_attestation_valid: Blocks.get_block(beacon_block_root) and
Blocks.get_block(target.root).

With 512-entry LRU, most blocks are cached, but attestations can reference
old blocks that have been evicted. Reading a 200KB block shouldn't normally
block long, but under LevelDB compaction pressure (from 775MB state writes
every epoch), reads can queue for minutes.

Added:
- Blocks.get_block_info_cached/1 — ETS-only lookup
- Blocks.get_block_cached/1 — ETS-only convenience

Used them in check_attestation_valid. Attestations referencing uncached
blocks are returned as {:unknown_block, root}, which existing error
handlers treat as "defer for later" — no fork choice impact since we
re-receive the block via sync/gossip and retry.

Verified: 132/132 fork_choice + sanity spec tests pass.
After 4 prior LevelDB state fixes, runs still stalled every ~2-3h because
block reads (Blocks.get_block!) in the fork choice hot path also trigger
eleveldb.get during LevelDB compaction. Under compaction pressure from
775MB epoch state writes, even 200KB block reads queue for minutes.

Converted ALL remaining LevelDB-hitting paths to cache-only:

Store:
- get_ancestor: Blocks.get_block → get_block_cached (nil = return root as-is)
- get_children: Blocks.get_block! → get_block_cached (filter out uncached)
- update_head_info: Blocks.get_block! → get_block_cached (fallback prev slot)

Head:
- get_weight: Blocks.get_block! → get_block_cached (nil = return 0 weight)
- get_filtered_block_tree: try cached first, fallback to DB only for justified root
- get_voting_source: Blocks.get_block! → get_block_cached (nil = justified_checkpoint)

Blocks:
- Added get_block_cached/1 and get_block_info_cached/1 (ETS-only, no LevelDB)

All fallbacks are conservative: uncached blocks get 0 weight in LMD-GHOST,
uncached children are filtered out of the fork tree, and uncached ancestors
return the root as-is (same as pruned blocks). The node self-corrects via
optimistic sync if head selection is briefly inaccurate.

Verified: 132/132 fork_choice + sanity spec tests pass.
Run 34 still stalled at ~2h40m despite 5 prior fixes. Comprehensive audit
found additional LevelDB reads still on the Libp2pPort hot path:

- fork_choice.ex: recompute_head → Blocks.get_block!(head_root)
- handlers.ex: notify_forkchoice_update → Blocks.get_block!(finalized_root)
- handlers.ex: get_safe_execution_payload_hash → Blocks.get_block!(safe_root)
- head.ex: get_filtered_block_tree → Blocks.get_block!(justified_root)
- store.ex: collect_parent_chain → Blocks.get_block_info(current_root)

All converted to cache-only with graceful degradation:
- recompute_head: skip EL notification if head block uncached
- notify_forkchoice_update: return error if finalized block uncached
- get_filtered_block_tree: return empty tree if justified block uncached
- collect_parent_chain: stop walking at uncached blocks

This is the 6th commit in the LevelDB stall prevention series. Goal is to
ensure NO synchronous LevelDB read ever runs on the Libp2pPort process.

Verified: 132/132 fork_choice + sanity spec tests pass.
Run 35 stalled at ~1h47m in eleveldb.get despite 6 prior cache-only fixes.
Root cause: IncomingRequestsHandler serves peer sync requests (BlocksByRange
and BlocksByRoot) synchronously on the Libp2pPort process.

BlocksByRange (line 139) called BlockDb.get_block_info_by_slot/1 DIRECTLY
to LevelDB — not even through the ETS cache! Reading 32-64 blocks per
request, any one read can block during LevelDB compaction.

Fixes:
- BlocksByRange: spawn Task.async for LevelDB reads with 5s timeout.
  If reads take too long, return empty response and kill the task.
  This keeps Libp2pPort responsive while still serving peers when fast.

- BlocksByRoot: use Blocks.get_block_info_cached/1 (ETS-only).
  Uncached blocks return :skip (peers try other nodes).

This is the 7th commit in the LevelDB stall prevention series.

Verified: 132/132 fork_choice + sanity spec tests pass.
…ndler

Run 36 survived 4h28m (longest yet with all fixes) but still stalled in
eleveldb.get. PendingBlocks.process_blocks runs on Libp2pPort and had
5 calls to Blocks.get_block_info/1 that fall through to LevelDB.

Converted all to Blocks.get_block_info_cached/1:
- pending_blocks.ex lines 68, 224, 256, 272, 388
- On cache miss, blocks stay in download queue for retry (correct behavior)

This is the 8th commit in the LevelDB stall prevention series.

Verified: 132/132 fork_choice + sanity spec tests pass.
Symptom: On mainnet, a node whose head was 11-65 slots behind wall clock
kept processing fresh gossip blocks via the full prefetch_states path,
because `catching_up?` only checked the arriving block's slot distance
from wall clock — not the store head's distance. Each fresh gossip block
cost 30-45 s in `prefetch_states_and_committees/2`, which tore down the
NIF's incremental merkle cache (via process_slots evicting the parent
state from the 10-entry LRU), after which every subsequent block did
full merkleization (4,300 ms) forever — a cascade that grew gap by
~2.9 slots/min and eventually froze the node for 19 h (observed run at
head=14,114,704 frozen 2026-04-14T21:05 → 2026-04-15T16:24).

Root cause: `wall_slot - block_slot > 4` has per-block semantics.
A fresh gossip block at tip passes this check even when our store's
head is far behind, so the node keeps paying prefetch_states costs it
can't benefit from (LMD-GHOST is already short-circuited when
`wall_slot - block_slot > 1` in `recompute_head/3`).

Fix: widen `catching_up?` to also fire when `store.head_slot` is >4
slots behind wall clock. Gives the safety valve state-wide semantics
instead of per-block. Confirmed live on mainnet: after the fix, no
`prefetch_states=` entries appear in `[on_block]` log lines during
catch-up, and per-block processing stays under the 12 s slot cadence.

Note: spec-test / lint pre-existing failures in other files are not
related to this change.
Symptom: Twice during a ~3 hour mainnet run (2026-04-15 20:51:56 at
slot 14,121,856, and 22:15:35 at slot 14,122,274), Libp2pPort stopped
processing blocks. Mailbox grew to 30-70k messages. Block head
stopped advancing but beam stayed alive.

Stack via `Process.info(Libp2pPort, [:current_stacktrace])`:
  :eleveldb.get/3              ← blocking sync LevelDB read
  Peerbook.get/1
  Peerbook.db_span/2
  :telemetry.span/3
  Peerbook.fetch_peerbook!/0
  Peerbook.handle_new_peer/2
  Libp2pPort.handle_notification/2
  Libp2pPort.batch_drain_port_messages/3   ← inside shed-drain loop

Root cause: `new_peer` was in the shed keep-list (both `shed_load?/1`
at line 764 and the inner list in `batch_drain_port_messages/3` at
line 746). The rationale for keeping it was PeerDAS routing needs
node_ids. But `Peerbook.handle_new_peer/2` does a read-modify-write
against the `peerbook` KvSchema, which hits `eleveldb:get/3` on the
hot Libp2pPort GenServer path.

Trigger sequence: (a) `prune_old_states` advances finalized checkpoint,
(b) LevelDB compaction starts in the background, (c) an epoch-boundary
block with 10-16 s processing piles gossip past the 2000 shed threshold,
(d) shed-drain loop processes each queued new_peer sync-blocked on
eleveldb during compaction. The drain itself stalls for minutes.

Fix: remove `:new_peer` from both the `shed_load?/1` exemption and
the kept-list inside `batch_drain_port_messages/3`. During overload,
new_peer events are now dropped along with gossip and req/resp-inbound.
The Go-side libp2p port still tracks connected peers; only the
Elixir-side Peerbook bookkeeping misses the notification. Subsequent
discovery events and AddPeer calls re-populate Peerbook when load
clears. Responses and results (replies to our own outbound requests)
remain in the keep-list for correctness.

Risk: during sustained overload, Peerbook score/metadata/custody-group
tracking will miss new peers. That's a graceful degradation vs. a full
node stall.

Observed prior to this fix: two stalls within 4 hours of the
performance-improvements-2-fixes branch. Companion fix to f09e5fa.
…leak

Symptom: After 17-23h of mainnet operation, gossip blocks stop arriving.
The node continues serving peers (DataColumnsByRoot, slot transitions)
but Libp2pPort is idle (mbox=0) with no incoming gossip. Observed on
runs 5 and 6 at slots 14,129,232 and 14,134,340 respectively.

Root cause: When gossip messages are dropped during load shedding (both
in the top-level `shed_load?` path at handle_info/2 and inside
`batch_drain_port_messages/3`), no validation response is sent back to
the Go port. On the Go side (subscriptions.go), each gossip message
spawns a validator goroutine that blocks on `return <-ch`. Without a
validation response (:accept/:reject/:ignore), the goroutine blocks
forever. These leaked goroutines exhaust go-libp2p-pubsub's validation
queue (`WithValidateQueueSize(600)`). Once all 600 slots are consumed
by leaked goroutines, no new gossip messages can be validated, and the
subscription is functionally dead.

Fix: Add `maybe_ignore_gossip/3` that sends `validate_message(:ignore)`
directly to the port (via `send_data/2`) for every gossip message
dropped during shedding. This unblocks the Go-side goroutine so the
validation slot is returned to the pool. Non-gossip dropped messages
(requests, tracer, new_peer) are not affected — they don't have
validator goroutines.

Also fixes a secondary bug in `handle_cast({:error_downloading_chunk})`
where a failed sync range request never decremented `blocks_remaining`,
leaving the node stuck in "syncing" state with no retry mechanism.

Risk: Sending :ignore for shed gossip means those messages won't be
re-propagated by our node. This is the correct behavior — we're under
load and can't validate them anyway. The alternative (goroutine leak
leading to gossip death) is strictly worse.

Companion to f09e5fa (catching_up? widening) and 2a13d1c (new_peer
shedding). Together these three fixes address all observed mainnet
stall patterns on this branch.
The /metrics endpoint on mainnet returns ~96k lines and takes ~2s to
serve. The previous 1s scrape_interval caused every scrape to time out
(scrape_timeout defaults to scrape_interval), so Prometheus never
ingested any samples and Grafana dashboards showed no data.

Raised to scrape_interval: 15s / scrape_timeout: 10s. Target now
reports up=1 with scrape duration ~1.83s.
After a multi-minute prefetch_states stall on 2026-04-20 22:30,
all peers timed out and disconnected. The subsequent
:check_pending_blocks tick hit BlockDownloader.get_some_peer/0,
which raised RuntimeError "No peers available to request blocks
from." That raise escaped all the way up to Libp2pPort's
handle_continue/2 callback, killing the entire GenServer.
Supervisor restarted it every ~4s, and the new GenServer hit the
same :check_pending_blocks → same raise, crash-looping indefinitely
(no [on_block] for 20+ minutes).

Fix:
- BlockDownloader.get_some_peer/0 returns :no_peers instead of
  raising. Resolves TODO #1317.
- BlockDownloader.request_blocks_by_root/3 and request_blocks_by_range/4
  handle :no_peers by logging and returning :ok; callers leave the
  pending block in the download queue for the next tick to retry.

Also added defensive filters in PendingBlocks.process_blocks/1 and
PendingBlocks.retry_download_columns/1 to skip any :pending or
:download_columns block whose signed_block is nil. The crash loop
left the store in that corrupted state and the resulting
BadMapError was the second/third crash behind the first one.
Remove the mailbox-queue-length shedder introduced in 629b9f4 (and its
follow-up fixes in 77c809f, 3abb428, ba6f1f4). The fixes in the rest of
the performance-improvements-2-fixes branch (cache-only block/state
lookups, prefetch_states offloading, minimized LevelDB persistence)
have materially reduced the steady-state pressure on Libp2pPort, and
we want to measure whether the shedder is still load-bearing.

If mailbox growth / OOM pressure returns under the current workload,
revert this commit. Otherwise the simpler one-message-per-callback
path stays.

Removes:
- @max_queue_before_shedding / @shed_log_interval constants
- shed_load?/1, batch_drain_port_messages/3, maybe_ignore_gossip/3
- shed-count tracking and recovery log in the :on_tick handler
- the shed branch in handle_info({port, {:data, _}}, state)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant