Skip to content

Fix double handoff for unloaded small messages#84

Open
breakertt wants to merge 1 commit intoPlatformLab:mainfrom
breakertt:fix-handoff-twice
Open

Fix double handoff for unloaded small messages#84
breakertt wants to merge 1 commit intoPlatformLab:mainfrom
breakertt:fix-handoff-twice

Conversation

@breakertt
Copy link
Copy Markdown
Contributor

Fix double handoff for unloaded small messages

Resolved issue #77, we move the ACK block to after homa_add_packet + homa_rpc_handoff. The unlock is still there (still need it for homa_rpc_acked's lock ordering), but now the skb is on the queue before the unlock window opens. Anyone who grabs the lock during the window finds data, no clearing happens. Unit test modified correspondingly.

Impact on a CloudLab xl170 pair (25 GbE, Linux 6.17.8): server-side handoff_count / requests_received at unloaded 64 B drops from 1.148 to 1.000 (5/5 trials, race closed). Loaded throughput across w1..w5 is unchanged within noise (Δ kops swings -4.4% to +3.5% with no consistent sign across workloads). More details below.

Root cause

homa_rpc_alloc_server sets RPC_PKTS_READY and fires homa_rpc_handoff for the first packet of a new server RPC, before anyone has actually put the skb on msgin.packets.

homa_data_pkt drops the bucket lock to call homa_rpc_acked() for any piggy-backed ACK. That happens before homa_add_packet. The unlock is mandatory, homa_rpc_acked needs to grab other RPCs' locks, so we can't hold this one.

softirq (holds bucket lock)        recvmsg side (woken up, blocked on lock)
-------------------------          ----------------------------------------
alloc_server: set RPC_PKTS_READY
              homa_rpc_handoff --wake-->
                                   wait_shared -> pulls rpc -> tries lock
homa_data_pkt:
   ack: rpc_unlock           --------->  gets the lock
        homa_rpc_acked              homa_copy_to_user
                                      skb_peek == NULL  <-- empty queue
                                      clear_bit(RPC_PKTS_READY)
                                      break
                                    rpc_unlock -> goes back to sleep
        rpc_lock              <----
   homa_add_packet (skb finally on the queue)
   set RPC_PKTS_READY (was just cleared)
   homa_rpc_handoff       --wake--> (second wake; this one delivers)

More details on fix measurement

Two CloudLab xl170 nodes (E5-2640 v4 @ 2.40 GHz, 20 logical cores, 25 GbE Mellanox), small-lan profile, both on Linux 6.17.8 mainline (the version the upstream README says works).

For each branch (fix-handoff-twice-reproduce for baseline metrics; fix-handoff-twice with the metric overlay for the fix):

  1. Patch cloudlab/bin/config's VLAN regex inet 10\.0\.1\. -> inet 10\.10\.1\. (current small-lan uses the latter).
  2. make all && cd util && make cp_node.
  3. For each cell:
    • Unloaded 64 B: client --ports 1 --port-receivers 0 --client-max 1 --workload 64; server --ports 1 --port-threads 1. 5 x ~10 s.
    • Loaded w1..w5: cperf 25 Gbps defaults, client --ports 3 --port-receivers 3 --client-max 200 --gbps 0; server --ports 3 --port-threads 3. 5 x 30 s.
  4. Reload homa.ko + run cloudlab/bin/config homa <ko> nic power rps between every trial. Without that, server-side state accumulates and contaminates loaded numbers (tested it; the variance is wild without per-trial reset).
  5. Read /proc/net/homa_metrics after each trial, divide.

The probe is one INC_METRIC call at the top of homa_rpc_handoff plus a u64 field. ~16 LoC across 3 files + a 10-line shell helper. See fix-handoff-twice-reproduce.

For single-packet messages each RPC should see exactly one handoff, so ratio > 1 is the race signal. Loaded ratios aren't reported below; for messages larger than one MTU each packet that lands after the receiver drained the queue legitimately fires its own wake-up, so the metric stops measuring the race.

Unloaded 64 B

kops P50 µs P99 µs ratio
baseline 50.27 17.96 38.81 1.148
fix 49.99 18.46 37.91 1.000

Race closed. Latency / throughput delta is within the per-trial jitter (~0.5 µs, ~3% kops trial-to-trial).

Loaded, cperf 25 Gbps defaults

workload baseline kops fix kops Δ kops bP50 fP50 bP99 fP99
w1 404.50 386.56 -4.4 % 449 492 540 581
w2 378.76 386.57 +2.1 % 476 468 612 608
w3 309.07 319.97 +3.5 % 568 529 866 812
w4 32.94 32.45 -1.5 % 104 104 197 ms 203 ms
w5 4.81 4.68 -2.7 % 23.8 ms 23.7 ms 136 ms 150 ms

(P50/P99 in µs unless marked.) Δ swings -4.4 to +3.5% with no consistent sign across workloads, that's noise. Within-variant variance is comparable: baseline-w1's 5 trials span 385.93-413.60 (7%), fix-w1 spans 370.52-411.32 (11%). The cross-variant deltas are smaller than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant