Conversation
Summary of ChangesHello @nirs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical network stability and performance issue on macOS when using Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly handles the ENOBUFS error on macOS for unixgram sockets by mapping it to WriteError::NothingWritten, which allows the caller to retry sending the frame. This prevents a fatal error when the kernel socket buffer is full. I've added one suggestion to also handle EAGAIN and EWOULDBLOCK in the same way to improve portability and make the error handling more robust across different Unix-like systems.
|
@mtjhrc PTAL |
|
The change LGTM, but what did you mean by this in the commit message:
Wouldn't we get notified via kqueue the socket is writable again? (same as EAGAIN on Linux...) |
I could not find any evidence that we can use non-blocking io to detect if the socket is writable or not. The only info I could find was in FreeBSD mailing list thread explaining that the only way to recover is retrying. Looking at early experiments with vmnet-helper I did not try non-blocking io, or maybe I did not commit this since it did not work. |
|
Hmm, but we are already using a non-blocking socket here. Notably we also use What I am wondering is that, If receiving an |
I'm not sure that edge triggered behavior works for datagram socket. The socket is probably always writable since we never get EAGAIN. But this is a separate issue to investigate, retrying ENOBUFS works. It can be interesting to count retries and be able to get stats so we have some visibility on this issue. Maybe add debug log in this case? |
|
I mean this is definitely good, even if it likely isn't a full fix (like discussed above) - the code change LGTM. It's just the commit message seems misleading: "Map ENOBUFS to WriteError::NothingWritten so the caller retries the frame instead of treating it as a fatal error." - the caller doesn't retry on |
Right, "caller retries" is not a good description. The write is retried when the socket is considered writable (need to test if this happen) or the guest kicks us. I'll try to experiment more and have a more correct description how this is handled. |
|
I feel like we should be getting the EV_WRITE events. I made a simple test and it seemed to work: |
3236fe9 to
c1efb00
Compare
Interesting, but there are no timestamps so it is not clear if we got the event immediately after the read. |
|
@mtjhrc I'm trying the EV_WRITE way, it seems like a simple change. |
When running iperf3 with gvproxy or vmnet-helper, krunkit breaks
randomly with:
[2026-02-19T02:53:41Z ERROR devices::virtio::net::worker] Failed to process rx:
Backend(Internal(ENOBUFS)) (triggered by backend socket readable)
macOS returns ENOBUFS when the kernel socket buffer is full, rather
than blocking or returning EAGAIN on non-blocking sockets. This is
handled by gvproxy and vmnet-helper by retrying the write.
Map ENOBUFS to WriteError::NothingWritten so the write is retried
instead of treating it as a fatal error.
When write_frame() returns NothingWritten, process_tx() pushes the descriptor
back onto the virtqueue (undo_pop) and breaks out of the inner loop, returning
Ok(()). process_tx_loop() retries if we did not finish to process all
entries, or the driver added new entries to the available ring.
This creates a busy retry loop. Adding debug logs shows:
[17:33:13.308027Z] kev: { ident: 77, data: 8 }
[17:33:13.308034Z] write_frame: ENOBUFS
[17:33:13.308038Z] write_frame: ENOBUFS
[17:33:13.308042Z] write_frame: ENOBUFS
...
[17:33:13.308260Z] write_frame: ENOBUFS
[17:33:13.308264Z] write_frame: ENOBUFS
[17:33:13.308268Z] write_frame: ENOBUFS
[17:33:13.308280Z] Written frame size=1514, written=1514
To simulate ENOBUFS in this example I added a 10 milliseconds sleep in
vmnet-helper read loop. The spin lasted 234 microseconds (64 retries at
~3.6us per attempt) before the buffer drained and the write succeeded.
With a slower reader the spin can be much longer.
Assisted-by: Cursor/Claude Opus 4.6
Signed-off-by: Nir Soffer <nirsof@gmail.com>
If write_from() returns WrittenNothing, propagate the error to process_tx_loop() and return to the event loop. The event loop will wake us the when the socket become writable. Previously when the backend could not write anything (EAGAIN, ENOBUFS) we enabled notifications and retry process_tx(), creating a busy loop ending when the write complete. Testing shows that ENOBUFS events are rare and happen only in the bidirectional test with 8 parallel streams. During 600 seconds run we logged 124 ENOBUFS events, retrying up to 1624 times and 12.7 milliseconds. Signed-off-by: Nir Soffer <nirsof@gmail.com>
The default second-resolution timestamps make it impossible to understand the timing of events. Use format_timestamp_micros() in both logger init paths (krun_set_log_level and krun_init_log) so log output shows microsecond precision. Assisted-by: Cursor/Claude Opus 4.6 Signed-off-by: Nir Soffer <nirsof@gmail.com>
c1efb00 to
3a1c999
Compare
|
@mtjhrc Current version should be a complete fix. Tested with vment-helper and podman/gvproxy. I will post updated results later. |
When running iperf3 with gvproxy or vmnet-helper, krunkit breaks randomly with:
macOS returns ENOBUFS when the kernel socket buffer is full, rather than blocking or returning EAGAIN on non-blocking sockets. This is handled by gvproxy and vmnet-helper by retrying the write.
Changes
Retry on ENOBUFS — Map ENOBUFS to
WriteError::NothingWrittenso the write is retried instead of treated as a fatal error.Stop tx loop on WrittenNothing — When a backend cannot write, stop the tx loop and turn to the event loop. The event loop will wakes when the socket is writable.
Microsecond timestamps — Use
format_timestamp_micros()in both logger init paths so timing of events can be understood from the logs.Test results
Tested 600 seconds bidirectional iperf3 runs with 8 streams:
Configuration:
macOS 26.3
iperf3: 3.20
Fixes #555