Skip to content

Conversation

@SherrinZhou
Copy link

The perftest framework makes extensive use of alarm() to control test duration (--duration) and to schedule periodic tasks.

Functions such as run_iter_bw(), run_iter_lat_send(), and run_iter_bi() install a handler via signal(SIGALRM, catch_alarm) when the -D option is used, and then set an alarm.

In run_iter_bw_server() and run_iter_bi(), a watchdog is also installed in iterations mode via signal(SIGALRM, check_alive) followed by alarm(60) to detect stalled tests.

In the problematic case, run_iter_bi() with the -e option invokes ctx_notify_send_recv_events(), which performs a select() on two file descriptors:

ctx->recv_channel->fd — CQ receive completion channel

ctx->send_channel->fd — CQ send completion channel

When a completion event is generated, the kernel marks the corresponding file descriptor readable and select() returns.

However, due to low processing speed on the some NICs, no completion event is generated within 60 seconds(test case is not finished under high pressure test). The watchdog alarm() fires, delivering SIGALRM, which interrupts the blocking select() call. The function then exits with an error instead of retrying.

This behavior exposes a robustness issue in perftest: SIGALRM in this context is meant only as a check-alive signal, not as a fatal condition. A select() call interrupted by SIGALRM should be restarted rather than causing an unexpected termination.

This patch updates perftest to properly handle EINTR by retrying select() when it is interrupted by SIGALRM, ensuring correct behavior even under slow device processing conditions.

The perftest framework makes extensive use of alarm() to control test
duration (--duration) and to schedule periodic tasks.

Functions such as `run_iter_bw()`, `run_iter_lat_send()`, and
`run_iter_bi()` install a handler via `signal(SIGALRM, catch_alarm)`
when the -D option is used, and then set an alarm.

In `run_iter_bw_server()` and `run_iter_bi()`, a watchdog is also
installed in iterations mode via `signal(SIGALRM, check_alive)`
followed by `alarm(60)` to detect stalled tests.

In the problematic case, `run_iter_bi()` with the -e option invokes
`ctx_notify_send_recv_events()`, which performs a `select()` on two file
descriptors:

`ctx->recv_channel->fd` — CQ receive completion channel

`ctx->send_channel->fd` — CQ send completion channel

When a completion event is generated, the kernel marks the corresponding
file descriptor readable and `select()` returns.

However, due to low processing speed on the some NICs, no completion
event is generated within 60 seconds(test case is not finished under
high pressure test). The watchdog `alarm()` fires, delivering SIGALRM,
which interrupts the blocking `select()` call. The function then exits
with an error instead of retrying.

This behavior exposes a robustness issue in perftest: SIGALRM in this
context is meant only as a check-alive signal, not as a fatal condition.
A `select()` call interrupted by SIGALRM should be restarted rather than
causing an unexpected termination.

This patch updates perftest to properly handle EINTR by retrying
`select()` when it is interrupted by SIGALRM, ensuring correct behavior
even under slow device processing conditions.

Signed-off-by: Ruizhe Zhou <zhouruizhe@resnics.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant