[RV64_DYNAREC] Add BOX64_DYNAREC_CALLN_RESTORE option for call_n a0/a1 preservation#3576
[RV64_DYNAREC] Add BOX64_DYNAREC_CALLN_RESTORE option for call_n a0/a1 preservation#3576zqb-all wants to merge 2 commits intoptitSeb:mainfrom
Conversation
|
I still don't understand the issue here. All those register that you restored are caller-saved in system-v ABI. There are not supposed to be preserved when calling a function. Are you sure this is the fix to the coroutine issue? This seems very strange to me. |
|
Yes, they are caller-saved in x86 System V ABI. But the issue is on the RISC-V side, not the x86 side. In RV64, x86 registers (RDI/RSI/RDX/RCX/R8/R9/RAX) are mapped directly to RISC-V a0-a6, which are caller-saved in the RISC-V ABI. After JALR in call_n, a0-a6 are clobbered by the native call, but the dynarec continues using these RISC-V registers as if they still hold valid x86 state. call_c already handles this correctly : box64/src/dynarec/rv64/dynarec_rv64_helper.c Lines 633 to 641 in dee9a74 call_n was missing the LOAD_REG part. |
|
But, that was the point of mapping x64 ABI call register to RISC-V ABI call register: so they don't have to be saved when calling a native function. Again, those RV64 regs are exclusively used to map x64 regs here, so them being clobered should be of no consequence because all those x64 regs are caller saved. Do you have a concrete example of were the clobering is an issue? |
|
You're right, I was thinking about it incorrectly. From an ABI perspective there's no need to restore those registers. This patch did avoid the WeMeet crash, but I need to reinvestigate to find the actual root cause — it may have just masked the real issue. I'll report back with a concrete trace. |
|
check, using BOX64_SHOWSEGV=1, if you see regular signal, like 24 or something like, that could be linked to a timer. If the co-routines are preemptly multitasked using a combinaison of timer signal and longjmp, the issue might be much more complex! |
|
Thanks for the advice — I tried that and no timer signals were involved. After some tracing, I think the root cause is that WeMeet violates the x86-64 System V ABI. xRDI is a caller-saved register, so its value is undefined after a function returns. This is not a Box64 bug, but could we add a workaround? For example, an environment variable to optionally restore xRDI after call_n? |
|
Sure, why not. I guess it can be usefull for every plateform that want to run WeMeet. |
|
A quick analysis with Claude's help shows: At the call site, the coroutine function pointer is hardcoded: co_init_context sets up the coroutine stack The coroutine function 0x92ccc0 ends with: Now, xRDI depends on 0x92ab30: Finally, xRDI depends on ~lock_guard() -> mutex.unlock() |
That last function doesn't preserve RDI. The inital value RDI is saved at rbp-8 but never restored at return. |
…1 preservation On RV64, xRDI is mapped to a0 and xRSI to a1 (RISC-V return value registers). After call_n's JALR, a0/a1 contain the native function's return value, silently overwriting xRDI/xRSI. Per x86-64 ABI this is correct (RDI/RSI are caller-saved), but some programs (e.g. WeMeet's coroutine library) violate the ABI by depending on caller-saved registers surviving function calls. On native x86-64, RAX and RDI are separate physical registers, so a function returning 0 in RAX doesn't affect RDI. On RV64, the aliasing (xRDI=a0=return value) causes these ABI-violating programs to crash. BOX64_DYNAREC_CALLN_RESTORE=1 saves/restores xRDI(a0) and xRSI(a1) around native calls in call_n. Off by default. Enable per-app via box64rc.
|
I am still not convinced this is the right fix for the issue, as the code you show is not doing what you said it's doing. |
|
Do you mean that from this code, it's not clear why the x86 version works? My understanding is that the existing x86 implementation is actually unsafe; it doesn't restore RDI, but relies on the fact that after the last function uses RDI, its value happens to be non-zero. |
From this code, your explanation doesn't add up. I see 0 evidence that RDI should be preserve for some external call. This absolutly not a proof at all. I still think your change is just a lucky one, that happen to fix the startup case for your machine, but have 0 garantie it will work on other machine or if it actualy fixes anythng at all. |
|
Restoring RDI to a non-zero value has the direct effect of preventing the co_jump_to_link function from calling exit(1) due to detecting RDI as zero. However, I haven't thoroughly researched what the coroutine function actually does internally, or what role RDI plays after passing the checkpoint in co_jump_to_link. |
|
I also feel this approach isn't ideal. This patch essentially restores an earlier RDI value, so if the x86 callee modifies RDI, we won't be able to stay consistent with it. I just haven't thought of a better way to handle this problem for now. |
RV64 maps x86 registers directly to RISC-V a0-a6, which are caller-saved and clobbered by native function calls. Unlike ARM64 where x86 state lives in callee-saved regs (x9+), RV64's call_n was not restoring these registers after JALR, leaving garbage values that propagated through subsequent dynarec blocks.
This caused crashes in coroutine-heavy code (e.g. WeMeet) where a wrapped function's return value (0) in a0/xRDI was misinterpreted as a null context pointer by co_jump_to_link, triggering exit(1).