Skip to content

Speed up host-arm workflow via native execution on Arm64 runners#318

Open
DrXiao wants to merge 2 commits intosysprog21:masterfrom
DrXiao:workflow/use-arm64-runner-native
Open

Speed up host-arm workflow via native execution on Arm64 runners#318
DrXiao wants to merge 2 commits intosysprog21:masterfrom
DrXiao:workflow/use-arm64-runner-native

Conversation

@DrXiao
Copy link
Collaborator

@DrXiao DrXiao commented Jan 24, 2026

The proposed changes modify the build system to allow native execution on AArch64 when targeting Arm32, and update the workflow definitions for host-arm to use Arm64 runners and perform native bootstrapping and other validations.

Advantages

The original workflow of host-arm uses run-on-arch-action to validate whether bootstrapping and test cases run successfully. However, this job often takes a long time (6 ~ 8 minutes) because it relies on QEMU(*) to perform the build and run the test cases.

(*) If I understand correctly, run-on-arch-action actually uses qemu-system to execute the specified commands, which makes the emulation slow.

After applying the proposed changes, the job can be sped up via native execution on Arm64 runners. According to the results, host-arm can complete bootstrapping and test case validations within 2 minutes.

As a result, the workflow is noticeably faster.


Summary by cubic

Run the host-arm workflow natively on Arm64 runners by enabling Arm32 execution on AArch64. This removes QEMU emulation and cuts the job from ~6–8 minutes to ~2 minutes.

  • Refactors
    • Allow native Arm32 execution on AArch64 via machine model detection; only use qemu-arm when the host model is not allowed.
    • Switch host-arm job to ubuntu-24.04-arm and install armhf libc and required tools.
    • Replace run-on-arch-action with direct make, check-sanitizer, and unit test steps.

Written for commit 30b1d7a. Summary will update on new commits.

cubic-dev-ai[bot]

This comment was marked as resolved.

@DrXiao
Copy link
Collaborator Author

DrXiao commented Jan 24, 2026

By the way, I have another branch, workflow/use-arm64-runner, to test whether using qemu-user with Arm64 runners improves performance.

The commit and the test results are provided as follows:
commit 992d1a1
host-arm (static) - 2m 41s
host-arm (dynamic) - 2m 2s

In summary, although using qemu-user with Arm64 runners also improves performance, native execution is faster.

mk/arm.mk Outdated
@@ -1,4 +1,4 @@
ARCH_NAME = armv7l
ARCH_NAME = armv7l aarch64
Copy link
Collaborator

@jserv jserv Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. Not all Arm64 machines support the A32 ISA; therefore, the supported instruction sets must be validated explicitly, without assuming that Arm64 implies Arm32 compatibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further investigation, it appears that there is no way to validate whether a machine supports the A32 ISA using shell commands.

Therefore, I would like to propose an altenative approach: introducing a new variable, FORCE_NATIVE, to control whether native execution should be enforced. For example:

$ make                 # Let the build system automatically determine the proper way to run
$ make FORCE_NATIVE=1  # Enforce native execution

By default, the build system performs native execution only when the machine architecture is armv7l; otherwise, QEMU is used to run the generated executables. However, if a user has confirmed that their machine can use native execution, they can manually set FORCE_NATIVE=1 to enforce it.


For Arm64 runners, I found a reference on the Arm Developer website. Although GitHub's official documentation does not describe the hardware details of Arm64 runners, the referenced page states:

Arm-hosted runners are powered by Cobalt 100 processors, based on the Arm Neoverse N2. The free runners have 4 vCPUs and Armv9-A features including Scalable Vector Extension 2 (SVE2).

From this information, we can conclude that GitHub's Arm64 runners use Arm Neoverse N2 processors to run workflows on VMs. According to the Arm Neoverse N2 specifications, the supported ISAs explicitly include A32.

Thus, if the proposed FORCE_NATIVE approach is acceptable, we can directly add a workflow step to run make FORCE_NATIVE=1, allowing native execution on Arm64 runners. As a result, the speedup can still be achieved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above description represents my initial view. I can research further and provide more information if necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can introduce an allow list of machines with verified A32 support, eliminating the need for the manual FORCE_NATIVE option.

The original build system automatically uses qemu-arm to run the
compiler when targeting Arm32 if the native environment is not armv7l.

However, modern Linux distributions typically enable CONFIG_COMPAT,
which allows a 64-bit system to execute 32-bit binaries directly without
any emulator.

Therefore, this commit updates the build system to detect the user's
machine model and enable native execution when it is supported.
@DrXiao DrXiao force-pushed the workflow/use-arm64-runner-native branch from 47f6100 to 8417f89 Compare February 1, 2026 14:02
Since the build system has been updated to allow native execution on
AArch64 when targeting Arm32, this commit improves the workflow
definitions by running the "host-arm" job on Arm64 runners to perform
validation.

As a result of these changes, the "host-arm" job no longer uses
"run-on-arch-action".
@DrXiao DrXiao force-pushed the workflow/use-arm64-runner-native branch from 8417f89 to 30b1d7a Compare February 1, 2026 14:16
@DrXiao
Copy link
Collaborator Author

DrXiao commented Feb 1, 2026

The build system has been updated to detect the user's machine using lscpu and allow native execution if the user's machine model is in the allow list.

I have verified the allow list approach on the following hardware platforms:

  • BeagleBone Black (Cortex-A8)
  • Raspberry Pi 5 (Cortex-A76)

After adopting the new approach, the above platforms are still able to perform native execution.

@DrXiao DrXiao requested a review from jserv February 1, 2026 15:24
# | Neoverse-N2 | ARMv9-A | | GitHub's |
# | | | | Arm-hosted runners |
# +-------------+--------------+-----------------+--------------------+
ALLOW_MACHINES = Cortex-A8 Cortex-A53 Cortex-A72 Cortex-A76 Neoverse-N2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually refer to Raspberry Pi as a machine, while Arm Cortex-A72 is treated as an architecture. In this context, machine denotes a concrete hardware platform, distinguishing it from architectural descriptions or CPU cores (e.g. Cortex-A53) and their supported hardware features.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we hope that the allowlist is defined in terms of concrete machines (e.g.: ALLOW_MACHINES = "Raspberry Pi 5" "BeagleBone Black"), the only way to identify the running machine appears to be reading /proc/device-tree/model.

# BeagleBone Black
$ cat /proc/device-tree/model 
TI AM335x BeagleBone Black

# Raspberry Pi 5
$ cat /proc/device-tree/model 
Raspberry Pi 5 Model B Rev 1.0

# On desktop computers or Arm-hosted runners, /proc/device-tree-model does not exist

Then, we can use /proc/device-tree/model to determine whether the running machine is in the allowlist.

Although this approach may enable Raspberry Pi and BeagleBone boards to perform native execution, it does not work for Arm-hosted runners because a reliable machine name cannot be obtained on those runners.

(It seems difficult to find a consistent way to obtain the running machine name across different hardware platforms.)

Copy link
Collaborator

@jserv jserv Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we hope that the allowlist is defined in terms of concrete machines (e.g.: ALLOW_MACHINES = "Raspberry Pi 5" "BeagleBone Black"), the only way to identify the running machine appears to be reading /proc/device-tree/model.
...
On desktop computers or Arm-hosted runners, /proc/device-tree-model does not exist

Instead, you can simply install prebuilt fastfetch executable file to determine machine names/types.

@juice928
Copy link

juice928 commented Feb 4, 2026

👋 Hi, I'm an automated AI code review bot. I ran some checks on this PR and found 2 points that might be worth attention (could be false positives, please use your judgment):

  1. The hardware detection logic might be sensitive to the system's language settings

    • Location: Makefile:L42
    • Impact: On systems set to non-English locales, the build might fail to identify the host model, potentially triggering unnecessary emulation.
    • Suggestion: Consider prepending LC_ALL=C to the lscpu command to ensure the output format remains consistent across different environments.
  2. The hardware allowlist approach could be expanded to support a wider range of compatible CPUs

    • Location: mk/arm.mk:L19
    • Impact: Compatible hardware not explicitly listed (like Cortex-A57 or A73) might be forced to use QEMU, which could impact execution performance.
    • Suggestion: You might find it more flexible to use general architecture detection via uname -m to support all compatible hosts generically.

If you find these suggestions disruptive, you can reply "stop" , and I'll automatically skip this repository in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants