procd: add support for OCI runtime spec 1.3.0 by dangowrt · Pull Request #37 · openwrt/procd

dangowrt · 2026-05-08T14:14:10Z

This updates procd-jail and uxc to OCI runtime spec 1.3.0, and also (with the help of Claude) improves seccomp support (new RISCV64 arch and cleanup of tracing, needs testing!)

Should fix #35

dangowrt · 2026-05-08T14:17:23Z

@dhewg would appreciate your review!

dhewg · 2026-05-09T07:14:33Z

Nice, will take a look!
But before I test drive that on a franken setup, mind enlighten me in #38?

Use size_t instead of int, in order to avoid garbage values being used inside calloc_a. Signed-off-by: Felix Fietkau <nbd@nbd.name>

Juff-Ma · 2026-05-22T20:01:24Z

@dangowrt While this PR does indeed fix part of my issue in #35, the part I included in the edit does this happen.

When actually trying to run a OCI container UXC will throw
user.err : jail: parsing of OCI JSON spec has failed: Operation not permitted (1)
on the log
uxc error: No such file or directory
on the command line and
[ ] 12b5d8f454bd3a9ea29ddae6e6523c96e3d2cb5f19f0a898005b37feb365a32a stopped exitcode: 1 (Operation not permitted)
when running uxc list

The directory in the uxc json file is present and filled with files:

tree /data/podman/storage/overlay-containers/12b5d8f454bd3a9ea29ddae6e6523c96e3d2cb5f19f0a898005b37feb365a32a/userdata/
├── artifacts
├── attach
├── config.json
├── ctl
├── ctr.log
├── secrets
├── shm
└── winsz

The command is just podman run quay.io/podman/hello
I've attached the config.json it generates. I changed nothing, only formatting it for readability.

_add_mount() returned 1 (which happens to equal EPERM) for any duplicate target in the mounts AVL tree. The loops in parseOCIlinux() that register linux.readonlyPaths and linux.maskedPaths treat any non-zero return as fatal, so a config.json with a repeated entry (e.g. podman emits /sys/devices/virtual/powercap in its default maskedPaths *and* appends it again as the CVE-2020-8694 mitigation) aborted parseOCI() with the bogus diagnostic parsing of OCI JSON spec has failed: Operation not permitted (1) Compare the incoming registration against the existing entry instead. If source, filesystemtype, optstr, mountflags, propflags, error and inner all match, the call is idempotent and returns 0. Any divergence returns EEXIST so a conflicting re-registration at the same target is rejected loudly rather than silently shadowed by whichever entry happened to land in the AVL first. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

The version check rejected anything that did not start with "1.0", blocking every bundle authored against spec 1.1.0 or later. OpenWrt is a monolithic build where every container package must conform to the distribution's current spec, so there is no need to keep separate code paths for the 1.0.x semantics that 1.1.0 changed (e.g. poststart hook failures becoming fatal, prestart's role overlapping createRuntime). Tighten the prefix match to require any "1.x" with x >= 1. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

dangowrt · 2026-05-23T00:49:57Z

@Juff-Ma I think I've found the issue, just analyzing config.json and tracing where EPERM could come from. Please retry (fix it in 6cd2985 )

Juff-Ma · 2026-05-23T10:01:26Z

@Juff-Ma I think I've found the issue, just analyzing config.json and tracing where EPERM could come from. Please retry (fix it in 6cd2985 )

@dangowrt Your fix does indeed resolve the EPERM issue however it still does not work. Instead of error 1 EPERM I now get error 22 Invalid Argument. UXC stills ays No such file or sirectory with this on the log:

Sat May 23 11:19:15 2026 user.err : jail: parsing of OCI JSON spec has failed: Invalid argument (22)

and this in the listing:

[ ] f72077ecdaa5d6724a1a7118e4f952e14619ded9f17d804a6ebe1946e5e142cc stopped exitcode: 22 (Invalid argument)

config file/command is the same. Note that this also happens when running uxc start manually, so it's not an invalid argument to the uxc command itself (which I originally thought)

The issue for this happening seems to be seccomp. Since when disabling seccomp the error disappears when using UXC directly. Now it can run the container. However when trying to attach via uxc start --console NAME I get ubus request failed. (When trying to attach via -it I get Error: OCI runtime error: uxc: /sbin/uxc: unrecognized option: console-socket)

Another point is, it really only works when using uxc directly. When it is execut by podman I get the following error still:

> uxc error: No such file or directory
> ERRO[0000] Removing container 501a88cc497d6c1bb784f679a228cc50d1676a1e3e2a542a03400492174cef97 from runtime after creation failed
> Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

Also note that despite what's being said, the container is NOT removed from uxc.

Edit: here's the full podman log:

podman run --rm --security-opt seccomp=unconfined --log-level=debug e7db/geekbench
INFO[0000] podman filtering at log level debug
DEBU[0000] Called run.PersistentPreRunE(podman run --rm --security-opt seccomp=unconfined --log-level=debug e7db/geekbench)
INFO[0000] Setting parallel job count to 7
DEBU[0000] Using conmon: "/usr/bin/conmon"
INFO[0000] Using sqlite as database backend
DEBU[0000] Using graph driver overlay
DEBU[0000] Using graph root /data/podman/storage
DEBU[0000] Using run root /tmp/run/containers/storage
DEBU[0000] Using static dir /data/podman/static
DEBU[0000] Using tmp dir /var/run/libpod
DEBU[0000] Using volume path /data/podman/volumes
DEBU[0000] Using transient store: false
DEBU[0000] [graphdriver] trying provided driver "overlay"
DEBU[0000] Cached value indicated that overlay is supported
DEBU[0000] Cached value indicated that overlay is supported
DEBU[0000] Cached value indicated that metacopy is not being used
DEBU[0000] Cached value indicated that native-diff is usable
DEBU[0000] backingFs=extfs, projectQuotaSupported=false, useNativeDiff=true, usingMetacopy=false
DEBU[0000] Initializing event backend none
DEBU[0000] Configured OCI runtime runc initialization failed: no valid executable found for OCI runtime runc: invalid argument
DEBU[0000] Configured OCI runtime runj initialization failed: no valid executable found for OCI runtime runj: invalid argument
DEBU[0000] Configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument
DEBU[0000] Configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument
DEBU[0000] Configured OCI runtime youki initialization failed: no valid executable found for OCI runtime youki: invalid argument
DEBU[0000] Configured OCI runtime krun initialization failed: no valid executable found for OCI runtime krun: invalid argument
DEBU[0000] Configured OCI runtime ocijail initialization failed: no valid executable found for OCI runtime ocijail: invalid argument
DEBU[0000] Configured OCI runtime crun initialization failed: no valid executable found for OCI runtime crun: invalid argument
DEBU[0000] Configured OCI runtime crun-vm initialization failed: no valid executable found for OCI runtime crun-vm: invalid argument
DEBU[0000] Configured OCI runtime crun-wasm initialization failed: no valid executable found for OCI runtime crun-wasm: invalid argument
DEBU[0000] Using OCI runtime "/sbin/uxc"
DEBU[0000] Pulling image e7db/geekbench (policy: missing)
DEBU[0000] Looking up image "e7db/geekbench" in local containers storage
DEBU[0000] Normalized platform linux/amd64 to {amd64 linux  [] }
DEBU[0000] Loading registries configuration "/etc/containers/registries.conf"
DEBU[0000] Trying "localhost/e7db/geekbench:latest" ...
DEBU[0000] reference "[overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]localhost/e7db/geekbench:latest" does not resolve to an image ID
DEBU[0000] Trying "docker.io/e7db/geekbench:latest" ...
DEBU[0000] parsed reference into "[overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]@ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Found image "e7db/geekbench" as "docker.io/e7db/geekbench:latest" in local containers storage
DEBU[0000] Found image "e7db/geekbench" as "docker.io/e7db/geekbench:latest" in local containers storage ([overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]@ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e)
DEBU[0000] exporting opaque data as blob "sha256:ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Looking up image "e7db/geekbench" in local containers storage
DEBU[0000] Normalized platform linux/amd64 to {amd64 linux  [] }
DEBU[0000] Trying "localhost/e7db/geekbench:latest" ...
DEBU[0000] reference "[overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]localhost/e7db/geekbench:latest" does not resolve to an image ID
DEBU[0000] Trying "docker.io/e7db/geekbench:latest" ...
DEBU[0000] parsed reference into "[overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]@ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Found image "e7db/geekbench" as "docker.io/e7db/geekbench:latest" in local containers storage
DEBU[0000] Found image "e7db/geekbench" as "docker.io/e7db/geekbench:latest" in local containers storage ([overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]@ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e)
DEBU[0000] exporting opaque data as blob "sha256:ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Inspecting image ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e
DEBU[0000] exporting opaque data as blob "sha256:ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Inspecting image ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e
DEBU[0000] Inspecting image ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e
DEBU[0000] Inspecting image ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e
DEBU[0000] using systemd mode: false
DEBU[0000] No hostname set; container's hostname will default to runtime default
DEBU[0000] Successfully loaded network podman: &{podman 5ef894788befd4d42498314b6e66282ca730aa2e1e82f9b9597bf4d1725ca074 bridge podman0 2026-05-20 16:26:39.519216229 +0000 UTC [{{{172.17.0.0 ffff0000}} 172.17.0.1 <nil>}] [] false false true [] map[] map[] map[driver:host-local]}
DEBU[0000] Successfully loaded 1 networks
DEBU[0000] Allocated lock 3 for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] exporting opaque data as blob "sha256:ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Cached value indicated that idmapped mounts for overlay are supported
DEBU[0000] Created container "6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c"
DEBU[0000] Container "6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c" has work directory "/data/podman/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata"
DEBU[0000] Container "6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c" has run directory "/tmp/run/containers/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata"
DEBU[0000] Not attaching to stdin
INFO[0000] Received shutdown.Stop(), terminating!        PID=22063
DEBU[0000] Enabling signal proxying
DEBU[0000] Cached value indicated that volatile is being used
DEBU[0000] overlay: mount_data=lowerdir=/data/podman/storage/overlay/l/CVIWIVQJHOTYEVPQEJXPDF274I:/data/podman/storage/overlay/l/5NGBTSZRZWVJIPCS2O5DUC4KPW:/data/podman/storage/overlay/l/XNKYBFRIFIOYSMOMF4WDXIIA77,upperdir=/data/podman/storage/overlay/9cb324b82ec58a3e47df09f5c829b29ede9192380f15d6ace2b2e2c76c9e0653/diff,workdir=/data/podman/storage/overlay/9cb324b82ec58a3e47df09f5c829b29ede9192380f15d6ace2b2e2c76c9e0653/work,nodev,volatile
DEBU[0000] Mounted container "6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c" at "/data/podman/storage/overlay/9cb324b82ec58a3e47df09f5c829b29ede9192380f15d6ace2b2e2c76c9e0653/merged"
DEBU[0000] Created root filesystem for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c at /data/podman/storage/overlay/9cb324b82ec58a3e47df09f5c829b29ede9192380f15d6ace2b2e2c76c9e0653/merged
DEBU[0000] Made network namespace at /run/netns/netns-3b795775-1b32-3863-6bff-7e3457bd7d8e for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
[DEBUG netavark::network::validation] "Validating network namespace..."
[DEBUG netavark::commands::setup] "Setting up..."
[INFO  netavark::firewall] Using iptables firewall driver
[DEBUG netavark::network::bridge] Setup network podman
[DEBUG netavark::network::bridge] Container interface name: eth0 with IP addresses [172.17.0.7/16]
[DEBUG netavark::network::bridge] Bridge name: podman0 with IP addresses [172.17.0.1/16]
[DEBUG netavark::network::core_utils] Setting sysctl value for net.ipv4.ip_forward to 1
[DEBUG netavark::network::core_utils] Setting sysctl value for /proc/sys/net/ipv4/conf/podman0/rp_filter to 2
[DEBUG netavark::network::core_utils] Setting sysctl value for /proc/sys/net/ipv6/conf/eth0/autoconf to 0
[DEBUG netavark::network::core_utils] Setting sysctl value for /proc/sys/net/ipv4/conf/eth0/arp_notify to 1
[DEBUG netavark::network::core_utils] Setting sysctl value for /proc/sys/net/ipv4/conf/eth0/rp_filter to 2
[INFO  netavark::network::netlink] Adding route (dest: 0.0.0.0/0 ,gw: 172.17.0.1, metric 100)
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-1D8721804F16F created on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_ISOLATION_2 exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_ISOLATION_2 exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_ISOLATION_3 exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_ISOLATION_3 exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_INPUT exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_INPUT exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_FORWARD exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_FORWARD exists on table filter
[DEBUG netavark::firewall::varktables::helpers] rule -d 172.17.0.0/16 -j ACCEPT created on table nat and chain NETAVARK-1D8721804F16F
[DEBUG netavark::firewall::varktables::helpers] rule ! -d 224.0.0.0/4 -j MASQUERADE created on table nat and chain NETAVARK-1D8721804F16F
[DEBUG netavark::firewall::varktables::helpers] rule -s 172.17.0.0/16 -j NETAVARK-1D8721804F16F created on table nat and chain POSTROUTING
[DEBUG netavark::firewall::varktables::helpers] rule -p udp -s 172.17.0.0/16 --dport 53 -j ACCEPT created on table filter and chain NETAVARK_INPUT
[DEBUG netavark::firewall::varktables::helpers] rule -p tcp -s 172.17.0.0/16 --dport 53 -j ACCEPT created on table filter and chain NETAVARK_INPUT
[DEBUG netavark::firewall::varktables::helpers] rule -m conntrack --ctstate INVALID -j DROP exists on table filter and chain NETAVARK_FORWARD
[DEBUG netavark::firewall::varktables::helpers] rule -d 172.17.0.0/16 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT created on table filter and chain NETAVARK_FORWARD
[DEBUG netavark::firewall::varktables::helpers] rule -s 172.17.0.0/16 -j ACCEPT created on table filter and chain NETAVARK_FORWARD
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-SETMARK exists on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-SETMARK exists on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-MASQ exists on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-MASQ exists on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-DNAT exists on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-DNAT exists on table nat
[DEBUG netavark::firewall::varktables::helpers] rule -j MARK  --set-xmark 0x2000/0x2000 exists on table nat and chain NETAVARK-HOSTPORT-SETMARK
[DEBUG netavark::firewall::varktables::helpers] rule -j MASQUERADE -m comment --comment 'netavark portfw masq mark' -m mark --mark 0x2000/0x2000 exists on table nat and chain NETAVARK-HOSTPORT-MASQ
[DEBUG netavark::firewall::varktables::helpers] rule -j NETAVARK-HOSTPORT-DNAT -m addrtype --dst-type LOCAL exists on table nat and chain PREROUTING
[DEBUG netavark::firewall::varktables::helpers] rule -j NETAVARK-HOSTPORT-DNAT -m addrtype --dst-type LOCAL exists on table nat and chain OUTPUT
[DEBUG netavark::dns::aardvark] Spawning aardvark server
[DEBUG netavark::dns::aardvark] start aardvark-dns: ["/usr/lib/podman/aardvark-dns", "--config", "/run/containers/networks/aardvark-dns", "-p", "53", "run"]
[DEBUG netavark::commands::setup] {
        "podman": StatusBlock {
            dns_search_domains: Some(
                [
                    "dns.podman",
                ],
            ),
            dns_server_ips: Some(
                [
                    172.17.0.1,
                ],
            ),
            interfaces: Some(
                {
                    "eth0": NetInterface {
                        mac_address: "b2:70:f0:b5:a8:f1",
                        subnets: Some(
                            [
                                NetAddress {
                                    gateway: Some(
                                        172.17.0.1,
                                    ),
                                    ipnet: 172.17.0.7/16,
                                },
                            ],
                        ),
                    },
                },
            ),
        },
    }
[DEBUG netavark::commands::setup] "Setup complete"
DEBU[0000] Not modifying container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c /etc/passwd
DEBU[0000] Modifying container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c /etc/group
DEBU[0000] Setting Cgroup path for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c to /libpod_parent/libpod-6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] reading hooks from /usr/share/containers/oci/hooks.d
DEBU[0000] Workdir "/opt/geekbench" resolved to host path "/data/podman/storage/overlay/9cb324b82ec58a3e47df09f5c829b29ede9192380f15d6ace2b2e2c76c9e0653/merged/opt/geekbench"
DEBU[0000] Created OCI spec for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c at /data/podman/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata/config.json
DEBU[0000] /usr/bin/conmon messages will be logged to syslog
DEBU[0000] running conmon: /usr/bin/conmon               args="[--api-version 1 -c 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c -u 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c -r /sbin/uxc -b /data/podman/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata -p /tmp/run/containers/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata/pidfile -n modest_hoover --exit-dir /var/run/libpod/exits --persist-dir /var/run/libpod/persist/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c --full-attach -l k8s-file:/data/podman/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata/ctr.log --log-level debug --syslog --conmon-pidfile /tmp/run/containers/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /data/podman/storage --exit-command-arg --runroot --exit-command-arg /tmp/run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --network-config-dir --exit-command-arg /etc/containers/networks/ --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /data/podman/volumes --exit-command-arg --db-backend --exit-command-arg sqlite --exit-command-arg --transient-store=false --exit-command-arg --hooks-dir --exit-command-arg /usr/share/containers/oci/hooks.d --exit-command-arg --runtime --exit-command-arg uxc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev --exit-command-arg --events-backend --exit-command-arg none --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --stopped-only --exit-command-arg --rm --exit-command-arg 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c]"
uxc error: No such file or directory
ERRO[0000] Removing container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c from runtime after creation failed
DEBU[0000] Cleaning up container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] Tearing down network namespace at /run/netns/netns-3b795775-1b32-3863-6bff-7e3457bd7d8e for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
[DEBUG netavark::commands::teardown] "Tearing down.."
[INFO  netavark::firewall] Using iptables firewall driver
[INFO  netavark::network::bridge] removing bridge podman0
[DEBUG netavark::commands::teardown] "Teardown complete"
DEBU[0000] Unmounted container "6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c"
DEBU[0000] Removing container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] Cleaning up container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] Network is already cleaned up, skipping...
DEBU[0000] Container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c storage is already unmounted, skipping...
DEBU[0000] Removing all exec sessions for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] Container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c storage is already unmounted, skipping...
DEBU[0000] Temp dir already cleaned up
DEBU[0000] ExitCode msg: "container create failed (no logs from conmon): conmon bytes \"\": readobjectstart: expect { or n, but found \x00, error found in #0 byte of ...||..., bigger context ...||..."
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
DEBU[0000] Shutting down engines

Edit 2: After recompiling OpenWRT with seccomp support and rerunning the error still persists but I get a new error message:

user.err : jail: can't open /sys/fs/cgroup/libpod_parent/libpod-85650516cfe8ea8fa4a9e153b075b7c07cb7f97b75370d59d293ff80a4ffac5c/memory.swap_max: No such file or directory

However inspecting it shows that it is indeed present just with a slightly different name

-rw-r--r-- 1 root root 0 May 23 13:40 memory.swap.max

It looks like a) podman is giving out bas paths that UXC can't find or b) UXC just uses a bad name. Could this be a cgroup v1/v2 thing? According to strings and some googling ujail searches for memory.swap_max which is cgroup v1 only yet the cgroup version that's running is v2 which uses memory.swap.max instead.

dangowrt · 2026-05-26T13:17:38Z

@Juff-Ma I was wrongly assuming that podman, runc and crun would implement the CLI as defined in the spec, but turns out that part of the spec is abandonned and it become more of just a convention. uxc now tries to behave more like runc or crun, eventhough that's not what the spec says (but the spec-compliant behavior also still works). Also many features added which should get it going as a runtime in recent podman.

Now regarding the seccomp problem you were seeing: I've reworked the seccomp support for now also work with "foreign" libc containers (ie. OpenWrt host being musl, container being eg. glibc or bionic). This needs a small change to the procd package Makefile as well, see below

diff --git a/package/system/procd/Makefile b/package/system/procd/Makefile
index cf730a3c0a..c508a447c6 100644
--- a/package/system/procd/Makefile
+++ b/package/system/procd/Makefile
@@ -30,12 +30,14 @@ PKG_CONFIG_DEPENDS:= \
 
 include $(INCLUDE_DIR)/package.mk
 include $(INCLUDE_DIR)/cmake.mk
+include $(INCLUDE_DIR)/kernel.mk
 
 ifeq ($(DUMP),)
   STAMP_CONFIGURED:=$(strip $(STAMP_CONFIGURED))_$(shell echo $(CONFIG_TARGET_INIT_PATH) | $(MKHASH) md5)
 endif
 
-CMAKE_OPTIONS += -DEARLY_PATH="$(TARGET_INIT_PATH)"
+CMAKE_OPTIONS += -DEARLY_PATH="$(TARGET_INIT_PATH)" \
+		 -DNOLIBC_INCLUDE_DIR="$(LINUX_DIR)/tools/include/nolibc"
 
 define Package/procd/Default
   SECTION:=base

Juff-Ma · 2026-05-26T14:50:33Z

@dangowrt I've tested your build and it has the same problem I reported in Edit 2 of my previous message. The seccomp error (while I appreciate it being fixed) was easily solvable by just compiling OpenWRT with seccomp support.

Through a bit of grepping and googling I eventually made this simple one line patch but still got an error. I could try combining my change with your patchset and see if that fixes anything.

The OCI 1.0.0 spec field process.consoleSize was never parsed, so bundles requesting a specific initial pty size silently got the kernel default. Add the field to the process policy and call ioctl(TIOCSWINSZ) on the master fd after unlockpt when both height and width are set. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

The OCI 1.0.0 spec field linux.personality was never parsed, so bundles asking for LINUX32 on a 64-bit kernel ran with the host personality instead of being rejected. Parse the field and accept only when 'domain' matches the kernel's native personality (read via personality(0xFFFFFFFF)); return ENOTSUP otherwise, since OpenWrt does not support cross-personality execution. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

The OCI 1.1.0 spec field process.scheduler was unsupported, so bundles selecting SCHED_FIFO/RR/DEADLINE/etc. fell back to the inherited host policy. Parse the field and apply via sched_setattr just before execve; direct syscall used since musl/glibc lack a wrapper. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

The OCI 1.1.0 spec field process.ioPriority was unsupported, so block I/O priority defaulted to whatever the parent inherits. Parse class and priority and apply via ioprio_set just before execve, using a direct syscall. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Juff-Ma · 2026-05-26T15:28:39Z

@dangowrt I've tested your build and it has the same problem I reported in Edit 2 of my previous message. The seccomp error (while I appreciate it being fixed) was easily solvable by just compiling OpenWRT with seccomp support.

Through a bit of grepping and googling I eventually made this simple one line patch but still got an error. I could try combining my change with your patchset and see if that fixes anything.

I can not report success. The error about the missing cgroup file is gone but (once again) I still receive a uxc error: no such file or directory without any more info anywhere. The containers are being created though and are shown as such in the list. When I run. The relevant part of podman log has not changed

DEBU[0000] Setting Cgroup path for container a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a to /libpod_parent/libpod-a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a
DEBU[0000] reading hooks from /usr/share/containers/oci/hooks.d
DEBU[0000] Workdir "/" resolved to host path "/data/podman/storage/overlay/eea63b9351612b00eeb274fcb81e8a8136008aed4555daec2ff2f3bffdee72c4/merged"
DEBU[0000] Created OCI spec for container a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a at /data/podman/storage/overlay-containers/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a/userdata/config.json
DEBU[0000] /usr/bin/conmon messages will be logged to syslog
DEBU[0000] running conmon: /usr/bin/conmon               args="[--api-version 1 -c a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a -u a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a -r /sbin/uxc -b /data/podman/storage/overlay-containers/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a/userdata -p /tmp/run/containers/storage/overlay-containers/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a/userdata/pidfile -n clever_merkle --exit-dir /var/run/libpod/exits --persist-dir /var/run/libpod/persist/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a --full-attach -l k8s-file:/data/podman/storage/overlay-containers/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a/userdata/ctr.log --log-level debug --syslog --conmon-pidfile /tmp/run/containers/storage/overlay-containers/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /data/podman/storage --exit-command-arg --runroot --exit-command-arg /tmp/run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --network-config-dir --exit-command-arg /etc/containers/networks/ --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /data/podman/volumes --exit-command-arg --db-backend --exit-command-arg sqlite --exit-command-arg --transient-store=false --exit-command-arg --hooks-dir --exit-command-arg /usr/share/containers/oci/hooks.d --exit-command-arg --runtime --exit-command-arg uxc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --stopped-only --exit-command-arg a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a]"
uxc error: No such file or directory
ERRO[0000] Removing container a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a from runtime after creation failed
DEBU[0000] Cleaning up container a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a
DEBU[0000] Tearing down network namespace at /run/netns/netns-262c7ca9-dba3-3001-9f44-d565a71f4cc8 for container a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a
[DEBUG netavark::commands::teardown] "Tearing down.."
[DEBUG netavark::firewall] Forcibly using firewall driver none
[INFO  netavark::firewall] Not using firewall
[INFO  netavark::network::bridge] removing bridge podman0
[DEBUG netavark::commands::teardown] "Teardown complete"
DEBU[0000] Unmounted container "a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a"
DEBU[0000] ExitCode msg: "container create failed (no logs from conmon): conmon bytes \"\": readobjectstart: expect { or n, but found \x00, error found in #0 byte of ...||..., bigger context ...||..."
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
DEBU[0000] Shutting down engines

The OCI 1.1.0 spec field domainname was unsupported, so containers ran with the host's NIS domain name regardless of bundle configuration. Parse the field and apply via setdomainname after sethostname inside the new UTS namespace. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

The OCI 1.1.0 spec field linux.timeOffsets was unsupported, so the container saw the host's CLOCK_MONOTONIC and CLOCK_BOOTTIME values even when a time namespace was created. Parse the per-clock offsets and write them to /proc/self/timens_offsets early in the cloned child, while it is still the only process in the new time namespace. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Bundles using the deprecated prestart hook had it parsed but never run, so any setup it performed was silently dropped. Per OCI spec 1.1.0 prestart and createRuntime fire at the same point; parse it into its own hooklist and run it before createRuntime via a new post_prestart callback in the existing chain. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Per OCI runtime spec 1.1.0 a poststart hook failure now MUST cause the runtime to stop the container and proceed to the delete phase, instead of the 1.0.x "log a warning and continue" behaviour. Track aggregate hook chain failure across run_hooks invocations and, in the poststart return callback, skip the running phase and tear the container down. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

The OCI 1.1.0 cgroup-v2 field linux.resources.cpu.burst was unsupported, so bundles requesting CFS bandwidth burst accumulation got no burst. Add the field to the cpu policy and write the value to cpu.max.burst. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

The OCI 1.1.0 cgroup-v2 field linux.resources.cpu.idle was unsupported. Add the field to the cpu policy and write the value to cpu.idle, which schedules tasks in the cgroup as SCHED_IDLE. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

The OCI 1.1.0 field linux.resources.memory.checkBeforeUpdate was silently dropped through the unknown-attr path. The check it gates against ("reject a new limit lower than current usage") only matters for an OCI 'update' operation, which the runtime spec does not define and which uxc does not implement. Add the field to the memory policy so blobmsg_parse recognises and discards it explicitly; it would be applied if uxc ever grew an update operation. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

The pids parser rejected absent limit with EINVAL and treated the value as unsigned, so OCI 1.3.0 bundles relying on optional pids.limit or using -1 for unlimited failed at create. Skip silently when absent and write "max" to pids.max for any negative value. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

The seccomp violation handler used PTRACE_POKEUSER (and PTRACE_SET_SYSCALL on arm) to overwrite the tracee's syscall number register with -1 on x86_64, i386, arm, mips and ppc, causing the kernel to skip the offending syscall with -ENOSYS. aarch64 and loongarch64 already lacked this poke and let the syscall proceed transparently. The behaviour was therefore inconsistent across architectures, and the tool's purpose -- collecting the full set of syscalls a program issues so an allowlist can be derived afterwards -- is defeated by the cancellation: the first non-whitelisted call returns ENOSYS, the program typically falls into an error path or exits, and the recorded profile is incomplete. Drop the cancellation on all architectures so SECCOMP_TRACE mode just observes and logs. After the change every supported arch behaves like aarch64/loongarch64 already did. The remaining UTRACE mode (PTRACE_SYSCALL path) was already a pure observer and is unchanged. Why removing the poke is safe (verified against Linux mainline): seccomp(2) defines SCMP_ACT_TRACE / SECCOMP_RET_TRACE such that "if the tracer does not change the system call number, the system call will proceed normally." All four arches that we used to poke implement exactly that contract; the cancellation only worked because procd actively wrote -1 into the syscall register. Concretely: - x86_64: arch/x86/entry/syscall_64.c:do_syscall_64() invokes the syscall via do_syscall_x64(); with nr unchanged the table dispatch runs as usual. The skip path is gated on "nr == -1". - i386: arch/x86/entry/syscall_32.c:do_syscall_32_irqs_on() has the same "else if (nr != -1)" structure. - arm: arch/arm/kernel/entry-common.S after syscall_trace_enter does "cmp scno, #-1; bne 2b" -- only -1 skips, anything else dispatches. - mips: arch/mips/kernel/scall*-*.S has "bltz v0, 1f" after syscall_trace_enter; without a negative override, the syscall is invoked normally. - powerpc: arch/powerpc/kernel/syscall.c uses the trace_enter return value as the syscall number; if unmodified, the dispatcher runs. No arch re-reads or invalidates the syscall number on its own when SECCOMP_RET_TRACE fires, so leaving the register alone is the documented "let it proceed" path on every supported target. Real seccomp enforcement (SCMP_ACT_KILL / SCMP_ACT_ERRNO) is performed inside the kernel by the BPF program and is unaffected by this change; seccomp-trace is and now consistently is a pure profiling tool. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

REG_SYSCALL was carried over from the Chromium seccomp example header, where it is consumed by a SIGSYS handler that reads the offending syscall's register state from the signal context. procd never installed such a handler and never referenced the macro: a tree-wide search across the entire git history finds zero uses outside of the per-arch definitions in this header itself. The seccomp BPF program reads the syscall number from struct seccomp_data, and trace.c uses its own arch-specific reg_syscall_nr offsets (or PTRACE_GET_SYSCALL_INFO). Remove the macro from every arch branch and from the fallback. No functional change. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Without an __riscv branch the header fell into the generic #else block, emitted "Platform does not support seccomp filter yet" and set ARCH_NR to 0. parseOCIlinuxseccomp() then either rejected the bundle (when the container declared SCMP_ARCH_RISCV64 -- AUDIT_ARCH_RISCV64 never matched 0) or, with no architectures listed, generated a BPF program whose arch gate compared against 0 and killed every syscall. seccomp filters could not be applied on RISC-V at all. Map __riscv (xlen == 64) to AUDIT_ARCH_RISCV64 so the existing SCMP_ARCH_RISCV64 mapping in seccomp-oci.c can take effect and the generated BPF program identifies the running architecture correctly. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

The seccomp arch resolver did not know SCMP_ARCH_RISCV64, so OCI 1.1.0 bundles targeting RISC-V 64-bit were rejected with "unknown seccomp architecture". Map it to AUDIT_ARCH_RISCV64. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

container_handle_exec previously fork()ed the grandchild and then, back in the ujail parent, wrote the grandchild's host pid into cgroup.procs via cgroups_attach_pid(). The grandchild was in the container's namespaces from the first instruction but in ujail's own cgroup for the sub-millisecond window between fork() and the parent's write - a noted residual gap. Open the container's cgroup directory once with O_PATH | O_DIRECTORY | O_CLOEXEC via cgroups_open_dir(), let it inherit into the exec_child intermediate, and replace exec_child's inner fork() with clone3(CLONE_INTO_CGROUP) carrying the cgroup fd in clone_args.cgroup. The kernel places the grandchild in the container's cgroup atomically with its creation; the post-fork cgroups_attach_pid() call goes away. cgroups.{c,h} grows cgroups_open_dir() that returns open(cgroup_path, O_PATH | O_DIRECTORY | O_CLOEXEC) or -1 when cgroup_path is unset. Cleanup paths close the fd in both the ujail parent's success-after-fork path and the 'out:' error label. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

cgroups_create() now runs ahead of clone3, so the cgroup tree (mkdir_p + subtree_control writes) exists before any task is created. clone3 then takes CLONE_INTO_CGROUP plus a cgroup directory fd via cgroups_open_dir(), and the kernel places the container init in the destination cgroup atomically with its creation. The post-fork sequence becomes cgroups_configure() (controller value writes + ebpf attach) followed by cgroups_attach_pid() which is now a no-op for the INTO_CGROUP path - the init is already a member - and serves as the fallback when cgroups_open_dir() returned -1 (a deployment where the cgroup tree was unreachable at clone3 time and clone3 ran without INTO_CGROUP). cgroups_apply() stays available for container_handle_update which legitimately wants the full create + configure + attach sequence against an existing container. Kernel 5.7+ for CLONE_INTO_CGROUP, well below the 6.18 target. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

OCI does not standardise a key for Memory-Deny-Write-Execute. Read 'org.openwrt.ujail.mdwe' from the bundle annotations, expect a comma-separated value of refuse_exec_gain[,no_inherit], and call prctl(PR_SET_MDWE, flags) on both the container init (after PR_SET_NO_NEW_PRIVS and the capability drop in post_start_hook) and the exec grandchild (right after its NNP step). The container init prctl is fatal on failure; the exec grandchild prctl exits 127 like the surrounding privilege-drop steps. A workload that legitimately mmaps RWX (JITs) should leave the annotation unset, or pass refuse_exec_gain,no_inherit so the bit is dropped across exec into trusted helper binaries that need W+X. The MDWE bit propagates across fork unconditionally. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

New self-contained jail/landlock.{c,h} wraps the three Landlock syscalls (landlock_create_ruleset, landlock_add_rule, landlock_restrict_self) and exposes a small struct landlock_config that callers grow path-by-path. landlock_available() caches the ABI probe; landlock_apply() builds and installs the ruleset. Activation is opt-in via three bundle annotations: org.openwrt.ujail.landlock.ro = '<path>[:<path>]*' org.openwrt.ujail.landlock.rx = '<path>[:<path>]*' org.openwrt.ujail.landlock.rw = '<path>[:<path>]*' Each adds a LANDLOCK_RULE_PATH_BENEATH rule with the corresponding access mask (READ_FILE | READ_DIR; the same plus EXECUTE; the read/write/create/remove set). With no annotation present landlock_apply() is skipped, preserving runc-compatible allow-all behaviour. As soon as any rule is registered the runtime forces opts.no_new_privs = 1, because landlock_restrict_self requires NNP. The apply call sits in post_start_hook after pivot_root (paths must resolve against the post-pivot view) and before applyOCIlinuxseccomp, so denied paths fail before the workload runs and the seccomp filter still loads on top. CMakeLists.txt links jail/landlock.c into ujail. Reverse-DNS prefix follows the org.openwrt.* convention used by other ujail-specific bundle annotations (mdwe, systemd-cgroup hint). Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Copy libpreload-seccomp.so into a memfd backed by MFD_EXEC | MFD_ALLOW_SEALING, apply F_SEAL_WRITE|GROW|SHRINK, and point LD_PRELOAD at /proc/self/fd/<N>. The workload sees a tamper-evident, read-only copy of the helper. The on-disk .so no longer needs to be visible inside the jail rootfs; only its NEEDED shared-library dependencies (libc, libubox, libblobmsg_json, libdl) are bind-mounted, which preload_load_deps() resolves by walking the dynamic section of the on-host copy. This works for both OCI containers and classic OpenWrt procd jails since both flows construct the environment via build_envp(); the former always mounts procfs, and the latter does so whenever the jail is launched with namespace separation. The memfd is created with MFD_EXEC (not MFD_NOEXEC_SEAL) because ld.so maps the file with PROT_EXEC; likewise F_SEAL_EXEC is not applied for the same reason. close_range() before execve would otherwise close the memfd via CLOSE_RANGE_CLOEXEC, so the descriptor flag is cleared just before the exec. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Compile the seccomp JSON into a cBPF program inside ujail (and inside seccomp-trace), then write the struct sock_filter[] array into a sealed memfd created with MFD_ALLOW_SEALING and locked down with F_SEAL_WRITE | F_SEAL_GROW | F_SEAL_SHRINK. The descriptor number is passed to the workload via the SECCOMP_BPF_FD environment variable; close_range() before execve marks it CLOEXEC, so the flag is cleared just like the preload memfd itself. libpreload-seccomp.so shrinks to a single constructor that mmaps the memfd, calls prctl(PR_SET_NO_NEW_PRIVS) and seccomp(SET_MODE_FILTER), and returns. The __libc_start_main / __uClibc_main interception trick is gone; the constructor mechanism is universal across glibc, musl, uClibc and bionic, so the preload is no longer wedded to the libc the helper itself was built against. Side effects of moving JSON parsing out of the workload: - the .so no longer needs libubox, libblomsg_json or libdl, so those bind-mounts vanish from the jail rootfs entirely (libc is still required, as the workload uses it), - SECCOMP_FILE and SECCOMP_DEBUG are retired from the workload-side env contract; the BPF blob is now the only thing the workload sees, - jail/seccomp.c and jail/seccomp.h are removed since install_syscall_filter() lived solely to bridge the old contract, - seccomp-trace gains a dependency on jail/seccomp-oci.c so it can compile the JSON itself; the contract between procd and seccomp-trace (SECCOMP_FILE env var or -f arg) is unchanged. The filter is still armed inside the workloads address space and only after ld.so has finished its work, which preserves the property that the cBPF program does not have to whitelist execve, the dynamic linker bootstrap, or libc init. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Add a NOLIBC_INCLUDE_DIR CMake variable that the OpenWrt procd recipe can point at the kernel tree (typically $(LINUX_DIR)/tools/include/nolibc). When the variable is set, libpreload-seccomp.so is compiled with -nostdlib -ffreestanding -fno-builtin -include nolibc.h and linked with -nostdlib, leaving the resulting shared object with no DT_NEEDED entries at all. The system libc and libdl no longer have to be visible to the helper, which closes the only remaining "two libcs in one address space" trap when a workload built against a different libc than ujail itself is run under classic procd-jail with -S. When the variable is not set the helper continues to link against the system libc exactly as before; the same preload.c builds in both modes. A #ifndef NOLIBC guard around the system-header block prevents clashes between nolibcs static-inline definitions and the system declarations of getenv / mmap / prctl / etc. The unsetenv call that previously cleared SECCOMP_BPF_FD from the workloads environment is dropped, both because nolibc does not provide unsetenv and because the now-closed fd number is harmless to anything that does not already know about the SECCOMP_BPF_FD contract. A comprehensive top-of-file comment in jail/preload.c documents the contract, the constructor-over-__libc_start_main rationale, the sealed memfd handoff, and the two build modes, so future maintainers and auditors do not have to reverse-engineer the design from the build system. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Broadcast two new ubus events from ujail for OCI lifecycle visibility: instance.ready payload: { service, instance } emitted from post_create_runtime once jail_oci_state has transitioned to OCI_STATE_CREATED (i.e. the OCI pidfile has been written and the container is parked at the start barrier). instance.running payload: { service, instance } emitted from exec_ack_cb after the kernel closes the init's O_CLOEXEC end of a new exec_ack pipe on a successful execve. uxc can wait on this event to return synchronously from "uxc start". The exec_ack pipe is opened in post_main with pipe2(O_CLOEXEC) so the init inherits the CLOEXEC write end. The parent closes its copy immediately after clone3 and registers a uloop_fd callback on the read end; EOF (the kernel auto-closing the CLOEXEC fd at execve) is the success signal. Also fix a latent initialiser bug: jail_oci_state was statically initialised to OCI_STATE_CREATED, which would falsely satisfy handle_start's "must be created to start" guard if any caller raced the post_main / post_create_runtime sequence. Initialise to OCI_STATE_CREATING instead. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

uxc_create, uxc_start and uxc_delete now block until the relevant lifecycle event arrives on ubusd, rather than returning as soon as the triggering ubus.invoke replies: create arms instance.ready watcher for the container, invokes container.add, waits up to 30 s for ujail to broadcast instance.ready (i.e. pidfile written, OCI state CREATED). start arms instance.running watcher, invokes container.<name>.start, waits up to 30 s for ujail to broadcast instance.running after the container init's execve. delete arms a ubus.object.remove watcher for the container.<name> path, invokes container.<name>.delete, waits up to 30 s for ubusd to broadcast the object's removal (ujail's exit unregisters it). The event handler is registered BEFORE the triggering ubus.invoke to avoid losing a fast-arriving event. procd is otherwise uninvolved in this synchronisation; existing ubus consumers see no change. This closes the Podman race ("pidfile not yet created" when uxc returned) without procd having to grow any new state machinery. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

signals_init() installed jail_handle_signal on every catchable signal except SIGCHLD/SIGPIPE/SIGSEGV/SIGSTOP/SIGKILL. When the kernel delivered a synchronous fault signal (SIGBUS, SIGFPE, SIGILL, SIGSYS, SIGABRT, SIGTRAP) the handler forwarded it to the contained process and returned. The kernel then restarted the faulting instruction, re-raised the same signal, and the loop continued forever; the bug manifested as ujail pinning a CPU at 100% with rt_sigreturn / SI_KERNEL SIGBUS in a tight loop. Treat the synchronous fault signals the same way SIGSEGV is already treated: leave them at the default disposition so the kernel terminates the process (and produces a core dump if enabled). Forwarding a synchronous fault to the contained process is meaningless anyway since the fault originated in ujail itself, not in the contained workload. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Both functions declared a single struct pointer and stepped it with free(cur++), as if the underlying buffer were an array of structs. The data structure is actually a NULL-terminated array of POINTERS to struct (parseOCIhook calloc's hooklist as sizeof(struct hook_execvpe *) slots, and parseOCIlinuxsysctl does the same for sysctl). After freeing the first entry the iterator advanced by sizeof(struct) into garbage and either crashed there or, more commonly, returned a non-NULL pointer that fed garbage into free_oci_envp where the actual segfault (or x86 #SS stack-segment trap) manifested. The bug was latent until the new "abort container creation on failed prestart hook" path started triggering free_opts -> free_hooklist with a non-empty hook chain. The OCI runtime-tools hooks.t test reproduces it deterministically. Mirror free_devices, which already iterates correctly via struct mknod_args ** with *(cur++). Signed-off-by: Daniel Golle <daniel@makrotopia.org>

…fsets BPF jt/jf are uint8_t (max forward jump 255 instructions). The previous filter generation emitted one LD syscall_nr followed by N JEQs and a single RET action per OCI syscalls[] group. With more than ~254 names the JEQs at the front of the group needed a jt larger than 255 to reach the trailing RET; the offset silently wrapped modulo 256, sending matched syscalls into the middle of the next group instead of to the ALLOW return. podman's default seccomp profile has one ALLOW group with 370 names; "execve" sat at filter index [061] with jt=16 (truncated from 272), landing on another group's JEQ instead of the RET. The kernel then fell through to the default RET ERRNO(38) and "execve" came back as ENOSYS, leaving the container init spinning on the same ENOSYS-blocked exit(2) syscall. Split each group into chunks of at most SECCOMP_CHUNK_NAMES (240) names. Each chunk emits its own LD syscall_nr + JEQs + (group's args, duplicated) + RET action, so the maximum forward jump stays comfortably below 255 even for the worst case (full chunk with several args). Also drop the stray "assert(idx = start_rule_idx)" assignment-bug that masked size mismatches; the rewritten loop uses correct equality asserts on the per-chunk landmarks. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

OCI runtime-spec >= 1.0.2 mandates that a missing entry in process.capabilities (bounding, effective, inheritable, permitted, ambient) be equivalent to an empty list, i.e. no capabilities. parseOCIcap was returning JAIL_CAP_ALL on a NULL blob, which combined with applyOCIcapabilities's "if (set != JAIL_CAP_ALL)" short-circuit caused the entire drop / set step to be skipped for that capset, so the container kept the runtime's full inherited set. The visible symptom: "podman run --cap-drop ALL" silently granted the full bounding+permitted+effective set because podman sends the OCI shorthand `"capabilities": {}` for the drop-all case and relies on the spec's "missing = empty" semantics. Tested explicitly with `grep ^Cap /proc/self/status` inside containers running each of: default, --cap-drop ALL, --cap-drop ALL --cap-add NET_BIND_SERVICE, and --cap-add SYS_ADMIN; all four now match crun's output. The same parser is used for the legacy procd `-C <file>` path. The OCI contract has always applied: every /etc/capabilities/*.json shipped by OpenWrt base and packages-feed (wpad, ntpd, simple-captive-portal, go2rtc, adguardhome, dnsproxy, rsyncd, snort3, dbus) already lists all five sets explicitly. Permissive handling of missing sets was a bug, not a feature. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Juff-Ma · 2026-05-28T14:06:43Z

I'm making one small step after the other. Upgrading all podman dependencies helped a bit. I needed to patch a build error in conmon but that was easily fixable. (I'm still using the 1s delay for UXC to work around the race condition)

Using the latest tools I now am able to start a container and it actually runs. Well the container does. The app inside not so much.

When trying to get logs I get the error

jail: exec-ing /docker-entrypoint.sh
jail: failed to execve /docker-entrypoint.sh: Function not implemented

Note that /docker-entrypoint.sh is not static. It is the entrypoint specified by the Dockerfile and therefore changes with the container used.

Note that I am not able to recover from this error. My entire shell session freezes and I can't use killall to end podman or conmon.

dangowrt · 2026-05-28T14:09:50Z

There is an architectural problem with how seccomp is applied which break foreign-libc containers (ie. OpenWrt with musl and container eg. with glibc). I've resolved that and tons of other issues, currently last mile of testing, going to push to this branch in the next hours with validated podman working. Meanwhile, please open a PR to update podman in case you didn't do that already and there isn't any existing PR for that (I've opened the PR to update conmon earlier this morning)

Juff-Ma · 2026-05-28T14:15:51Z

There is an architectural problem with how seccomp is applied which break foreign-libc containers (ie. OpenWrt with musl and container eg. with glibc). I've resolved that and tons of other issues, currently last mile of testing, going to push to this branch in the next hours with validated podman working. Meanwhile, please open a PR to update podman in case you didn't do that already and there isn't any existing PR for that (I've opened the PR to update conmon earlier this morning)

I see. Glad to hear that. I haven't actually updated Podman itself. It is up to date. In addition to conmon I've update the netavark/aardvark-dns stack, catatonit and crun (since it was also out of date and I was at it)

I'll look into opening PRs

P.S.: Your PR does not include a second patch conmon required in my build env at least. I got a GCC false positive on -Wstringop-overread in ctr_logging.c. If your build works though that may have been an issue on my side, but I wouldn't know which.

dangowrt · 2026-05-28T17:39:13Z

@Juff-Ma Please retry with the changes I've pushed now. Don't forget to also apply the patch for procd's Makefile I've posted in #37 (comment)

Juff-Ma · 2026-05-28T18:45:15Z

@Juff-Ma Please retry with the changes I've pushed now. Don't forget to also apply the patch for procd's Makefile I've posted in #37 (comment)

@dangowrt IT IS ALIVE. Oh my god this is amazing. I still included my cgroups patch just to be sure but it absolutely works now. I can run containers and their services are reachable.

The work you've done to get this working is incredible.

I found 2 little issues (that do not impact day to day operations, at least for me, but I still wanna name them).

When using -it the console (at least my SSH connection) still hangs up. I haven't found a way to disconnect from the container
UXC still seems to report a file not found error sometimes (when podman deletes containers maybe?) but I have not found any actual problems from this behaviour. Maybe it is a case of too much verbosity by default in a recoverable state.

P.S.: The error really does occur when killing or deleting containers. I don't know why though. Killing never works and rm-ing will error out the first time. Still not an issue for me.

Until now any container whose OCI spec listed CLONE_NEWNET caused ujail to spawn a per-jail ubusd and netifd inside /var/containers/. That is the OpenWrt-native uxc UX (jail-side wiring from /etc/config/network) but actively fights OCI orchestrators: podman, docker and friends set up the netns and addressing themselves via netavark / CNI / CDI, then expect a fresh empty netns inside the container. The parallel netifd inside the jail was both unwanted and a permanent leak — its lifecycle hooked into poststop() only, while every error path through free_and_exit() left ubusd and netifd as orphans of procd. Two new OCI annotations make the behaviour explicit: org.openwrt.procd.ubus = "true" -> spawn jail-private ubusd org.openwrt.procd.netifd = "true" -> spawn jail-private ubusd + netifd (netifd implies ubus) Both default false, so OCI orchestrators get the netns podman/CNI/CDI prepared, undisturbed. Native uxc rc-script flows that want the old behaviour set the annotations when generating the bundle. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

cgroups_free() walked the cgvals AVL tree and freed cgroup_path but did not clear the `initialized` flag, leaving cgroup_path and avl entries as dangling pointers. The parent exit path calls cgroups_free() twice: free_and_exit() // jail.c cgroups_free() // first call: real teardown free_opts(parent=true) cgroups_free() // second call: re-walks freed tree -> double-free -> mallocng's get_meta() faults reading "erged" bytes from a recycled chunk's metadata field -> SIGSEGV at libc.so+0x45bc4 in the ujail process Reproduces with any OCI container (alpine /bin/true, /bin/ls, etc.). ASAN's heap reshuffling masked the bug because the recycled chunk no longer overlapped the avl bookkeeping; UBSAN did not mask it because it does not touch the allocator. ptrace attach also masked it. The fix: early-return if not initialized, then NULL cgroup_path and clear `initialized` after the actual teardown, so a second call is a no-op. With the fix, /bin/true under uxc runs cleanly across 30 consecutive runs, capability and seccomp behaviour vs crun stays 8/8 MATCH, and no per-jail ubusd/netifd leaks. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

When podman runs with -t, conmon binds a SOCK_STREAM listener at /var/tmp/conmon-term.XXX and execs the runtime with --console-socket pointing at it. conmon closes the listener as soon as it sees the runtime process exit, so the runtime must connect *during* its own lifetime. For crun/runc this is fine: the runtime does posix_openpt + connect + sendmsg within milliseconds. For uxc the chain is uxc -> ubus to procd -> spawn ujail -> clone3 child -> build_jail_fs -> create_dev_console; by the time ujail's child reaches send_console_fd the listener is already gone and connect() returns ECONNREFUSED, causing podman to fail with "Runtime did not set up terminal". Connect in the parent ujail process immediately after CLI parsing (before parseOCI and clone3), then pass the fd number to the child through opts.console_socket. parse_inherited_console_fd in open_console_sock already handles a numeric spec, so the existing sendmsg path in the child just reuses the inherited fd. The fd is CLOEXEC-cleared so it survives clone3. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Two related cosmetic-but-visible fixes for podman -t containers under uxc: 1. create_dev_console opened the PTY slave and dup2'd it to 0/1/2 but never made it the controlling terminal. Bash inside the container printed "cannot set terminal process group (-1): Inappropriate ioctl for device" and "no job control in this shell". Call setsid (best effort; PID 1 in a fresh CLONE_NEWPID is already a session leader) and ioctl(TIOCSCTTY) so the slave becomes the workload's controlling terminal. 2. Two INFO() prints fired AFTER stdio had been dup'd to the PTY slave, so "jail: exec-ing bash" and "using guest console ..." ended up on the user's terminal before the workload could write its prompt. Demote both to DEBUG so they only surface with -d > 0. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Two related fixes that together let podman -t containers run cleanly under uxc, including normal exit detection. 1) Allocate the PTY in the parent ujail (host mount NS) instead of the child (newinstance devpts NS). conmon receives the master fd via SCM_RIGHTS and the slave name as iovec data; it then opens the slave by name in its own mount NS. When ujail allocated the PTY inside the child after mounting newinstance devpts, the slave name was only valid in the child's NS, so conmon couldn't open it. Moved posix_openpt + grantpt + unlockpt + slave open + send_console_fd into the parent path right before clone3, and inherit the slave to the child via CLOEXEC-cleared console_slave_fd. create_dev_console in the child is now just the bind-mount of the inherited slave at /dev/console (via /proc/self/fd/N to survive pivot_root) plus setsid + TIOCSCTTY + dup2 on the inherited fd. 2) Send SIGCHLD to conmon when the parent ujail exits. conmon's exit-detection is driven by SIGCHLD: on signal it runs waitpid, then if ECHILD it probes kill(container_pid, 0). For runc/crun the runtime is conmon's child so SIGCHLD arrives naturally when the container init dies. For uxc the runtime (and ujail) are spawned by procd, not by conmon, so conmon never sees SIGCHLD and the exit file never gets written; podman sits at "Up Xs" forever. Read conmon's pidfile (sibling of opts.pidfile) at free_and_exit and kill(conmon, SIGCHLD); conmon's existing kill-probe path then detects the dead container_pid and writes the exit file. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

After commit c85ae2e (PTY allocated in parent, slave opened in child early in build_jail_fs), the post-clone3 init reaches post_jail_fs right after create_dev_console has installed the PTY slave as the controlling terminal via setsid + TIOCSCTTY. The resulting session change triggers a kernel-delivered signal that the catch-all handler installed by signals_init intercepts, and because that handler is registered without SA_RESTART the blocking read on pipes[2] returns EINTR. The old code treated that as an error, the init died with 'can't read from parent', the parent's free_and_exit then removed container.<name>, and the subsequent uxc start lookup returned ENOENT (visible to podman as 'uxc error: No such file or directory' and '/sbin/uxc start ... failed: exit status 254'). Wrap both pipe sync reads in EINTR retry loops so the init blocks until the parent actually writes 'O' / '!'. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Refactor uxc_attach to a single goto-out cleanup path so each new error branch can't drift out of sync with the resource-release list. Map the UBUS_STATUS_NOT_SUPPORTED reply from procd's container_handle_console (returned when the container was created without process.terminal=true) to a specific message telling the user to recreate with console=true, instead of the opaque "ubus request failed". Drop the two stray setup_tios() calls on the PTY master/server fds (raw termios on a PTY endpoint is meaningless; only the local /dev/tty needs it) and gate the tcsetattr() restore on whether setup actually took effect. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

dangowrt · 2026-05-31T19:22:49Z

@Juff-Ma console issues with podman should now also be fixed, please retry and report if you find the time. Thank you!

Juff-Ma · 2026-05-31T19:29:07Z

@Juff-Ma console issues with podman should now also be fixed, please retry and report if you find the time. Thank you!

If you mean the commits up to 280a8bb I wasn't really able to test them (I already tried when they were pushed) since I've been encountering a new issue. I though that I had left a comment but it seems it didn't go through (why github?)

On some containers I get a jail: failed to execve /docker-entrypoint.sh: File not found error. I don't know why it happens.

Here are the two containers I used:

nginxdemos/hello:latest - worked
coturn/coturn:4 or coturn/coturn:4-alpine - did not work

execvpe() in musl performs its PATH search with getenv("PATH") against the calling process's own environment, not the envp argument handed to it. ujail runs with procd's PATH (the build-time EARLY_PATH, /usr/sbin:/usr/bin:/sbin:/bin), so a relative entrypoint that lives only in a directory present in the container's PATH but not ujail's - e.g. coturn's docker-entrypoint.sh in /usr/local/bin - is never found and the exec fails with ENOENT ("failed to execve docker-entrypoint.sh: No such file or directory"). Images whose entrypoint resolves to /bin or /usr/bin happened to work because those dirs are in both paths. Point environ at the freshly built container envp right before execvpe() so the PATH lookup honours the container's PATH. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

service: fix calloc_a argument

74bfbee

Use size_t instead of int, in order to avoid garbage values being used inside calloc_a. Signed-off-by: Felix Fietkau <nbd@nbd.name>

dangowrt added 2 commits May 23, 2026 01:48

dangowrt force-pushed the oci-1.3.0 branch from 05d8c18 to b28dd18 Compare May 23, 2026 00:48

dangowrt force-pushed the oci-1.3.0 branch 2 times, most recently from ed1ba35 to 14c4b4a Compare May 26, 2026 13:01

dangowrt added 4 commits May 26, 2026 15:57

dangowrt added 12 commits May 26, 2026 16:48

cgroups: support cpu.idle

e196e56

The OCI 1.1.0 cgroup-v2 field linux.resources.cpu.idle was unsupported. Add the field to the cpu policy and write the value to cpu.idle, which schedules tasks in the cgroup as SCHED_IDLE. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

jail/seccomp-oci: add RISCV64 architecture

09be6f1

The seccomp arch resolver did not know SCMP_ARCH_RISCV64, so OCI 1.1.0 bundles targeting RISC-V 64-bit were rejected with "unknown seccomp architecture". Map it to AUDIT_ARCH_RISCV64. Signed-off-by: Daniel Golle <daniel@makrotopia.org>

dangowrt added 13 commits May 27, 2026 02:35

dangowrt force-pushed the oci-1.3.0 branch from 14c4b4a to 38356bd Compare May 28, 2026 17:22

dangowrt added 7 commits May 29, 2026 03:40

dangowrt force-pushed the oci-1.3.0 branch from 38356bd to 280a8bb Compare May 29, 2026 02:41

Conversation

dangowrt commented May 8, 2026

Uh oh!

dangowrt commented May 8, 2026

Uh oh!

dhewg commented May 9, 2026

Uh oh!

Juff-Ma commented May 22, 2026

Uh oh!

dangowrt commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Juff-Ma commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangowrt commented May 26, 2026

Uh oh!

Juff-Ma commented May 26, 2026

Uh oh!

Juff-Ma commented May 26, 2026

Uh oh!

Juff-Ma commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangowrt commented May 28, 2026

Uh oh!

Juff-Ma commented May 28, 2026

Uh oh!

dangowrt commented May 28, 2026

Uh oh!

Juff-Ma commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangowrt commented May 31, 2026

Uh oh!

Juff-Ma commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dangowrt commented May 23, 2026 •

edited

Loading

Juff-Ma commented May 23, 2026 •

edited

Loading

Juff-Ma commented May 28, 2026 •

edited

Loading

Juff-Ma commented May 28, 2026 •

edited

Loading