Skip to content

procd: add support for OCI runtime spec 1.3.0#37

Open
dangowrt wants to merge 79 commits into
openwrt:mainfrom
dangowrt:oci-1.3.0
Open

procd: add support for OCI runtime spec 1.3.0#37
dangowrt wants to merge 79 commits into
openwrt:mainfrom
dangowrt:oci-1.3.0

Conversation

@dangowrt
Copy link
Copy Markdown
Member

@dangowrt dangowrt commented May 8, 2026

This updates procd-jail and uxc to OCI runtime spec 1.3.0, and also (with the help of Claude) improves seccomp support (new RISCV64 arch and cleanup of tracing, needs testing!)

Should fix #35

@dangowrt
Copy link
Copy Markdown
Member Author

dangowrt commented May 8, 2026

@dhewg would appreciate your review!

@dhewg
Copy link
Copy Markdown
Contributor

dhewg commented May 9, 2026

Nice, will take a look!
But before I test drive that on a franken setup, mind enlighten me in #38?

Use size_t instead of int, in order to avoid garbage values being used
inside calloc_a.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
@Juff-Ma
Copy link
Copy Markdown

Juff-Ma commented May 22, 2026

@dangowrt While this PR does indeed fix part of my issue in #35, the part I included in the edit does this happen.

When actually trying to run a OCI container UXC will throw
user.err : jail: parsing of OCI JSON spec has failed: Operation not permitted (1)
on the log
uxc error: No such file or directory
on the command line and
[ ] 12b5d8f454bd3a9ea29ddae6e6523c96e3d2cb5f19f0a898005b37feb365a32a stopped exitcode: 1 (Operation not permitted)
when running uxc list

The directory in the uxc json file is present and filled with files:

tree /data/podman/storage/overlay-containers/12b5d8f454bd3a9ea29ddae6e6523c96e3d2cb5f19f0a898005b37feb365a32a/userdata/
├── artifacts
├── attach
├── config.json
├── ctl
├── ctr.log
├── secrets
├── shm
└── winsz

The command is just podman run quay.io/podman/hello
I've attached the config.json it generates. I changed nothing, only formatting it for readability.

dangowrt added 2 commits May 23, 2026 01:48
_add_mount() returned 1 (which happens to equal EPERM) for any
duplicate target in the mounts AVL tree. The loops in parseOCIlinux()
that register linux.readonlyPaths and linux.maskedPaths treat any
non-zero return as fatal, so a config.json with a repeated entry
(e.g. podman emits /sys/devices/virtual/powercap in its default
maskedPaths *and* appends it again as the CVE-2020-8694 mitigation)
aborted parseOCI() with the bogus diagnostic

    parsing of OCI JSON spec has failed: Operation not permitted (1)

Compare the incoming registration against the existing entry instead.
If source, filesystemtype, optstr, mountflags, propflags, error and
inner all match, the call is idempotent and returns 0. Any divergence
returns EEXIST so a conflicting re-registration at the same target is
rejected loudly rather than silently shadowed by whichever entry
happened to land in the AVL first.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The version check rejected anything that did not start with "1.0",
blocking every bundle authored against spec 1.1.0 or later. OpenWrt is
a monolithic build where every container package must conform to the
distribution's current spec, so there is no need to keep separate code
paths for the 1.0.x semantics that 1.1.0 changed (e.g. poststart hook
failures becoming fatal, prestart's role overlapping createRuntime).
Tighten the prefix match to require any "1.x" with x >= 1.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
@dangowrt
Copy link
Copy Markdown
Member Author

dangowrt commented May 23, 2026

@Juff-Ma I think I've found the issue, just analyzing config.json and tracing where EPERM could come from. Please retry (fix it in 6cd2985 )

@Juff-Ma
Copy link
Copy Markdown

Juff-Ma commented May 23, 2026

@Juff-Ma I think I've found the issue, just analyzing config.json and tracing where EPERM could come from. Please retry (fix it in 6cd2985 )

@dangowrt Your fix does indeed resolve the EPERM issue however it still does not work. Instead of error 1 EPERM I now get error 22 Invalid Argument. UXC stills ays No such file or sirectory with this on the log:

Sat May 23 11:19:15 2026 user.err : jail: parsing of OCI JSON spec has failed: Invalid argument (22)

and this in the listing:

[ ] f72077ecdaa5d6724a1a7118e4f952e14619ded9f17d804a6ebe1946e5e142cc stopped exitcode: 22 (Invalid argument)

config file/command is the same. Note that this also happens when running uxc start manually, so it's not an invalid argument to the uxc command itself (which I originally thought)

The issue for this happening seems to be seccomp. Since when disabling seccomp the error disappears when using UXC directly. Now it can run the container. However when trying to attach via uxc start --console NAME I get ubus request failed. (When trying to attach via -it I get Error: OCI runtime error: uxc: /sbin/uxc: unrecognized option: console-socket)

Another point is, it really only works when using uxc directly. When it is execut by podman I get the following error still:

> uxc error: No such file or directory
> ERRO[0000] Removing container 501a88cc497d6c1bb784f679a228cc50d1676a1e3e2a542a03400492174cef97 from runtime after creation failed
> Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

Also note that despite what's being said, the container is NOT removed from uxc.

Edit: here's the full podman log:
podman run --rm --security-opt seccomp=unconfined --log-level=debug e7db/geekbench
INFO[0000] podman filtering at log level debug
DEBU[0000] Called run.PersistentPreRunE(podman run --rm --security-opt seccomp=unconfined --log-level=debug e7db/geekbench)
INFO[0000] Setting parallel job count to 7
DEBU[0000] Using conmon: "/usr/bin/conmon"
INFO[0000] Using sqlite as database backend
DEBU[0000] Using graph driver overlay
DEBU[0000] Using graph root /data/podman/storage
DEBU[0000] Using run root /tmp/run/containers/storage
DEBU[0000] Using static dir /data/podman/static
DEBU[0000] Using tmp dir /var/run/libpod
DEBU[0000] Using volume path /data/podman/volumes
DEBU[0000] Using transient store: false
DEBU[0000] [graphdriver] trying provided driver "overlay"
DEBU[0000] Cached value indicated that overlay is supported
DEBU[0000] Cached value indicated that overlay is supported
DEBU[0000] Cached value indicated that metacopy is not being used
DEBU[0000] Cached value indicated that native-diff is usable
DEBU[0000] backingFs=extfs, projectQuotaSupported=false, useNativeDiff=true, usingMetacopy=false
DEBU[0000] Initializing event backend none
DEBU[0000] Configured OCI runtime runc initialization failed: no valid executable found for OCI runtime runc: invalid argument
DEBU[0000] Configured OCI runtime runj initialization failed: no valid executable found for OCI runtime runj: invalid argument
DEBU[0000] Configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument
DEBU[0000] Configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument
DEBU[0000] Configured OCI runtime youki initialization failed: no valid executable found for OCI runtime youki: invalid argument
DEBU[0000] Configured OCI runtime krun initialization failed: no valid executable found for OCI runtime krun: invalid argument
DEBU[0000] Configured OCI runtime ocijail initialization failed: no valid executable found for OCI runtime ocijail: invalid argument
DEBU[0000] Configured OCI runtime crun initialization failed: no valid executable found for OCI runtime crun: invalid argument
DEBU[0000] Configured OCI runtime crun-vm initialization failed: no valid executable found for OCI runtime crun-vm: invalid argument
DEBU[0000] Configured OCI runtime crun-wasm initialization failed: no valid executable found for OCI runtime crun-wasm: invalid argument
DEBU[0000] Using OCI runtime "/sbin/uxc"
DEBU[0000] Pulling image e7db/geekbench (policy: missing)
DEBU[0000] Looking up image "e7db/geekbench" in local containers storage
DEBU[0000] Normalized platform linux/amd64 to {amd64 linux  [] }
DEBU[0000] Loading registries configuration "/etc/containers/registries.conf"
DEBU[0000] Trying "localhost/e7db/geekbench:latest" ...
DEBU[0000] reference "[overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]localhost/e7db/geekbench:latest" does not resolve to an image ID
DEBU[0000] Trying "docker.io/e7db/geekbench:latest" ...
DEBU[0000] parsed reference into "[overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]@ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Found image "e7db/geekbench" as "docker.io/e7db/geekbench:latest" in local containers storage
DEBU[0000] Found image "e7db/geekbench" as "docker.io/e7db/geekbench:latest" in local containers storage ([overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]@ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e)
DEBU[0000] exporting opaque data as blob "sha256:ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Looking up image "e7db/geekbench" in local containers storage
DEBU[0000] Normalized platform linux/amd64 to {amd64 linux  [] }
DEBU[0000] Trying "localhost/e7db/geekbench:latest" ...
DEBU[0000] reference "[overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]localhost/e7db/geekbench:latest" does not resolve to an image ID
DEBU[0000] Trying "docker.io/e7db/geekbench:latest" ...
DEBU[0000] parsed reference into "[overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]@ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Found image "e7db/geekbench" as "docker.io/e7db/geekbench:latest" in local containers storage
DEBU[0000] Found image "e7db/geekbench" as "docker.io/e7db/geekbench:latest" in local containers storage ([overlay@/data/podman/storage+/tmp/run/containers/storage:overlay.mountopt=nodev]@ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e)
DEBU[0000] exporting opaque data as blob "sha256:ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Inspecting image ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e
DEBU[0000] exporting opaque data as blob "sha256:ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Inspecting image ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e
DEBU[0000] Inspecting image ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e
DEBU[0000] Inspecting image ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e
DEBU[0000] using systemd mode: false
DEBU[0000] No hostname set; container's hostname will default to runtime default
DEBU[0000] Successfully loaded network podman: &{podman 5ef894788befd4d42498314b6e66282ca730aa2e1e82f9b9597bf4d1725ca074 bridge podman0 2026-05-20 16:26:39.519216229 +0000 UTC [{{{172.17.0.0 ffff0000}} 172.17.0.1 <nil>}] [] false false true [] map[] map[] map[driver:host-local]}
DEBU[0000] Successfully loaded 1 networks
DEBU[0000] Allocated lock 3 for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] exporting opaque data as blob "sha256:ecb5e3625e43b91d070b53dce80ca47b93f06a6f8926f48efb7dee49a9e3413e"
DEBU[0000] Cached value indicated that idmapped mounts for overlay are supported
DEBU[0000] Created container "6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c"
DEBU[0000] Container "6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c" has work directory "/data/podman/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata"
DEBU[0000] Container "6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c" has run directory "/tmp/run/containers/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata"
DEBU[0000] Not attaching to stdin
INFO[0000] Received shutdown.Stop(), terminating!        PID=22063
DEBU[0000] Enabling signal proxying
DEBU[0000] Cached value indicated that volatile is being used
DEBU[0000] overlay: mount_data=lowerdir=/data/podman/storage/overlay/l/CVIWIVQJHOTYEVPQEJXPDF274I:/data/podman/storage/overlay/l/5NGBTSZRZWVJIPCS2O5DUC4KPW:/data/podman/storage/overlay/l/XNKYBFRIFIOYSMOMF4WDXIIA77,upperdir=/data/podman/storage/overlay/9cb324b82ec58a3e47df09f5c829b29ede9192380f15d6ace2b2e2c76c9e0653/diff,workdir=/data/podman/storage/overlay/9cb324b82ec58a3e47df09f5c829b29ede9192380f15d6ace2b2e2c76c9e0653/work,nodev,volatile
DEBU[0000] Mounted container "6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c" at "/data/podman/storage/overlay/9cb324b82ec58a3e47df09f5c829b29ede9192380f15d6ace2b2e2c76c9e0653/merged"
DEBU[0000] Created root filesystem for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c at /data/podman/storage/overlay/9cb324b82ec58a3e47df09f5c829b29ede9192380f15d6ace2b2e2c76c9e0653/merged
DEBU[0000] Made network namespace at /run/netns/netns-3b795775-1b32-3863-6bff-7e3457bd7d8e for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
[DEBUG netavark::network::validation] "Validating network namespace..."
[DEBUG netavark::commands::setup] "Setting up..."
[INFO  netavark::firewall] Using iptables firewall driver
[DEBUG netavark::network::bridge] Setup network podman
[DEBUG netavark::network::bridge] Container interface name: eth0 with IP addresses [172.17.0.7/16]
[DEBUG netavark::network::bridge] Bridge name: podman0 with IP addresses [172.17.0.1/16]
[DEBUG netavark::network::core_utils] Setting sysctl value for net.ipv4.ip_forward to 1
[DEBUG netavark::network::core_utils] Setting sysctl value for /proc/sys/net/ipv4/conf/podman0/rp_filter to 2
[DEBUG netavark::network::core_utils] Setting sysctl value for /proc/sys/net/ipv6/conf/eth0/autoconf to 0
[DEBUG netavark::network::core_utils] Setting sysctl value for /proc/sys/net/ipv4/conf/eth0/arp_notify to 1
[DEBUG netavark::network::core_utils] Setting sysctl value for /proc/sys/net/ipv4/conf/eth0/rp_filter to 2
[INFO  netavark::network::netlink] Adding route (dest: 0.0.0.0/0 ,gw: 172.17.0.1, metric 100)
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-1D8721804F16F created on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_ISOLATION_2 exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_ISOLATION_2 exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_ISOLATION_3 exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_ISOLATION_3 exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_INPUT exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_INPUT exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_FORWARD exists on table filter
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_FORWARD exists on table filter
[DEBUG netavark::firewall::varktables::helpers] rule -d 172.17.0.0/16 -j ACCEPT created on table nat and chain NETAVARK-1D8721804F16F
[DEBUG netavark::firewall::varktables::helpers] rule ! -d 224.0.0.0/4 -j MASQUERADE created on table nat and chain NETAVARK-1D8721804F16F
[DEBUG netavark::firewall::varktables::helpers] rule -s 172.17.0.0/16 -j NETAVARK-1D8721804F16F created on table nat and chain POSTROUTING
[DEBUG netavark::firewall::varktables::helpers] rule -p udp -s 172.17.0.0/16 --dport 53 -j ACCEPT created on table filter and chain NETAVARK_INPUT
[DEBUG netavark::firewall::varktables::helpers] rule -p tcp -s 172.17.0.0/16 --dport 53 -j ACCEPT created on table filter and chain NETAVARK_INPUT
[DEBUG netavark::firewall::varktables::helpers] rule -m conntrack --ctstate INVALID -j DROP exists on table filter and chain NETAVARK_FORWARD
[DEBUG netavark::firewall::varktables::helpers] rule -d 172.17.0.0/16 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT created on table filter and chain NETAVARK_FORWARD
[DEBUG netavark::firewall::varktables::helpers] rule -s 172.17.0.0/16 -j ACCEPT created on table filter and chain NETAVARK_FORWARD
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-SETMARK exists on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-SETMARK exists on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-MASQ exists on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-MASQ exists on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-DNAT exists on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-DNAT exists on table nat
[DEBUG netavark::firewall::varktables::helpers] rule -j MARK  --set-xmark 0x2000/0x2000 exists on table nat and chain NETAVARK-HOSTPORT-SETMARK
[DEBUG netavark::firewall::varktables::helpers] rule -j MASQUERADE -m comment --comment 'netavark portfw masq mark' -m mark --mark 0x2000/0x2000 exists on table nat and chain NETAVARK-HOSTPORT-MASQ
[DEBUG netavark::firewall::varktables::helpers] rule -j NETAVARK-HOSTPORT-DNAT -m addrtype --dst-type LOCAL exists on table nat and chain PREROUTING
[DEBUG netavark::firewall::varktables::helpers] rule -j NETAVARK-HOSTPORT-DNAT -m addrtype --dst-type LOCAL exists on table nat and chain OUTPUT
[DEBUG netavark::dns::aardvark] Spawning aardvark server
[DEBUG netavark::dns::aardvark] start aardvark-dns: ["/usr/lib/podman/aardvark-dns", "--config", "/run/containers/networks/aardvark-dns", "-p", "53", "run"]
[DEBUG netavark::commands::setup] {
        "podman": StatusBlock {
            dns_search_domains: Some(
                [
                    "dns.podman",
                ],
            ),
            dns_server_ips: Some(
                [
                    172.17.0.1,
                ],
            ),
            interfaces: Some(
                {
                    "eth0": NetInterface {
                        mac_address: "b2:70:f0:b5:a8:f1",
                        subnets: Some(
                            [
                                NetAddress {
                                    gateway: Some(
                                        172.17.0.1,
                                    ),
                                    ipnet: 172.17.0.7/16,
                                },
                            ],
                        ),
                    },
                },
            ),
        },
    }
[DEBUG netavark::commands::setup] "Setup complete"
DEBU[0000] Not modifying container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c /etc/passwd
DEBU[0000] Modifying container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c /etc/group
DEBU[0000] Setting Cgroup path for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c to /libpod_parent/libpod-6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] reading hooks from /usr/share/containers/oci/hooks.d
DEBU[0000] Workdir "/opt/geekbench" resolved to host path "/data/podman/storage/overlay/9cb324b82ec58a3e47df09f5c829b29ede9192380f15d6ace2b2e2c76c9e0653/merged/opt/geekbench"
DEBU[0000] Created OCI spec for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c at /data/podman/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata/config.json
DEBU[0000] /usr/bin/conmon messages will be logged to syslog
DEBU[0000] running conmon: /usr/bin/conmon               args="[--api-version 1 -c 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c -u 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c -r /sbin/uxc -b /data/podman/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata -p /tmp/run/containers/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata/pidfile -n modest_hoover --exit-dir /var/run/libpod/exits --persist-dir /var/run/libpod/persist/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c --full-attach -l k8s-file:/data/podman/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata/ctr.log --log-level debug --syslog --conmon-pidfile /tmp/run/containers/storage/overlay-containers/6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /data/podman/storage --exit-command-arg --runroot --exit-command-arg /tmp/run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --network-config-dir --exit-command-arg /etc/containers/networks/ --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /data/podman/volumes --exit-command-arg --db-backend --exit-command-arg sqlite --exit-command-arg --transient-store=false --exit-command-arg --hooks-dir --exit-command-arg /usr/share/containers/oci/hooks.d --exit-command-arg --runtime --exit-command-arg uxc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev --exit-command-arg --events-backend --exit-command-arg none --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --stopped-only --exit-command-arg --rm --exit-command-arg 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c]"
uxc error: No such file or directory
ERRO[0000] Removing container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c from runtime after creation failed
DEBU[0000] Cleaning up container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] Tearing down network namespace at /run/netns/netns-3b795775-1b32-3863-6bff-7e3457bd7d8e for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
[DEBUG netavark::commands::teardown] "Tearing down.."
[INFO  netavark::firewall] Using iptables firewall driver
[INFO  netavark::network::bridge] removing bridge podman0
[DEBUG netavark::commands::teardown] "Teardown complete"
DEBU[0000] Unmounted container "6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c"
DEBU[0000] Removing container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] Cleaning up container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] Network is already cleaned up, skipping...
DEBU[0000] Container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c storage is already unmounted, skipping...
DEBU[0000] Removing all exec sessions for container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c
DEBU[0000] Container 6298bf193de794b9eb70faaae79e98ab517da83ce3174f0b630e2d2eb978062c storage is already unmounted, skipping...
DEBU[0000] Temp dir already cleaned up
DEBU[0000] ExitCode msg: "container create failed (no logs from conmon): conmon bytes \"\": readobjectstart: expect { or n, but found \x00, error found in #0 byte of ...||..., bigger context ...||..."
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
DEBU[0000] Shutting down engines

Edit 2: After recompiling OpenWRT with seccomp support and rerunning the error still persists but I get a new error message:

user.err : jail: can't open /sys/fs/cgroup/libpod_parent/libpod-85650516cfe8ea8fa4a9e153b075b7c07cb7f97b75370d59d293ff80a4ffac5c/memory.swap_max: No such file or directory

However inspecting it shows that it is indeed present just with a slightly different name

-rw-r--r-- 1 root root 0 May 23 13:40 memory.swap.max

It looks like a) podman is giving out bas paths that UXC can't find or b) UXC just uses a bad name. Could this be a cgroup v1/v2 thing? According to strings and some googling ujail searches for memory.swap_max which is cgroup v1 only yet the cgroup version that's running is v2 which uses memory.swap.max instead.

@dangowrt dangowrt force-pushed the oci-1.3.0 branch 2 times, most recently from ed1ba35 to 14c4b4a Compare May 26, 2026 13:01
@dangowrt
Copy link
Copy Markdown
Member Author

@Juff-Ma I was wrongly assuming that podman, runc and crun would implement the CLI as defined in the spec, but turns out that part of the spec is abandonned and it become more of just a convention. uxc now tries to behave more like runc or crun, eventhough that's not what the spec says (but the spec-compliant behavior also still works). Also many features added which should get it going as a runtime in recent podman.

Now regarding the seccomp problem you were seeing: I've reworked the seccomp support for now also work with "foreign" libc containers (ie. OpenWrt host being musl, container being eg. glibc or bionic). This needs a small change to the procd package Makefile as well, see below

diff --git a/package/system/procd/Makefile b/package/system/procd/Makefile
index cf730a3c0a..c508a447c6 100644
--- a/package/system/procd/Makefile
+++ b/package/system/procd/Makefile
@@ -30,12 +30,14 @@ PKG_CONFIG_DEPENDS:= \
 
 include $(INCLUDE_DIR)/package.mk
 include $(INCLUDE_DIR)/cmake.mk
+include $(INCLUDE_DIR)/kernel.mk
 
 ifeq ($(DUMP),)
   STAMP_CONFIGURED:=$(strip $(STAMP_CONFIGURED))_$(shell echo $(CONFIG_TARGET_INIT_PATH) | $(MKHASH) md5)
 endif
 
-CMAKE_OPTIONS += -DEARLY_PATH="$(TARGET_INIT_PATH)"
+CMAKE_OPTIONS += -DEARLY_PATH="$(TARGET_INIT_PATH)" \
+		 -DNOLIBC_INCLUDE_DIR="$(LINUX_DIR)/tools/include/nolibc"
 
 define Package/procd/Default
   SECTION:=base

@Juff-Ma
Copy link
Copy Markdown

Juff-Ma commented May 26, 2026

@dangowrt I've tested your build and it has the same problem I reported in Edit 2 of my previous message. The seccomp error (while I appreciate it being fixed) was easily solvable by just compiling OpenWRT with seccomp support.

Through a bit of grepping and googling I eventually made this simple one line patch but still got an error. I could try combining my change with your patchset and see if that fixes anything.

dangowrt added 4 commits May 26, 2026 15:57
The OCI 1.0.0 spec field process.consoleSize was never parsed, so
bundles requesting a specific initial pty size silently got the kernel
default. Add the field to the process policy and call ioctl(TIOCSWINSZ)
on the master fd after unlockpt when both height and width are set.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.0.0 spec field linux.personality was never parsed, so bundles
asking for LINUX32 on a 64-bit kernel ran with the host personality
instead of being rejected. Parse the field and accept only when 'domain'
matches the kernel's native personality (read via personality(0xFFFFFFFF));
return ENOTSUP otherwise, since OpenWrt does not support cross-personality
execution.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 spec field process.scheduler was unsupported, so bundles
selecting SCHED_FIFO/RR/DEADLINE/etc. fell back to the inherited host
policy. Parse the field and apply via sched_setattr just before execve;
direct syscall used since musl/glibc lack a wrapper.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 spec field process.ioPriority was unsupported, so block
I/O priority defaulted to whatever the parent inherits. Parse class and
priority and apply via ioprio_set just before execve, using a direct
syscall.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
@Juff-Ma
Copy link
Copy Markdown

Juff-Ma commented May 26, 2026

@dangowrt I've tested your build and it has the same problem I reported in Edit 2 of my previous message. The seccomp error (while I appreciate it being fixed) was easily solvable by just compiling OpenWRT with seccomp support.

Through a bit of grepping and googling I eventually made this simple one line patch but still got an error. I could try combining my change with your patchset and see if that fixes anything.

I can not report success. The error about the missing cgroup file is gone but (once again) I still receive a uxc error: no such file or directory without any more info anywhere. The containers are being created though and are shown as such in the list. When I run. The relevant part of podman log has not changed

DEBU[0000] Setting Cgroup path for container a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a to /libpod_parent/libpod-a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a
DEBU[0000] reading hooks from /usr/share/containers/oci/hooks.d
DEBU[0000] Workdir "/" resolved to host path "/data/podman/storage/overlay/eea63b9351612b00eeb274fcb81e8a8136008aed4555daec2ff2f3bffdee72c4/merged"
DEBU[0000] Created OCI spec for container a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a at /data/podman/storage/overlay-containers/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a/userdata/config.json
DEBU[0000] /usr/bin/conmon messages will be logged to syslog
DEBU[0000] running conmon: /usr/bin/conmon               args="[--api-version 1 -c a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a -u a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a -r /sbin/uxc -b /data/podman/storage/overlay-containers/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a/userdata -p /tmp/run/containers/storage/overlay-containers/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a/userdata/pidfile -n clever_merkle --exit-dir /var/run/libpod/exits --persist-dir /var/run/libpod/persist/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a --full-attach -l k8s-file:/data/podman/storage/overlay-containers/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a/userdata/ctr.log --log-level debug --syslog --conmon-pidfile /tmp/run/containers/storage/overlay-containers/a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /data/podman/storage --exit-command-arg --runroot --exit-command-arg /tmp/run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --network-config-dir --exit-command-arg /etc/containers/networks/ --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /data/podman/volumes --exit-command-arg --db-backend --exit-command-arg sqlite --exit-command-arg --transient-store=false --exit-command-arg --hooks-dir --exit-command-arg /usr/share/containers/oci/hooks.d --exit-command-arg --runtime --exit-command-arg uxc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --stopped-only --exit-command-arg a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a]"
uxc error: No such file or directory
ERRO[0000] Removing container a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a from runtime after creation failed
DEBU[0000] Cleaning up container a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a
DEBU[0000] Tearing down network namespace at /run/netns/netns-262c7ca9-dba3-3001-9f44-d565a71f4cc8 for container a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a
[DEBUG netavark::commands::teardown] "Tearing down.."
[DEBUG netavark::firewall] Forcibly using firewall driver none
[INFO  netavark::firewall] Not using firewall
[INFO  netavark::network::bridge] removing bridge podman0
[DEBUG netavark::commands::teardown] "Teardown complete"
DEBU[0000] Unmounted container "a68ff245b935214d44cbf8c535ad5b55cdc450e9aac0ee0d4aba2f378c1b0b5a"
DEBU[0000] ExitCode msg: "container create failed (no logs from conmon): conmon bytes \"\": readobjectstart: expect { or n, but found \x00, error found in #0 byte of ...||..., bigger context ...||..."
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
DEBU[0000] Shutting down engines

dangowrt added 12 commits May 26, 2026 16:48
The OCI 1.1.0 spec field domainname was unsupported, so containers ran
with the host's NIS domain name regardless of bundle configuration.
Parse the field and apply via setdomainname after sethostname inside
the new UTS namespace.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 spec field linux.timeOffsets was unsupported, so the
container saw the host's CLOCK_MONOTONIC and CLOCK_BOOTTIME values even
when a time namespace was created. Parse the per-clock offsets and
write them to /proc/self/timens_offsets early in the cloned child,
while it is still the only process in the new time namespace.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Bundles using the deprecated prestart hook had it parsed but never run,
so any setup it performed was silently dropped. Per OCI spec 1.1.0
prestart and createRuntime fire at the same point; parse it into its
own hooklist and run it before createRuntime via a new post_prestart
callback in the existing chain.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Per OCI runtime spec 1.1.0 a poststart hook failure now MUST cause the
runtime to stop the container and proceed to the delete phase, instead
of the 1.0.x "log a warning and continue" behaviour. Track aggregate
hook chain failure across run_hooks invocations and, in the poststart
return callback, skip the running phase and tear the container down.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 cgroup-v2 field linux.resources.cpu.burst was unsupported,
so bundles requesting CFS bandwidth burst accumulation got no burst.
Add the field to the cpu policy and write the value to cpu.max.burst.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 cgroup-v2 field linux.resources.cpu.idle was unsupported.
Add the field to the cpu policy and write the value to cpu.idle, which
schedules tasks in the cgroup as SCHED_IDLE.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 field linux.resources.memory.checkBeforeUpdate was
silently dropped through the unknown-attr path. The check it gates
against ("reject a new limit lower than current usage") only matters
for an OCI 'update' operation, which the runtime spec does not define
and which uxc does not implement. Add the field to the memory policy
so blobmsg_parse recognises and discards it explicitly; it would be
applied if uxc ever grew an update operation.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The pids parser rejected absent limit with EINVAL and treated the value
as unsigned, so OCI 1.3.0 bundles relying on optional pids.limit or
using -1 for unlimited failed at create. Skip silently when absent and
write "max" to pids.max for any negative value.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The seccomp violation handler used PTRACE_POKEUSER (and PTRACE_SET_SYSCALL
on arm) to overwrite the tracee's syscall number register with -1 on
x86_64, i386, arm, mips and ppc, causing the kernel to skip the offending
syscall with -ENOSYS. aarch64 and loongarch64 already lacked this poke
and let the syscall proceed transparently. The behaviour was therefore
inconsistent across architectures, and the tool's purpose -- collecting
the full set of syscalls a program issues so an allowlist can be derived
afterwards -- is defeated by the cancellation: the first non-whitelisted
call returns ENOSYS, the program typically falls into an error path or
exits, and the recorded profile is incomplete.

Drop the cancellation on all architectures so SECCOMP_TRACE mode just
observes and logs. After the change every supported arch behaves like
aarch64/loongarch64 already did.

The remaining UTRACE mode (PTRACE_SYSCALL path) was already a pure
observer and is unchanged.

Why removing the poke is safe (verified against Linux mainline):

  seccomp(2) defines SCMP_ACT_TRACE / SECCOMP_RET_TRACE such that "if
  the tracer does not change the system call number, the system call
  will proceed normally." All four arches that we used to poke implement
  exactly that contract; the cancellation only worked because procd
  actively wrote -1 into the syscall register. Concretely:

  - x86_64: arch/x86/entry/syscall_64.c:do_syscall_64() invokes the
    syscall via do_syscall_x64(); with nr unchanged the table dispatch
    runs as usual. The skip path is gated on "nr == -1".
  - i386: arch/x86/entry/syscall_32.c:do_syscall_32_irqs_on() has the
    same "else if (nr != -1)" structure.
  - arm: arch/arm/kernel/entry-common.S after syscall_trace_enter does
    "cmp scno, #-1; bne 2b" -- only -1 skips, anything else dispatches.
  - mips: arch/mips/kernel/scall*-*.S has "bltz v0, 1f" after
    syscall_trace_enter; without a negative override, the syscall is
    invoked normally.
  - powerpc: arch/powerpc/kernel/syscall.c uses the trace_enter return
    value as the syscall number; if unmodified, the dispatcher runs.

  No arch re-reads or invalidates the syscall number on its own when
  SECCOMP_RET_TRACE fires, so leaving the register alone is the
  documented "let it proceed" path on every supported target.

Real seccomp enforcement (SCMP_ACT_KILL / SCMP_ACT_ERRNO) is performed
inside the kernel by the BPF program and is unaffected by this change;
seccomp-trace is and now consistently is a pure profiling tool.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
REG_SYSCALL was carried over from the Chromium seccomp example header,
where it is consumed by a SIGSYS handler that reads the offending
syscall's register state from the signal context. procd never installed
such a handler and never referenced the macro: a tree-wide search across
the entire git history finds zero uses outside of the per-arch
definitions in this header itself. The seccomp BPF program reads the
syscall number from struct seccomp_data, and trace.c uses its own
arch-specific reg_syscall_nr offsets (or PTRACE_GET_SYSCALL_INFO).

Remove the macro from every arch branch and from the fallback. No
functional change.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Without an __riscv branch the header fell into the generic #else block,
emitted "Platform does not support seccomp filter yet" and set ARCH_NR
to 0. parseOCIlinuxseccomp() then either rejected the bundle (when the
container declared SCMP_ARCH_RISCV64 -- AUDIT_ARCH_RISCV64 never matched
0) or, with no architectures listed, generated a BPF program whose arch
gate compared against 0 and killed every syscall. seccomp filters could
not be applied on RISC-V at all.

Map __riscv (xlen == 64) to AUDIT_ARCH_RISCV64 so the existing
SCMP_ARCH_RISCV64 mapping in seccomp-oci.c can take effect and the
generated BPF program identifies the running architecture correctly.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The seccomp arch resolver did not know SCMP_ARCH_RISCV64, so OCI 1.1.0
bundles targeting RISC-V 64-bit were rejected with "unknown seccomp
architecture". Map it to AUDIT_ARCH_RISCV64.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
dangowrt added 13 commits May 27, 2026 02:35
container_handle_exec previously fork()ed the grandchild and then,
back in the ujail parent, wrote the grandchild's host pid into
cgroup.procs via cgroups_attach_pid(). The grandchild was in the
container's namespaces from the first instruction but in ujail's
own cgroup for the sub-millisecond window between fork() and the
parent's write - a noted residual gap.

Open the container's cgroup directory once with O_PATH |
O_DIRECTORY | O_CLOEXEC via cgroups_open_dir(), let it inherit
into the exec_child intermediate, and replace exec_child's inner
fork() with clone3(CLONE_INTO_CGROUP) carrying the cgroup fd in
clone_args.cgroup. The kernel places the grandchild in the
container's cgroup atomically with its creation; the post-fork
cgroups_attach_pid() call goes away.

cgroups.{c,h} grows cgroups_open_dir() that returns
open(cgroup_path, O_PATH | O_DIRECTORY | O_CLOEXEC) or -1 when
cgroup_path is unset. Cleanup paths close the fd in both the
ujail parent's success-after-fork path and the 'out:' error
label.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
cgroups_create() now runs ahead of clone3, so the cgroup tree
(mkdir_p + subtree_control writes) exists before any task is
created. clone3 then takes CLONE_INTO_CGROUP plus a cgroup
directory fd via cgroups_open_dir(), and the kernel places the
container init in the destination cgroup atomically with its
creation.

The post-fork sequence becomes cgroups_configure() (controller
value writes + ebpf attach) followed by cgroups_attach_pid()
which is now a no-op for the INTO_CGROUP path - the init is
already a member - and serves as the fallback when
cgroups_open_dir() returned -1 (a deployment where the cgroup
tree was unreachable at clone3 time and clone3 ran without
INTO_CGROUP). cgroups_apply() stays available for
container_handle_update which legitimately wants the full
create + configure + attach sequence against an existing
container.

Kernel 5.7+ for CLONE_INTO_CGROUP, well below the 6.18 target.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
OCI does not standardise a key for Memory-Deny-Write-Execute. Read
'org.openwrt.ujail.mdwe' from the bundle annotations, expect a
comma-separated value of refuse_exec_gain[,no_inherit], and call
prctl(PR_SET_MDWE, flags) on both the container init (after
PR_SET_NO_NEW_PRIVS and the capability drop in post_start_hook)
and the exec grandchild (right after its NNP step).

The container init prctl is fatal on failure; the exec grandchild
prctl exits 127 like the surrounding privilege-drop steps.

A workload that legitimately mmaps RWX (JITs) should leave the
annotation unset, or pass refuse_exec_gain,no_inherit so the bit
is dropped across exec into trusted helper binaries that need
W+X. The MDWE bit propagates across fork unconditionally.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
New self-contained jail/landlock.{c,h} wraps the three Landlock
syscalls (landlock_create_ruleset, landlock_add_rule,
landlock_restrict_self) and exposes a small struct
landlock_config that callers grow path-by-path. landlock_available()
caches the ABI probe; landlock_apply() builds and installs the
ruleset.

Activation is opt-in via three bundle annotations:

  org.openwrt.ujail.landlock.ro = '<path>[:<path>]*'
  org.openwrt.ujail.landlock.rx = '<path>[:<path>]*'
  org.openwrt.ujail.landlock.rw = '<path>[:<path>]*'

Each adds a LANDLOCK_RULE_PATH_BENEATH rule with the corresponding
access mask (READ_FILE | READ_DIR; the same plus EXECUTE; the
read/write/create/remove set). With no annotation present
landlock_apply() is skipped, preserving runc-compatible
allow-all behaviour. As soon as any rule is registered the runtime
forces opts.no_new_privs = 1, because landlock_restrict_self
requires NNP.

The apply call sits in post_start_hook after pivot_root (paths
must resolve against the post-pivot view) and before
applyOCIlinuxseccomp, so denied paths fail before the workload
runs and the seccomp filter still loads on top. CMakeLists.txt
links jail/landlock.c into ujail.

Reverse-DNS prefix follows the org.openwrt.* convention used by
other ujail-specific bundle annotations (mdwe, systemd-cgroup
hint).

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Copy libpreload-seccomp.so into a memfd backed by MFD_EXEC |
MFD_ALLOW_SEALING, apply F_SEAL_WRITE|GROW|SHRINK, and point
LD_PRELOAD at /proc/self/fd/<N>. The workload sees a tamper-evident,
read-only copy of the helper. The on-disk .so no longer needs to be
visible inside the jail rootfs; only its NEEDED shared-library
dependencies (libc, libubox, libblobmsg_json, libdl) are bind-mounted,
which preload_load_deps() resolves by walking the dynamic section of
the on-host copy.

This works for both OCI containers and classic OpenWrt procd jails
since both flows construct the environment via build_envp(); the
former always mounts procfs, and the latter does so whenever the jail
is launched with namespace separation.

The memfd is created with MFD_EXEC (not MFD_NOEXEC_SEAL) because ld.so
maps the file with PROT_EXEC; likewise F_SEAL_EXEC is not applied for
the same reason. close_range() before execve would otherwise close the
memfd via CLOSE_RANGE_CLOEXEC, so the descriptor flag is cleared just
before the exec.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Compile the seccomp JSON into a cBPF program inside ujail (and inside
seccomp-trace), then write the struct sock_filter[] array into a
sealed memfd created with MFD_ALLOW_SEALING and locked down with
F_SEAL_WRITE | F_SEAL_GROW | F_SEAL_SHRINK. The descriptor number is
passed to the workload via the SECCOMP_BPF_FD environment variable;
close_range() before execve marks it CLOEXEC, so the flag is cleared
just like the preload memfd itself.

libpreload-seccomp.so shrinks to a single constructor that mmaps the
memfd, calls prctl(PR_SET_NO_NEW_PRIVS) and seccomp(SET_MODE_FILTER),
and returns. The __libc_start_main / __uClibc_main interception trick
is gone; the constructor mechanism is universal across glibc, musl,
uClibc and bionic, so the preload is no longer wedded to the libc the
helper itself was built against.

Side effects of moving JSON parsing out of the workload:

  - the .so no longer needs libubox, libblomsg_json or libdl, so
    those bind-mounts vanish from the jail rootfs entirely (libc is
    still required, as the workload uses it),
  - SECCOMP_FILE and SECCOMP_DEBUG are retired from the workload-side
    env contract; the BPF blob is now the only thing the workload
    sees,
  - jail/seccomp.c and jail/seccomp.h are removed since
    install_syscall_filter() lived solely to bridge the old contract,
  - seccomp-trace gains a dependency on jail/seccomp-oci.c so it can
    compile the JSON itself; the contract between procd and
    seccomp-trace (SECCOMP_FILE env var or -f arg) is unchanged.

The filter is still armed inside the workloads address space and only
after ld.so has finished its work, which preserves the property that
the cBPF program does not have to whitelist execve, the dynamic
linker bootstrap, or libc init.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Add a NOLIBC_INCLUDE_DIR CMake variable that the OpenWrt procd recipe
can point at the kernel tree (typically $(LINUX_DIR)/tools/include/nolibc).
When the variable is set, libpreload-seccomp.so is compiled with
  -nostdlib -ffreestanding -fno-builtin -include nolibc.h
and linked with -nostdlib, leaving the resulting shared object with no
DT_NEEDED entries at all. The system libc and libdl no longer have to
be visible to the helper, which closes the only remaining "two libcs
in one address space" trap when a workload built against a different
libc than ujail itself is run under classic procd-jail with -S.

When the variable is not set the helper continues to link against the
system libc exactly as before; the same preload.c builds in both
modes. A #ifndef NOLIBC guard around the system-header block prevents
clashes between nolibcs static-inline definitions and the system
declarations of getenv / mmap / prctl / etc.

The unsetenv call that previously cleared SECCOMP_BPF_FD from the
workloads environment is dropped, both because nolibc does not provide
unsetenv and because the now-closed fd number is harmless to anything
that does not already know about the SECCOMP_BPF_FD contract.

A comprehensive top-of-file comment in jail/preload.c documents the
contract, the constructor-over-__libc_start_main rationale, the sealed
memfd handoff, and the two build modes, so future maintainers and
auditors do not have to reverse-engineer the design from the build
system.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Broadcast two new ubus events from ujail for OCI lifecycle visibility:

  instance.ready    payload: { service, instance }
                    emitted from post_create_runtime once
                    jail_oci_state has transitioned to OCI_STATE_CREATED
                    (i.e. the OCI pidfile has been written and the
                    container is parked at the start barrier).

  instance.running  payload: { service, instance }
                    emitted from exec_ack_cb after the kernel closes
                    the init's O_CLOEXEC end of a new exec_ack pipe on
                    a successful execve. uxc can wait on this event to
                    return synchronously from "uxc start".

The exec_ack pipe is opened in post_main with pipe2(O_CLOEXEC) so the
init inherits the CLOEXEC write end. The parent closes its copy
immediately after clone3 and registers a uloop_fd callback on the
read end; EOF (the kernel auto-closing the CLOEXEC fd at execve) is
the success signal.

Also fix a latent initialiser bug: jail_oci_state was statically
initialised to OCI_STATE_CREATED, which would falsely satisfy
handle_start's "must be created to start" guard if any caller raced
the post_main / post_create_runtime sequence. Initialise to
OCI_STATE_CREATING instead.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
uxc_create, uxc_start and uxc_delete now block until the relevant
lifecycle event arrives on ubusd, rather than returning as soon as
the triggering ubus.invoke replies:

  create  arms instance.ready watcher for the container, invokes
          container.add, waits up to 30 s for ujail to broadcast
          instance.ready (i.e. pidfile written, OCI state CREATED).

  start   arms instance.running watcher, invokes container.<name>.start,
          waits up to 30 s for ujail to broadcast instance.running
          after the container init's execve.

  delete  arms a ubus.object.remove watcher for the container.<name>
          path, invokes container.<name>.delete, waits up to 30 s for
          ubusd to broadcast the object's removal (ujail's exit
          unregisters it).

The event handler is registered BEFORE the triggering ubus.invoke to
avoid losing a fast-arriving event. procd is otherwise uninvolved in
this synchronisation; existing ubus consumers see no change.

This closes the Podman race ("pidfile not yet created" when uxc
returned) without procd having to grow any new state machinery.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
signals_init() installed jail_handle_signal on every catchable signal
except SIGCHLD/SIGPIPE/SIGSEGV/SIGSTOP/SIGKILL. When the kernel
delivered a synchronous fault signal (SIGBUS, SIGFPE, SIGILL, SIGSYS,
SIGABRT, SIGTRAP) the handler forwarded it to the contained process
and returned. The kernel then restarted the faulting instruction,
re-raised the same signal, and the loop continued forever; the bug
manifested as ujail pinning a CPU at 100% with rt_sigreturn /
SI_KERNEL SIGBUS in a tight loop.

Treat the synchronous fault signals the same way SIGSEGV is already
treated: leave them at the default disposition so the kernel terminates
the process (and produces a core dump if enabled). Forwarding a
synchronous fault to the contained process is meaningless anyway since
the fault originated in ujail itself, not in the contained workload.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Both functions declared a single struct pointer and stepped it with
free(cur++), as if the underlying buffer were an array of structs.
The data structure is actually a NULL-terminated array of POINTERS to
struct (parseOCIhook calloc's hooklist as sizeof(struct hook_execvpe *)
slots, and parseOCIlinuxsysctl does the same for sysctl). After freeing
the first entry the iterator advanced by sizeof(struct) into garbage and
either crashed there or, more commonly, returned a non-NULL pointer that
fed garbage into free_oci_envp where the actual segfault (or x86 #SS
stack-segment trap) manifested.

The bug was latent until the new "abort container creation on failed
prestart hook" path started triggering free_opts -> free_hooklist with
a non-empty hook chain. The OCI runtime-tools hooks.t test reproduces
it deterministically.

Mirror free_devices, which already iterates correctly via struct
mknod_args ** with *(cur++).

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
…fsets

BPF jt/jf are uint8_t (max forward jump 255 instructions). The previous
filter generation emitted one LD syscall_nr followed by N JEQs and a
single RET action per OCI syscalls[] group. With more than ~254 names
the JEQs at the front of the group needed a jt larger than 255 to
reach the trailing RET; the offset silently wrapped modulo 256, sending
matched syscalls into the middle of the next group instead of to the
ALLOW return.

podman's default seccomp profile has one ALLOW group with 370 names;
"execve" sat at filter index [061] with jt=16 (truncated from 272),
landing on another group's JEQ instead of the RET. The kernel then
fell through to the default RET ERRNO(38) and "execve" came back as
ENOSYS, leaving the container init spinning on the same ENOSYS-blocked
exit(2) syscall.

Split each group into chunks of at most SECCOMP_CHUNK_NAMES (240)
names. Each chunk emits its own LD syscall_nr + JEQs + (group's args,
duplicated) + RET action, so the maximum forward jump stays comfortably
below 255 even for the worst case (full chunk with several args).

Also drop the stray "assert(idx = start_rule_idx)" assignment-bug that
masked size mismatches; the rewritten loop uses correct equality
asserts on the per-chunk landmarks.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
OCI runtime-spec >= 1.0.2 mandates that a missing entry in
process.capabilities (bounding, effective, inheritable, permitted,
ambient) be equivalent to an empty list, i.e. no capabilities.
parseOCIcap was returning JAIL_CAP_ALL on a NULL blob, which combined
with applyOCIcapabilities's "if (set != JAIL_CAP_ALL)" short-circuit
caused the entire drop / set step to be skipped for that capset, so
the container kept the runtime's full inherited set.

The visible symptom: "podman run --cap-drop ALL" silently granted the
full bounding+permitted+effective set because podman sends the OCI
shorthand `"capabilities": {}` for the drop-all case and relies on the
spec's "missing = empty" semantics. Tested explicitly with
`grep ^Cap /proc/self/status` inside containers running each of:
default, --cap-drop ALL, --cap-drop ALL --cap-add NET_BIND_SERVICE,
and --cap-add SYS_ADMIN; all four now match crun's output.

The same parser is used for the legacy procd `-C <file>` path. The OCI
contract has always applied: every /etc/capabilities/*.json shipped by
OpenWrt base and packages-feed (wpad, ntpd, simple-captive-portal,
go2rtc, adguardhome, dnsproxy, rsyncd, snort3, dbus) already lists all
five sets explicitly. Permissive handling of missing sets was a bug,
not a feature.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
@Juff-Ma
Copy link
Copy Markdown

Juff-Ma commented May 28, 2026

I'm making one small step after the other. Upgrading all podman dependencies helped a bit. I needed to patch a build error in conmon but that was easily fixable. (I'm still using the 1s delay for UXC to work around the race condition)

Using the latest tools I now am able to start a container and it actually runs. Well the container does. The app inside not so much.

When trying to get logs I get the error

jail: exec-ing /docker-entrypoint.sh
jail: failed to execve /docker-entrypoint.sh: Function not implemented

Note that /docker-entrypoint.sh is not static. It is the entrypoint specified by the Dockerfile and therefore changes with the container used.

Note that I am not able to recover from this error. My entire shell session freezes and I can't use killall to end podman or conmon.

@dangowrt
Copy link
Copy Markdown
Member Author

There is an architectural problem with how seccomp is applied which break foreign-libc containers (ie. OpenWrt with musl and container eg. with glibc). I've resolved that and tons of other issues, currently last mile of testing, going to push to this branch in the next hours with validated podman working. Meanwhile, please open a PR to update podman in case you didn't do that already and there isn't any existing PR for that (I've opened the PR to update conmon earlier this morning)

@Juff-Ma
Copy link
Copy Markdown

Juff-Ma commented May 28, 2026

There is an architectural problem with how seccomp is applied which break foreign-libc containers (ie. OpenWrt with musl and container eg. with glibc). I've resolved that and tons of other issues, currently last mile of testing, going to push to this branch in the next hours with validated podman working. Meanwhile, please open a PR to update podman in case you didn't do that already and there isn't any existing PR for that (I've opened the PR to update conmon earlier this morning)

I see. Glad to hear that. I haven't actually updated Podman itself. It is up to date. In addition to conmon I've update the netavark/aardvark-dns stack, catatonit and crun (since it was also out of date and I was at it)

I'll look into opening PRs

P.S.: Your PR does not include a second patch conmon required in my build env at least. I got a GCC false positive on -Wstringop-overread in ctr_logging.c. If your build works though that may have been an issue on my side, but I wouldn't know which.

@dangowrt
Copy link
Copy Markdown
Member Author

@Juff-Ma Please retry with the changes I've pushed now. Don't forget to also apply the patch for procd's Makefile I've posted in #37 (comment)

@Juff-Ma
Copy link
Copy Markdown

Juff-Ma commented May 28, 2026

@Juff-Ma Please retry with the changes I've pushed now. Don't forget to also apply the patch for procd's Makefile I've posted in #37 (comment)

@dangowrt IT IS ALIVE. Oh my god this is amazing. I still included my cgroups patch just to be sure but it absolutely works now. I can run containers and their services are reachable.

The work you've done to get this working is incredible.

I found 2 little issues (that do not impact day to day operations, at least for me, but I still wanna name them).

  • When using -it the console (at least my SSH connection) still hangs up. I haven't found a way to disconnect from the container
  • UXC still seems to report a file not found error sometimes (when podman deletes containers maybe?) but I have not found any actual problems from this behaviour. Maybe it is a case of too much verbosity by default in a recoverable state.

P.S.: The error really does occur when killing or deleting containers. I don't know why though. Killing never works and rm-ing will error out the first time. Still not an issue for me.

dangowrt added 7 commits May 29, 2026 03:40
Until now any container whose OCI spec listed CLONE_NEWNET caused ujail
to spawn a per-jail ubusd and netifd inside /var/containers/. That is
the OpenWrt-native uxc UX (jail-side wiring from /etc/config/network)
but actively fights OCI orchestrators: podman, docker and friends set
up the netns and addressing themselves via netavark / CNI / CDI, then
expect a fresh empty netns inside the container. The parallel netifd
inside the jail was both unwanted and a permanent leak — its lifecycle
hooked into poststop() only, while every error path through
free_and_exit() left ubusd and netifd as orphans of procd.

Two new OCI annotations make the behaviour explicit:

  org.openwrt.procd.ubus    = "true"  -> spawn jail-private ubusd
  org.openwrt.procd.netifd  = "true"  -> spawn jail-private ubusd + netifd
                                         (netifd implies ubus)

Both default false, so OCI orchestrators get the netns podman/CNI/CDI
prepared, undisturbed. Native uxc rc-script flows that want the old
behaviour set the annotations when generating the bundle.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
cgroups_free() walked the cgvals AVL tree and freed cgroup_path but did
not clear the `initialized` flag, leaving cgroup_path and avl entries
as dangling pointers. The parent exit path calls cgroups_free() twice:

  free_and_exit()                     // jail.c
      cgroups_free()                  // first call: real teardown
      free_opts(parent=true)
          cgroups_free()              // second call: re-walks freed tree
              -> double-free
              -> mallocng's get_meta() faults reading "erged" bytes from
                 a recycled chunk's metadata field
              -> SIGSEGV at libc.so+0x45bc4 in the ujail process

Reproduces with any OCI container (alpine /bin/true, /bin/ls, etc.).
ASAN's heap reshuffling masked the bug because the recycled chunk no
longer overlapped the avl bookkeeping; UBSAN did not mask it because it
does not touch the allocator. ptrace attach also masked it.

The fix: early-return if not initialized, then NULL cgroup_path and
clear `initialized` after the actual teardown, so a second call is a
no-op. With the fix, /bin/true under uxc runs cleanly across 30
consecutive runs, capability and seccomp behaviour vs crun stays 8/8
MATCH, and no per-jail ubusd/netifd leaks.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
When podman runs with -t, conmon binds a SOCK_STREAM listener at
/var/tmp/conmon-term.XXX and execs the runtime with --console-socket
pointing at it. conmon closes the listener as soon as it sees the runtime
process exit, so the runtime must connect *during* its own lifetime.

For crun/runc this is fine: the runtime does posix_openpt + connect +
sendmsg within milliseconds. For uxc the chain is uxc -> ubus to procd
-> spawn ujail -> clone3 child -> build_jail_fs -> create_dev_console;
by the time ujail's child reaches send_console_fd the listener is
already gone and connect() returns ECONNREFUSED, causing podman to fail
with "Runtime did not set up terminal".

Connect in the parent ujail process immediately after CLI parsing
(before parseOCI and clone3), then pass the fd number to the child
through opts.console_socket. parse_inherited_console_fd in
open_console_sock already handles a numeric spec, so the existing
sendmsg path in the child just reuses the inherited fd. The fd is
CLOEXEC-cleared so it survives clone3.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Two related cosmetic-but-visible fixes for podman -t containers under uxc:

  1. create_dev_console opened the PTY slave and dup2'd it to 0/1/2 but
     never made it the controlling terminal. Bash inside the container
     printed "cannot set terminal process group (-1): Inappropriate
     ioctl for device" and "no job control in this shell". Call setsid
     (best effort; PID 1 in a fresh CLONE_NEWPID is already a session
     leader) and ioctl(TIOCSCTTY) so the slave becomes the workload's
     controlling terminal.

  2. Two INFO() prints fired AFTER stdio had been dup'd to the PTY
     slave, so "jail: exec-ing bash" and "using guest console ..."
     ended up on the user's terminal before the workload could write
     its prompt. Demote both to DEBUG so they only surface with -d > 0.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Two related fixes that together let podman -t containers run cleanly
under uxc, including normal exit detection.

1) Allocate the PTY in the parent ujail (host mount NS) instead of the
   child (newinstance devpts NS).

   conmon receives the master fd via SCM_RIGHTS and the slave name as
   iovec data; it then opens the slave by name in its own mount NS. When
   ujail allocated the PTY inside the child after mounting newinstance
   devpts, the slave name was only valid in the child's NS, so conmon
   couldn't open it. Moved posix_openpt + grantpt + unlockpt + slave
   open + send_console_fd into the parent path right before clone3, and
   inherit the slave to the child via CLOEXEC-cleared console_slave_fd.

   create_dev_console in the child is now just the bind-mount of the
   inherited slave at /dev/console (via /proc/self/fd/N to survive
   pivot_root) plus setsid + TIOCSCTTY + dup2 on the inherited fd.

2) Send SIGCHLD to conmon when the parent ujail exits.

   conmon's exit-detection is driven by SIGCHLD: on signal it runs
   waitpid, then if ECHILD it probes kill(container_pid, 0). For
   runc/crun the runtime is conmon's child so SIGCHLD arrives naturally
   when the container init dies. For uxc the runtime (and ujail) are
   spawned by procd, not by conmon, so conmon never sees SIGCHLD and
   the exit file never gets written; podman sits at "Up Xs" forever.

   Read conmon's pidfile (sibling of opts.pidfile) at free_and_exit and
   kill(conmon, SIGCHLD); conmon's existing kill-probe path then
   detects the dead container_pid and writes the exit file.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
After commit c85ae2e (PTY allocated in parent, slave opened in child
early in build_jail_fs), the post-clone3 init reaches post_jail_fs
right after create_dev_console has installed the PTY slave as the
controlling terminal via setsid + TIOCSCTTY. The resulting session
change triggers a kernel-delivered signal that the catch-all handler
installed by signals_init intercepts, and because that handler is
registered without SA_RESTART the blocking read on pipes[2] returns
EINTR. The old code treated that as an error, the init died with
'can't read from parent', the parent's free_and_exit then removed
container.<name>, and the subsequent uxc start lookup returned ENOENT
(visible to podman as 'uxc error: No such file or directory' and
'/sbin/uxc start ... failed: exit status 254').

Wrap both pipe sync reads in EINTR retry loops so the init blocks
until the parent actually writes 'O' / '!'.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Refactor uxc_attach to a single goto-out cleanup path so each new
error branch can't drift out of sync with the resource-release list.
Map the UBUS_STATUS_NOT_SUPPORTED reply from procd's
container_handle_console (returned when the container was created
without process.terminal=true) to a specific message telling the
user to recreate with console=true, instead of the opaque
"ubus request failed".

Drop the two stray setup_tios() calls on the PTY master/server fds
(raw termios on a PTY endpoint is meaningless; only the local
/dev/tty needs it) and gate the tcsetattr() restore on whether
setup actually took effect.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
@dangowrt
Copy link
Copy Markdown
Member Author

@Juff-Ma console issues with podman should now also be fixed, please retry and report if you find the time. Thank you!

@Juff-Ma
Copy link
Copy Markdown

Juff-Ma commented May 31, 2026

@Juff-Ma console issues with podman should now also be fixed, please retry and report if you find the time. Thank you!

If you mean the commits up to 280a8bb I wasn't really able to test them (I already tried when they were pushed) since I've been encountering a new issue. I though that I had left a comment but it seems it didn't go through (why github?)

On some containers I get a jail: failed to execve /docker-entrypoint.sh: File not found error. I don't know why it happens.

Here are the two containers I used:

  • nginxdemos/hello:latest - worked
  • coturn/coturn:4 or coturn/coturn:4-alpine - did not work

execvpe() in musl performs its PATH search with getenv("PATH") against
the calling process's own environment, not the envp argument handed to
it. ujail runs with procd's PATH (the build-time EARLY_PATH,
/usr/sbin:/usr/bin:/sbin:/bin), so a relative entrypoint that lives only
in a directory present in the container's PATH but not ujail's - e.g.
coturn's docker-entrypoint.sh in /usr/local/bin - is never found and the
exec fails with ENOENT ("failed to execve docker-entrypoint.sh: No such
file or directory"). Images whose entrypoint resolves to /bin or
/usr/bin happened to work because those dirs are in both paths.

Point environ at the freshly built container envp right before
execvpe() so the PATH lookup honours the container's PATH.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UXC: OCI version mismatch error

4 participants