procd: add support for OCI runtime spec 1.3.0#37
Conversation
|
@dhewg would appreciate your review! |
|
Nice, will take a look! |
Use size_t instead of int, in order to avoid garbage values being used inside calloc_a. Signed-off-by: Felix Fietkau <nbd@nbd.name>
|
@dangowrt While this PR does indeed fix part of my issue in #35, the part I included in the edit does this happen. When actually trying to run a OCI container UXC will throw The directory in the uxc json file is present and filled with files: The command is just |
_add_mount() returned 1 (which happens to equal EPERM) for any duplicate target in the mounts AVL tree. The loops in parseOCIlinux() that register linux.readonlyPaths and linux.maskedPaths treat any non-zero return as fatal, so a config.json with a repeated entry (e.g. podman emits /sys/devices/virtual/powercap in its default maskedPaths *and* appends it again as the CVE-2020-8694 mitigation) aborted parseOCI() with the bogus diagnostic parsing of OCI JSON spec has failed: Operation not permitted (1) Compare the incoming registration against the existing entry instead. If source, filesystemtype, optstr, mountflags, propflags, error and inner all match, the call is idempotent and returns 0. Any divergence returns EEXIST so a conflicting re-registration at the same target is rejected loudly rather than silently shadowed by whichever entry happened to land in the AVL first. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The version check rejected anything that did not start with "1.0", blocking every bundle authored against spec 1.1.0 or later. OpenWrt is a monolithic build where every container package must conform to the distribution's current spec, so there is no need to keep separate code paths for the 1.0.x semantics that 1.1.0 changed (e.g. poststart hook failures becoming fatal, prestart's role overlapping createRuntime). Tighten the prefix match to require any "1.x" with x >= 1. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
@dangowrt Your fix does indeed resolve the EPERM issue however it still does not work. Instead of error 1 EPERM I now get error 22 Invalid Argument. UXC stills ays
and this in the listing:
config file/command is the same. Note that this also happens when running uxc start manually, so it's not an invalid argument to the uxc command itself (which I originally thought) The issue for this happening seems to be seccomp. Since when disabling seccomp the error disappears when using UXC directly. Now it can run the container. However when trying to attach via Another point is, it really only works when using uxc directly. When it is execut by podman I get the following error still: Also note that despite what's being said, the container is NOT removed from uxc. Edit: here's the full podman log:Edit 2: After recompiling OpenWRT with seccomp support and rerunning the error still persists but I get a new error message:
However inspecting it shows that it is indeed present just with a slightly different name
It looks like a) podman is giving out bas paths that UXC can't find or b) UXC just uses a bad name. Could this be a cgroup v1/v2 thing? According to strings and some googling ujail searches for memory.swap_max which is cgroup v1 only yet the cgroup version that's running is v2 which uses memory.swap.max instead. |
ed1ba35 to
14c4b4a
Compare
|
@Juff-Ma I was wrongly assuming that podman, runc and crun would implement the CLI as defined in the spec, but turns out that part of the spec is abandonned and it become more of just a convention. Now regarding the seccomp problem you were seeing: I've reworked the seccomp support for now also work with "foreign" libc containers (ie. OpenWrt host being musl, container being eg. glibc or bionic). This needs a small change to the procd package Makefile as well, see below diff --git a/package/system/procd/Makefile b/package/system/procd/Makefile
index cf730a3c0a..c508a447c6 100644
--- a/package/system/procd/Makefile
+++ b/package/system/procd/Makefile
@@ -30,12 +30,14 @@ PKG_CONFIG_DEPENDS:= \
include $(INCLUDE_DIR)/package.mk
include $(INCLUDE_DIR)/cmake.mk
+include $(INCLUDE_DIR)/kernel.mk
ifeq ($(DUMP),)
STAMP_CONFIGURED:=$(strip $(STAMP_CONFIGURED))_$(shell echo $(CONFIG_TARGET_INIT_PATH) | $(MKHASH) md5)
endif
-CMAKE_OPTIONS += -DEARLY_PATH="$(TARGET_INIT_PATH)"
+CMAKE_OPTIONS += -DEARLY_PATH="$(TARGET_INIT_PATH)" \
+ -DNOLIBC_INCLUDE_DIR="$(LINUX_DIR)/tools/include/nolibc"
define Package/procd/Default
SECTION:=base |
|
@dangowrt I've tested your build and it has the same problem I reported in Edit 2 of my previous message. The seccomp error (while I appreciate it being fixed) was easily solvable by just compiling OpenWRT with seccomp support. Through a bit of grepping and googling I eventually made this simple one line patch but still got an error. I could try combining my change with your patchset and see if that fixes anything. |
The OCI 1.0.0 spec field process.consoleSize was never parsed, so bundles requesting a specific initial pty size silently got the kernel default. Add the field to the process policy and call ioctl(TIOCSWINSZ) on the master fd after unlockpt when both height and width are set. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.0.0 spec field linux.personality was never parsed, so bundles asking for LINUX32 on a 64-bit kernel ran with the host personality instead of being rejected. Parse the field and accept only when 'domain' matches the kernel's native personality (read via personality(0xFFFFFFFF)); return ENOTSUP otherwise, since OpenWrt does not support cross-personality execution. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 spec field process.scheduler was unsupported, so bundles selecting SCHED_FIFO/RR/DEADLINE/etc. fell back to the inherited host policy. Parse the field and apply via sched_setattr just before execve; direct syscall used since musl/glibc lack a wrapper. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 spec field process.ioPriority was unsupported, so block I/O priority defaulted to whatever the parent inherits. Parse class and priority and apply via ioprio_set just before execve, using a direct syscall. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
I can not report success. The error about the missing cgroup file is gone but (once again) I still receive a |
The OCI 1.1.0 spec field domainname was unsupported, so containers ran with the host's NIS domain name regardless of bundle configuration. Parse the field and apply via setdomainname after sethostname inside the new UTS namespace. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 spec field linux.timeOffsets was unsupported, so the container saw the host's CLOCK_MONOTONIC and CLOCK_BOOTTIME values even when a time namespace was created. Parse the per-clock offsets and write them to /proc/self/timens_offsets early in the cloned child, while it is still the only process in the new time namespace. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Bundles using the deprecated prestart hook had it parsed but never run, so any setup it performed was silently dropped. Per OCI spec 1.1.0 prestart and createRuntime fire at the same point; parse it into its own hooklist and run it before createRuntime via a new post_prestart callback in the existing chain. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Per OCI runtime spec 1.1.0 a poststart hook failure now MUST cause the runtime to stop the container and proceed to the delete phase, instead of the 1.0.x "log a warning and continue" behaviour. Track aggregate hook chain failure across run_hooks invocations and, in the poststart return callback, skip the running phase and tear the container down. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 cgroup-v2 field linux.resources.cpu.burst was unsupported, so bundles requesting CFS bandwidth burst accumulation got no burst. Add the field to the cpu policy and write the value to cpu.max.burst. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 cgroup-v2 field linux.resources.cpu.idle was unsupported. Add the field to the cpu policy and write the value to cpu.idle, which schedules tasks in the cgroup as SCHED_IDLE. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The OCI 1.1.0 field linux.resources.memory.checkBeforeUpdate was
silently dropped through the unknown-attr path. The check it gates
against ("reject a new limit lower than current usage") only matters
for an OCI 'update' operation, which the runtime spec does not define
and which uxc does not implement. Add the field to the memory policy
so blobmsg_parse recognises and discards it explicitly; it would be
applied if uxc ever grew an update operation.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The pids parser rejected absent limit with EINVAL and treated the value as unsigned, so OCI 1.3.0 bundles relying on optional pids.limit or using -1 for unlimited failed at create. Skip silently when absent and write "max" to pids.max for any negative value. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The seccomp violation handler used PTRACE_POKEUSER (and PTRACE_SET_SYSCALL
on arm) to overwrite the tracee's syscall number register with -1 on
x86_64, i386, arm, mips and ppc, causing the kernel to skip the offending
syscall with -ENOSYS. aarch64 and loongarch64 already lacked this poke
and let the syscall proceed transparently. The behaviour was therefore
inconsistent across architectures, and the tool's purpose -- collecting
the full set of syscalls a program issues so an allowlist can be derived
afterwards -- is defeated by the cancellation: the first non-whitelisted
call returns ENOSYS, the program typically falls into an error path or
exits, and the recorded profile is incomplete.
Drop the cancellation on all architectures so SECCOMP_TRACE mode just
observes and logs. After the change every supported arch behaves like
aarch64/loongarch64 already did.
The remaining UTRACE mode (PTRACE_SYSCALL path) was already a pure
observer and is unchanged.
Why removing the poke is safe (verified against Linux mainline):
seccomp(2) defines SCMP_ACT_TRACE / SECCOMP_RET_TRACE such that "if
the tracer does not change the system call number, the system call
will proceed normally." All four arches that we used to poke implement
exactly that contract; the cancellation only worked because procd
actively wrote -1 into the syscall register. Concretely:
- x86_64: arch/x86/entry/syscall_64.c:do_syscall_64() invokes the
syscall via do_syscall_x64(); with nr unchanged the table dispatch
runs as usual. The skip path is gated on "nr == -1".
- i386: arch/x86/entry/syscall_32.c:do_syscall_32_irqs_on() has the
same "else if (nr != -1)" structure.
- arm: arch/arm/kernel/entry-common.S after syscall_trace_enter does
"cmp scno, #-1; bne 2b" -- only -1 skips, anything else dispatches.
- mips: arch/mips/kernel/scall*-*.S has "bltz v0, 1f" after
syscall_trace_enter; without a negative override, the syscall is
invoked normally.
- powerpc: arch/powerpc/kernel/syscall.c uses the trace_enter return
value as the syscall number; if unmodified, the dispatcher runs.
No arch re-reads or invalidates the syscall number on its own when
SECCOMP_RET_TRACE fires, so leaving the register alone is the
documented "let it proceed" path on every supported target.
Real seccomp enforcement (SCMP_ACT_KILL / SCMP_ACT_ERRNO) is performed
inside the kernel by the BPF program and is unaffected by this change;
seccomp-trace is and now consistently is a pure profiling tool.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
REG_SYSCALL was carried over from the Chromium seccomp example header, where it is consumed by a SIGSYS handler that reads the offending syscall's register state from the signal context. procd never installed such a handler and never referenced the macro: a tree-wide search across the entire git history finds zero uses outside of the per-arch definitions in this header itself. The seccomp BPF program reads the syscall number from struct seccomp_data, and trace.c uses its own arch-specific reg_syscall_nr offsets (or PTRACE_GET_SYSCALL_INFO). Remove the macro from every arch branch and from the fallback. No functional change. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Without an __riscv branch the header fell into the generic #else block, emitted "Platform does not support seccomp filter yet" and set ARCH_NR to 0. parseOCIlinuxseccomp() then either rejected the bundle (when the container declared SCMP_ARCH_RISCV64 -- AUDIT_ARCH_RISCV64 never matched 0) or, with no architectures listed, generated a BPF program whose arch gate compared against 0 and killed every syscall. seccomp filters could not be applied on RISC-V at all. Map __riscv (xlen == 64) to AUDIT_ARCH_RISCV64 so the existing SCMP_ARCH_RISCV64 mapping in seccomp-oci.c can take effect and the generated BPF program identifies the running architecture correctly. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
The seccomp arch resolver did not know SCMP_ARCH_RISCV64, so OCI 1.1.0 bundles targeting RISC-V 64-bit were rejected with "unknown seccomp architecture". Map it to AUDIT_ARCH_RISCV64. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
container_handle_exec previously fork()ed the grandchild and then,
back in the ujail parent, wrote the grandchild's host pid into
cgroup.procs via cgroups_attach_pid(). The grandchild was in the
container's namespaces from the first instruction but in ujail's
own cgroup for the sub-millisecond window between fork() and the
parent's write - a noted residual gap.
Open the container's cgroup directory once with O_PATH |
O_DIRECTORY | O_CLOEXEC via cgroups_open_dir(), let it inherit
into the exec_child intermediate, and replace exec_child's inner
fork() with clone3(CLONE_INTO_CGROUP) carrying the cgroup fd in
clone_args.cgroup. The kernel places the grandchild in the
container's cgroup atomically with its creation; the post-fork
cgroups_attach_pid() call goes away.
cgroups.{c,h} grows cgroups_open_dir() that returns
open(cgroup_path, O_PATH | O_DIRECTORY | O_CLOEXEC) or -1 when
cgroup_path is unset. Cleanup paths close the fd in both the
ujail parent's success-after-fork path and the 'out:' error
label.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
cgroups_create() now runs ahead of clone3, so the cgroup tree (mkdir_p + subtree_control writes) exists before any task is created. clone3 then takes CLONE_INTO_CGROUP plus a cgroup directory fd via cgroups_open_dir(), and the kernel places the container init in the destination cgroup atomically with its creation. The post-fork sequence becomes cgroups_configure() (controller value writes + ebpf attach) followed by cgroups_attach_pid() which is now a no-op for the INTO_CGROUP path - the init is already a member - and serves as the fallback when cgroups_open_dir() returned -1 (a deployment where the cgroup tree was unreachable at clone3 time and clone3 ran without INTO_CGROUP). cgroups_apply() stays available for container_handle_update which legitimately wants the full create + configure + attach sequence against an existing container. Kernel 5.7+ for CLONE_INTO_CGROUP, well below the 6.18 target. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
OCI does not standardise a key for Memory-Deny-Write-Execute. Read 'org.openwrt.ujail.mdwe' from the bundle annotations, expect a comma-separated value of refuse_exec_gain[,no_inherit], and call prctl(PR_SET_MDWE, flags) on both the container init (after PR_SET_NO_NEW_PRIVS and the capability drop in post_start_hook) and the exec grandchild (right after its NNP step). The container init prctl is fatal on failure; the exec grandchild prctl exits 127 like the surrounding privilege-drop steps. A workload that legitimately mmaps RWX (JITs) should leave the annotation unset, or pass refuse_exec_gain,no_inherit so the bit is dropped across exec into trusted helper binaries that need W+X. The MDWE bit propagates across fork unconditionally. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
New self-contained jail/landlock.{c,h} wraps the three Landlock
syscalls (landlock_create_ruleset, landlock_add_rule,
landlock_restrict_self) and exposes a small struct
landlock_config that callers grow path-by-path. landlock_available()
caches the ABI probe; landlock_apply() builds and installs the
ruleset.
Activation is opt-in via three bundle annotations:
org.openwrt.ujail.landlock.ro = '<path>[:<path>]*'
org.openwrt.ujail.landlock.rx = '<path>[:<path>]*'
org.openwrt.ujail.landlock.rw = '<path>[:<path>]*'
Each adds a LANDLOCK_RULE_PATH_BENEATH rule with the corresponding
access mask (READ_FILE | READ_DIR; the same plus EXECUTE; the
read/write/create/remove set). With no annotation present
landlock_apply() is skipped, preserving runc-compatible
allow-all behaviour. As soon as any rule is registered the runtime
forces opts.no_new_privs = 1, because landlock_restrict_self
requires NNP.
The apply call sits in post_start_hook after pivot_root (paths
must resolve against the post-pivot view) and before
applyOCIlinuxseccomp, so denied paths fail before the workload
runs and the seccomp filter still loads on top. CMakeLists.txt
links jail/landlock.c into ujail.
Reverse-DNS prefix follows the org.openwrt.* convention used by
other ujail-specific bundle annotations (mdwe, systemd-cgroup
hint).
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Copy libpreload-seccomp.so into a memfd backed by MFD_EXEC | MFD_ALLOW_SEALING, apply F_SEAL_WRITE|GROW|SHRINK, and point LD_PRELOAD at /proc/self/fd/<N>. The workload sees a tamper-evident, read-only copy of the helper. The on-disk .so no longer needs to be visible inside the jail rootfs; only its NEEDED shared-library dependencies (libc, libubox, libblobmsg_json, libdl) are bind-mounted, which preload_load_deps() resolves by walking the dynamic section of the on-host copy. This works for both OCI containers and classic OpenWrt procd jails since both flows construct the environment via build_envp(); the former always mounts procfs, and the latter does so whenever the jail is launched with namespace separation. The memfd is created with MFD_EXEC (not MFD_NOEXEC_SEAL) because ld.so maps the file with PROT_EXEC; likewise F_SEAL_EXEC is not applied for the same reason. close_range() before execve would otherwise close the memfd via CLOSE_RANGE_CLOEXEC, so the descriptor flag is cleared just before the exec. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Compile the seccomp JSON into a cBPF program inside ujail (and inside
seccomp-trace), then write the struct sock_filter[] array into a
sealed memfd created with MFD_ALLOW_SEALING and locked down with
F_SEAL_WRITE | F_SEAL_GROW | F_SEAL_SHRINK. The descriptor number is
passed to the workload via the SECCOMP_BPF_FD environment variable;
close_range() before execve marks it CLOEXEC, so the flag is cleared
just like the preload memfd itself.
libpreload-seccomp.so shrinks to a single constructor that mmaps the
memfd, calls prctl(PR_SET_NO_NEW_PRIVS) and seccomp(SET_MODE_FILTER),
and returns. The __libc_start_main / __uClibc_main interception trick
is gone; the constructor mechanism is universal across glibc, musl,
uClibc and bionic, so the preload is no longer wedded to the libc the
helper itself was built against.
Side effects of moving JSON parsing out of the workload:
- the .so no longer needs libubox, libblomsg_json or libdl, so
those bind-mounts vanish from the jail rootfs entirely (libc is
still required, as the workload uses it),
- SECCOMP_FILE and SECCOMP_DEBUG are retired from the workload-side
env contract; the BPF blob is now the only thing the workload
sees,
- jail/seccomp.c and jail/seccomp.h are removed since
install_syscall_filter() lived solely to bridge the old contract,
- seccomp-trace gains a dependency on jail/seccomp-oci.c so it can
compile the JSON itself; the contract between procd and
seccomp-trace (SECCOMP_FILE env var or -f arg) is unchanged.
The filter is still armed inside the workloads address space and only
after ld.so has finished its work, which preserves the property that
the cBPF program does not have to whitelist execve, the dynamic
linker bootstrap, or libc init.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Add a NOLIBC_INCLUDE_DIR CMake variable that the OpenWrt procd recipe can point at the kernel tree (typically $(LINUX_DIR)/tools/include/nolibc). When the variable is set, libpreload-seccomp.so is compiled with -nostdlib -ffreestanding -fno-builtin -include nolibc.h and linked with -nostdlib, leaving the resulting shared object with no DT_NEEDED entries at all. The system libc and libdl no longer have to be visible to the helper, which closes the only remaining "two libcs in one address space" trap when a workload built against a different libc than ujail itself is run under classic procd-jail with -S. When the variable is not set the helper continues to link against the system libc exactly as before; the same preload.c builds in both modes. A #ifndef NOLIBC guard around the system-header block prevents clashes between nolibcs static-inline definitions and the system declarations of getenv / mmap / prctl / etc. The unsetenv call that previously cleared SECCOMP_BPF_FD from the workloads environment is dropped, both because nolibc does not provide unsetenv and because the now-closed fd number is harmless to anything that does not already know about the SECCOMP_BPF_FD contract. A comprehensive top-of-file comment in jail/preload.c documents the contract, the constructor-over-__libc_start_main rationale, the sealed memfd handoff, and the two build modes, so future maintainers and auditors do not have to reverse-engineer the design from the build system. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Broadcast two new ubus events from ujail for OCI lifecycle visibility:
instance.ready payload: { service, instance }
emitted from post_create_runtime once
jail_oci_state has transitioned to OCI_STATE_CREATED
(i.e. the OCI pidfile has been written and the
container is parked at the start barrier).
instance.running payload: { service, instance }
emitted from exec_ack_cb after the kernel closes
the init's O_CLOEXEC end of a new exec_ack pipe on
a successful execve. uxc can wait on this event to
return synchronously from "uxc start".
The exec_ack pipe is opened in post_main with pipe2(O_CLOEXEC) so the
init inherits the CLOEXEC write end. The parent closes its copy
immediately after clone3 and registers a uloop_fd callback on the
read end; EOF (the kernel auto-closing the CLOEXEC fd at execve) is
the success signal.
Also fix a latent initialiser bug: jail_oci_state was statically
initialised to OCI_STATE_CREATED, which would falsely satisfy
handle_start's "must be created to start" guard if any caller raced
the post_main / post_create_runtime sequence. Initialise to
OCI_STATE_CREATING instead.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
uxc_create, uxc_start and uxc_delete now block until the relevant
lifecycle event arrives on ubusd, rather than returning as soon as
the triggering ubus.invoke replies:
create arms instance.ready watcher for the container, invokes
container.add, waits up to 30 s for ujail to broadcast
instance.ready (i.e. pidfile written, OCI state CREATED).
start arms instance.running watcher, invokes container.<name>.start,
waits up to 30 s for ujail to broadcast instance.running
after the container init's execve.
delete arms a ubus.object.remove watcher for the container.<name>
path, invokes container.<name>.delete, waits up to 30 s for
ubusd to broadcast the object's removal (ujail's exit
unregisters it).
The event handler is registered BEFORE the triggering ubus.invoke to
avoid losing a fast-arriving event. procd is otherwise uninvolved in
this synchronisation; existing ubus consumers see no change.
This closes the Podman race ("pidfile not yet created" when uxc
returned) without procd having to grow any new state machinery.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
signals_init() installed jail_handle_signal on every catchable signal except SIGCHLD/SIGPIPE/SIGSEGV/SIGSTOP/SIGKILL. When the kernel delivered a synchronous fault signal (SIGBUS, SIGFPE, SIGILL, SIGSYS, SIGABRT, SIGTRAP) the handler forwarded it to the contained process and returned. The kernel then restarted the faulting instruction, re-raised the same signal, and the loop continued forever; the bug manifested as ujail pinning a CPU at 100% with rt_sigreturn / SI_KERNEL SIGBUS in a tight loop. Treat the synchronous fault signals the same way SIGSEGV is already treated: leave them at the default disposition so the kernel terminates the process (and produces a core dump if enabled). Forwarding a synchronous fault to the contained process is meaningless anyway since the fault originated in ujail itself, not in the contained workload. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Both functions declared a single struct pointer and stepped it with free(cur++), as if the underlying buffer were an array of structs. The data structure is actually a NULL-terminated array of POINTERS to struct (parseOCIhook calloc's hooklist as sizeof(struct hook_execvpe *) slots, and parseOCIlinuxsysctl does the same for sysctl). After freeing the first entry the iterator advanced by sizeof(struct) into garbage and either crashed there or, more commonly, returned a non-NULL pointer that fed garbage into free_oci_envp where the actual segfault (or x86 #SS stack-segment trap) manifested. The bug was latent until the new "abort container creation on failed prestart hook" path started triggering free_opts -> free_hooklist with a non-empty hook chain. The OCI runtime-tools hooks.t test reproduces it deterministically. Mirror free_devices, which already iterates correctly via struct mknod_args ** with *(cur++). Signed-off-by: Daniel Golle <daniel@makrotopia.org>
…fsets BPF jt/jf are uint8_t (max forward jump 255 instructions). The previous filter generation emitted one LD syscall_nr followed by N JEQs and a single RET action per OCI syscalls[] group. With more than ~254 names the JEQs at the front of the group needed a jt larger than 255 to reach the trailing RET; the offset silently wrapped modulo 256, sending matched syscalls into the middle of the next group instead of to the ALLOW return. podman's default seccomp profile has one ALLOW group with 370 names; "execve" sat at filter index [061] with jt=16 (truncated from 272), landing on another group's JEQ instead of the RET. The kernel then fell through to the default RET ERRNO(38) and "execve" came back as ENOSYS, leaving the container init spinning on the same ENOSYS-blocked exit(2) syscall. Split each group into chunks of at most SECCOMP_CHUNK_NAMES (240) names. Each chunk emits its own LD syscall_nr + JEQs + (group's args, duplicated) + RET action, so the maximum forward jump stays comfortably below 255 even for the worst case (full chunk with several args). Also drop the stray "assert(idx = start_rule_idx)" assignment-bug that masked size mismatches; the rewritten loop uses correct equality asserts on the per-chunk landmarks. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
OCI runtime-spec >= 1.0.2 mandates that a missing entry in
process.capabilities (bounding, effective, inheritable, permitted,
ambient) be equivalent to an empty list, i.e. no capabilities.
parseOCIcap was returning JAIL_CAP_ALL on a NULL blob, which combined
with applyOCIcapabilities's "if (set != JAIL_CAP_ALL)" short-circuit
caused the entire drop / set step to be skipped for that capset, so
the container kept the runtime's full inherited set.
The visible symptom: "podman run --cap-drop ALL" silently granted the
full bounding+permitted+effective set because podman sends the OCI
shorthand `"capabilities": {}` for the drop-all case and relies on the
spec's "missing = empty" semantics. Tested explicitly with
`grep ^Cap /proc/self/status` inside containers running each of:
default, --cap-drop ALL, --cap-drop ALL --cap-add NET_BIND_SERVICE,
and --cap-add SYS_ADMIN; all four now match crun's output.
The same parser is used for the legacy procd `-C <file>` path. The OCI
contract has always applied: every /etc/capabilities/*.json shipped by
OpenWrt base and packages-feed (wpad, ntpd, simple-captive-portal,
go2rtc, adguardhome, dnsproxy, rsyncd, snort3, dbus) already lists all
five sets explicitly. Permissive handling of missing sets was a bug,
not a feature.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
|
I'm making one small step after the other. Upgrading all podman dependencies helped a bit. I needed to patch a build error in conmon but that was easily fixable. (I'm still using the 1s delay for UXC to work around the race condition) Using the latest tools I now am able to start a container and it actually runs. Well the container does. The app inside not so much. When trying to get logs I get the error Note that /docker-entrypoint.sh is not static. It is the entrypoint specified by the Dockerfile and therefore changes with the container used. Note that I am not able to recover from this error. My entire shell session freezes and I can't use |
|
There is an architectural problem with how seccomp is applied which break foreign-libc containers (ie. OpenWrt with musl and container eg. with glibc). I've resolved that and tons of other issues, currently last mile of testing, going to push to this branch in the next hours with validated podman working. Meanwhile, please open a PR to update podman in case you didn't do that already and there isn't any existing PR for that (I've opened the PR to update conmon earlier this morning) |
I see. Glad to hear that. I haven't actually updated Podman itself. It is up to date. In addition to conmon I've update the netavark/aardvark-dns stack, catatonit and crun (since it was also out of date and I was at it) I'll look into opening PRs P.S.: Your PR does not include a second patch conmon required in my build env at least. I got a GCC false positive on |
|
@Juff-Ma Please retry with the changes I've pushed now. Don't forget to also apply the patch for procd's Makefile I've posted in #37 (comment) |
@dangowrt IT IS ALIVE. Oh my god this is amazing. I still included my cgroups patch just to be sure but it absolutely works now. I can run containers and their services are reachable. The work you've done to get this working is incredible. I found 2 little issues (that do not impact day to day operations, at least for me, but I still wanna name them).
P.S.: The error really does occur when killing or deleting containers. I don't know why though. Killing never works and rm-ing will error out the first time. Still not an issue for me. |
Until now any container whose OCI spec listed CLONE_NEWNET caused ujail
to spawn a per-jail ubusd and netifd inside /var/containers/. That is
the OpenWrt-native uxc UX (jail-side wiring from /etc/config/network)
but actively fights OCI orchestrators: podman, docker and friends set
up the netns and addressing themselves via netavark / CNI / CDI, then
expect a fresh empty netns inside the container. The parallel netifd
inside the jail was both unwanted and a permanent leak — its lifecycle
hooked into poststop() only, while every error path through
free_and_exit() left ubusd and netifd as orphans of procd.
Two new OCI annotations make the behaviour explicit:
org.openwrt.procd.ubus = "true" -> spawn jail-private ubusd
org.openwrt.procd.netifd = "true" -> spawn jail-private ubusd + netifd
(netifd implies ubus)
Both default false, so OCI orchestrators get the netns podman/CNI/CDI
prepared, undisturbed. Native uxc rc-script flows that want the old
behaviour set the annotations when generating the bundle.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
cgroups_free() walked the cgvals AVL tree and freed cgroup_path but did
not clear the `initialized` flag, leaving cgroup_path and avl entries
as dangling pointers. The parent exit path calls cgroups_free() twice:
free_and_exit() // jail.c
cgroups_free() // first call: real teardown
free_opts(parent=true)
cgroups_free() // second call: re-walks freed tree
-> double-free
-> mallocng's get_meta() faults reading "erged" bytes from
a recycled chunk's metadata field
-> SIGSEGV at libc.so+0x45bc4 in the ujail process
Reproduces with any OCI container (alpine /bin/true, /bin/ls, etc.).
ASAN's heap reshuffling masked the bug because the recycled chunk no
longer overlapped the avl bookkeeping; UBSAN did not mask it because it
does not touch the allocator. ptrace attach also masked it.
The fix: early-return if not initialized, then NULL cgroup_path and
clear `initialized` after the actual teardown, so a second call is a
no-op. With the fix, /bin/true under uxc runs cleanly across 30
consecutive runs, capability and seccomp behaviour vs crun stays 8/8
MATCH, and no per-jail ubusd/netifd leaks.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
When podman runs with -t, conmon binds a SOCK_STREAM listener at /var/tmp/conmon-term.XXX and execs the runtime with --console-socket pointing at it. conmon closes the listener as soon as it sees the runtime process exit, so the runtime must connect *during* its own lifetime. For crun/runc this is fine: the runtime does posix_openpt + connect + sendmsg within milliseconds. For uxc the chain is uxc -> ubus to procd -> spawn ujail -> clone3 child -> build_jail_fs -> create_dev_console; by the time ujail's child reaches send_console_fd the listener is already gone and connect() returns ECONNREFUSED, causing podman to fail with "Runtime did not set up terminal". Connect in the parent ujail process immediately after CLI parsing (before parseOCI and clone3), then pass the fd number to the child through opts.console_socket. parse_inherited_console_fd in open_console_sock already handles a numeric spec, so the existing sendmsg path in the child just reuses the inherited fd. The fd is CLOEXEC-cleared so it survives clone3. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Two related cosmetic-but-visible fixes for podman -t containers under uxc:
1. create_dev_console opened the PTY slave and dup2'd it to 0/1/2 but
never made it the controlling terminal. Bash inside the container
printed "cannot set terminal process group (-1): Inappropriate
ioctl for device" and "no job control in this shell". Call setsid
(best effort; PID 1 in a fresh CLONE_NEWPID is already a session
leader) and ioctl(TIOCSCTTY) so the slave becomes the workload's
controlling terminal.
2. Two INFO() prints fired AFTER stdio had been dup'd to the PTY
slave, so "jail: exec-ing bash" and "using guest console ..."
ended up on the user's terminal before the workload could write
its prompt. Demote both to DEBUG so they only surface with -d > 0.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Two related fixes that together let podman -t containers run cleanly under uxc, including normal exit detection. 1) Allocate the PTY in the parent ujail (host mount NS) instead of the child (newinstance devpts NS). conmon receives the master fd via SCM_RIGHTS and the slave name as iovec data; it then opens the slave by name in its own mount NS. When ujail allocated the PTY inside the child after mounting newinstance devpts, the slave name was only valid in the child's NS, so conmon couldn't open it. Moved posix_openpt + grantpt + unlockpt + slave open + send_console_fd into the parent path right before clone3, and inherit the slave to the child via CLOEXEC-cleared console_slave_fd. create_dev_console in the child is now just the bind-mount of the inherited slave at /dev/console (via /proc/self/fd/N to survive pivot_root) plus setsid + TIOCSCTTY + dup2 on the inherited fd. 2) Send SIGCHLD to conmon when the parent ujail exits. conmon's exit-detection is driven by SIGCHLD: on signal it runs waitpid, then if ECHILD it probes kill(container_pid, 0). For runc/crun the runtime is conmon's child so SIGCHLD arrives naturally when the container init dies. For uxc the runtime (and ujail) are spawned by procd, not by conmon, so conmon never sees SIGCHLD and the exit file never gets written; podman sits at "Up Xs" forever. Read conmon's pidfile (sibling of opts.pidfile) at free_and_exit and kill(conmon, SIGCHLD); conmon's existing kill-probe path then detects the dead container_pid and writes the exit file. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
After commit c85ae2e (PTY allocated in parent, slave opened in child early in build_jail_fs), the post-clone3 init reaches post_jail_fs right after create_dev_console has installed the PTY slave as the controlling terminal via setsid + TIOCSCTTY. The resulting session change triggers a kernel-delivered signal that the catch-all handler installed by signals_init intercepts, and because that handler is registered without SA_RESTART the blocking read on pipes[2] returns EINTR. The old code treated that as an error, the init died with 'can't read from parent', the parent's free_and_exit then removed container.<name>, and the subsequent uxc start lookup returned ENOENT (visible to podman as 'uxc error: No such file or directory' and '/sbin/uxc start ... failed: exit status 254'). Wrap both pipe sync reads in EINTR retry loops so the init blocks until the parent actually writes 'O' / '!'. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Refactor uxc_attach to a single goto-out cleanup path so each new error branch can't drift out of sync with the resource-release list. Map the UBUS_STATUS_NOT_SUPPORTED reply from procd's container_handle_console (returned when the container was created without process.terminal=true) to a specific message telling the user to recreate with console=true, instead of the opaque "ubus request failed". Drop the two stray setup_tios() calls on the PTY master/server fds (raw termios on a PTY endpoint is meaningless; only the local /dev/tty needs it) and gate the tcsetattr() restore on whether setup actually took effect. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
|
@Juff-Ma console issues with |
If you mean the commits up to 280a8bb I wasn't really able to test them (I already tried when they were pushed) since I've been encountering a new issue. I though that I had left a comment but it seems it didn't go through (why github?) On some containers I get a Here are the two containers I used:
|
execvpe() in musl performs its PATH search with getenv("PATH") against
the calling process's own environment, not the envp argument handed to
it. ujail runs with procd's PATH (the build-time EARLY_PATH,
/usr/sbin:/usr/bin:/sbin:/bin), so a relative entrypoint that lives only
in a directory present in the container's PATH but not ujail's - e.g.
coturn's docker-entrypoint.sh in /usr/local/bin - is never found and the
exec fails with ENOENT ("failed to execve docker-entrypoint.sh: No such
file or directory"). Images whose entrypoint resolves to /bin or
/usr/bin happened to work because those dirs are in both paths.
Point environ at the freshly built container envp right before
execvpe() so the PATH lookup honours the container's PATH.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
This updates
procd-jailanduxcto OCI runtime spec 1.3.0, and also (with the help of Claude) improves seccomp support (new RISCV64 arch and cleanup of tracing, needs testing!)Should fix #35