Skip to content

Add missing Linux capability checks for SO_BINDTODEVICE, mknod, sched_setaffinity, and setpriority#12872

Open
petrmarinec wants to merge 1 commit intogoogle:masterfrom
petrmarinec:fix/missing-capability-checks
Open

Add missing Linux capability checks for SO_BINDTODEVICE, mknod, sched_setaffinity, and setpriority#12872
petrmarinec wants to merge 1 commit intogoogle:masterfrom
petrmarinec:fix/missing-capability-checks

Conversation

@petrmarinec
Copy link
Copy Markdown

@petrmarinec petrmarinec commented Apr 5, 2026

Summary

This patch adds capability and permission checks that the Linux kernel enforces but gVisor currently omits. Each fix was verified against native Linux behavior using bazel test on both native and runsc_ptrace platforms.

Changes

1. SO_BINDTODEVICE: Add CAP_NET_RAW check

File: pkg/sentry/socket/netstack/netstack.go
Linux reference: net/core/sock.c:sock_setsockopt() checks ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW)
Evidence this is unintended: gVisor's own test suite asserts "CAP_NET_RAW is required to use SO_BINDTODEVICE" (test/syscalls/linux/socket_bind_to_device.cc:52), and SO_RCVBUFFORCE in the same file already correctly checks CAP_NET_ADMIN.

2. mknod(S_IFBLK/S_IFCHR): Add CAP_MKNOD check

File: pkg/sentry/syscalls/linux/sys_file.go
Linux reference: fs/namei.c:vfs_mknod() checks capable(CAP_MKNOD) for block/char device creation
Evidence this is unintended: CAP_MKNOD is defined (pkg/abi/linux/capability.go:56), parsed from OCI specs (runsc/specutils/specutils.go:491), and has strace formatting — but is never checked anywhere. Zero HasCapability calls for it exist in the codebase.

3. sched_setaffinity: Add UID match / CAP_SYS_NICE check

File: pkg/sentry/syscalls/linux/sys_thread.go
Linux reference: kernel/sched/core.c:check_same_owner() requires EUID match or CAP_SYS_NICE
Impact: Without this check, any unprivileged process could modify another process's CPU affinity mask.

4. setpriority: Add UID match / CAP_SYS_NICE check

File: pkg/sentry/syscalls/linux/sys_thread.go
Linux reference: kernel/sys.c:set_one_prio() requires UID match or CAP_SYS_NICE
Impact: Without this check, any unprivileged process could change another process's scheduling priority.

Testing

Tests added in test/syscalls/linux/capability_checks.cc, verified on both native Linux and gVisor:

bazel test //test/syscalls:capability_checks_test_native        → 6/6 passed
bazel test //test/syscalls:capability_checks_test_runsc_ptrace  → 4 passed, 2 skipped

The 2 skipped tests are the mknod positive cases (creating device nodes with CAP_MKNOD), which are skipped on gVisor because the sandbox does not permit device node creation regardless of capabilities.

Test What it verifies
SoBindToDeviceCapTest.RequiresCapNetRaw EPERM without CAP_NET_RAW
MknodCapTest.CharDevRequiresCapMknod EPERM for S_IFCHR without CAP_MKNOD (native only)
MknodCapTest.BlockDevRequiresCapMknod EPERM for S_IFBLK without CAP_MKNOD (native only)
MknodCapTest.FifoDoesNotRequireCapMknod S_IFIFO succeeds without CAP_MKNOD
SchedSetaffinityCapTest.OtherUidRequiresCapSysNice EPERM without UID match or CAP_SYS_NICE
SetpriorityCapTest.OtherUidRequiresCapSysNice EPERM without UID match or CAP_SYS_NICE

Assisted-by: Codex

@google-cla
Copy link
Copy Markdown

google-cla bot commented Apr 5, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@petrmarinec petrmarinec force-pushed the fix/missing-capability-checks branch from bfa76ad to 9cab654 Compare April 5, 2026 06:55
Copy link
Copy Markdown
Collaborator

@EtiennePerot EtiennePerot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add syscall tests under test/syscalls/linux to exercise these and to ensure consistency with Linux?

@petrmarinec petrmarinec changed the title Add missing Linux capability checks across multiple subsystems Add missing Linux capability checks for SO_BINDTODEVICE, mknod, sched_setaffinity, and setpriority Apr 6, 2026
@petrmarinec
Copy link
Copy Markdown
Author

Tests added in test/syscalls/linux/capability_checks.cc. I also verified the changes against native Linux and gVisor (runsc_ptrace) using Bazel:

bazel test //test/syscalls:capability_checks_test_native        → 6/6 passed
bazel test //test/syscalls:capability_checks_test_runsc_ptrace  → 4 passed, 2 skipped

During testing I found and fixed a few issues:

  1. Build fix: s.HasCapability() doesn't exist on the socket.Socket interface in SetSockOptSocket. Changed to t.HasCapabilityIn(linux.CAP_NET_RAW, t.NetworkNamespace().UserNamespace()).

  2. Removed /proc/sys capability checks: After testing on native Linux, writing to tcp_sack, tcp_recovery, tcp_rmem/wmem, ip_local_port_range, and nr_open succeeded even after dropping CAP_NET_ADMIN/CAP_SYS_ADMIN. These checks did not match actual Linux behavior, so I removed them to keep the PR aligned with its goal.

  3. mknod tests: Added IsRunningOnGvisor() skip for the positive device creation cases, since the sandbox blocks device node creation regardless of capabilities. The negative tests (EPERM without CAP_MKNOD) still run on both platforms.

The PR now covers 4 verified fixes: SO_BINDTODEVICE, mknod, sched_setaffinity, and setpriority.

Comment thread test/syscalls/linux/capability_checks.cc Outdated
Comment thread test/syscalls/linux/capability_checks.cc Outdated
Comment thread pkg/sentry/syscalls/linux/sys_file.go
Comment thread test/syscalls/linux/capability_checks.cc Outdated
@ayushr2 ayushr2 requested a review from EtiennePerot April 13, 2026 18:16
@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented Apr 13, 2026

Could you also sign the CLA?

@petrmarinec petrmarinec force-pushed the fix/missing-capability-checks branch from bba2f3a to e0dd022 Compare April 13, 2026 18:32
Comment thread test/syscalls/linux/socket_inet_loopback_isolated.cc Outdated
Comment thread test/syscalls/linux/socket_inet_loopback_isolated.cc Outdated
Comment thread test/syscalls/linux/socket_inet_loopback_isolated.cc Outdated
@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented Apr 13, 2026

Could you verify that these newly added tests fail without your fixes?

@petrmarinec
Copy link
Copy Markdown
Author

Done. I addressed the SO_BINDTODEVICE test comments and verified the new tests against a scratch build with the implementation checks removed. They fail there because the operations succeed instead of returning EPERM, and pass again with this PR applied.

I also changed the mknod tests to mount an explicit tmpfs so they do not depend on /tmp already being tmpfs.

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented Apr 14, 2026

Could you rebase and resolve conflicts?

Comment thread pkg/sentry/syscalls/linux/sys_thread.go Outdated
@petrmarinec petrmarinec force-pushed the fix/missing-capability-checks branch from 7aafaff to f9f3716 Compare April 14, 2026 19:13
@petrmarinec
Copy link
Copy Markdown
Author

Rebased and resolved the conflict. I also addressed the setpriority nit.

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented Apr 14, 2026

@ayushr2 ayushr2 dismissed EtiennePerot’s stale review April 14, 2026 19:28

Review was addressed

@petrmarinec petrmarinec force-pushed the fix/missing-capability-checks branch from f9f3716 to 70b4e0f Compare April 14, 2026 19:37
@petrmarinec
Copy link
Copy Markdown
Author

Done, squashed into a single commit.

copybara-service bot pushed a commit that referenced this pull request Apr 14, 2026
…_setaffinity, and setpriority

## Summary

This patch adds capability and permission checks that the Linux kernel enforces but gVisor currently omits. Each fix was verified against native Linux behavior using `bazel test` on both native and `runsc_ptrace` platforms.

## Changes

### 1. `SO_BINDTODEVICE`: Add `CAP_NET_RAW` check
**File:** `pkg/sentry/socket/netstack/netstack.go`
**Linux reference:** `net/core/sock.c:sock_setsockopt()` checks `ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW)`
**Evidence this is unintended:** gVisor's own test suite asserts `"CAP_NET_RAW is required to use SO_BINDTODEVICE"` (`test/syscalls/linux/socket_bind_to_device.cc:52`), and `SO_RCVBUFFORCE` in the same file already correctly checks `CAP_NET_ADMIN`.

### 2. `mknod(S_IFBLK/S_IFCHR)`: Add `CAP_MKNOD` check
**File:** `pkg/sentry/syscalls/linux/sys_file.go`
**Linux reference:** `fs/namei.c:vfs_mknod()` checks `capable(CAP_MKNOD)` for block/char device creation
**Evidence this is unintended:** `CAP_MKNOD` is defined (`pkg/abi/linux/capability.go:56`), parsed from OCI specs (`runsc/specutils/specutils.go:491`), and has strace formatting — but is never checked anywhere. Zero `HasCapability` calls for it exist in the codebase.

### 3. `sched_setaffinity`: Add UID match / `CAP_SYS_NICE` check
**File:** `pkg/sentry/syscalls/linux/sys_thread.go`
**Linux reference:** `kernel/sched/core.c:check_same_owner()` requires EUID match or `CAP_SYS_NICE`
**Impact:** Without this check, any unprivileged process could modify another process's CPU affinity mask.

### 4. `setpriority`: Add UID match / `CAP_SYS_NICE` check
**File:** `pkg/sentry/syscalls/linux/sys_thread.go`
**Linux reference:** `kernel/sys.c:set_one_prio()` requires UID match or `CAP_SYS_NICE`
**Impact:** Without this check, any unprivileged process could change another process's scheduling priority.

## Testing

Tests added in `test/syscalls/linux/capability_checks.cc`, verified on both native Linux and gVisor:

```
bazel test //test/syscalls:capability_checks_test_native        → 6/6 passed
bazel test //test/syscalls:capability_checks_test_runsc_ptrace  → 4 passed, 2 skipped
```

The 2 skipped tests are the mknod positive cases (creating device nodes with `CAP_MKNOD`), which are skipped on gVisor because the sandbox does not permit device node creation regardless of capabilities.

| Test | What it verifies |
|------|-----------------|
| `SoBindToDeviceCapTest.RequiresCapNetRaw` | `EPERM` without `CAP_NET_RAW` |
| `MknodCapTest.CharDevRequiresCapMknod` | `EPERM` for `S_IFCHR` without `CAP_MKNOD` (native only) |
| `MknodCapTest.BlockDevRequiresCapMknod` | `EPERM` for `S_IFBLK` without `CAP_MKNOD` (native only) |
| `MknodCapTest.FifoDoesNotRequireCapMknod` | `S_IFIFO` succeeds without `CAP_MKNOD` |
| `SchedSetaffinityCapTest.OtherUidRequiresCapSysNice` | `EPERM` without UID match or `CAP_SYS_NICE` |
| `SetpriorityCapTest.OtherUidRequiresCapSysNice` | `EPERM` without UID match or `CAP_SYS_NICE` |

Assisted-by: Codex
FUTURE_COPYBARA_INTEGRATE_REVIEW=#12872 from petrmarinec:fix/missing-capability-checks 70b4e0f
PiperOrigin-RevId: 899726669
copybara-service bot pushed a commit that referenced this pull request Apr 14, 2026
…_setaffinity, and setpriority

## Summary

This patch adds capability and permission checks that the Linux kernel enforces but gVisor currently omits. Each fix was verified against native Linux behavior using `bazel test` on both native and `runsc_ptrace` platforms.

## Changes

### 1. `SO_BINDTODEVICE`: Add `CAP_NET_RAW` check
**File:** `pkg/sentry/socket/netstack/netstack.go`
**Linux reference:** `net/core/sock.c:sock_setsockopt()` checks `ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW)`
**Evidence this is unintended:** gVisor's own test suite asserts `"CAP_NET_RAW is required to use SO_BINDTODEVICE"` (`test/syscalls/linux/socket_bind_to_device.cc:52`), and `SO_RCVBUFFORCE` in the same file already correctly checks `CAP_NET_ADMIN`.

### 2. `mknod(S_IFBLK/S_IFCHR)`: Add `CAP_MKNOD` check
**File:** `pkg/sentry/syscalls/linux/sys_file.go`
**Linux reference:** `fs/namei.c:vfs_mknod()` checks `capable(CAP_MKNOD)` for block/char device creation
**Evidence this is unintended:** `CAP_MKNOD` is defined (`pkg/abi/linux/capability.go:56`), parsed from OCI specs (`runsc/specutils/specutils.go:491`), and has strace formatting — but is never checked anywhere. Zero `HasCapability` calls for it exist in the codebase.

### 3. `sched_setaffinity`: Add UID match / `CAP_SYS_NICE` check
**File:** `pkg/sentry/syscalls/linux/sys_thread.go`
**Linux reference:** `kernel/sched/core.c:check_same_owner()` requires EUID match or `CAP_SYS_NICE`
**Impact:** Without this check, any unprivileged process could modify another process's CPU affinity mask.

### 4. `setpriority`: Add UID match / `CAP_SYS_NICE` check
**File:** `pkg/sentry/syscalls/linux/sys_thread.go`
**Linux reference:** `kernel/sys.c:set_one_prio()` requires UID match or `CAP_SYS_NICE`
**Impact:** Without this check, any unprivileged process could change another process's scheduling priority.

## Testing

Tests added in `test/syscalls/linux/capability_checks.cc`, verified on both native Linux and gVisor:

```
bazel test //test/syscalls:capability_checks_test_native        → 6/6 passed
bazel test //test/syscalls:capability_checks_test_runsc_ptrace  → 4 passed, 2 skipped
```

The 2 skipped tests are the mknod positive cases (creating device nodes with `CAP_MKNOD`), which are skipped on gVisor because the sandbox does not permit device node creation regardless of capabilities.

| Test | What it verifies |
|------|-----------------|
| `SoBindToDeviceCapTest.RequiresCapNetRaw` | `EPERM` without `CAP_NET_RAW` |
| `MknodCapTest.CharDevRequiresCapMknod` | `EPERM` for `S_IFCHR` without `CAP_MKNOD` (native only) |
| `MknodCapTest.BlockDevRequiresCapMknod` | `EPERM` for `S_IFBLK` without `CAP_MKNOD` (native only) |
| `MknodCapTest.FifoDoesNotRequireCapMknod` | `S_IFIFO` succeeds without `CAP_MKNOD` |
| `SchedSetaffinityCapTest.OtherUidRequiresCapSysNice` | `EPERM` without UID match or `CAP_SYS_NICE` |
| `SetpriorityCapTest.OtherUidRequiresCapSysNice` | `EPERM` without UID match or `CAP_SYS_NICE` |

Assisted-by: Codex
FUTURE_COPYBARA_INTEGRATE_REVIEW=#12872 from petrmarinec:fix/missing-capability-checks 70b4e0f
PiperOrigin-RevId: 899726669
copybara-service bot pushed a commit that referenced this pull request Apr 14, 2026
…_setaffinity, and setpriority

## Summary

This patch adds capability and permission checks that the Linux kernel enforces but gVisor currently omits. Each fix was verified against native Linux behavior using `bazel test` on both native and `runsc_ptrace` platforms.

## Changes

### 1. `SO_BINDTODEVICE`: Add `CAP_NET_RAW` check
**File:** `pkg/sentry/socket/netstack/netstack.go`
**Linux reference:** `net/core/sock.c:sock_setsockopt()` checks `ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW)`
**Evidence this is unintended:** gVisor's own test suite asserts `"CAP_NET_RAW is required to use SO_BINDTODEVICE"` (`test/syscalls/linux/socket_bind_to_device.cc:52`), and `SO_RCVBUFFORCE` in the same file already correctly checks `CAP_NET_ADMIN`.

### 2. `mknod(S_IFBLK/S_IFCHR)`: Add `CAP_MKNOD` check
**File:** `pkg/sentry/syscalls/linux/sys_file.go`
**Linux reference:** `fs/namei.c:vfs_mknod()` checks `capable(CAP_MKNOD)` for block/char device creation
**Evidence this is unintended:** `CAP_MKNOD` is defined (`pkg/abi/linux/capability.go:56`), parsed from OCI specs (`runsc/specutils/specutils.go:491`), and has strace formatting — but is never checked anywhere. Zero `HasCapability` calls for it exist in the codebase.

### 3. `sched_setaffinity`: Add UID match / `CAP_SYS_NICE` check
**File:** `pkg/sentry/syscalls/linux/sys_thread.go`
**Linux reference:** `kernel/sched/core.c:check_same_owner()` requires EUID match or `CAP_SYS_NICE`
**Impact:** Without this check, any unprivileged process could modify another process's CPU affinity mask.

### 4. `setpriority`: Add UID match / `CAP_SYS_NICE` check
**File:** `pkg/sentry/syscalls/linux/sys_thread.go`
**Linux reference:** `kernel/sys.c:set_one_prio()` requires UID match or `CAP_SYS_NICE`
**Impact:** Without this check, any unprivileged process could change another process's scheduling priority.

## Testing

Tests added in `test/syscalls/linux/capability_checks.cc`, verified on both native Linux and gVisor:

```
bazel test //test/syscalls:capability_checks_test_native        → 6/6 passed
bazel test //test/syscalls:capability_checks_test_runsc_ptrace  → 4 passed, 2 skipped
```

The 2 skipped tests are the mknod positive cases (creating device nodes with `CAP_MKNOD`), which are skipped on gVisor because the sandbox does not permit device node creation regardless of capabilities.

| Test | What it verifies |
|------|-----------------|
| `SoBindToDeviceCapTest.RequiresCapNetRaw` | `EPERM` without `CAP_NET_RAW` |
| `MknodCapTest.CharDevRequiresCapMknod` | `EPERM` for `S_IFCHR` without `CAP_MKNOD` (native only) |
| `MknodCapTest.BlockDevRequiresCapMknod` | `EPERM` for `S_IFBLK` without `CAP_MKNOD` (native only) |
| `MknodCapTest.FifoDoesNotRequireCapMknod` | `S_IFIFO` succeeds without `CAP_MKNOD` |
| `SchedSetaffinityCapTest.OtherUidRequiresCapSysNice` | `EPERM` without UID match or `CAP_SYS_NICE` |
| `SetpriorityCapTest.OtherUidRequiresCapSysNice` | `EPERM` without UID match or `CAP_SYS_NICE` |

Assisted-by: Codex
FUTURE_COPYBARA_INTEGRATE_REVIEW=#12872 from petrmarinec:fix/missing-capability-checks 70b4e0f
PiperOrigin-RevId: 899726669
copybara-service bot pushed a commit that referenced this pull request Apr 14, 2026
…_setaffinity, and setpriority

## Summary

This patch adds capability and permission checks that the Linux kernel enforces but gVisor currently omits. Each fix was verified against native Linux behavior using `bazel test` on both native and `runsc_ptrace` platforms.

## Changes

### 1. `SO_BINDTODEVICE`: Add `CAP_NET_RAW` check
**File:** `pkg/sentry/socket/netstack/netstack.go`
**Linux reference:** `net/core/sock.c:sock_setsockopt()` checks `ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW)`
**Evidence this is unintended:** gVisor's own test suite asserts `"CAP_NET_RAW is required to use SO_BINDTODEVICE"` (`test/syscalls/linux/socket_bind_to_device.cc:52`), and `SO_RCVBUFFORCE` in the same file already correctly checks `CAP_NET_ADMIN`.

### 2. `mknod(S_IFBLK/S_IFCHR)`: Add `CAP_MKNOD` check
**File:** `pkg/sentry/syscalls/linux/sys_file.go`
**Linux reference:** `fs/namei.c:vfs_mknod()` checks `capable(CAP_MKNOD)` for block/char device creation
**Evidence this is unintended:** `CAP_MKNOD` is defined (`pkg/abi/linux/capability.go:56`), parsed from OCI specs (`runsc/specutils/specutils.go:491`), and has strace formatting — but is never checked anywhere. Zero `HasCapability` calls for it exist in the codebase.

### 3. `sched_setaffinity`: Add UID match / `CAP_SYS_NICE` check
**File:** `pkg/sentry/syscalls/linux/sys_thread.go`
**Linux reference:** `kernel/sched/core.c:check_same_owner()` requires EUID match or `CAP_SYS_NICE`
**Impact:** Without this check, any unprivileged process could modify another process's CPU affinity mask.

### 4. `setpriority`: Add UID match / `CAP_SYS_NICE` check
**File:** `pkg/sentry/syscalls/linux/sys_thread.go`
**Linux reference:** `kernel/sys.c:set_one_prio()` requires UID match or `CAP_SYS_NICE`
**Impact:** Without this check, any unprivileged process could change another process's scheduling priority.

## Testing

Tests added in `test/syscalls/linux/capability_checks.cc`, verified on both native Linux and gVisor:

```
bazel test //test/syscalls:capability_checks_test_native        → 6/6 passed
bazel test //test/syscalls:capability_checks_test_runsc_ptrace  → 4 passed, 2 skipped
```

The 2 skipped tests are the mknod positive cases (creating device nodes with `CAP_MKNOD`), which are skipped on gVisor because the sandbox does not permit device node creation regardless of capabilities.

| Test | What it verifies |
|------|-----------------|
| `SoBindToDeviceCapTest.RequiresCapNetRaw` | `EPERM` without `CAP_NET_RAW` |
| `MknodCapTest.CharDevRequiresCapMknod` | `EPERM` for `S_IFCHR` without `CAP_MKNOD` (native only) |
| `MknodCapTest.BlockDevRequiresCapMknod` | `EPERM` for `S_IFBLK` without `CAP_MKNOD` (native only) |
| `MknodCapTest.FifoDoesNotRequireCapMknod` | `S_IFIFO` succeeds without `CAP_MKNOD` |
| `SchedSetaffinityCapTest.OtherUidRequiresCapSysNice` | `EPERM` without UID match or `CAP_SYS_NICE` |
| `SetpriorityCapTest.OtherUidRequiresCapSysNice` | `EPERM` without UID match or `CAP_SYS_NICE` |

Assisted-by: Codex
FUTURE_COPYBARA_INTEGRATE_REVIEW=#12872 from petrmarinec:fix/missing-capability-checks 70b4e0f
PiperOrigin-RevId: 899726669
copybara-service bot pushed a commit that referenced this pull request Apr 16, 2026
…_setaffinity, and setpriority

## Summary

This patch adds capability and permission checks that the Linux kernel enforces but gVisor currently omits. Each fix was verified against native Linux behavior using `bazel test` on both native and `runsc_ptrace` platforms.

## Changes

### 1. `SO_BINDTODEVICE`: Add `CAP_NET_RAW` check
**File:** `pkg/sentry/socket/netstack/netstack.go`
**Linux reference:** `net/core/sock.c:sock_setsockopt()` checks `ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW)`
**Evidence this is unintended:** gVisor's own test suite asserts `"CAP_NET_RAW is required to use SO_BINDTODEVICE"` (`test/syscalls/linux/socket_bind_to_device.cc:52`), and `SO_RCVBUFFORCE` in the same file already correctly checks `CAP_NET_ADMIN`.

### 2. `mknod(S_IFBLK/S_IFCHR)`: Add `CAP_MKNOD` check
**File:** `pkg/sentry/syscalls/linux/sys_file.go`
**Linux reference:** `fs/namei.c:vfs_mknod()` checks `capable(CAP_MKNOD)` for block/char device creation
**Evidence this is unintended:** `CAP_MKNOD` is defined (`pkg/abi/linux/capability.go:56`), parsed from OCI specs (`runsc/specutils/specutils.go:491`), and has strace formatting — but is never checked anywhere. Zero `HasCapability` calls for it exist in the codebase.

### 3. `sched_setaffinity`: Add UID match / `CAP_SYS_NICE` check
**File:** `pkg/sentry/syscalls/linux/sys_thread.go`
**Linux reference:** `kernel/sched/core.c:check_same_owner()` requires EUID match or `CAP_SYS_NICE`
**Impact:** Without this check, any unprivileged process could modify another process's CPU affinity mask.

### 4. `setpriority`: Add UID match / `CAP_SYS_NICE` check
**File:** `pkg/sentry/syscalls/linux/sys_thread.go`
**Linux reference:** `kernel/sys.c:set_one_prio()` requires UID match or `CAP_SYS_NICE`
**Impact:** Without this check, any unprivileged process could change another process's scheduling priority.

## Testing

Tests added in `test/syscalls/linux/capability_checks.cc`, verified on both native Linux and gVisor:

```
bazel test //test/syscalls:capability_checks_test_native        → 6/6 passed
bazel test //test/syscalls:capability_checks_test_runsc_ptrace  → 4 passed, 2 skipped
```

The 2 skipped tests are the mknod positive cases (creating device nodes with `CAP_MKNOD`), which are skipped on gVisor because the sandbox does not permit device node creation regardless of capabilities.

| Test | What it verifies |
|------|-----------------|
| `SoBindToDeviceCapTest.RequiresCapNetRaw` | `EPERM` without `CAP_NET_RAW` |
| `MknodCapTest.CharDevRequiresCapMknod` | `EPERM` for `S_IFCHR` without `CAP_MKNOD` (native only) |
| `MknodCapTest.BlockDevRequiresCapMknod` | `EPERM` for `S_IFBLK` without `CAP_MKNOD` (native only) |
| `MknodCapTest.FifoDoesNotRequireCapMknod` | `S_IFIFO` succeeds without `CAP_MKNOD` |
| `SchedSetaffinityCapTest.OtherUidRequiresCapSysNice` | `EPERM` without UID match or `CAP_SYS_NICE` |
| `SetpriorityCapTest.OtherUidRequiresCapSysNice` | `EPERM` without UID match or `CAP_SYS_NICE` |

Assisted-by: Codex
FUTURE_COPYBARA_INTEGRATE_REVIEW=#12872 from petrmarinec:fix/missing-capability-checks 70b4e0f
PiperOrigin-RevId: 900522695
Comment thread pkg/sentry/socket/netstack/netstack.go Outdated
Comment thread pkg/sentry/syscalls/linux/sys_file.go Outdated
Comment thread pkg/sentry/syscalls/linux/sys_thread.go Outdated
Comment thread pkg/sentry/syscalls/linux/sys_thread.go Outdated
Comment thread pkg/sentry/syscalls/linux/sys_thread.go Outdated
@petrmarinec petrmarinec force-pushed the fix/missing-capability-checks branch from 70b4e0f to 4946e8c Compare April 16, 2026 22:14
@petrmarinec
Copy link
Copy Markdown
Author

SO_BINDTODEVICE now only requires CAP_NET_RAW when overwriting an existing binding, and the test now covers first bind without CAP_NET_RAW succeeding plus overwrite without CAP_NET_RAW failing.

mknod now skips whiteouts and checks CAP_MKNOD in the init user namespace.

sched_setaffinity now references kernel/sched/syscalls.c:sched_setaffinity().

For setpriority, I moved the target lookup and permission check into the who != 0 branch. I also rechecked this path against current Linux kernel/sys.c:set_one_prio_perm(); setpriority uses ns_capable(pcred->user_ns, CAP_SYS_NICE), so this remains checked against the target task's user namespace.

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented Apr 16, 2026

I also rechecked this path against current Linux kernel/sys.c:set_one_prio_perm(); setpriority uses ns_capable(pcred->user_ns, CAP_SYS_NICE), so this remains checked against the target task's user namespace.

Could you confirm which Linux version source you are looking at. At least since Linux 6.11, the capability check is done on the init userns: https://github.com/torvalds/linux/blob/3cd8b194bf3428dfa53120fee47e827a7c495815/kernel/sched/syscalls.c#L487-L488

@petrmarinec
Copy link
Copy Markdown
Author

I was looking at the same torvalds/linux commit you linked, but at the direct setpriority(2) path in kernel/sys.c.

SYSCALL_DEFINE3(setpriority) calls set_one_prio(), and set_one_prio() calls set_one_prio_perm():
https://github.com/torvalds/linux/blob/3cd8b194bf3428dfa53120fee47e827a7c495815/kernel/sys.c#L259-L286
https://github.com/torvalds/linux/blob/3cd8b194bf3428dfa53120fee47e827a7c495815/kernel/sys.c#L219-L228

That helper checks ns_capable(pcred->user_ns, CAP_SYS_NICE), so I kept the cross-UID setpriority permission check against the target task's user namespace.

The line you linked in kernel/sched/syscalls.c is the sched_setscheduler path, and I see that uses capable(CAP_SYS_NICE). If you prefer gVisor's simplified Setpriority implementation to model that scheduler-side check instead, I can change it.

Copy link
Copy Markdown
Collaborator

@ayushr2 ayushr2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Just nits.

Comment thread pkg/sentry/socket/netstack/netstack.go Outdated
// Since Linux 5.7, CAP_NET_RAW is only required to overwrite an
// existing SO_BINDTODEVICE binding. See
// net/core/sock.c:sock_bindtoindex_locked() and upstream commit
// c427bfec18f2 ("net: core: add sock_bindtoindex").
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit title seems incorrect: c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for non-root users")

Comment thread pkg/sentry/syscalls/linux/sys_file.go Outdated
// block or character device nodes, except for whiteouts (S_IFCHR
// with device number WHITEOUT_DEV). See fs/namei.c:vfs_mknod().
isWhiteout := mode.FileType() == linux.ModeCharacterDevice && dev == linux.WHITEOUT_DEV
if !isWhiteout && !t.HasCapabilityIn(linux.CAP_MKNOD, t.Kernel().RootUserNamespace()) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.HasCapabilityIn(linux.CAP_MKNOD, t.Kernel().RootUserNamespace()) can be simplified to t.HasRootCapability(linux.CAP_MKNOD)

Comment thread pkg/sentry/syscalls/linux/sys_file.go Outdated

// "Zero file type is equivalent to type S_IFREG." - mknod(2)
if mode.FileType() == 0 {
switch mode.FileType() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch ft := mode.FileType(); ft {
}

So you can use ft below instead of calling mode.FileType() again.

Add Linux-compatible capability enforcement for SO_BINDTODEVICE, mknod, sched_setaffinity, and setpriority.

Add syscall tests covering each capability check.
@petrmarinec petrmarinec force-pushed the fix/missing-capability-checks branch from 4946e8c to 0231fa9 Compare April 18, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants