Skip to content

feat(sandbox): integrate openshell-ocsf crate with dual-file output, SandboxConfig proto, and log site migration #393

@johntmyers

Description

@johntmyers

Problem Statement

With the openshell-ocsf crate built and tested (see #392), we need to wire it into the sandbox supervisor to replace all 93 file-level log sites with typed OCSF events, set up the dual-file output (openshell.log shorthand + openshell-ocsf.log JSONL), and add the SandboxConfig proto mechanism for gateway-controlled OCSF toggle with hot-reload.

Proposed Design

This issue covers all integration work: proto changes, gateway config plumbing, sandbox subscriber wiring, config poll hot-reload, migration of every log site from ad-hoc info!()/warn!() to builder → ocsf_emit!(), OCSF profile enrichment, and E2E verification.

The full design is documented in .opencode/plans/ocsf-log-export.md. Key sections: "SandboxConfig: Gateway → Sandbox Operational Config", "Tracing Layer Integration", "Implementation Plan", "Delivery Plan — Part 2".

Architecture

tracing event
    │
    ▼
OcsfEvent struct (built in openshell-sandbox using openshell-ocsf builders)
    │
    ├──► OcsfShorthandLayer ──► /var/log/openshell.log  (always on)
    │    (openshell-ocsf)         └──► gRPC log push to gateway
    │
    └──► OcsfJsonlLayer     ──► /var/log/openshell-ocsf.log  (toggle via SandboxConfig)
         (openshell-ocsf)        wrapped in reload::Layer for hot-reload

SandboxConfig Proto

message SandboxConfig {
  uint64 config_revision = 1;
  LoggingConfig logging = 2;
  // Future: DiagnosticsConfig, FeatureFlags, BroadcastMessage
}

message LoggingConfig {
  bool ocsf_enabled = 1;
  // Future: log_level_override, rotation
}

Added as optional SandboxConfig sandbox_config = 4 on GetSandboxPolicyResponse. Independent revision tracking from policy version. Designed as a general-purpose gateway → sandbox operational config channel.

Log Site Migration Scope

93 file-level log sites across 18 source files:

Event Class Count Primary Files
Network Activity [4001] 19 proxy.rs, bypass_monitor.rs
HTTP Activity [4002] 7 proxy.rs, l7/relay.rs
SSH Activity [4007] 10 ssh.rs
Process Activity [1007] 4 lib.rs, process.rs
Detection Finding [2004] 9 ssh.rs, opa.rs, l7/relay.rs, proxy.rs, bypass_monitor.rs
Application Lifecycle [6002] 18 main.rs, lib.rs, netns.rs
Device Config State Change [5019] 24 lib.rs
Base Event [0] 20 netns.rs, mechanistic_mapper.rs, proxy.rs, bypass_monitor.rs

Dual-emit events: BYPASS_DETECT (Network Activity + Detection Finding), NSSH1 nonce replay (SSH Activity + Detection Finding).

Dependencies

Order of Battle

Each step depends on prior steps unless noted.

Step 1: Proto changes (~0.5 day)

  • Add SandboxConfig and LoggingConfig messages to proto/sandbox.proto:
    message SandboxConfig {
      uint64 config_revision = 1;
      LoggingConfig logging = 2;
    }
    
    message LoggingConfig {
      bool ocsf_enabled = 1;
    }
  • Add optional SandboxConfig sandbox_config = 4 to GetSandboxPolicyResponse
  • Regenerate Rust proto bindings (mise run proto:gen or equivalent)
  • Verify proto compilation succeeds and generated Rust types are accessible
  • Verify existing proto tests still pass

Done when: Proto compiles. SandboxConfig and LoggingConfig types exist in generated Rust code. GetSandboxPolicyResponse has an optional sandbox_config field. Existing tests pass.

Step 2: Gateway config plumbing (~1-1.5 days)

  • In openshell-server, read ocsf_logging_enabled from gateway config sources (priority: YAML > CLI flag > env var OPENSHELL_OCSF_LOGGING)
  • Maintain config_revision: u64 counter in gateway, initialized to 1 on startup, incremented on any config change
  • Populate SandboxConfig in every GetSandboxPolicyResponse with current config_revision and ocsf_enabled value
  • When ocsf_logging_enabled is not configured, default to false
  • Add unit tests: config parsing from each source, revision counter increments, response population

Done when: Gateway populates SandboxConfig in every GetSandboxPolicyResponse. Config revision starts at 1 and increments correctly. Tests verify config source priority and default behavior.

Depends on: Step 1.

Step 3: Sandbox subscriber wiring (~1-1.5 days)

  • Add openshell-ocsf as dependency of openshell-sandbox in Cargo.toml
  • In main.rs, replace existing fmt::Full file layer for /var/log/openshell.log with OcsfShorthandLayer
  • Set up OcsfJsonlLayer for /var/log/openshell-ocsf.log, wrapped in tracing_subscriber::reload::Layer
  • Initialize JSONL layer as None (disabled) or Some(...) (enabled) based on initial SandboxConfig from first GetSandboxPolicy response
  • Create SandboxContext from sandbox config values (sandbox ID, name, image, hostname, proxy address) and store for use by all log sites
  • Store reload::Handle for JSONL layer for use in config poll loop
  • Wire subscriber: registry().with(shorthand_layer).with(jsonl_reload_layer).with(stdout_layer).with(log_push_layer)
  • Add integration tests: subscriber setup with OCSF on/off, verify shorthand always active, JSONL conditional

Done when: Sandbox starts with new subscriber stack. OcsfShorthandLayer writes to openshell.log. OcsfJsonlLayer writes to openshell-ocsf.log only when enabled. SandboxContext created and accessible. Existing stdout and log push layers remain functional.

Depends on: Steps 1, 2.

Step 4: Config poll integration (~0.5-1 day)

  • Add SandboxConfig handling to existing policy poll loop in lib.rs
  • Track current_config_revision: u64 in sandbox state (initialized to 0)
  • On each poll response: if sandbox_config present and config_revision > current_config_revision, apply changes
  • For OCSF toggle: use jsonl_reload_handle.modify(|layer| ...) to hot-reload
    • Enable: create JSONL file + non-blocking writer + Some(OcsfJsonlLayer)
    • Disable: None
  • Emit CONFIG:UPDATED event (via ConfigStateChangeBuilder) recording toggle change
  • Handle absent sandbox_config (older gateway) gracefully: no-op, keep current config
  • Add unit tests: revision tracking, toggle on/off via reload handle, backward compat with absent config

Done when: Config poll loop correctly tracks revisions and hot-reloads the JSONL layer. Toggling ocsf_enabled from gateway config creates/removes JSONL file at runtime without sandbox restart. CONFIG:UPDATED event emitted on change. Older gateway (no sandbox_config) doesn't break sandbox.

Depends on: Steps 2, 3.

Step 5: Log site migration — Network + HTTP events (25 sites) (~1.5-2 days)

  • Refactor all 19 Network Activity [4001] log sites in proxy.rs and bypass_monitor.rs:
    • CONNECT allow (L4-only), CONNECT_L7 (allow, L7 follows), CONNECT deny → NetworkActivityBuilder
    • BYPASS_DETECT network event → NetworkActivityBuilder (with observation_point_id=3)
    • Proxy listen, connection errors, relay errors → NetworkActivityBuilder
    • FORWARD parse/reject/upstream errors → NetworkActivityBuilder
    • SSRF blocks (allowed_ips failed, invalid config, internal IP) → NetworkActivityBuilder
  • Refactor all 7 HTTP Activity [4002] log sites in proxy.rs and l7/relay.rs:
    • FORWARD allow/deny → HttpActivityBuilder with HTTP method → activity_id mapping
    • L7_REQUEST → HttpActivityBuilder
    • SSRF blocks → HttpActivityBuilder
    • Non-inference request at inference.local → HttpActivityBuilder
  • Each refactored site: construct builder → ocsf_emit!(). Remove old info!()/warn!() call
  • Verify shorthand output matches expected patterns for proxy events

Done when: All 25 Network + HTTP log sites use builder → ocsf_emit!(). No ad-hoc info!()/warn!() calls remain for these event types. cargo test passes.

Depends on: Steps 3, 4 (subscriber wired, SandboxContext available).

Step 6: Log site migration — SSH + Process + Finding events (22 sites) (~1.5-2 days)

  • Refactor 10 SSH Activity [4007] log sites in ssh.rs:
    • SSH listen, handshake read/verify/accepted/failed → SshActivityBuilder
    • NSSH1 nonce replay (SSH side of dual-emit) → SshActivityBuilder
    • direct-tcpip refuse/fail, unsupported subsystem → SshActivityBuilder
  • Refactor 4 Process Activity [1007] log sites in lib.rs and process.rs:
    • Process started → ProcessActivityBuilder with launch_type_id=1 (Spawn)
    • Process exited → ProcessActivityBuilder with exit_code
    • Process timed out → ProcessActivityBuilder with forced kill
    • SIGTERM failed → ProcessActivityBuilder
  • Refactor 9 Detection Finding [2004] log sites:
    • NSSH1 nonce replay finding (dual-emit with SSH) → DetectionFindingBuilder with MITRE T1550/TA0008
    • BYPASS_DETECT finding (dual-emit with Network) → DetectionFindingBuilder with MITRE T1090.003/TA0011, remediation.desc from hint
    • Unsafe disk policy → DetectionFindingBuilder
    • L7 policy validation warnings (2x) → DetectionFindingBuilder
    • SQL L7 not implemented, HTTP parse error → DetectionFindingBuilder
    • Inference interception, upstream chunk error → DetectionFindingBuilder
  • Verify dual-emit: NSSH1 replay produces 1 SSH Activity + 1 Detection Finding
  • Verify dual-emit: BYPASS_DETECT produces 1 Network Activity (step 5) + 1 Detection Finding

Done when: All 22 log sites use builder → ocsf_emit!(). Dual-emit sites produce exactly 2 events each. No ad-hoc calls remain for these event types. cargo test passes.

Depends on: Step 5 (BYPASS_DETECT network event already migrated; finding side added here).

Step 7: Log site migration — Lifecycle + Config + Base events (46 sites) (~1.5-2 days)

  • Refactor 18 Application Lifecycle [6002] log sites in main.rs, lib.rs, netns.rs:
    • Sandbox start, log file fallback → AppLifecycleBuilder
    • SSH server ready/failed → AppLifecycleBuilder
    • Provider env fetch success/failure → AppLifecycleBuilder
    • TLS init success/failure → AppLifecycleBuilder
    • SIGCHLD handler, image validation, platform warning → AppLifecycleBuilder
    • Config validation (zero/invalid interval) → AppLifecycleBuilder
    • Bypass detection setup: installing rules, rules installed, iptables not found, install failed → AppLifecycleBuilder
  • Refactor 24+ Device Config State Change [5019] log sites in lib.rs:
    • Policy: load/fetch/reload/fallback/enrichment/poll/report → ConfigStateChangeBuilder
    • Inference routes: file load/gateway fetch/update/bundle status → ConfigStateChangeBuilder
  • Refactor 20 Base Event [0] log sites in netns.rs, mechanistic_mapper.rs, proxy.rs, bypass_monitor.rs:
    • Netns create/cleanup (6 events) → BaseEventBuilder
    • Denial flush events → BaseEventBuilder
    • DNS resolution failures → BaseEventBuilder
    • Proxy operational errors → BaseEventBuilder
    • Bypass detection operational events (rule failures, dmesg failures) → BaseEventBuilder

Done when: All remaining 46 log sites use builder → ocsf_emit!(). Every file-level log statement in the sandbox (93 total) now goes through OCSF builders. No ad-hoc info!()/warn!() calls remain for events that reach the file layer. cargo test passes.

Depends on: Steps 5, 6.

Step 8: Profile enrichment (~1 day)

  • Verify and apply OCSF profiles to all events:
    • Container profile: All events include container from SandboxContext::container()
    • Network Proxy profile: Network Activity and HTTP Activity include proxy_endpoint from SandboxContext::proxy_endpoint()
    • Security Control profile: Network, HTTP, SSH, Detection Finding include action_id, disposition_id, firewall_rule where applicable
    • Host profile: All events include device from SandboxContext::device()
  • Populate v1.6.0+ fields:
    • observation_point_id=2 (Destination) on proxy network events
    • observation_point_id=3 (Inline) on BYPASS_DETECT network events
    • is_src_dst_assignment_known=true on all network events
  • Verify metadata.profiles array is correctly populated per event class
  • Schema validation tests confirm profile fields are present

Done when: Every event includes correct profile fields. metadata.profiles lists profiles applied. Schema validation confirms profile fields present.

Depends on: Steps 5, 6, 7.

Step 9: Log push channel update (~0.5 day)

  • Ensure gRPC log push layer receives shorthand text from OcsfShorthandLayer output
  • Verify message field in SandboxLogLine proto contains the shorthand line
  • Verify openshell sandbox logs CLI command displays shorthand output correctly
  • Verify TUI log panel renders shorthand lines
  • No proto changes needed — same SandboxLogLine message, better formatted content

Done when: Log push delivers shorthand text to gateway. openshell sandbox logs shows shorthand lines. TUI displays correctly. No regressions.

Depends on: Steps 3, 5-7.

Step 10: E2E verification (~1 day)

  • Run mise run e2e and verify all existing tests pass (update assertions as needed for shorthand format)
  • Verify dual-file output:
    • /var/log/openshell.log contains shorthand text lines (not JSON, not old format)
    • /var/log/openshell-ocsf.log (when enabled) contains OCSF JSONL with correct schema structure
    • Line counts match between files (accounting for dual-emit events)
  • Verify config toggle: enable/disable OCSF via gateway config, observe JSONL file creation/cessation without sandbox restart
  • Update test_sandbox_policy.py assertions for shorthand patterns
  • Verify log push and TUI end-to-end

Done when: mise run e2e passes. Manual verification of dual-file output, log push, TUI display, and config toggle all succeed.

Depends on: All prior steps.

Step 11: Cleanup (~0.5 day)

  • Remove all replaced ad-hoc info!()/warn!() calls superseded by ocsf_emit!()
  • Verify no OCSF formatting or serialization code exists in openshell-sandbox — all lives in openshell-ocsf
  • Verify openshell-sandbox only contains: builder call sites, subscriber wiring in main.rs, SandboxContext creation, config poll integration
  • Run mise run pre-commit and fix any issues
  • Run cargo clippy across the workspace — zero warnings
  • Update architecture/ docs to describe new logging architecture (dual-file output, OCSF adoption, SandboxConfig)
  • Update docs/ for any user-facing log format changes

Done when: No dead code remains. mise run pre-commit passes. cargo clippy zero warnings. Architecture docs updated. No OCSF logic in sandbox outside call sites and wiring.

Depends on: Steps 5-10.

Acceptance Criteria

  1. All 93 file-level log sites in openshell-sandbox emit typed OCSF events via builder → ocsf_emit!()
  2. /var/log/openshell.log contains shorthand-formatted text (not old tracing::fmt::Full output)
  3. /var/log/openshell-ocsf.log is created only when SandboxConfig.logging.ocsf_enabled=true and contains valid OCSF JSONL
  4. OCSF JSONL toggle is hot-reloadable via SandboxConfig.config_revision — no sandbox restart needed
  5. Backward compatible: sandbox works correctly with a gateway that does not populate sandbox_config
  6. gRPC log push delivers shorthand text. TUI and openshell sandbox logs display correctly
  7. Dual-emit events (BYPASS_DETECT, NSSH1 replay) produce exactly 2 OCSF events each
  8. All OCSF profiles (Container, Network Proxy, Security Control, Host) correctly applied
  9. mise run e2e passes with updated assertions
  10. mise run pre-commit and cargo clippy pass with zero issues
  11. Architecture docs updated to describe dual-file logging and OCSF adoption

Estimated Effort

~9-12 days (after Part 1 is complete)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions