Status: active
The fleet/sensors/ directory at extensions/system/server/app/services/system/fleet/sensors/ holds 21 files: one BaseSensor abstract class, 17 sensors registered for the live tick loop via FleetAutonomyService::SENSORS, and 3 unregistered fleet sensors (PackageDriftSensor, StorageAssignmentDriftSensor, SdwanCredentialExpirySensor) that run via separate invocation paths and emit signals into the same pipeline. (A further 2 CVE sensors live under cve_ops/sensors/ and are owned by the CVE Responder agent — see its section — not part of this directory's count.) Each sensor inspects a slice of fleet state on a recurring tick, emits typed FleetEvent signals when thresholds trip, and feeds the autonomy DecisionEngine which gates remediation actions per intervention policy.
The 17 registered sensors, in SENSORS order: InstanceStatusSensor, InstanceStateDriftSensor, ModuleDriftSensor, CertificateExpirySensor, CertExpirySensor, ModulePromotionSensor, ConfigDriftSensor, SloViolationSensor, HoneypotAccessSensor, TradingPressureSensor, SdwanDriftSensor, SdwanReachabilitySensor, SdwanBgpSessionHealthSensor, SdwanVipReachabilitySensor, GitopsDriftSensor, ProjectSloSensor, FederationPeerLivenessSensor.
The Fleet Autonomy reconciler runs every 60s (configurable via autonomy_config.interval_seconds on the Fleet Autonomy agent; with the 2026-05-10 7-agent split, CVE / SDWAN / Disk Image / Runtime Manager agents each carry their own interval_seconds for their respective scopes). Each tick:
- The 17 sensors in
FleetAutonomyService::SENSORSrun in series (cheap; per-sensor work is bounded by the data it inspects).PackageDriftSensor,StorageAssignmentDriftSensor, andSdwanCredentialExpirySensorrun on the same cadence via their owning services and emit signals into the same pipeline. - Each sensor emits zero or more
FleetEventsignals withkind,severity,payload,correlation_id - The DecisionEngine maps signals → action categories → intervention policy lookup
- Policy =
auto_approve→ executor runs immediately - Policy =
notify_and_proceed→ executor runs + operator notified - Policy =
require_approval→ ApprovalRequest queued; executor blocked until operator clicks Approve
flowchart LR
subgraph Sensors["20 fleet sensors (17 registered for 60s tick, 3 via separate paths)"]
S1[instance_status]
S2[module_drift]
S3[module_promotion]
S4[certificate_expiry]
S4b[cert_expiry / ACME]
S5[config_drift]
S6[instance_state_drift]
S7[sdwan_reachability]
S8[sdwan_drift]
S9[sdwan_bgp_session_health]
S10[sdwan_vip_reachability]
S11[sdwan_credential_expiry*]
S12[honeypot_access]
S13[slo_violation]
S14[project_slo]
S15[gitops_drift]
S16[trading_pressure]
S17[federation_peer_liveness]
S18[package_drift*]
S19[storage_assignment_drift*]
end
subgraph Signals["FleetEvent signal kinds"]
Sig[instance.* / module.* / cert.* / config.* / gitops.*<br/>sdwan.* / honeypot.* / slo.* / project.* / storage.* / fleet.trading_*]
end
subgraph Executors["Skill executors (representative — see SKILL_EXECUTOR_CATALOG.md for all 48)"]
E1[drift_remediate]
E2[cve_response / cve_remediation_orchestration]
E3[rolling_module_upgrade]
E4[sdwan_peer_remediate]
E5[sdwan_vip_failover]
E6[sdwan_bgp_session_remediate]
E7[attribute_failure]
E8[package_module_refresh]
E9[architecture_create / update / delete / propose]
end
Sensors --> Signals
Signals --> DE[DecisionEngine]
DE --> FA[FleetAutonomyService<br/>gate_action!]
FA --> Executors
* = not in FleetAutonomyService::SENSORS; invoked on the same cadence by its owning service. The 17 unstarred nodes are the registered set.
Source: instance_status_sensor.rb
Watches: System::NodeInstance.last_heartbeat_at
Threshold: Configurable per-template; default 5 minutes silent → instance_silent signal
Signals: instance.silent, instance.recovered
Recommended remediation: attribute_failure (skill) for diagnostics, then operator-initiated reprovision.
Source: module_drift_sensor.rb
Watches: NodeInstance.running_module_digests vs assigned module digests
Threshold: Any digest mismatch → module_drift signal
Signals: module.drift_detected, module.drift_resolved
Recommended remediation: drift_remediate skill (Fleet Autonomy auto-runs with notify_and_proceed).
Source: module_promotion_sensor.rb
Watches: NodeModuleVersion.lifecycle_state transitions (staging → blessed)
Threshold: Module spends >24h in staging without operator promotion → module_promotion_pending signal
Signals: module.promotion_ready, module.promotion_stalled
Recommended remediation: None automated — operator promotes via UI or system_promote_module_version MCP action.
Source: certificate_expiry_sensor.rb
Watches: NodeCertificate.not_after (mTLS instance certs from InternalCaService)
Threshold: Cert expires within 14 days → cert_expiring signal; expired → cert_expired signal
Signals: cert.expiring, cert.expired, cert.rotated
Recommended remediation: Auto-rotate via system.cert_rotate action (Fleet Autonomy auto_approve policy). 90-day default lifetime.
Source: config_drift_sensor.rb
Watches: Agent-reported config hash vs platform-computed config hash
Threshold: Hash mismatch → config_drift signal
Signals: config.drift_detected, config.drift_resolved
Recommended remediation: drift_remediate skill (same as module drift).
Source: sdwan_reachability_sensor.rb
Watches: Sdwan::Peer.last_handshake_at for hub peers
Threshold: No handshake in 5 minutes from hub → sdwan.hub_unreachable signal
Signals: sdwan.hub_unreachable, sdwan.hub_recovered
Recommended remediation: sdwan_failover skill (planning-only in v1; operator manually flips publicly_reachable).
Source: sdwan_drift_sensor.rb
Watches: Agent-reported wg interface state vs platform desired config
Threshold: Interface missing or wrong AllowedIPs → sdwan.peer_drift signal
Signals: sdwan.peer_drift_detected, sdwan.peer_drift_resolved
Recommended remediation: sdwan_peer_remediate skill — rotate keys + force tunnel re-establish.
Source: sdwan_bgp_session_health_sensor.rb
Watches: Sdwan::BgpSession.state (Idle/Connect/Active/OpenSent/OpenConfirm/Established)
Threshold: Session non-Established for >10 minutes → sdwan.bgp_unhealthy signal
Signals: sdwan.bgp_unhealthy, sdwan.bgp_recovered
Recommended remediation: sdwan_bgp_session_remediate skill (planning-only; operator runs vtysh recommendation).
Source: sdwan_vip_reachability_sensor.rb
Watches: Sdwan::VirtualIp.holder_peer_ids against peer handshake health
Threshold: Single-holder VIP's holder is silent → sdwan.vip_holder_silent signal
Signals: sdwan.vip_holder_silent, sdwan.vip_holder_recovered
Recommended remediation: sdwan_vip_failover skill — promotes the next failover candidate.
Source: honeypot_access_sensor.rb
Watches: CanaryModuleService access logs on canary modules placed in the catalog
Threshold: Any access attempt → honeypot.access signal (high severity)
Signals: honeypot.access_attempted, honeypot.access_blocked
Recommended remediation: None automated — escalates to operator + governance pipeline.
Source: slo_violation_sensor.rb
Watches: Slo::Definition rolling-window metrics
Threshold: SLO breach → slo.violated signal
Signals: slo.violated, slo.recovered
Recommended remediation: None automated — surfaces in operator dashboard for manual investigation.
Source: trading_pressure_sensor.rb (class TradingPressureSensor)
Watches: Stigmergic pressure signals emitted by sibling extensions on the platform-wide signal bus
Threshold: Trading-aggregate pressure ≥1.0 → fleet defers non-critical actions
Signals: fleet.trading_pressure_high, fleet.trading_pressure_normal
Recommended remediation: Internal — no executor; the TradingAwareThrottle consults this signal to defer Fleet Autonomy actions when trading is under load.
Naming: The sensor + throttle consume trading-domain signals specifically. A broader cross-domain refactor (renaming to ExternalPressureSensor / ExternalAwareThrottle and accepting any sibling extension's pressure feed) is contemplated but not in scope today.
Source: instance_state_drift_sensor.rb
Watches: NodeInstance rows whose model status disagrees with provider truth (e.g., DB says running, provider says stopped).
Threshold: Any mismatch outside the in-flight task window → system.instance_state_drift signal
Signals: system.instance_state_drift
Recommended remediation: Reconcile — operator-acknowledged correction or notify_and_proceed reassertion.
Source: gitops_drift_sensor.rb (Phase 6c GitOps reconciler integration)
Watches: fleet.yaml-declared state vs effective fleet (assignments / templates / instances).
Threshold: Diff present → gitops.drift_detected signal with the proposal payload
Signals: gitops.drift_detected, gitops.drift_resolved
Recommended remediation: Gitops::ApplyService proposes a reconcile change via Ai::AgentProposal (operator approval required for apply).
Source: package_drift_sensor.rb
Watches: PackageRepository freshness windows + drift between manifests and registered NodeModules.
Threshold: Stale repository sync OR manifest divergence → system.package_drift_pressure signal
Signals: system.package_drift_pressure
Recommended remediation: package_repository_sync or package_module_refresh (Fleet Autonomy auto_approve for sync, notify_and_proceed for refresh).
Source: project_slo_sensor.rb
Watches: Project-scoped rolling-window metrics (latency, error rate, cost guardrail).
Threshold: Per-project SLO breach OR cost guardrail trip → typed signal (project.slo_violation, project.drift, project.cost_breach).
Signals: project.slo_violation, project.drift, project.cost_breach
Recommended remediation: None automated — feeds the project dashboard for operator review.
Source: sdwan_credential_expiry_sensor.rb
Watches: WireGuard pre-shared keys, IPSec material, peer credentials with TTL ≤ 5 minutes / 15 minutes.
Threshold: Per-key advisory/urgent windows → sdwan.credential_expiring / sdwan.credential_expired signals
Signals: sdwan.credential_expiring, sdwan.credential_expired, sdwan.credential_rotated
Recommended remediation: sdwan_key_rotate (SDWAN Manager auto_approve).
Source: storage_assignment_drift_sensor.rb
Watches: Volume / NFS export assignment freshness; 5-minute stale window.
Threshold: Stale assignment data → system.storage_assignment_drift signal
Signals: system.storage_assignment_drift
Recommended remediation: attach_storage / detach_storage (operator-approved).
flowchart TD
Tick[Sensor tick 60s] --> Emit[Emit FleetEvent]
Emit --> Eval[DecisionEngine.evaluate event]
Eval --> Lookup{Lookup<br/>InterventionPolicy<br/>action_category}
Lookup -->|auto_approve| AutoExec[Execute immediately<br/>e.g. cert_rotate]
Lookup -->|notify_and_proceed| NotifyExec[Execute + push notification<br/>e.g. drift_remediate]
Lookup -->|require_approval| Queue[Queue ApprovalRequest<br/>e.g. cve_remediate]
Lookup -->|blocked| Drop[Drop — refuse to execute]
Queue --> OpApprove{Operator<br/>approves?}
OpApprove -->|yes| Exec2[Execute]
OpApprove -->|no / timeout| Reject[Rejected]
AutoExec --> Audit[Audit + FleetEvent + ActionCable broadcast]
NotifyExec --> Audit
Exec2 --> Audit
Drop --> Audit
Reject --> Audit
Action executors live at:
extensions/system/server/app/services/system/ai/skills/*_executor.rb
Sensors read thresholds from Fleet::SensorConfig records (account-scoped). Operator-tunable via:
// ⚠️ Sensor config MCP actions are aspirational — edit Fleet::SensorConfig via Rails console or REST today
// platform.system_get_sensor_config({ sensor: "instance_status" }) // aspirational
// platform.system_update_sensor_config({ // aspirational
// sensor: "instance_status",
// silent_threshold_minutes: 10 // default 5
// })Until those MCP wrappers ship, configure via Rails console:
Fleet::SensorConfig.upsert_for(account: Account.find("<id>"), sensor: "instance_status",
config: { silent_threshold_minutes: 10 })If no Fleet::SensorConfig exists for an account, sensor defaults from constants in each sensor class apply.
- Create
extensions/system/server/app/services/system/fleet/sensors/<name>_sensor.rbextendingFleet::Sensors::BaseSensor. - Implement
tick(account:)returning an array ofFleetEventrows (or empty). - Register the sensor in
Fleet::Reconcilerso it runs on each autonomy tick. - Add an intervention policy entry in
fleet_autonomy_agent.rbfor the action category your sensor's recommendation maps to. - Add a corresponding skill executor (if remediation is automatable) — see
SKILL_EXECUTORS.md.
Seven AI agents seed intervention policies (action_category → policy mapping) since the 2026-05-10 domain split. Sourced from:
db/seeds/fleet_autonomy_agent.rb— 27 policies (non-CVE / non-SDWAN / non-disk-image fleet ops, including the 7 AUTONOMOUSsystem.sdwan_*remediations Fleet Autonomy owns)db/seeds/system_runtime_manager_agent.rb— 7 policies (Phase 1 Docker + Phase 2 K3s runtime; the priorsystem.runtime_docker_tls_rotatewas removed 2026-05-19 — no executor existed)db/seeds/system_cve_responder_agent.rb— 5 policies (CVE feed → exposure → remediation; CVE policies historically lived on Fleet Autonomy)db/seeds/system_sdwan_manager_agent.rb— 24 policies (operator-initiatedsdwan.*CRUD — networks / peers / VIPs / firewall / route policies / federation; moved off Fleet Autonomy 2026-05-10)db/seeds/system_disk_image_manager_agent.rb— 6 policies (disk image CI publication lifecycle)db/seeds/system_concierge_agent.rb— 0 action-category policies — Concierge is a chat agent; intervention is via therequest_confirmationskill, not policy gatingdb/seeds/system_topology_designer_agent.rb— 0 action-category policies — Topology Designer is a skill-gated specialist invoked by Concierge viaexecute_agent; intervention rides on the parent agent's queue
= 69 action-category policies across the seven system-extension agents.
Prefix split (important): autonomous remediations use the
system.sdwan_*action prefix and are owned by Fleet Autonomy; operator-initiated CRUD uses the baresdwan.*prefix and is owned by the SDWAN Manager. The two prefixes are distinct policy namespaces.
Policy semantics:
| Policy | Behavior |
|---|---|
auto_approve |
Skill executes immediately on the next reconciler tick. Reversible / routine work only. |
notify_and_proceed |
Skill executes + operator notification fires. Operator opted in by upstream config. |
require_approval |
ApprovalRequest queued; skill blocked until operator clicks Approve. Sensitive / destructive work. |
blocked |
Action is disabled entirely. Reserved for incident response. |
All policies decay to the agent's trust_tier_minimum: monitored condition — agents below trust threshold are auto-blocked regardless of policy.
Source: db/seeds/fleet_autonomy_agent.rb. Approval chain: Fleet Autonomy Actions (4-hour timeout, * approver, sequential). Note: as of 2026-05-10, CVE policies moved to system_cve_responder_agent.rb, the operator-initiated sdwan.* CRUD policies to system_sdwan_manager_agent.rb, and Disk Image policies to system_disk_image_manager_agent.rb — they no longer live here. Fleet Autonomy retains the 7 AUTONOMOUS system.sdwan_* remediation policies (peer remediate, key rotate, failover, VIP failover, BGP session remediate, route policy audit, federation peer remediate), which is why this count exceeds the categories tabulated below.
| Action category | Default policy | Why |
|---|---|---|
system.cert_rotate |
auto_approve |
Routine + reversible (90-day mTLS rotation) |
system.module_assign |
notify_and_proceed |
Operator already opted-in by configuring template |
system.instance_reboot |
notify_and_proceed |
Reversible — instance returns within ~60 s |
system.instance_reprovision |
require_approval |
Destructive — wipes ephemeral state |
system.instance_terminate |
require_approval |
Destructive — releases provider VM, cascade-FK deletes managed rows |
system.cert_revoke |
require_approval |
Cuts active mTLS session |
system.module_promote_to_live |
require_approval |
Promotes module across the fleet |
system.fleet_rolling_upgrade |
require_approval |
Touches many instances; rolling_module_upgrade skill plans batches |
system.region_expansion |
require_approval |
Cost-bearing |
system.capacity_resize |
require_approval |
Cost-bearing; capacity_recommend skill emits the proposal |
system.observation |
auto_approve |
Pure observation — no remediation; collects events for dashboards |
system.package_repository.sync |
auto_approve |
Routine PackageRepository refresh |
system.package_module.create |
notify_and_proceed |
Materialises a NodeModule from PackageRepository |
system.package_module.refresh |
notify_and_proceed |
Re-resolves dependencies / re-validates manifest |
system.architecture.propose |
notify_and_proceed |
suggest_architectures_for_fleet skill emits proposals |
system.architecture.create |
require_approval |
Catalog change — affects future provisioning |
system.architecture.update |
require_approval |
Catalog change |
system.architecture.delete |
require_approval |
Catalog change |
Source: db/seeds/system_cve_responder_agent.rb. Approval chain: CVE Response Actions (8-hour timeout — security responses span business days).
| Action category | Default policy | Why |
|---|---|---|
system.cve_remediate |
require_approval |
Composes cve_response + rolling_module_upgrade; touches fleet |
system.cve_sbom_ingest |
auto_approve |
Routine SBOM refresh from NVD feed |
system.cve_exposure_scan |
auto_approve |
Read-only scan for exposed modules |
system.cve_auto_remediate |
require_approval |
Auto-remediation candidate (CriticalUpgradeAvailableSensor) |
system.module_critical_upgrade_ready |
notify_and_proceed |
Patch already in catalog — fly it (gated by operator notify) |
Source: db/seeds/system_sdwan_manager_agent.rb. Approval chain: SDWAN Manager Actions (4-hour timeout). These are operator-initiated sdwan.* CRUD categories (network/peer/firewall/VIP/route-policy/port-mapping/access-grant/user-device/federation create/update/delete) — distinct from the AUTONOMOUS system.sdwan_* remediations that stay on Fleet Autonomy. Examples: sdwan.network_create, sdwan.firewall_rule_create, sdwan.access_grant_revoke, sdwan.federation_peer_accept. See SDWAN_MANAGER_AGENT.md for the full table.
Source: db/seeds/system_disk_image_manager_agent.rb. Approval chain: Disk Image Manager Actions (12-hour timeout — image promotions span release windows). See DISK_IMAGE_MANAGER_AGENT.md for the full table. Categories include system.disk_image_publication_promote, system.disk_image_publication_rollback, system.disk_image_webhook_trigger, system.disk_image_retention_update. Note: the 2026-05-19 accuracy audit found two seeded policies (system.disk_image_webhook_revoke, system.disk_image_webhook_rotate_secret) whose executors were still pending — confirm their current status before relying on autonomous handling.
Source: db/seeds/system_runtime_manager_agent.rb. Approval chain: Runtime Manager Actions (4-hour timeout, * approver, sequential, separate from Fleet Autonomy chain).
| Action category | Default policy | Why |
|---|---|---|
system.runtime_docker_provision |
notify_and_proceed |
Operator opted in by assigning docker-engine module; provisioning is the obvious follow-through |
system.runtime_docker_decommission |
require_approval |
Destructive — destroys managed Devops::DockerHost row + Vault TLS material |
system.runtime_k8s_cluster_bootstrap |
notify_and_proceed |
Operator opted in by assigning k3s-server module |
system.runtime_k8s_cluster_decommission |
require_approval |
Destructive — cascade-deletes member node rows |
system.runtime_k8s_node_join |
notify_and_proceed |
Operator opted in by assigning k3s-agent module |
system.runtime_k8s_node_drain |
require_approval |
Affects running pods |
system.runtime_k8s_runtime_upgrade |
require_approval |
Affects workloads |
Operators can override any policy per-account via the AI Agents UI or by editing Ai::InterventionPolicy directly:
// Tighten a default-auto policy
platform.update_intervention_policy({
agent_id: "<fleet-autonomy-agent-id>",
action_category: "system.cert_rotate",
policy: "require_approval"
})Policy changes take effect on the next reconciler tick (≤60 s).
In addition to per-policy gates, operators can set a per-module consent budget capping the daily count of autonomous decisions touching that module. Once exhausted, all autonomous actions on that module are forced to require_approval regardless of policy. See app/services/system/fleet/consent_budget_service.rb.
SKILL_EXECUTORS.md— remediation actions invoked by sensor signals (SKILL_EXECUTOR_CATALOG.mdfor the full auto-generated list)ARCHITECTURE.md— autonomy + decision engine subsystemCONTAINER_RUNTIMES.md— runtime-specific monitoring (Runtime Manager agent has its own policies)runbooks/cve-response.md— operator runbook usingcve_remediatepolicy chainrunbooks/sdwan-network-setup.md— operator runbook covering SDWAN policies
Last verified: 2026-06-03