Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion PROOFS_NEEDED.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Current State

- **src/abi/*.idr**: YES — `Protocol.idr`, `Types.idr`
- **src/abi/*.idr**: YES — `Protocol.idr`, `Types.idr`, `MTLSPolicy.idr` (obligation stub)
- **Dangerous patterns**: 0 (`believe_me` reference in Protocol.idr is documentation only)
- **LOC**: ~9,500
- **ABI layer**: Idris2 definitions present
Expand All @@ -15,6 +15,18 @@
| Protocol state machine | Session state transitions are total | Prevent stuck/invalid protocol states |
| Permission composition | Capability intersection/union laws | Ensure composed permissions don't escalate |
| ABI type safety | FFI boundary type marshalling correctness | Prevent memory corruption at language boundary |
| mTLS trust policy | An unverified client cert is never mapped to a privileged trust class (`classify CertUnverified _ = Untrusted`) | Security core of Phase B mTLS-as-primary-path: forged/unverified certs must be indistinguishable from anonymous |

### mTLS trust policy (Phase B / standards#97)

- **Claim stated:** `src/abi/MTLSPolicy.idr` — `unverifiedNeverPrivileged`.
- **Status:** PENDING (scheduled for Phase C/D — standards#98/#99). The
obligation is recorded as a `0`-multiplicity hole; not yet discharged and
intentionally excluded from `gateway.ipkg modules` so it does not gate the
build before discharge.
- **Mirrors:** the runtime decision in
`Gateway.determine_trust_level_from_cert/2` and the listener fail-closed
contract in `HttpCapabilityGateway.Application`.

## Recommended Prover

Expand Down
2 changes: 1 addition & 1 deletion TEST-NEEDS.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
- **Zig FFI integration test execution** — requires zig toolchain; covered by separate FFI build step.
- **Container build smoke test** — performed in CI, not in `mix test`.
- **Error handling: upstream timeout** — Req receive_timeout covered implicitly; no dedicated test.
- **Real-CA mTLS integration test** — code uses `Record.extract` accessors but no live cert in test fixtures.
- ~~**Real-CA mTLS integration test** — code uses `Record.extract` accessors but no live cert in test fixtures.~~ **CLOSED (Phase B / standards#97):** `test/mtls_test.exs` drives the cert→trust pipeline with a real test CA (`test/fixtures/mtls/`) and proves the CA trust invariant via `:public_key.pkix_path_validation/3`. Live-socket handshake test across the gateway↔BoJ seam is Phase C scope.
- **Self-tests for config validation on startup** — Application.start refuses without policy, but no dedicated assertion.

## Priority
Expand Down
10 changes: 9 additions & 1 deletion config/runtime.exs
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,13 @@ if config_env() == :prod do
backend_url: System.get_env("BACKEND_URL"),
port: String.to_integer(System.get_env("PORT") || "4000"),
trust_level_header: System.get_env("TRUST_LEVEL_HEADER") || "x-trust-level",
trust_level_source: System.get_env("TRUST_LEVEL_SOURCE") || "header"
trust_level_source: System.get_env("TRUST_LEVEL_SOURCE") || "header",

# mTLS listener (Phase B). When TRUST_LEVEL_SOURCE=mtls these three paths
# MUST all be set and readable or the application refuses to start
# (fail-closed -- see HttpCapabilityGateway.Application.http_listeners/1).
tls_port: String.to_integer(System.get_env("GATEWAY_TLS_PORT") || "4443"),
mtls_ca_cert_path: System.get_env("MTLS_CA_CERT_PATH"),
gateway_cert_path: System.get_env("GATEWAY_CERT_PATH"),
gateway_key_path: System.get_env("GATEWAY_KEY_PATH")
end
100 changes: 100 additions & 0 deletions docs/mtls-rotation-runbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
<!-- SPDX-License-Identifier: PMPL-1.0-or-later -->

# mTLS CA Selection & Certificate Rotation Runbook

**Phase:** B (`hyperpolymath/standards#97`) — mTLS as the primary trust-level path
**Applies to:** `http-capability-gateway` deployed as BoJ tier-2 (ADR 0004)
**Companion:** `boj-server` `docs/integration/http-capability-gateway-plan.md` §Phase B

---

## 1. CA selection decision

Phase B deliverable B3 requires a committed decision on which CA roots the
mTLS trust chain. The options evaluated in the integration plan:

| Option | Decision |
|---|---|
| 1. BoJ-own CA (self-signed root, generated at deploy) | **Adopted for initial wiring.** No external dependency; the gateway and the BoJ gnosis-handler container are the only relying parties, so a dedicated single-purpose root is the smallest trust base. |
| 2. Estate SDP CA | Deferred. Adopt once an estate-wide SDP CA exists; migration is a CA-file swap (§4) with no gateway code change. |
| 3. Cloudflare Origin CA (AOP parity) | Complementary, not a replacement. Cloudflare Authenticated Origin Pulls protect the edge→origin hop; the gateway mTLS protects the gateway→BoJ hop. Both may run; they trust different roots. |

**Authenticated Origin Pulls parity.** The gateway's HTTPS listener is
configured with `verify: :verify_peer` and `fail_if_no_peer_cert: true`
(`HttpCapabilityGateway.Application.tls_socket_opts/0`). A client that does
not present a certificate chaining to `MTLS_CA_CERT_PATH` is rejected at the
TLS handshake — the gateway mirrors the Cloudflare AOP "reject unauthenticated
origin clients" model one tier inward.

## 2. Environment contract

The gateway reads TLS material from three environment variables. When
`TRUST_LEVEL_SOURCE=mtls` all three are **mandatory**: if any is missing or
unreadable the application refuses to start (fail-closed — it never silently
downgrades to the forgeable header path).

| Variable | Meaning |
|---|---|
| `MTLS_CA_CERT_PATH` | PEM CA bundle the client cert chain is verified against |
| `GATEWAY_CERT_PATH` | The gateway's own TLS server certificate |
| `GATEWAY_KEY_PATH` | Private key for `GATEWAY_CERT_PATH` |
| `GATEWAY_TLS_PORT` | HTTPS listener port (default `4443`) |
| `TRUST_LEVEL_SOURCE` | `mtls` to make the HTTPS listener mandatory |

Trust mapping (`Gateway.determine_trust_level_from_cert/2`): a verified client
cert with `OU=Internal Services` → `internal`; any other verified client cert
→ `authenticated`; unverified / no TLS → `untrusted`.

## 3. Certificate generation

Production certificates are generated by estate tooling. The shape required:

- **CA**: a long-lived (e.g. 5y) RSA-2048+ or EC-P256 self-signed root.
- **Gateway server cert**: SAN must cover the gateway's service name.
- **Client certs** (one per calling service): subject `OU` carries the
trust class. `OU=Internal Services` grants `internal`. Subjects MUST be
UTF8String-encoded — the gateway's RDN matcher only reads `utf8String`
attribute values (see `test/fixtures/mtls/gen-test-ca.sh` for the exact
`string_mask = utf8only` recipe used by the test CA).

## 4. Rotation without downtime

The gateway re-reads TLS files only on listener (re)start. Rotate with a
rolling restart, not an in-process reload:

1. **Stage new material** alongside the old (new paths or a versioned dir).
2. **Cross-sign / dual-trust window.** If rotating the *CA*, publish a
`MTLS_CA_CERT_PATH` bundle containing **both** the old and new roots.
Both old and new client certs validate during the overlap.
3. **Roll the gateway.** Update the gateway deployment env to the new
`GATEWAY_CERT_PATH`/`GATEWAY_KEY_PATH`; k9-svc performs a rolling replace.
The old replica drains in-flight requests before exit (circuit breaker
trips closed on the old replica, not the seam).
4. **Roll the BoJ-side client certs** the same way against the dual-trust
bundle.
5. **Retire the old root.** Once every client presents a new-root cert,
publish a `MTLS_CA_CERT_PATH` bundle containing only the new root and
roll the gateway once more.

At no point is there a window where the gateway accepts an unverified client:
each step keeps `verify: :verify_peer` + `fail_if_no_peer_cert: true` in force.

## 5. Failure & rollback

| Symptom | Likely cause | Action |
|---|---|---|
| Gateway refuses to start, log `mTLS listener configuration invalid` | A TLS path is unset/unreadable under `TRUST_LEVEL_SOURCE=mtls` | Restore the path or roll back the deployment env to the previous material |
| All clients get TLS handshake failure after a rotation | CA bundle replaced before clients rotated | Re-publish the dual-trust bundle (step 2) and re-roll |
| A specific service gets `untrusted` unexpectedly | Client cert subject not UTF8String, or wrong `OU` | Re-issue that client cert per §3 |

Full traffic-bypass rollback (re-route around the gateway entirely) is the
`docs/integration/gateway-rollback-runbook.md` Phase E deliverable; this
runbook covers only the certificate/CA layer.

## 6. Test fixtures

`test/fixtures/mtls/` contains a self-contained **test** CA (root, gateway
server cert, an internal-OU client, an ordinary client, and a rogue-CA client
that must fail verification). Regenerate with
`test/fixtures/mtls/gen-test-ca.sh`. These keys are committed deliberately for
the test suite and chain to nothing trusted in production.
191 changes: 151 additions & 40 deletions lib/http_capability_gateway/application.ex
Original file line number Diff line number Diff line change
Expand Up @@ -38,46 +38,58 @@ defmodule HttpCapabilityGateway.Application do
# Start HTTP server and other children
port = Application.get_env(:http_capability_gateway, :port, 4000)

children = [
# Prometheus metrics exporter
{TelemetryMetricsPrometheus.Core, metrics: telemetry_metrics()},

# VeriSimDB async audit log client -- started early so that the
# ETS buffer table (:capgw_verisimdb_buffer) exists before the
# first request arrives. Writes are fire-and-forget casts.
{VeriSimDB, []},

# Circuit breaker FSM -- started BEFORE Minikaran and the HTTP
# server so its ETS table (:gateway_circuit_breaker) exists before
# the first request arrives. The gateway calls allow?/1 on every
# request, so the table must be available from startup.
{CircuitBreaker, []},

# Minikaran traffic anomaly detector -- started BEFORE the HTTP
# server so its telemetry handlers are attached before the first
# request arrives. This ensures zero observation loss at startup.
{Minikaran, name: Minikaran},

# HTTP server with our Gateway router
{Plug.Cowboy, scheme: :http, plug: HttpCapabilityGateway.Gateway, options: [port: port]}
]

opts = [strategy: :one_for_one, name: HttpCapabilityGateway.Supervisor]

Logger.info("Starting HTTP Capability Gateway", port: port)

# Attach Minikaran telemetry handlers after supervision tree starts.
# We use a callback to ensure handlers are attached only after
# the Minikaran GenServer is alive and ready to receive casts.
result = Supervisor.start_link(children, opts)

case result do
{:ok, _pid} ->
Minikaran.TelemetryHandler.attach()
result

error ->
error
# Build the listener child specs. When trust_level_source is "mtls",
# a valid TLS listener is mandatory; http_listeners/1 returns
# {:error, reason} and the application refuses to start (fail-closed)
# rather than silently falling back to the forgeable header path.
with {:ok, listeners} <- http_listeners(port) do
children =
[
# Prometheus metrics exporter
{TelemetryMetricsPrometheus.Core, metrics: telemetry_metrics()},

# VeriSimDB async audit log client -- started early so that the
# ETS buffer table (:capgw_verisimdb_buffer) exists before the
# first request arrives. Writes are fire-and-forget casts.
{VeriSimDB, []},

# Circuit breaker FSM -- started BEFORE Minikaran and the HTTP
# server so its ETS table (:gateway_circuit_breaker) exists before
# the first request arrives. The gateway calls allow?/1 on every
# request, so the table must be available from startup.
{CircuitBreaker, []},

# Minikaran traffic anomaly detector -- started BEFORE the HTTP
# server so its telemetry handlers are attached before the first
# request arrives. This ensures zero observation loss at startup.
{Minikaran, name: Minikaran}
] ++ listeners

opts = [strategy: :one_for_one, name: HttpCapabilityGateway.Supervisor]

Logger.info("Starting HTTP Capability Gateway", port: port)

# Attach Minikaran telemetry handlers after supervision tree starts.
# We use a callback to ensure handlers are attached only after
# the Minikaran GenServer is alive and ready to receive casts.
result = Supervisor.start_link(children, opts)

case result do
{:ok, _pid} ->
Minikaran.TelemetryHandler.attach()
result

error ->
error
end
else
{:error, reason} ->
Logger.error(
"mTLS listener configuration invalid; refusing to start (fail-closed)",
error: inspect(reason)
)

{:error, {:listener_config_invalid, reason}}
end

{:error, reason} ->
Expand All @@ -86,6 +98,105 @@ defmodule HttpCapabilityGateway.Application do
end
end

# Build the HTTP/HTTPS listener child specs.
#
# The plaintext HTTP listener is always started: it serves the development
# header-trust path and unauthenticated public routes. The mTLS HTTPS
# listener is started in addition whenever TLS material is configured.
#
# Trust-level-source contract (the Phase B security invariant):
#
# * "header" (default) -- HTTP listener only. Header trust is for
# development and for public routes behind a trusted edge.
#
# * "mtls" -- the HTTPS listener with `verify: :verify_peer` and
# `fail_if_no_peer_cert: true` is MANDATORY. If the TLS material is
# missing or unreadable we return {:error, _} so the application
# refuses to start. We never silently downgrade an mTLS deployment to
# the forgeable header path.
defp http_listeners(port) do
http = {Plug.Cowboy, scheme: :http, plug: HttpCapabilityGateway.Gateway, options: [port: port]}

trust_source = Application.get_env(:http_capability_gateway, :trust_level_source, "header")

case tls_socket_opts() do
{:ok, tls_opts} ->
tls_port = Application.get_env(:http_capability_gateway, :tls_port, 4443)

https =
{Plug.Cowboy,
scheme: :https,
plug: HttpCapabilityGateway.Gateway,
options: [port: tls_port] ++ tls_opts}

Logger.info("mTLS listener enabled", tls_port: tls_port, verify: :verify_peer)
{:ok, [http, https]}

:no_tls when trust_source == "mtls" ->
{:error,
"trust_level_source is \"mtls\" but TLS material is not configured. " <>
"Set MTLS_CA_CERT_PATH, GATEWAY_CERT_PATH and GATEWAY_KEY_PATH."}

:no_tls ->
{:ok, [http]}

{:error, _reason} = err when trust_source == "mtls" ->
err

{:error, reason} ->
Logger.warning(
"TLS material configured but unreadable; starting HTTP listener only",
error: inspect(reason)
)

{:ok, [http]}
end
end

# Resolve the Cowboy TLS socket options from the environment.
#
# Returns:
# * {:ok, opts} -- all three paths set and the files exist
# * :no_tls -- no TLS material configured at all
# * {:error, reason} -- partially configured or files missing
#
# `verify: :verify_peer` + `fail_if_no_peer_cert: true` makes the TLS
# handshake itself reject any client that does not present a certificate
# chaining to `cacertfile`. A request that reaches the Plug pipeline over
# this listener has therefore already had its client certificate chain
# verified by the transport (see Gateway.is_cert_verified/1).
defp tls_socket_opts do
ca = Application.get_env(:http_capability_gateway, :mtls_ca_cert_path)
cert = Application.get_env(:http_capability_gateway, :gateway_cert_path)
key = Application.get_env(:http_capability_gateway, :gateway_key_path)

cond do
is_nil(ca) and is_nil(cert) and is_nil(key) ->
:no_tls

is_nil(ca) or is_nil(cert) or is_nil(key) ->
{:error,
"incomplete TLS configuration: MTLS_CA_CERT_PATH, GATEWAY_CERT_PATH " <>
"and GATEWAY_KEY_PATH must all be set together"}

true ->
missing = Enum.reject([ca, cert, key], &File.exists?/1)

if missing == [] do
{:ok,
[
cacertfile: ca,
certfile: cert,
keyfile: key,
verify: :verify_peer,
fail_if_no_peer_cert: true
]}
else
{:error, "TLS files not found: #{Enum.join(missing, ", ")}"}
end
end
end

# Load policy from file or BoJ catalog, validate, and compile.
# Resolution order:
# 1. BOJ_CARTRIDGES_ROOT env var — catalog mode (auto-policy from cartridge.json)
Expand Down
Loading
Loading