-
Notifications
You must be signed in to change notification settings - Fork 1
Add ARCHITECTURE.md #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add ARCHITECTURE.md #374
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,229 @@ | ||||||||||||||||||||||||||||||
| # Architecture | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| This document describes the internal architecture of CipherStash Proxy. It's intended for anyone who wants to understand how the proxy pulls off transparent, searchable encryption without requiring application changes. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ## Overview | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| CipherStash Proxy sits between an application and PostgreSQL. It intercepts SQL statements over the PostgreSQL wire protocol, determines which columns are encrypted, rewrites queries to use [EQL v2](https://github.com/cipherstash/encrypt-query-language) operations, encrypts literals and parameters, forwards the transformed query to PostgreSQL, and decrypts results before returning them to the application. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| The two most interesting pieces of the system are: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| 1. **eql-mapper** — a SQL type inference and transformation engine that understands which parts of a query touch encrypted columns | ||||||||||||||||||||||||||||||
| 2. **The protocol bridge** — a dual-stream PostgreSQL wire protocol interceptor that handles encryption and decryption transparently across both the simple and extended query protocols | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ## How a Query Flows Through the System | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||
| Application CipherStash Proxy PostgreSQL | ||||||||||||||||||||||||||||||
| | | | | ||||||||||||||||||||||||||||||
| |--- SQL statement ------------->| | | ||||||||||||||||||||||||||||||
| | Parse SQL into AST | | ||||||||||||||||||||||||||||||
| | Import schema (tables, columns, EQL types) | | ||||||||||||||||||||||||||||||
| | Run type inference (unification) | | ||||||||||||||||||||||||||||||
| | Identify encrypted literals & parameters | | ||||||||||||||||||||||||||||||
| | Encrypt values via ZeroKMS | | ||||||||||||||||||||||||||||||
| | Apply transformation rules to AST | | ||||||||||||||||||||||||||||||
| | Emit rewritten SQL | | ||||||||||||||||||||||||||||||
| | |--- transformed SQL ------------------>| | ||||||||||||||||||||||||||||||
| | |<-- result rows ----------------------| | ||||||||||||||||||||||||||||||
| | Identify encrypted columns in results | | ||||||||||||||||||||||||||||||
| | Batch-decrypt values via ZeroKMS | | ||||||||||||||||||||||||||||||
| | Re-encode to PostgreSQL wire format | | ||||||||||||||||||||||||||||||
| |<-- plaintext results ----------| | | ||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ## SQL Type Inference Engine (eql-mapper) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| The `eql-mapper` package is responsible for analyzing SQL statements and determining exactly which expressions, literals, and parameters need to be encrypted — and *how* they need to be encrypted. It does this through a constraint-based type inference system that operates entirely at parse time, without executing any SQL. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ### The Type System | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Every AST node in a parsed SQL statement is assigned a type. Types are either: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| - **Native** — a standard PostgreSQL type. The proxy doesn't need to do anything special with these. | ||||||||||||||||||||||||||||||
| - **EQL** — an encrypted column type, carrying information about which operations it supports. | ||||||||||||||||||||||||||||||
| - **Projection** — an ordered list of column types (the result shape of a `SELECT`, subquery, or CTE). | ||||||||||||||||||||||||||||||
| - **Var** — an unresolved type variable, used during inference and resolved through unification. | ||||||||||||||||||||||||||||||
| - **Associated** — a type that depends on another type's trait implementation (e.g., "the tokenized form of this column"). | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| EQL types carry **trait bounds** that describe what operations the encrypted column supports: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| | Trait | Operations | Example | | ||||||||||||||||||||||||||||||
| |---|---|---| | ||||||||||||||||||||||||||||||
| | `Eq` | `=`, `<>` | `WHERE email = 'alice@example.com'` | | ||||||||||||||||||||||||||||||
| | `Ord` | `<`, `>`, `<=`, `>=`, `MIN`, `MAX` | `WHERE salary > 100000` | | ||||||||||||||||||||||||||||||
| | `TokenMatch` | `LIKE`, `ILIKE` | `WHERE name LIKE '%alice%'` | | ||||||||||||||||||||||||||||||
| | `JsonLike` | `->`, `->>`, `jsonb_path_query` | `WHERE data->>'role' = 'admin'` | | ||||||||||||||||||||||||||||||
| | `Contain` | `@>`, `<@` | `WHERE tags @> '["urgent"]'` | | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Traits form a hierarchy — `Ord` implies `Eq`, and `JsonLike` implies both `Ord` and `Eq`. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ### Unification | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Type inference uses a **unification algorithm** (in the Robinson tradition, similar to what you'd find in a Hindley-Milner type system) adapted for SQL and encrypted types. When the type checker encounters an expression like `salary > 100000`, it: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| 1. Looks up `salary` in the current scope and finds its type (e.g., `EQL(employees.salary, Ord+Eq)`) | ||||||||||||||||||||||||||||||
| 2. Assigns a fresh type variable to the literal `100000` | ||||||||||||||||||||||||||||||
| 3. Looks up the `>` operator's type signature: `<T>(T > T) -> Native where T: Ord` | ||||||||||||||||||||||||||||||
| 4. Unifies `T` with the salary's EQL type, checking that it satisfies the `Ord` bound | ||||||||||||||||||||||||||||||
| 5. Unifies `T` with the literal's type variable, binding it to the same EQL type | ||||||||||||||||||||||||||||||
| 6. Records that the literal `100000` must be encrypted as `EQL(employees.salary, Ord)` | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| This process propagates type information across the entire statement — through subqueries, CTEs, JOINs, `UNION` branches, function calls, and `RETURNING` clauses. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| A particularly interesting aspect is how EQL types unify with each other. When two `Partial` EQL types for the same column meet, their bounds are merged (union). When a `Partial` meets a `Full`, the result promotes to `Full`. This means the system automatically infers the minimum encryption payload needed for each value. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| A particularly interesting aspect is how EQL types unify with each other. When two `Partial` EQL types for the same column meet, their bounds are merged (union). When a `Partial` meets a `Full`, the result promotes to `Full`. This means the system automatically infers the minimum encryption payload needed for each value. | |
| A particularly interesting aspect is how EQL types unify with each other. When two `Partial` EQL types for the same column meet, their bounds are merged (union). When a `Partial` meets a `Full`, the result promotes to `Full`. Within `eql-mapper`, this allows the type system to infer the minimum bounds required for each value; however, the current proxy encryption path does not yet use these `Partial` bounds and treats `Full`/`Partial`/`Tokenized` the same when deciding what to encrypt. |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EQL operation routing table suggests Partial(Eq) vs Partial(Ord) map to different storage behavior and that JsonAccessor uses a distinct “field selector” query op. In the current proxy encryption service, Full/Partial/Tokenized all use EqlOperation::Store, and both JsonPath and JsonAccessor currently use EqlOperation::Query(..., QueryOp::SteVecSelector) when a SteVec index exists. Please align this table with the implemented routing logic (or note where behavior differs by design).
| | `Full` | `EqlOperation::Store` | Inserting a new encrypted value with all search terms | | |
| | `Partial(Eq)` | `EqlOperation::Store` | Equality query — only equality search terms needed | | |
| | `Partial(Ord)` | `EqlOperation::Store` | Comparison query — only ORE search terms needed | | |
| | `Tokenized` | `EqlOperation::Store` | LIKE query — tokenized search terms | | |
| | `JsonPath` | `EqlOperation::Query` with `SteVecSelector` | JSON path query argument | | |
| | `JsonAccessor` | `EqlOperation::Query` with field selector | JSON field access argument | | |
| | `Full` | `EqlOperation::Store` | Insert/update with all configured search terms materialised | | |
| | `Partial(Eq)` | `EqlOperation::Store` | Equality-oriented operations — only equality search terms are constructed | | |
| | `Partial(Ord)` | `EqlOperation::Store` | Ordering/comparison operations — only ORE search terms are constructed | | |
| | `Tokenized` | `EqlOperation::Store` | Pattern/LIKE-style operations — tokenized search terms are constructed | | |
| | `JsonPath` | `EqlOperation::Query` with `SteVecSelector` | JSON path query argument (uses SteVec index when available) | | |
| | `JsonAccessor` | `EqlOperation::Query` with `SteVecSelector` | JSON field access argument (same SteVec selector as `JsonPath` in current implementation) | | |
| **Note:** In the current proxy implementation, all of `Full`, `Partial(Eq)`, `Partial(Ord)`, and `Tokenized` are routed to `EqlOperation::Store`; the `Partial`/`Tokenized` variants only affect which search terms are built inside the payload. Likewise, both `JsonPath` and `JsonAccessor` use `EqlOperation::Query(..., QueryOp::SteVecSelector)` when a SteVec index exists, even though the mapper distinguishes them conceptually. |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema-loading description is inaccurate: the proxy’s schema loader marks encrypted columns by checking information_schema.columns.udt_name == 'eql_v2_encrypted' (see select_table_schemas.sql / SchemaManager::load_schema). It does not consult eql_v2_configuration for this, and index support comes from the separate Encrypt configuration loaded from public.eql_v2_configuration (via select_config.sql). Please update this paragraph to reflect the actual split between schema discovery vs encrypt-config loading.
| The proxy discovers the database schema at startup and reloads it periodically. Schema loading queries PostgreSQL's `information_schema` to discover tables and columns, then checks `eql_v2_configuration` to determine which columns are encrypted and what index types they support. | |
| The proxy discovers the database schema at startup and reloads it periodically. Schema loading queries PostgreSQL's `information_schema` to discover tables and columns and marks encrypted columns by checking `information_schema.columns.udt_name = 'eql_v2_encrypted'` (via `select_table_schemas.sql` / `SchemaManager::load_schema`). Separately, it loads the Encrypt configuration from `public.eql_v2_configuration` (via `select_config.sql`) to determine index types and other search capabilities for those encrypted columns. |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reload-cycle bullet for Startup says schema is loaded with exponential backoff retry. In the current implementation, startup loads schema/config without retries (SchemaManager::init calls load_schema, and EncryptConfigManager::init_reloader calls load_encrypt_config); retries with backoff only happen during reloads (load_schema_with_retry / load_encrypt_config_with_retry). Please adjust the bullet list so it matches runtime behavior.
| 1. **Startup** — load schema with exponential backoff retry (up to 10 attempts, max 2-second backoff) to handle cases where PostgreSQL isn't ready yet | |
| 2. **Periodic** — a background task reloads schema on a configurable interval | |
| 3. **On-demand** — DDL detection during a transaction triggers a reload when the transaction completes | |
| 1. **Startup** — load schema and encryption configuration once; if this fails, proxy startup fails (no retry/backoff is performed during startup) | |
| 2. **Periodic** — a background task reloads schema and encryption configuration on a configurable interval, using exponential backoff retry on failure (up to 10 attempts, max 2-second backoff) | |
| 3. **On-demand** — DDL detection during a transaction triggers a reload when the transaction completes, also using the same exponential backoff retry behavior on failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a language identifier to the code fence.
Markdownlint flags the package tree block as missing a language specifier. Consider text for the tree.
Suggested fix
-```
+```text
packages/
├── cipherstash-proxy/ # Main proxy binary
│ └── src/
│ ├── postgresql/ # Wire protocol implementation
│ │ ├── frontend.rs # Client → Server message handling
│ │ ├── backend.rs # Server → Client message handling
│ │ ├── handler.rs # Connection startup and auth
│ │ ├── protocol.rs # Low-level message reading
│ │ ├── parser.rs # SQL parsing entry point
│ │ └── context/ # Session state (statements, portals, metadata)
│ ├── proxy/ # Encryption service, schema management, config
│ └── config/ # Configuration parsing
├── eql-mapper/ # SQL type inference and transformation
│ └── src/
│ ├── inference/ # Type inference engine
│ │ ├── unifier/ # Unification algorithm, type definitions, trait bounds
│ │ ├── sql_types/ # Operator and function type signatures
│ │ └── infer_type_impls/# Per-AST-node type inference implementations
│ ├── transformation_rules/# AST rewriting rules
│ ├── model/ # Schema, tables, columns, DDL tracking
│ └── scope_tracker.rs # Lexical scope management
├── eql-mapper-macros/ # Proc macros for operator/function declarations
└── showcase/ # Example healthcare data model
-```
+```🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 205-205: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@ARCHITECTURE.md` around lines 205 - 229, The fenced package tree in
ARCHITECTURE.md is missing a language identifier causing markdownlint warnings;
update the opening code fence for the package tree (the triple-backtick block
that contains the directory listing for packages/ and entries like
cipherstash-proxy/, eql-mapper/, eql-mapper-macros/, showcase/) to include a
language (use "text") so the block starts with ```text instead of ```; no other
content changes are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a language identifier to the code fence.
Markdownlint flags the diagram block as missing a language specifier. Consider
text(orplain) for the ASCII diagram to satisfy MD040.Suggested fix
📝 Committable suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 16-16: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents