From 6f4f8b4e22f937c3e266c72f33ae95d811df2bf5 Mon Sep 17 00:00:00 2001 From: JingsongLi Date: Sat, 30 May 2026 11:58:21 +0800 Subject: [PATCH] [project] Add security threat model and security docs page --- SECURITY.md | 391 ++++++++++++++++++++++++++++++++++ docs/docs/project/security.md | 73 +++++++ docs/sidebars.js | 3 +- 3 files changed, 466 insertions(+), 1 deletion(-) create mode 100644 SECURITY.md create mode 100644 docs/docs/project/security.md diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 000000000000..00de7a150e3a --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,391 @@ + + +# Apache Paimon Security Threat Model + +This document describes Apache Paimon's detailed security threat model for +maintainers and automated security triage. + +It complements the shorter public-facing security model in +[`docs/docs/project/security.md`](docs/docs/project/security.md) (published at the project website) by making +Paimon's trust assumptions, security boundaries, and recurring non-security +bug classes more explicit. + +## Purpose + +Apache Paimon is a streaming data lake platform that is often deployed as a +library and integration layer inside larger systems (Flink, Spark, Hive, and +other query engines) that provide their own authentication, authorization, and +credential management. Because of that deployment model, many bug classes that +look security-relevant in the abstract are not actually security +vulnerabilities in Paimon itself. + +This model is intended to answer: + +- what Paimon generally treats as a security vulnerability +- what Paimon generally treats as correctness, hardening, or deployment work +- which boundaries are primarily owned by Paimon versus the surrounding + catalog, engine, or service +- which issue classes should be downgraded by default by scanners + +## Scope + +This model is scoped to the Apache Paimon project itself: + +- the table format implementation (paimon-core) +- client libraries (paimon-api, paimon-common) +- the REST Catalog client and protocol (paimon-api, paimon-core) +- engine integrations (Flink, Spark, Hive connectors) +- the Python client (pypaimon) + +It is not a general threat model for every deployment that embeds Paimon. + +In particular, it does not attempt to define the complete security model for: + +- query engines or applications that embed Paimon +- storage-level authorization enforced outside Paimon +- REST Catalog server implementations (Paimon defines the client and protocol, + not the server) + +## Security Goals + +Paimon should: + +- avoid exposing secrets or delegated credentials to principals that were not + already trusted with them +- avoid creating new unauthorized capabilities in Paimon-owned components or + integrations +- avoid violating trust boundaries that Paimon itself owns, such as leaking + auth, signer, or credential-bearing state across catalog or session + boundaries in the same process +- avoid leaking delegated storage tokens (data tokens) across table or + principal boundaries + +Paimon does not aim to be the primary enforcement point for: + +- user-to-user authorization inside a query engine +- storage-level authorization (e.g., object store IAM policies) +- service-side authorization performed by a REST Catalog server +- row-level or column-level access control (Paimon relays server-provided + filters and column masking rules, but enforcement is in the server) + +## Roles + +### Operator + +The operator deploys and configures the catalog, REST Catalog server, engine, +and storage integration around Paimon. This role is trusted to choose +endpoints, warehouses, and storage integrations, configure credentials, and +decide which users may create, read, or modify tables. + +### Catalog Control Plane + +The catalog control plane is responsible for resolving tables and supplying +metadata, locations, configuration, and delegated credentials to Paimon. +This role may be implemented by: + +- a REST Catalog server +- a Hive Metastore +- a JDBC-backed catalog +- a filesystem-based catalog + +Regardless of implementation, it should not expose secrets to unintended +principals or leak credential-bearing state across unintended boundaries. + +Paimon assumes a trusted catalog or metastore, which is outside its primary +security boundary. + +### REST Catalog Server + +In REST deployments, part of the catalog control plane is implemented by a +server that returns metadata, configuration, delegated storage credentials +(data tokens), and query-level authorization (row filters and column masking) +to the client. This server is generally treated as a trusted control-plane +component. + +The REST Catalog server is responsible for: + +- authenticating clients +- authorizing catalog operations (create/drop/alter databases, tables, views, + functions) +- issuing scoped, time-limited data tokens for storage access +- providing row-level filters and column masking rules via the auth table + query API +- returning server-side configuration to merge with client configuration + +### REST Catalog Client + +In REST deployments, the client-side catalog (`RESTCatalog`, `RESTApi`) +consumes server-provided metadata, configuration, and credentials. Where the +client and server are meaningfully distinct, client-side bugs in token +handling, caching, or reuse may still be security-relevant. This is especially +true when the Paimon-owned client implementation leaks credential-bearing +state across catalog, session, or principal boundaries it is expected to +preserve. + +The REST Catalog client is responsible for: + +- sending authenticated requests using a configured `AuthProvider` +- refreshing tokens before expiration (with a configurable safe time margin) +- caching `FileIO` instances keyed by data token (via `RESTTokenFileIO`) + and evicting them when tokens expire +- not mixing data tokens or auth state across different catalog instances or + tables in the same process + +### Engine or Embedding Application + +Query engines (Flink, Spark, Hive, Trino, StarRocks, etc.) and applications +may expose only a subset of Paimon capabilities to users. They are responsible +for their own user-facing authorization boundaries unless Paimon explicitly +documents otherwise. + +### Table Writer or Maintainer + +This role may already have legitimate power to write or replace table +metadata, write or delete data files, manage snapshots, create or delete +branches and tags, and invoke destructive maintenance operations (compaction, +expiration, rollback). If a report only shows a new way to achieve the same +effect this role can already cause legitimately, it is usually not a security +issue in Paimon. + +## Trust Boundaries + +### Boundary 1: Operator-Trusted Configuration + +The following are generally treated as trusted operator or deployment inputs: + +- catalog properties (including `uri`, `warehouse`, `token.provider`) +- REST Catalog server endpoint configuration +- warehouse and storage roots +- authentication credentials +- Kerberos keytab paths and principal names + (`security.kerberos.login.keytab`, `security.kerberos.login.principal`) +- metastore wiring (Hive Metastore URI, JDBC connection strings) +- custom HTTP headers (`header.*`) + +If a report depends on the attacker controlling those values directly, it is +usually not a vulnerability in Paimon itself. + +### Boundary 2: Catalog-Supplied Metadata + +Paimon often accepts metadata locations, table properties, database +properties, schema definitions, and related control-plane information from a +catalog or metastore. By default, Paimon treats those sources as trusted. + +This means a malicious catalog supplying incorrect or malicious metadata is +usually not a Paimon vulnerability by itself. + +### Boundary 3: REST Catalog Server-Supplied Configuration and Delegated Storage Access + +In REST deployments, Paimon accepts the following from the REST Catalog server: + +- **Server configuration**: merged into client options via the `/v1/config` + endpoint, including catalog prefix and additional headers +- **Data tokens**: time-limited storage credentials returned by the + `/v1/{prefix}/databases/{database}/tables/{table}/token` endpoint, used by + `RESTTokenFileIO` to access the underlying object store +- **Auth table query responses**: row-level filters and column masking rules + returned by the `/v1/{prefix}/databases/{database}/tables/{table}/auth` + endpoint + +By default, these are treated as trusted control-plane inputs unless Paimon +explicitly documents a stronger guarantee. + +This means a malicious REST Catalog server sending dangerous configuration or +overly broad data tokens is usually not a Paimon vulnerability by itself. It +also means many client-side token-selection bugs are often correctness or +specification issues rather than security boundary failures. + +The major exception is **secret exposure**. If Paimon surfaces credentials or +secrets to a new audience that was not already trusted with them, that is +security-relevant. In particular: + +- Data tokens for one table leaking to operations on a different table +- Auth state from one catalog instance leaking into another +- Credentials appearing in logs, error messages, or serialized state + +### Boundary 4: Storage-Level Authorization + +Object store permissions (e.g., OSS, S3, HDFS ACLs) are enforced by the +storage provider and the credentials the surrounding deployment chooses to +hand to Paimon. Paimon is not the root authority for bucket- or object-level +authorization. + +Reports that depend primarily on over-broad IAM policies or permissive +storage ACLs are usually deployment-sensitive rather than product-security +issues in Paimon. + +### Boundary 5: Engine-Level User Authorization + +Paimon integrations may surface data and operations through a query engine or +application, but Paimon is not a complete user-authorization framework for +those systems. + +Paimon does provide a mechanism for the REST Catalog server to supply +row-level filters and column masking rules via `authTableQuery`, but +enforcement of those rules is a shared responsibility between the engine +integration and the catalog server. Paimon relays the rules; the engine +must apply them. + +## In-Scope Security Vulnerabilities + +The following categories are generally security-relevant in Paimon when the +report is credible and reproducible. + +### 1. Secret or Credential Disclosure to a New Audience + +Examples include: + +- catalog credentials exposed through a user-visible engine surface + (e.g., query results, EXPLAIN output, table properties) +- one catalog's credentials or auth state leaking into another catalog or + session within the same process +- data tokens for table A being used for (or exposed to) table B +- credentials or tokens logged at INFO or lower levels without redaction +- credentials surviving in serialized `RESTTokenFileIO` or `RESTApi` state + beyond their intended scope + +### 2. Paimon-Owned Trust-Boundary Violations + +Security issues exist when Paimon itself is expected to separate catalogs, +principals, or sessions and fails to do so. + +Examples include: + +- process-global auth provider or signer state crossing catalog instances + (e.g., the `FILE_IO_CACHE` in `RESTTokenFileIO` returning a `FileIO` + belonging to a different principal) +- a data token obtained for one table being reused for a different table's + data access +- auth header state from one `RESTApi` instance leaking into another + +### 3. Row-Level and Column-Level Access Control Bypass + +If Paimon's client-side handling of `authTableQuery` responses (row filters +or column masking rules) allows a caller to bypass filters that the server +intended to enforce, that is security-relevant when the bypass occurs within +Paimon-owned code rather than in the engine integration. + +## Usually Out of Scope or Non-Security by Default + +These categories may still be real bugs worth fixing, but they are not usually +security vulnerabilities in Paimon itself. + +### 1. Correctness Bugs + +Examples: + +- wrong byte offsets or stale decoded values in file formats +- incorrect merge-tree compaction producing wrong query results +- race conditions or logic bugs that do not create a new trust-boundary + violation +- snapshot or schema version conflicts that produce incorrect metadata + +### 2. Parser Hardening and Malformed-Input Robustness + +Malformed-input crashes, raw runtime exceptions from invalid JSON or Avro +data, and memory amplification from oversized manifests or schemas are usually +treated as robustness or hardening work rather than security issues in Paimon +itself. + +### 3. Malicious Catalog, Metastore, or External Service Scenarios + +Reports that require a malicious catalog, metastore, REST Catalog server, or +other external service are usually outside Paimon's primary security boundary. + +Examples: + +- a REST Catalog server returning a data token with overly broad storage + permissions +- a Hive Metastore returning a table location pointing to a sensitive path +- a REST Catalog server returning malicious row filters designed to extract + data through side channels + +### 4. Equivalent-Harm Reports + +If the actor already has a legitimate capability that can cause the same harm, +the new path is usually not a security issue. This often applies to writers or +maintainers who already control metadata layout, file layout, or destructive +maintenance operations (snapshot expiration, orphan file cleanup, branch +deletion). + +### 5. Denial of Service Through Normal Operations + +Resource exhaustion caused by legitimate but expensive operations (e.g., large +compaction, scanning many partitions, listing all snapshots) is usually +treated as an operational concern rather than a security vulnerability. + +## REST Catalog Specific Security Considerations + +### Authentication + +Paimon's REST Catalog client supports pluggable authentication through the +`AuthProvider` interface. + +Authentication providers are created via the `AuthProviderFactory` SPI, loaded +using Java's `ServiceLoader` mechanism based on the `token.provider` +configuration. The authentication provider is process-level per catalog +instance and must not share mutable state across instances. + +### Data Token Lifecycle + +When `data-token.enabled` is `true`, `RESTTokenFileIO` manages delegated +storage credentials: + +1. The client calls the table token endpoint to obtain a time-limited data + token +2. The token is cached and used to construct a `FileIO` instance for storage + access +3. Tokens are refreshed before expiration (1 hour safe time margin by default) +4. `FileIO` instances are cached in a process-global cache + (`FILE_IO_CACHE`) keyed by `RESTToken`, with a maximum size of 1000 + entries and 10-hour expiry + +Security-relevant invariants: + +- Data tokens must be scoped to specific tables by the server +- The `FILE_IO_CACHE` keys on the full `RESTToken` (token content + + expiration), so different tokens produce different `FileIO` instances +- Token refresh creates a new `RESTApi` instance from the catalog context if + the original instance is unavailable (e.g., after deserialization) + +### Kerberos + +Paimon supports Kerberos authentication for Hadoop-based deployments through +`SecurityContext` and `SecurityConfiguration`. Keytab paths and principals +are treated as trusted operator configuration. + +## Scanner Calibration Rules + +A scanner targeting Paimon should treat a finding as higher-confidence only if +it plausibly shows one of the following: + +- exposure of a secret or delegated credential to a new audience +- creation of a new unauthorized capability in a Paimon-owned component +- violation of a Paimon-owned trust boundary (e.g., cross-catalog credential + leak, cross-table data token reuse) + +A finding should be downgraded or rejected by default if it instead depends +primarily on: + +- malformed-input robustness or denial-of-service behavior +- a malicious catalog, metastore, REST Catalog server, or external service +- a principal that already has equivalent power through legitimate write or + maintenance capabilities +- operator misconfiguration (overly broad credentials, missing TLS, etc.) diff --git a/docs/docs/project/security.md b/docs/docs/project/security.md new file mode 100644 index 000000000000..134569d19df9 --- /dev/null +++ b/docs/docs/project/security.md @@ -0,0 +1,73 @@ +--- +title: "Security" +sidebar_position: 4 +--- + + + +# Security + +## Reporting Security Issues + +The Apache Paimon Project uses the standard process outlined by the +[Apache Security Team](https://www.apache.org/security/) for reporting +vulnerabilities. + +Note that vulnerabilities should not be publicly disclosed until the project +has responded. + +To report a possible security vulnerability, please email +**[security@apache.org](mailto:security@apache.org)**. + +## Security Model + +Apache Paimon is a data lake platform and a set of libraries and integrations +used inside larger systems such as catalogs, query engines, and services. + +In most deployments, the primary trust and authorization boundaries are +enforced by the surrounding catalog, engine, service, operator configuration, +and storage-level authorization rather than by Paimon alone. + +Paimon security issues generally include: + +- Secret or credential disclosure to a newly reachable audience (e.g., bearer + tokens, access keys, or delegated storage tokens leaking across catalog, + session, or table boundaries) +- Other cases where Paimon itself creates a new unauthorized capability + rather than merely reflecting the trust decisions of a catalog, engine, or + operator + +Many other issues may still be valid bugs, but are not normally considered +security vulnerabilities in Paimon. This includes: + +- Robustness issues such as malformed-input crashes or memory exhaustion +- Issues that require a malicious catalog, metastore, REST Catalog server, or + other external service +- Issues that depend on operator misconfiguration (e.g., overly broad IAM + policies, missing TLS) + +Potential vulnerabilities that fall within this security model should be +reported privately using the process above. Other bugs and hardening issues +should be reported through the +[public issue tracker](https://github.com/apache/paimon/issues). + +For a more detailed threat model used for maintainer triage and scanner +calibration, see the +[Apache Paimon Security Threat Model](https://github.com/apache/paimon/blob/master/SECURITY.md). diff --git a/docs/sidebars.js b/docs/sidebars.js index 2f57348cd432..6f6fa15afa2c 100644 --- a/docs/sidebars.js +++ b/docs/sidebars.js @@ -301,7 +301,8 @@ const sidebars = { "items": [ "project/download", "project/contributing", - "project/committer" + "project/committer", + "project/security" ] }, {