apache · JingsongLi · May 30, 2026
diff --git a/SECURITY.md b/SECURITY.md
@@ -0,0 +1,391 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Apache Paimon Security Threat Model
+
+This document describes Apache Paimon's detailed security threat model for
+maintainers and automated security triage.
+
+It complements the shorter public-facing security model in
+[`docs/docs/project/security.md`](docs/docs/project/security.md) (published at the project website) by making
+Paimon's trust assumptions, security boundaries, and recurring non-security
+bug classes more explicit.
+
+## Purpose
+
+Apache Paimon is a streaming data lake platform that is often deployed as a
+library and integration layer inside larger systems (Flink, Spark, Hive, and
+other query engines) that provide their own authentication, authorization, and
+credential management. Because of that deployment model, many bug classes that
+look security-relevant in the abstract are not actually security
+vulnerabilities in Paimon itself.
+
+This model is intended to answer:
+
+- what Paimon generally treats as a security vulnerability
+- what Paimon generally treats as correctness, hardening, or deployment work
+- which boundaries are primarily owned by Paimon versus the surrounding
+  catalog, engine, or service
+- which issue classes should be downgraded by default by scanners
+
+## Scope
+
+This model is scoped to the Apache Paimon project itself:
+
+- the table format implementation (paimon-core)
+- client libraries (paimon-api, paimon-common)
+- the REST Catalog client and protocol (paimon-api, paimon-core)
+- engine integrations (Flink, Spark, Hive connectors)
+- the Python client (pypaimon)
+
+It is not a general threat model for every deployment that embeds Paimon.
+
+In particular, it does not attempt to define the complete security model for:
+
+- query engines or applications that embed Paimon
+- storage-level authorization enforced outside Paimon
+- REST Catalog server implementations (Paimon defines the client and protocol,
+  not the server)
+
+## Security Goals
+
+Paimon should:
+
+- avoid exposing secrets or delegated credentials to principals that were not
+  already trusted with them
+- avoid creating new unauthorized capabilities in Paimon-owned components or
+  integrations
+- avoid violating trust boundaries that Paimon itself owns, such as leaking
+  auth, signer, or credential-bearing state across catalog or session
+  boundaries in the same process
+- avoid leaking delegated storage tokens (data tokens) across table or
+  principal boundaries
+
+Paimon does not aim to be the primary enforcement point for:
+
+- user-to-user authorization inside a query engine
+- storage-level authorization (e.g., object store IAM policies)
+- service-side authorization performed by a REST Catalog server
+- row-level or column-level access control (Paimon relays server-provided
+  filters and column masking rules, but enforcement is in the server)
+
+## Roles
+
+### Operator
+
+The operator deploys and configures the catalog, REST Catalog server, engine,
+and storage integration around Paimon. This role is trusted to choose
+endpoints, warehouses, and storage integrations, configure credentials, and 
+decide which users may create, read, or modify tables.
+
+### Catalog Control Plane
+
+The catalog control plane is responsible for resolving tables and supplying
+metadata, locations, configuration, and delegated credentials to Paimon.
+This role may be implemented by:
+
+- a REST Catalog server
+- a Hive Metastore
+- a JDBC-backed catalog
+- a filesystem-based catalog
+
+Regardless of implementation, it should not expose secrets to unintended
+principals or leak credential-bearing state across unintended boundaries.
+
+Paimon assumes a trusted catalog or metastore, which is outside its primary
+security boundary.
+
+### REST Catalog Server
+
+In REST deployments, part of the catalog control plane is implemented by a
+server that returns metadata, configuration, delegated storage credentials
+(data tokens), and query-level authorization (row filters and column masking)
+to the client. This server is generally treated as a trusted control-plane
+component.
+
+The REST Catalog server is responsible for:
+
+- authenticating clients
+- authorizing catalog operations (create/drop/alter databases, tables, views,
+  functions)
+- issuing scoped, time-limited data tokens for storage access
+- providing row-level filters and column masking rules via the auth table
+  query API
+- returning server-side configuration to merge with client configuration
+
+### REST Catalog Client
+
+In REST deployments, the client-side catalog (`RESTCatalog`, `RESTApi`)
+consumes server-provided metadata, configuration, and credentials. Where the
+client and server are meaningfully distinct, client-side bugs in token
+handling, caching, or reuse may still be security-relevant. This is especially
+true when the Paimon-owned client implementation leaks credential-bearing
+state across catalog, session, or principal boundaries it is expected to
+preserve.
+
+The REST Catalog client is responsible for:
+
+- sending authenticated requests using a configured `AuthProvider`
+- refreshing tokens before expiration (with a configurable safe time margin)
+- caching `FileIO` instances keyed by data token (via `RESTTokenFileIO`)
+  and evicting them when tokens expire
+- not mixing data tokens or auth state across different catalog instances or
+  tables in the same process
+
+### Engine or Embedding Application
+
+Query engines (Flink, Spark, Hive, Trino, StarRocks, etc.) and applications
+may expose only a subset of Paimon capabilities to users. They are responsible
+for their own user-facing authorization boundaries unless Paimon explicitly
+documents otherwise.
+
+### Table Writer or Maintainer
+
+This role may already have legitimate power to write or replace table
+metadata, write or delete data files, manage snapshots, create or delete
+branches and tags, and invoke destructive maintenance operations (compaction,
+expiration, rollback). If a report only shows a new way to achieve the same
+effect this role can already cause legitimately, it is usually not a security
+issue in Paimon.
+
+## Trust Boundaries
+
+### Boundary 1: Operator-Trusted Configuration
+
+The following are generally treated as trusted operator or deployment inputs:
+
+- catalog properties (including `uri`, `warehouse`, `token.provider`)
+- REST Catalog server endpoint configuration
+- warehouse and storage roots
+- authentication credentials
+- Kerberos keytab paths and principal names
+  (`security.kerberos.login.keytab`, `security.kerberos.login.principal`)
+- metastore wiring (Hive Metastore URI, JDBC connection strings)
+- custom HTTP headers (`header.*`)
+
+If a report depends on the attacker controlling those values directly, it is
+usually not a vulnerability in Paimon itself.
+
+### Boundary 2: Catalog-Supplied Metadata
+
+Paimon often accepts metadata locations, table properties, database
+properties, schema definitions, and related control-plane information from a
+catalog or metastore. By default, Paimon treats those sources as trusted.
+
+This means a malicious catalog supplying incorrect or malicious metadata is
+usually not a Paimon vulnerability by itself.
+
+### Boundary 3: REST Catalog Server-Supplied Configuration and Delegated Storage Access
+
+In REST deployments, Paimon accepts the following from the REST Catalog server:
+
+- **Server configuration**: merged into client options via the `/v1/config`
+  endpoint, including catalog prefix and additional headers
+- **Data tokens**: time-limited storage credentials returned by the
+  `/v1/{prefix}/databases/{database}/tables/{table}/token` endpoint, used by
+  `RESTTokenFileIO` to access the underlying object store
+- **Auth table query responses**: row-level filters and column masking rules
+  returned by the `/v1/{prefix}/databases/{database}/tables/{table}/auth`
+  endpoint
+
+By default, these are treated as trusted control-plane inputs unless Paimon
+explicitly documents a stronger guarantee.
+
+This means a malicious REST Catalog server sending dangerous configuration or
+overly broad data tokens is usually not a Paimon vulnerability by itself. It
+also means many client-side token-selection bugs are often correctness or
+specification issues rather than security boundary failures.
+
+The major exception is **secret exposure**. If Paimon surfaces credentials or
+secrets to a new audience that was not already trusted with them, that is
+security-relevant. In particular:
+
+- Data tokens for one table leaking to operations on a different table
+- Auth state from one catalog instance leaking into another
+- Credentials appearing in logs, error messages, or serialized state
+
+### Boundary 4: Storage-Level Authorization
+
+Object store permissions (e.g., OSS, S3, HDFS ACLs) are enforced by the
+storage provider and the credentials the surrounding deployment chooses to
+hand to Paimon. Paimon is not the root authority for bucket- or object-level
+authorization.
+
+Reports that depend primarily on over-broad IAM policies or permissive
+storage ACLs are usually deployment-sensitive rather than product-security
+issues in Paimon.
+
+### Boundary 5: Engine-Level User Authorization
+
+Paimon integrations may surface data and operations through a query engine or
+application, but Paimon is not a complete user-authorization framework for
+those systems.
+
+Paimon does provide a mechanism for the REST Catalog server to supply
+row-level filters and column masking rules via `authTableQuery`, but
+enforcement of those rules is a shared responsibility between the engine
+integration and the catalog server. Paimon relays the rules; the engine
+must apply them.
+
+## In-Scope Security Vulnerabilities
+
+The following categories are generally security-relevant in Paimon when the
+report is credible and reproducible.
+
+### 1. Secret or Credential Disclosure to a New Audience
+
+Examples include:
+
+- catalog credentials exposed through a user-visible engine surface 
+  (e.g., query results, EXPLAIN output, table properties)
+- one catalog's credentials or auth state leaking into another catalog or
+  session within the same process
+- data tokens for table A being used for (or exposed to) table B
+- credentials or tokens logged at INFO or lower levels without redaction
+- credentials surviving in serialized `RESTTokenFileIO` or `RESTApi` state
+  beyond their intended scope
+
+### 2. Paimon-Owned Trust-Boundary Violations
+
+Security issues exist when Paimon itself is expected to separate catalogs,
+principals, or sessions and fails to do so.
+
+Examples include:
+
+- process-global auth provider or signer state crossing catalog instances
+  (e.g., the `FILE_IO_CACHE` in `RESTTokenFileIO` returning a `FileIO`
+  belonging to a different principal)
+- a data token obtained for one table being reused for a different table's
+  data access
+- auth header state from one `RESTApi` instance leaking into another
+
+### 3. Row-Level and Column-Level Access Control Bypass
+
+If Paimon's client-side handling of `authTableQuery` responses (row filters
+or column masking rules) allows a caller to bypass filters that the server
+intended to enforce, that is security-relevant when the bypass occurs within
+Paimon-owned code rather than in the engine integration.
+
+## Usually Out of Scope or Non-Security by Default
+
+These categories may still be real bugs worth fixing, but they are not usually
+security vulnerabilities in Paimon itself.
+
+### 1. Correctness Bugs
+
+Examples:
+
+- wrong byte offsets or stale decoded values in file formats
+- incorrect merge-tree compaction producing wrong query results
+- race conditions or logic bugs that do not create a new trust-boundary
+  violation
+- snapshot or schema version conflicts that produce incorrect metadata
+
+### 2. Parser Hardening and Malformed-Input Robustness
+
+Malformed-input crashes, raw runtime exceptions from invalid JSON or Avro
+data, and memory amplification from oversized manifests or schemas are usually
+treated as robustness or hardening work rather than security issues in Paimon
+itself.
+
+### 3. Malicious Catalog, Metastore, or External Service Scenarios
+
+Reports that require a malicious catalog, metastore, REST Catalog server, or
+other external service are usually outside Paimon's primary security boundary.
+
+Examples:
+
+- a REST Catalog server returning a data token with overly broad storage
+  permissions
+- a Hive Metastore returning a table location pointing to a sensitive path
+- a REST Catalog server returning malicious row filters designed to extract
+  data through side channels
+
+### 4. Equivalent-Harm Reports
+
+If the actor already has a legitimate capability that can cause the same harm,
+the new path is usually not a security issue. This often applies to writers or
+maintainers who already control metadata layout, file layout, or destructive
+maintenance operations (snapshot expiration, orphan file cleanup, branch
+deletion).
+
+### 5. Denial of Service Through Normal Operations
+
+Resource exhaustion caused by legitimate but expensive operations (e.g., large
+compaction, scanning many partitions, listing all snapshots) is usually
+treated as an operational concern rather than a security vulnerability.
+
+## REST Catalog Specific Security Considerations
+
+### Authentication
+
+Paimon's REST Catalog client supports pluggable authentication through the
+`AuthProvider` interface.
+
+Authentication providers are created via the `AuthProviderFactory` SPI, loaded
+using Java's `ServiceLoader` mechanism based on the `token.provider`
+configuration. The authentication provider is process-level per catalog
+instance and must not share mutable state across instances.
+
+### Data Token Lifecycle
+
+When `data-token.enabled` is `true`, `RESTTokenFileIO` manages delegated
+storage credentials:
+
+1. The client calls the table token endpoint to obtain a time-limited data
+   token
+2. The token is cached and used to construct a `FileIO` instance for storage
+   access
+3. Tokens are refreshed before expiration (1 hour safe time margin by default)
+4. `FileIO` instances are cached in a process-global cache
+   (`FILE_IO_CACHE`) keyed by `RESTToken`, with a maximum size of 1000
+   entries and 10-hour expiry
+
+Security-relevant invariants:
+
+- Data tokens must be scoped to specific tables by the server
+- The `FILE_IO_CACHE` keys on the full `RESTToken` (token content +
+  expiration), so different tokens produce different `FileIO` instances
+- Token refresh creates a new `RESTApi` instance from the catalog context if
+  the original instance is unavailable (e.g., after deserialization)
+
+### Kerberos
+
+Paimon supports Kerberos authentication for Hadoop-based deployments through
+`SecurityContext` and `SecurityConfiguration`. Keytab paths and principals
+are treated as trusted operator configuration.
+
+## Scanner Calibration Rules
+
+A scanner targeting Paimon should treat a finding as higher-confidence only if
+it plausibly shows one of the following:
+
+- exposure of a secret or delegated credential to a new audience
+- creation of a new unauthorized capability in a Paimon-owned component
+- violation of a Paimon-owned trust boundary (e.g., cross-catalog credential
+  leak, cross-table data token reuse)
+
+A finding should be downgraded or rejected by default if it instead depends
+primarily on:
+
+- malformed-input robustness or denial-of-service behavior
+- a malicious catalog, metastore, REST Catalog server, or external service
+- a principal that already has equivalent power through legitimate write or
+  maintenance capabilities
+- operator misconfiguration (overly broad credentials, missing TLS, etc.)