Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
391 changes: 391 additions & 0 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,391 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Apache Paimon Security Threat Model

This document describes Apache Paimon's detailed security threat model for
maintainers and automated security triage.

It complements the shorter public-facing security model in
[`docs/docs/project/security.md`](docs/docs/project/security.md) (published at the project website) by making
Paimon's trust assumptions, security boundaries, and recurring non-security
bug classes more explicit.

## Purpose

Apache Paimon is a streaming data lake platform that is often deployed as a
library and integration layer inside larger systems (Flink, Spark, Hive, and
other query engines) that provide their own authentication, authorization, and
credential management. Because of that deployment model, many bug classes that
look security-relevant in the abstract are not actually security
vulnerabilities in Paimon itself.

This model is intended to answer:

- what Paimon generally treats as a security vulnerability
- what Paimon generally treats as correctness, hardening, or deployment work
- which boundaries are primarily owned by Paimon versus the surrounding
catalog, engine, or service
- which issue classes should be downgraded by default by scanners

## Scope

This model is scoped to the Apache Paimon project itself:

- the table format implementation (paimon-core)
- client libraries (paimon-api, paimon-common)
- the REST Catalog client and protocol (paimon-api, paimon-core)
- engine integrations (Flink, Spark, Hive connectors)
- the Python client (pypaimon)

It is not a general threat model for every deployment that embeds Paimon.

In particular, it does not attempt to define the complete security model for:

- query engines or applications that embed Paimon
- storage-level authorization enforced outside Paimon
- REST Catalog server implementations (Paimon defines the client and protocol,
not the server)

## Security Goals

Paimon should:

- avoid exposing secrets or delegated credentials to principals that were not
already trusted with them
- avoid creating new unauthorized capabilities in Paimon-owned components or
integrations
- avoid violating trust boundaries that Paimon itself owns, such as leaking
auth, signer, or credential-bearing state across catalog or session
boundaries in the same process
- avoid leaking delegated storage tokens (data tokens) across table or
principal boundaries

Paimon does not aim to be the primary enforcement point for:

- user-to-user authorization inside a query engine
- storage-level authorization (e.g., object store IAM policies)
- service-side authorization performed by a REST Catalog server
- row-level or column-level access control (Paimon relays server-provided
filters and column masking rules, but enforcement is in the server)

## Roles

### Operator

The operator deploys and configures the catalog, REST Catalog server, engine,
and storage integration around Paimon. This role is trusted to choose
endpoints, warehouses, and storage integrations, configure credentials, and
decide which users may create, read, or modify tables.

### Catalog Control Plane

The catalog control plane is responsible for resolving tables and supplying
metadata, locations, configuration, and delegated credentials to Paimon.
This role may be implemented by:

- a REST Catalog server
- a Hive Metastore
- a JDBC-backed catalog
- a filesystem-based catalog

Regardless of implementation, it should not expose secrets to unintended
principals or leak credential-bearing state across unintended boundaries.

Paimon assumes a trusted catalog or metastore, which is outside its primary
security boundary.

### REST Catalog Server

In REST deployments, part of the catalog control plane is implemented by a
server that returns metadata, configuration, delegated storage credentials
(data tokens), and query-level authorization (row filters and column masking)
to the client. This server is generally treated as a trusted control-plane
component.

The REST Catalog server is responsible for:

- authenticating clients
- authorizing catalog operations (create/drop/alter databases, tables, views,
functions)
- issuing scoped, time-limited data tokens for storage access
- providing row-level filters and column masking rules via the auth table
query API
- returning server-side configuration to merge with client configuration

### REST Catalog Client

In REST deployments, the client-side catalog (`RESTCatalog`, `RESTApi`)
consumes server-provided metadata, configuration, and credentials. Where the
client and server are meaningfully distinct, client-side bugs in token
handling, caching, or reuse may still be security-relevant. This is especially
true when the Paimon-owned client implementation leaks credential-bearing
state across catalog, session, or principal boundaries it is expected to
preserve.

The REST Catalog client is responsible for:

- sending authenticated requests using a configured `AuthProvider`
- refreshing tokens before expiration (with a configurable safe time margin)
- caching `FileIO` instances keyed by data token (via `RESTTokenFileIO`)
and evicting them when tokens expire
- not mixing data tokens or auth state across different catalog instances or
tables in the same process

### Engine or Embedding Application

Query engines (Flink, Spark, Hive, Trino, StarRocks, etc.) and applications
may expose only a subset of Paimon capabilities to users. They are responsible
for their own user-facing authorization boundaries unless Paimon explicitly
documents otherwise.

### Table Writer or Maintainer

This role may already have legitimate power to write or replace table
metadata, write or delete data files, manage snapshots, create or delete
branches and tags, and invoke destructive maintenance operations (compaction,
expiration, rollback). If a report only shows a new way to achieve the same
effect this role can already cause legitimately, it is usually not a security
issue in Paimon.

## Trust Boundaries

### Boundary 1: Operator-Trusted Configuration

The following are generally treated as trusted operator or deployment inputs:

- catalog properties (including `uri`, `warehouse`, `token.provider`)
- REST Catalog server endpoint configuration
- warehouse and storage roots
- authentication credentials
- Kerberos keytab paths and principal names
(`security.kerberos.login.keytab`, `security.kerberos.login.principal`)
- metastore wiring (Hive Metastore URI, JDBC connection strings)
- custom HTTP headers (`header.*`)

If a report depends on the attacker controlling those values directly, it is
usually not a vulnerability in Paimon itself.

### Boundary 2: Catalog-Supplied Metadata

Paimon often accepts metadata locations, table properties, database
properties, schema definitions, and related control-plane information from a
catalog or metastore. By default, Paimon treats those sources as trusted.

This means a malicious catalog supplying incorrect or malicious metadata is
usually not a Paimon vulnerability by itself.

### Boundary 3: REST Catalog Server-Supplied Configuration and Delegated Storage Access

In REST deployments, Paimon accepts the following from the REST Catalog server:

- **Server configuration**: merged into client options via the `/v1/config`
endpoint, including catalog prefix and additional headers
- **Data tokens**: time-limited storage credentials returned by the
`/v1/{prefix}/databases/{database}/tables/{table}/token` endpoint, used by
`RESTTokenFileIO` to access the underlying object store
- **Auth table query responses**: row-level filters and column masking rules
returned by the `/v1/{prefix}/databases/{database}/tables/{table}/auth`
endpoint

By default, these are treated as trusted control-plane inputs unless Paimon
explicitly documents a stronger guarantee.

This means a malicious REST Catalog server sending dangerous configuration or
overly broad data tokens is usually not a Paimon vulnerability by itself. It
also means many client-side token-selection bugs are often correctness or
specification issues rather than security boundary failures.

The major exception is **secret exposure**. If Paimon surfaces credentials or
secrets to a new audience that was not already trusted with them, that is
security-relevant. In particular:

- Data tokens for one table leaking to operations on a different table
- Auth state from one catalog instance leaking into another
- Credentials appearing in logs, error messages, or serialized state

### Boundary 4: Storage-Level Authorization

Object store permissions (e.g., OSS, S3, HDFS ACLs) are enforced by the
storage provider and the credentials the surrounding deployment chooses to
hand to Paimon. Paimon is not the root authority for bucket- or object-level
authorization.

Reports that depend primarily on over-broad IAM policies or permissive
storage ACLs are usually deployment-sensitive rather than product-security
issues in Paimon.

### Boundary 5: Engine-Level User Authorization

Paimon integrations may surface data and operations through a query engine or
application, but Paimon is not a complete user-authorization framework for
those systems.

Paimon does provide a mechanism for the REST Catalog server to supply
row-level filters and column masking rules via `authTableQuery`, but
enforcement of those rules is a shared responsibility between the engine
integration and the catalog server. Paimon relays the rules; the engine
must apply them.

## In-Scope Security Vulnerabilities

The following categories are generally security-relevant in Paimon when the
report is credible and reproducible.

### 1. Secret or Credential Disclosure to a New Audience

Examples include:

- catalog credentials exposed through a user-visible engine surface
(e.g., query results, EXPLAIN output, table properties)
- one catalog's credentials or auth state leaking into another catalog or
session within the same process
- data tokens for table A being used for (or exposed to) table B
- credentials or tokens logged at INFO or lower levels without redaction
- credentials surviving in serialized `RESTTokenFileIO` or `RESTApi` state
beyond their intended scope

### 2. Paimon-Owned Trust-Boundary Violations

Security issues exist when Paimon itself is expected to separate catalogs,
principals, or sessions and fails to do so.

Examples include:

- process-global auth provider or signer state crossing catalog instances
(e.g., the `FILE_IO_CACHE` in `RESTTokenFileIO` returning a `FileIO`
belonging to a different principal)
- a data token obtained for one table being reused for a different table's
data access
- auth header state from one `RESTApi` instance leaking into another

### 3. Row-Level and Column-Level Access Control Bypass

If Paimon's client-side handling of `authTableQuery` responses (row filters
or column masking rules) allows a caller to bypass filters that the server
intended to enforce, that is security-relevant when the bypass occurs within
Paimon-owned code rather than in the engine integration.

## Usually Out of Scope or Non-Security by Default

These categories may still be real bugs worth fixing, but they are not usually
security vulnerabilities in Paimon itself.

### 1. Correctness Bugs

Examples:

- wrong byte offsets or stale decoded values in file formats
- incorrect merge-tree compaction producing wrong query results
- race conditions or logic bugs that do not create a new trust-boundary
violation
- snapshot or schema version conflicts that produce incorrect metadata

### 2. Parser Hardening and Malformed-Input Robustness

Malformed-input crashes, raw runtime exceptions from invalid JSON or Avro
data, and memory amplification from oversized manifests or schemas are usually
treated as robustness or hardening work rather than security issues in Paimon
itself.

### 3. Malicious Catalog, Metastore, or External Service Scenarios

Reports that require a malicious catalog, metastore, REST Catalog server, or
other external service are usually outside Paimon's primary security boundary.

Examples:

- a REST Catalog server returning a data token with overly broad storage
permissions
- a Hive Metastore returning a table location pointing to a sensitive path
- a REST Catalog server returning malicious row filters designed to extract
data through side channels

### 4. Equivalent-Harm Reports

If the actor already has a legitimate capability that can cause the same harm,
the new path is usually not a security issue. This often applies to writers or
maintainers who already control metadata layout, file layout, or destructive
maintenance operations (snapshot expiration, orphan file cleanup, branch
deletion).

### 5. Denial of Service Through Normal Operations

Resource exhaustion caused by legitimate but expensive operations (e.g., large
compaction, scanning many partitions, listing all snapshots) is usually
treated as an operational concern rather than a security vulnerability.

## REST Catalog Specific Security Considerations

### Authentication

Paimon's REST Catalog client supports pluggable authentication through the
`AuthProvider` interface.

Authentication providers are created via the `AuthProviderFactory` SPI, loaded
using Java's `ServiceLoader` mechanism based on the `token.provider`
configuration. The authentication provider is process-level per catalog
instance and must not share mutable state across instances.

### Data Token Lifecycle

When `data-token.enabled` is `true`, `RESTTokenFileIO` manages delegated
storage credentials:

1. The client calls the table token endpoint to obtain a time-limited data
token
2. The token is cached and used to construct a `FileIO` instance for storage
access
3. Tokens are refreshed before expiration (1 hour safe time margin by default)
4. `FileIO` instances are cached in a process-global cache
(`FILE_IO_CACHE`) keyed by `RESTToken`, with a maximum size of 1000
entries and 10-hour expiry

Security-relevant invariants:

- Data tokens must be scoped to specific tables by the server
- The `FILE_IO_CACHE` keys on the full `RESTToken` (token content +
expiration), so different tokens produce different `FileIO` instances
- Token refresh creates a new `RESTApi` instance from the catalog context if
the original instance is unavailable (e.g., after deserialization)

### Kerberos

Paimon supports Kerberos authentication for Hadoop-based deployments through
`SecurityContext` and `SecurityConfiguration`. Keytab paths and principals
are treated as trusted operator configuration.

## Scanner Calibration Rules

A scanner targeting Paimon should treat a finding as higher-confidence only if
it plausibly shows one of the following:

- exposure of a secret or delegated credential to a new audience
- creation of a new unauthorized capability in a Paimon-owned component
- violation of a Paimon-owned trust boundary (e.g., cross-catalog credential
leak, cross-table data token reuse)

A finding should be downgraded or rejected by default if it instead depends
primarily on:

- malformed-input robustness or denial-of-service behavior
- a malicious catalog, metastore, REST Catalog server, or external service
- a principal that already has equivalent power through legitimate write or
maintenance capabilities
- operator misconfiguration (overly broad credentials, missing TLS, etc.)
Loading
Loading