diff --git a/AGENTS.md b/AGENTS.md new file mode 120000 index 0000000000..ac55cbdc9c --- /dev/null +++ b/AGENTS.md @@ -0,0 +1 @@ +.claude/CLAUDE.md \ No newline at end of file diff --git a/docs/setup_installation/admin/airflow3.md b/docs/setup_installation/admin/airflow3.md new file mode 100644 index 0000000000..c4de368cf0 --- /dev/null +++ b/docs/setup_installation/admin/airflow3.md @@ -0,0 +1,81 @@ +# Operator Notes: Airflow 3 on Hopsworks + +Administrative reference for cluster operators upgrading or installing Hopsworks with Airflow 3. + +## What the chart deploys + +The Airflow subchart now creates four Kubernetes objects in addition to +what the v1 chart deployed: + +1. `dag-processor` Deployment — runs `airflow dag-processor`, parses + DAGs listed in the manifest. Carries only validator keys (no private + keys). +2. `keys-bootstrap` pre-install Job — generates two RSA 4096 keypairs + (api-server + scheduler) and writes them to the + `hopsworks-airflow-keys` Secret. Idempotent; re-runs are no-ops. +3. `db-reset` pre-install Job — drops and recreates the Airflow metadata database before migration. + Install-only: gated by `hopsworkslib.isInstall`, so it never re-fires on `helm upgrade` or ArgoCD resync. + Set `global._hopsworks.mode=install` on the first install only. +4. `hopsworks-airflow-keys` Secret — four PEM files (two private, two + public). Pods project only the keys they need. + +The existing `webserver` (now Airflow `api-server`) and `scheduler` +Deployments keep their resource names; only the container command and +environment changed. + +## Resource matrix + +| Component | Image | Command | Private keys mounted | +| --- | --- | --- | --- | +| `airflow-webserver` | `apache/airflow:3.0.6-python3.12` + Hopsworks layers | `airflow api-server --proxy-headers` | api-server-private | +| `airflow-scheduler` | same | `airflow scheduler` | scheduler-private | +| `airflow-dag-processor` | same | `airflow dag-processor` | none (validator-only) | + +All three pods render `/opt/airflow/airflow.cfg` from `airflow.cfg.template` at container start (the runtime MySQL password is substituted into the template before the main process execs). +Liveness probes use a TCP check on the scheduler's serve_logs port (8793) and a `/proc/1/comm` read for the dag-processor; `pgrep` is unreliable under kernel hardening such as `kernel.yama.ptrace_scope >= 1` or `hidepid`. + +## Key rotation + +Re-run the `keys-bootstrap` Job with the `--rotate` flag (or delete the +`hopsworks-airflow-keys` Secret and run a `helm upgrade`), then +rolling-restart in this order: + +```text +api-server → scheduler → dag-processor +``` + +Outstanding user cookies and outstanding Execution-API task tokens are +invalidated; the proxy re-mints on 401. Plan for ~30s of UI +unavailability during the api-server restart. + +## Metadata DB on upgrade + +The v3 chart drops and recreates the Airflow metadata database **on install only**, gated by `hopsworkslib.isInstall` (set `global._hopsworks.mode=install` on the first install). +There is no in-place 1.x → 3.x schema migration path; DAG-run history, ad-hoc Variables, ad-hoc Connections, and audit records from a 1.x deployment are not preserved by the cutover. +Schema changes after install go through the existing `migrate` job's Alembic migrations. + +Customer DAG files in HopsFS at `Projects/
/Airflow/` are untouched.
+Snapshot HopsFS before the upgrade if you need a rollback path that preserves DAG sources.
+
+## Reverse proxy contract
+
+`AirflowProxyServlet` in hopsworks-ee validates the Hopsworks JWT and forwards `Set-Cookie` from Airflow unchanged.
+The proxy does not rewrite cookie `Path=`; Airflow sets the cookie path from `[api] base_url` automatically.
+
+Membership is **not** refreshed on every forwarded request.
+The Airflow JWT carries `project_ids` / `project_roles` / `is_admin` at mint time and is stable for the cookie's TTL (1 hour by default).
+Real-time membership changes are propagated by the Hopsworks backend pushing to `POST /auth/internal/invalidate`, which evicts the affected user's cached entry so the next login re-fetches the membership.
+A 60-second safety-net TTL on the cache catches drift even without an explicit invalidation.
+See [Airflow Security Model](../../user_guides/projects/airflow/security_model.md#token--cookie-behavior) for the full description.
+
+## Metrics
+
+The legacy `airflow-exporter` 1.3.0 does not support Airflow 3. Metrics
+are now emitted via Airflow-native StatsD into a sidecar
+`statsd_exporter` Pod, scraped by Prometheus on its `/metrics`
+endpoint. The legacy `AllowMetricsSecurityManager` is gone.
+
+## See also
+
+- User-facing release notes: `user_guides/projects/airflow/airflow3_upgrade.md`
+- Security model: `user_guides/projects/airflow/security_model.md`
diff --git a/docs/user_guides/projects/airflow/airflow.md b/docs/user_guides/projects/airflow/airflow.md
index 8d2bb6f638..90a443f745 100644
--- a/docs/user_guides/projects/airflow/airflow.md
+++ b/docs/user_guides/projects/airflow/airflow.md
@@ -7,25 +7,30 @@ description: Documentation on how to orchestrate Hopsworks jobs using Apache Air
## Introduction
Hopsworks jobs can be orchestrated using [Apache Airflow](https://airflow.apache.org/).
-You can define a Airflow DAG (Directed Acyclic Graph) containing the dependencies between Hopsworks jobs.
+You can define an Airflow DAG (Directed Acyclic Graph) containing the dependencies between Hopsworks jobs.
You can then schedule the DAG to be executed at a specific schedule using a [cron](https://en.wikipedia.org/wiki/Cron) expression.
Airflow DAGs are defined as Python files.
Within the Python file, different operators can be used to trigger different actions.
-Hopsworks provides an operator to execute jobs on Hopsworks and a sensor to wait for a specific job to finish.
+Hopsworks provides operators to execute jobs on Hopsworks and sensors to wait for a specific job to finish or for a HopsFS path to appear.
+
+Hopsworks ships **Airflow 3.0.6**.
+The DAG-authoring API differs significantly from Airflow 1.x; see [the Airflow 3 upgrade page](airflow3_upgrade.md) for a porting checklist.
### Use Apache Airflow in Hopsworks
Hopsworks deployments include a deployment of Apache Airflow.
You can access it from the Hopsworks UI by clicking on the _Airflow_ button on the left menu.
-Airflow is configured to enforce Role Based Access Control (RBAC) to the Airflow DAGs.
-Admin users on Hopsworks have access to all the DAGs in the deployment.
-Regular users can access all the DAGs of the projects they are a member of.
+Authorization is per Hopsworks project:
+admins on Hopsworks (`HOPS_ADMIN`) have access to all DAGs;
+regular users see only DAGs of the projects they are a member of (DAGs, runs, logs, triggers, edits — all surfaces).
+The Audit Log is row-filtered for non-admins to events for DAGs in their projects.
+See the [security model](security_model.md) for the full surface-by-surface contract.
-!!! note "Access Control"
- Airflow does not have any knowledge of the Hopsworks project you are currently working on.
- As such, when opening the Airflow UI, you will see all the DAGs all of the projects you are a member of.
+The Hopsworks UI's Airflow page shows each DAG's most recent runs as colored squares in a **Last runs** column (green = success, red = failed, blue = running, yellow = queued / scheduled, gray = other).
+Clicking anywhere on a DAG row opens the DAG in the Airflow UI.
+The pencil at the row's end opens the generated Python file in an in-app editor.
#### Hopsworks DAG Builder
@@ -63,54 +68,67 @@ You can then create the DAG and Hopsworks will generate the Python file.
If you prefer to code the DAGs or you want to edit a DAG built with the builder tool, you can do so.
The Airflow DAGs are stored in the _Airflow_ dataset which you can access using the file browser in the project settings.
-When writing the code for the DAG you can invoke the operator as follows:
+The Hopsworks operators and sensors are shipped in the `apache-airflow-providers-hopsworks` provider package that is preinstalled on Hopsworks-managed Airflow.
+Import them from the standard provider namespace:
+
+```python
+from hopsworks.airflow.operators import HopsworksLaunchOperator # noqa: F401
+from hopsworks.airflow.sensors import ( # noqa: F401
+ HopsworksHdfsSensor,
+ HopsworksJobSuccessSensor,
+)
+```
+
+Launch a Hopsworks job:
```python
+from hopsworks.airflow.operators import HopsworksLaunchOperator
+
+
HopsworksLaunchOperator(
- dag=dag,
task_id="profiles_fg_0",
- project_name="airflow_doc",
+ project_id=42,
job_name="profiles_fg",
- job_arguments="",
+ args="",
wait_for_completion=True,
)
```
-You should provide the name of the Airflow task (`task_id`) and the Hopsworks job information (`project_name`, `job_name`, `job_arguments`).
-You can set the `wait_for_completion` flag to `True` if you want the operator to block and wait for the job execution to be finished.
+Provide the Airflow task name (`task_id`) and the Hopsworks job information (`project_id`, `job_name`, `args`).
+Set `wait_for_completion=True` to block until the job execution finishes.
-Similarly, you can invoke the sensor as shown below.
-You should provide the name of the Airflow task (`task_id`) and the Hopsworks job information (`project_name`, `job_name`)
+Wait for a job's most recent execution to be successful:
```python
+from hopsworks.airflow.sensors import HopsworksJobSuccessSensor
+
+
HopsworksJobSuccessSensor(
- dag=dag,
task_id="wait_for_profiles_fg",
- project_name="airflow_doc",
+ project_id=42,
job_name="profiles_fg",
)
```
-When writing the DAG file, you should also add the `access_control` parameter to the DAG configuration.
-The `access_control` parameter specifies which projects have access to the DAG and which actions the project members can perform on it.
-If you do not specify the `access_control` option, project members will not be able to see the DAG in the Airflow UI.
-
-!!! warning "Admin access"
- The `access_control` configuration does not apply to Hopsworks admin users which have full access to all the DAGs even if they are not member of the project.
+Wait for a HopsFS path to exist (replaces the Airflow 1.x `HopsworksHdfsSensor` plugin):
```python
- dag = DAG(
- dag_id="example_dag",
- default_args=args,
- access_control={
- "project_name": {"can_dag_read", "can_dag_edit"},
- },
- schedule_interval="0 4 * * *",
- )
+from hopsworks.airflow.sensors import HopsworksHdfsSensor
+
+
+HopsworksHdfsSensor(
+ task_id="wait_for_arrival",
+ project_id=42,
+ path="Resources/landing/2026-05-22/_SUCCESS",
+)
```
-!!! note "Project Name"
- You should replace the `project_name` in the snippet above with the name of your own project
+`project_id` can be replaced with `project_name=` if you prefer name-based lookup.
+
+!!! note "Authorization is automatic"
+ Airflow 3 on Hopsworks does not use Airflow's `access_control` parameter.
+ DAG visibility is enforced from the dag_id-to-project mapping that Hopsworks writes when the DAG is composed (see the [security model](security_model.md)); editing the DAG file cannot change ownership.
+ Hopsworks admins (`HOPS_ADMIN`) have full access to every DAG; project members access DAGs of their own projects.
#### Manage Airflow DAGs using Git
diff --git a/docs/user_guides/projects/airflow/airflow3_upgrade.md b/docs/user_guides/projects/airflow/airflow3_upgrade.md
new file mode 100644
index 0000000000..e8e38661e9
--- /dev/null
+++ b/docs/user_guides/projects/airflow/airflow3_upgrade.md
@@ -0,0 +1,137 @@
+# Airflow 3 in Hopsworks
+
+Hopsworks now ships Apache Airflow 3.0.6 as its workflow engine.
+Airflow 3 is a major release with breaking changes to the DAG authoring API; the old 1.10-era DAGs do not run on it.
+This page covers what changed, what you need to do to your DAGs, and what the new per-project security model guarantees.
+
+## Per-project DAG isolation
+
+A non-admin Hopsworks user can only see, trigger, edit, pause, clear, or read logs of DAGs that belong to a Hopsworks project they are a member of.
+The boundary is enforced on every authenticated request to the Airflow API server, every navigation in the Airflow UI, and every CLI call.
+
+The **Audit Log** is visible to every authenticated user but its rows are filtered server-side: non-admin users see only events whose `dag_id` belongs to one of their projects.
+The Hopsworks platform admin (`HOPS_ADMIN`) sees the unfiltered log.
+
+What this **does not** isolate in this release:
+
+- **Execution-time data access.** DAG tasks run in one shared scheduler process (LocalExecutor).
+ A task can in principle read any Airflow Variable, Connection, or XCom row, regardless of project.
+ Treat Airflow Variables, Connections, and Pools as cluster-wide.
+- **DAG parsing.** DAGs from all projects are parsed in one shared process.
+ Module-top-level code in a DAG file runs with the dag-processor's privileges; treat it as cluster-wide too.
+
+These are tracked for a future release that switches to KubernetesExecutor plus per-team dag-processors.
+Until then, do not put project-private secrets in Airflow Variables or Connections — the per-DAG Hopsworks API key written by Hopsworks (see [API key for operators](#api-key-for-operators-no-embed)) is the exception, written by the platform itself rather than by users.
+
+## What changed in the DAG API
+
+You **must rewrite** your existing 1.10 DAGs for Airflow 3.
+No automated rewrite tool ships with this release.
+Concrete things to change:
+
+| Old (Airflow 1.10) | New (Airflow 3.0.6) |
+| --- | --- |
+| `schedule_interval='@daily'` | `schedule='@daily'` |
+| `provide_context=True` | implicit, remove the argument |
+| `execution_date` in a template | `logical_date` |
+| `from airflow.operators.python_operator import ...` | `from airflow.operators.python import ...` |
+| `@apply_defaults` on custom operators | removed; declare `__init__` params normally |
+| `SubDagOperator` | TaskGroups + Assets |
+| `from airflow.models import BaseOperator` | `from airflow.sdk.bases.operator import BaseOperator` |
+| Custom Hopsworks operators imported via plugins | Provider package `apache-airflow-providers-hopsworks` |
+| Default `catchup_by_default = True` | Default `catchup=False`; set explicitly |
+
+The Hopsworks-provided operators are now exposed via a standard provider:
+
+```python
+from hopsworks.airflow.operators import HopsworksLaunchOperator # noqa: F401
+from hopsworks.airflow.sensors import ( # noqa: F401
+ HopsworksHdfsSensor,
+ HopsworksJobSuccessSensor,
+)
+```
+
+`HopsworksHdfsSensor` replaces the legacy `HopsworksHdfsSensor` plugin from the 1.x shim.
+It polls `/hopsworks-api/api/project/ /Airflow/` before upgrade so you can roll back DAG files; the DB itself is not recoverable through the chart.
+
+## Task logs across pod restarts
+
+Airflow 3 with the LocalExecutor writes task logs to the scheduler pod's local filesystem and serves them on port 8793 to the api-server pod.
+The scheduler records the source endpoint (host + port) on the task instance row.
+Hopsworks configures the scheduler with `[core] hostname_callable = airflow.utils.net.get_host_ip_address` so that endpoint is the pod IP (routable across pods), not the pod's DNS hostname (not resolvable from sibling pods).
+
+Logs from runs that started before the scheduler pod was last restarted are unrecoverable — the pod's filesystem is ephemeral.
+Re-trigger the DAG to regenerate task logs if you need them.
diff --git a/docs/user_guides/projects/airflow/security_model.md b/docs/user_guides/projects/airflow/security_model.md
new file mode 100644
index 0000000000..45859fbec2
--- /dev/null
+++ b/docs/user_guides/projects/airflow/security_model.md
@@ -0,0 +1,71 @@
+# Airflow Security Model
+
+Hopsworks deploys Apache Airflow 3 with a custom auth manager that enforces per-Hopsworks-project DAG visibility.
+This page is the authoritative reference for what the security boundary covers and what it does not.
+
+## What is isolated
+
+Every authenticated request to the Airflow API server, every page in the Airflow UI, every CLI call:
+
+| Surface | Behavior for a non-admin user |
+| --- | --- |
+| `GET /api/v2/dags` | Returns only DAGs in the user's projects |
+| `GET /api/v2/dags/{dag_id}` | 404 for cross-project DAGs |
+| `POST /api/v2/dags/{dag_id}/dagRuns` (trigger) | 403 for cross-project DAGs |
+| `POST /api/v2/dags/{dag_id}/pause` | 403 for cross-project |
+| `GET /api/v2/dags/.../taskInstances/.../logs` | 403 for cross-project |
+| Airflow UI menu | Connections / Variables / Pools / Config / Plugins / Providers / Cluster Activity stripped (admin-only). Audit Log is visible to all users but rows are filtered server-side to `dag_id IN (user's project dag_ids)`. |
+| `GET /api/v2/dagSources/{dag_id}` | 403 for cross-project |
+| Direct HopsFS access to `Projects/ /Airflow/` | POSIX ACLs deny cross-project read |
+
+The dag-to-project map is written by the Hopsworks backend whenever a
+DAG is composed or deleted, and stored in a table inside Airflow's own
+metadata DB. Editing a DAG file directly (e.g. changing `tags=[...]`)
+cannot move the DAG to a different project's namespace.
+
+## What is **not** isolated
+
+The shared `dag-processor` parses DAGs from all projects.
+The shared LocalExecutor runs tasks from all projects in one process tree.
+As a consequence:
+
+- Tasks can read any Variable, Connection, or Pool present in the
+ metadata DB.
+- Tasks can read XCom rows belonging to other projects' tasks.
+- DAG-author code at module top level runs in the shared parser with
+ metadata DB credentials.
+- A task can in principle read environment variables of co-running tasks
+ on the same scheduler pod.
+
+A future release switches to KubernetesExecutor with per-team
+dag-processors to close these gaps. Until then:
+
+- **Do not store project-private secrets in Airflow Variables or Connections.**
+ Use the Hopsworks-side secrets API for per-project credentials.
+ Hopsworks operators obtain a short-lived project-scoped token at task runtime via the Execution-API token exchange, so the primary auth path does not depend on Airflow Variables.
+
+- **The per-DAG `hopsworks_api_key_