Skip to content
Merged
1 change: 1 addition & 0 deletions AGENTS.md
81 changes: 81 additions & 0 deletions docs/setup_installation/admin/airflow3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Operator Notes: Airflow 3 on Hopsworks

Administrative reference for cluster operators upgrading or installing Hopsworks with Airflow 3.

## What the chart deploys

The Airflow subchart now creates four Kubernetes objects in addition to
what the v1 chart deployed:

1. `dag-processor` Deployment — runs `airflow dag-processor`, parses
DAGs listed in the manifest. Carries only validator keys (no private
keys).
2. `keys-bootstrap` pre-install Job — generates two RSA 4096 keypairs
(api-server + scheduler) and writes them to the
`hopsworks-airflow-keys` Secret. Idempotent; re-runs are no-ops.
3. `db-reset` pre-install Job — drops and recreates the Airflow metadata database before migration.
Install-only: gated by `hopsworkslib.isInstall`, so it never re-fires on `helm upgrade` or ArgoCD resync.
Set `global._hopsworks.mode=install` on the first install only.
4. `hopsworks-airflow-keys` Secret — four PEM files (two private, two
public). Pods project only the keys they need.

The existing `webserver` (now Airflow `api-server`) and `scheduler`
Deployments keep their resource names; only the container command and
environment changed.

## Resource matrix

| Component | Image | Command | Private keys mounted |
| --- | --- | --- | --- |
| `airflow-webserver` | `apache/airflow:3.0.6-python3.12` + Hopsworks layers | `airflow api-server --proxy-headers` | api-server-private |
| `airflow-scheduler` | same | `airflow scheduler` | scheduler-private |
| `airflow-dag-processor` | same | `airflow dag-processor` | none (validator-only) |

All three pods render `/opt/airflow/airflow.cfg` from `airflow.cfg.template` at container start (the runtime MySQL password is substituted into the template before the main process execs).
Liveness probes use a TCP check on the scheduler's serve_logs port (8793) and a `/proc/1/comm` read for the dag-processor; `pgrep` is unreliable under kernel hardening such as `kernel.yama.ptrace_scope >= 1` or `hidepid`.

## Key rotation

Re-run the `keys-bootstrap` Job with the `--rotate` flag (or delete the
`hopsworks-airflow-keys` Secret and run a `helm upgrade`), then
rolling-restart in this order:

```text
api-server → scheduler → dag-processor
```

Outstanding user cookies and outstanding Execution-API task tokens are
invalidated; the proxy re-mints on 401. Plan for ~30s of UI
unavailability during the api-server restart.

## Metadata DB on upgrade

The v3 chart drops and recreates the Airflow metadata database **on install only**, gated by `hopsworkslib.isInstall` (set `global._hopsworks.mode=install` on the first install).
There is no in-place 1.x → 3.x schema migration path; DAG-run history, ad-hoc Variables, ad-hoc Connections, and audit records from a 1.x deployment are not preserved by the cutover.
Schema changes after install go through the existing `migrate` job's Alembic migrations.

Customer DAG files in HopsFS at `Projects/<P>/Airflow/` are untouched.
Snapshot HopsFS before the upgrade if you need a rollback path that preserves DAG sources.

## Reverse proxy contract

`AirflowProxyServlet` in hopsworks-ee validates the Hopsworks JWT and forwards `Set-Cookie` from Airflow unchanged.
The proxy does not rewrite cookie `Path=`; Airflow sets the cookie path from `[api] base_url` automatically.

Membership is **not** refreshed on every forwarded request.
The Airflow JWT carries `project_ids` / `project_roles` / `is_admin` at mint time and is stable for the cookie's TTL (1 hour by default).
Real-time membership changes are propagated by the Hopsworks backend pushing to `POST /auth/internal/invalidate`, which evicts the affected user's cached entry so the next login re-fetches the membership.
A 60-second safety-net TTL on the cache catches drift even without an explicit invalidation.
See [Airflow Security Model](../../user_guides/projects/airflow/security_model.md#token--cookie-behavior) for the full description.

## Metrics

The legacy `airflow-exporter` 1.3.0 does not support Airflow 3. Metrics
are now emitted via Airflow-native StatsD into a sidecar
`statsd_exporter` Pod, scraped by Prometheus on its `/metrics`
endpoint. The legacy `AllowMetricsSecurityManager` is gone.

## See also

- User-facing release notes: `user_guides/projects/airflow/airflow3_upgrade.md`
- Security model: `user_guides/projects/airflow/security_model.md`
86 changes: 52 additions & 34 deletions docs/user_guides/projects/airflow/airflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,30 @@ description: Documentation on how to orchestrate Hopsworks jobs using Apache Air
## Introduction

Hopsworks jobs can be orchestrated using [Apache Airflow](https://airflow.apache.org/).
You can define a Airflow DAG (Directed Acyclic Graph) containing the dependencies between Hopsworks jobs.
You can define an Airflow DAG (Directed Acyclic Graph) containing the dependencies between Hopsworks jobs.
You can then schedule the DAG to be executed at a specific schedule using a [cron](https://en.wikipedia.org/wiki/Cron) expression.

Airflow DAGs are defined as Python files.
Within the Python file, different operators can be used to trigger different actions.
Hopsworks provides an operator to execute jobs on Hopsworks and a sensor to wait for a specific job to finish.
Hopsworks provides operators to execute jobs on Hopsworks and sensors to wait for a specific job to finish or for a HopsFS path to appear.

Hopsworks ships **Airflow 3.0.6**.
The DAG-authoring API differs significantly from Airflow 1.x; see [the Airflow 3 upgrade page](airflow3_upgrade.md) for a porting checklist.
Comment thread
jimdowling marked this conversation as resolved.

Comment thread
jimdowling marked this conversation as resolved.
### Use Apache Airflow in Hopsworks

Hopsworks deployments include a deployment of Apache Airflow.
You can access it from the Hopsworks UI by clicking on the _Airflow_ button on the left menu.

Airflow is configured to enforce Role Based Access Control (RBAC) to the Airflow DAGs.
Admin users on Hopsworks have access to all the DAGs in the deployment.
Regular users can access all the DAGs of the projects they are a member of.
Authorization is per Hopsworks project:
admins on Hopsworks (`HOPS_ADMIN`) have access to all DAGs;
regular users see only DAGs of the projects they are a member of (DAGs, runs, logs, triggers, edits — all surfaces).
The Audit Log is row-filtered for non-admins to events for DAGs in their projects.
See the [security model](security_model.md) for the full surface-by-surface contract.
Comment thread
jimdowling marked this conversation as resolved.

Comment thread
jimdowling marked this conversation as resolved.
!!! note "Access Control"
Airflow does not have any knowledge of the Hopsworks project you are currently working on.
As such, when opening the Airflow UI, you will see all the DAGs all of the projects you are a member of.
The Hopsworks UI's Airflow page shows each DAG's most recent runs as colored squares in a **Last runs** column (green = success, red = failed, blue = running, yellow = queued / scheduled, gray = other).
Clicking anywhere on a DAG row opens the DAG in the Airflow UI.
The pencil at the row's end opens the generated Python file in an in-app editor.

#### Hopsworks DAG Builder

Expand Down Expand Up @@ -63,54 +68,67 @@ You can then create the DAG and Hopsworks will generate the Python file.
If you prefer to code the DAGs or you want to edit a DAG built with the builder tool, you can do so.
The Airflow DAGs are stored in the _Airflow_ dataset which you can access using the file browser in the project settings.

When writing the code for the DAG you can invoke the operator as follows:
The Hopsworks operators and sensors are shipped in the `apache-airflow-providers-hopsworks` provider package that is preinstalled on Hopsworks-managed Airflow.
Import them from the standard provider namespace:

```python
from hopsworks.airflow.operators import HopsworksLaunchOperator # noqa: F401
from hopsworks.airflow.sensors import ( # noqa: F401
HopsworksHdfsSensor,
HopsworksJobSuccessSensor,
)
```

Launch a Hopsworks job:

```python
from hopsworks.airflow.operators import HopsworksLaunchOperator


HopsworksLaunchOperator(
dag=dag,
task_id="profiles_fg_0",
project_name="airflow_doc",
project_id=42,
job_name="profiles_fg",
job_arguments="",
args="",
wait_for_completion=True,
)
Comment thread
jimdowling marked this conversation as resolved.
```

You should provide the name of the Airflow task (`task_id`) and the Hopsworks job information (`project_name`, `job_name`, `job_arguments`).
You can set the `wait_for_completion` flag to `True` if you want the operator to block and wait for the job execution to be finished.
Provide the Airflow task name (`task_id`) and the Hopsworks job information (`project_id`, `job_name`, `args`).
Set `wait_for_completion=True` to block until the job execution finishes.

Similarly, you can invoke the sensor as shown below.
You should provide the name of the Airflow task (`task_id`) and the Hopsworks job information (`project_name`, `job_name`)
Wait for a job's most recent execution to be successful:

```python
from hopsworks.airflow.sensors import HopsworksJobSuccessSensor


HopsworksJobSuccessSensor(
dag=dag,
task_id="wait_for_profiles_fg",
project_name="airflow_doc",
project_id=42,
job_name="profiles_fg",
)
Comment thread
jimdowling marked this conversation as resolved.
```

When writing the DAG file, you should also add the `access_control` parameter to the DAG configuration.
The `access_control` parameter specifies which projects have access to the DAG and which actions the project members can perform on it.
If you do not specify the `access_control` option, project members will not be able to see the DAG in the Airflow UI.

!!! warning "Admin access"
The `access_control` configuration does not apply to Hopsworks admin users which have full access to all the DAGs even if they are not member of the project.
Wait for a HopsFS path to exist (replaces the Airflow 1.x `HopsworksHdfsSensor` plugin):

```python
dag = DAG(
dag_id="example_dag",
default_args=args,
access_control={
"project_name": {"can_dag_read", "can_dag_edit"},
},
schedule_interval="0 4 * * *",
)
from hopsworks.airflow.sensors import HopsworksHdfsSensor


HopsworksHdfsSensor(
task_id="wait_for_arrival",
project_id=42,
path="Resources/landing/2026-05-22/_SUCCESS",
)
Comment thread
jimdowling marked this conversation as resolved.
```

!!! note "Project Name"
You should replace the `project_name` in the snippet above with the name of your own project
`project_id` can be replaced with `project_name=` if you prefer name-based lookup.

!!! note "Authorization is automatic"
Airflow 3 on Hopsworks does not use Airflow's `access_control` parameter.
DAG visibility is enforced from the dag_id-to-project mapping that Hopsworks writes when the DAG is composed (see the [security model](security_model.md)); editing the DAG file cannot change ownership.
Comment thread
jimdowling marked this conversation as resolved.
Hopsworks admins (`HOPS_ADMIN`) have full access to every DAG; project members access DAGs of their own projects.

#### Manage Airflow DAGs using Git

Expand Down
137 changes: 137 additions & 0 deletions docs/user_guides/projects/airflow/airflow3_upgrade.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Airflow 3 in Hopsworks

Hopsworks now ships Apache Airflow 3.0.6 as its workflow engine.
Airflow 3 is a major release with breaking changes to the DAG authoring API; the old 1.10-era DAGs do not run on it.
This page covers what changed, what you need to do to your DAGs, and what the new per-project security model guarantees.

## Per-project DAG isolation

A non-admin Hopsworks user can only see, trigger, edit, pause, clear, or read logs of DAGs that belong to a Hopsworks project they are a member of.
The boundary is enforced on every authenticated request to the Airflow API server, every navigation in the Airflow UI, and every CLI call.

The **Audit Log** is visible to every authenticated user but its rows are filtered server-side: non-admin users see only events whose `dag_id` belongs to one of their projects.
The Hopsworks platform admin (`HOPS_ADMIN`) sees the unfiltered log.

What this **does not** isolate in this release:

- **Execution-time data access.** DAG tasks run in one shared scheduler process (LocalExecutor).
A task can in principle read any Airflow Variable, Connection, or XCom row, regardless of project.
Treat Airflow Variables, Connections, and Pools as cluster-wide.
- **DAG parsing.** DAGs from all projects are parsed in one shared process.
Module-top-level code in a DAG file runs with the dag-processor's privileges; treat it as cluster-wide too.

These are tracked for a future release that switches to KubernetesExecutor plus per-team dag-processors.
Until then, do not put project-private secrets in Airflow Variables or Connections — the per-DAG Hopsworks API key written by Hopsworks (see [API key for operators](#api-key-for-operators-no-embed)) is the exception, written by the platform itself rather than by users.

Comment thread
jimdowling marked this conversation as resolved.
## What changed in the DAG API

You **must rewrite** your existing 1.10 DAGs for Airflow 3.
No automated rewrite tool ships with this release.
Concrete things to change:

| Old (Airflow 1.10) | New (Airflow 3.0.6) |
| --- | --- |
| `schedule_interval='@daily'` | `schedule='@daily'` |
| `provide_context=True` | implicit, remove the argument |
| `execution_date` in a template | `logical_date` |
| `from airflow.operators.python_operator import ...` | `from airflow.operators.python import ...` |
| `@apply_defaults` on custom operators | removed; declare `__init__` params normally |
| `SubDagOperator` | TaskGroups + Assets |
| `from airflow.models import BaseOperator` | `from airflow.sdk.bases.operator import BaseOperator` |
| Custom Hopsworks operators imported via plugins | Provider package `apache-airflow-providers-hopsworks` |
| Default `catchup_by_default = True` | Default `catchup=False`; set explicitly |

The Hopsworks-provided operators are now exposed via a standard provider:

```python
from hopsworks.airflow.operators import HopsworksLaunchOperator # noqa: F401
from hopsworks.airflow.sensors import ( # noqa: F401
HopsworksHdfsSensor,
HopsworksJobSuccessSensor,
)
```

`HopsworksHdfsSensor` replaces the legacy `HopsworksHdfsSensor` plugin from the 1.x shim.
It polls `/hopsworks-api/api/project/<id>/dataset/<path>?action=stat` and accepts either `project_id` or `project_name`.

## API key for operators (no embed)

Hopsworks operators and sensors authenticate via the `HopsworksHook`, which resolves a credential in this order:

1. **Task token exchange** — the scheduler signs a per-task RS256 token; the hook POSTs it to `/api/auth/airflow-task-exchange/exchange` on Hopsworks and gets a project-scoped JWT back.
2. **Per-DAG Airflow Variable** — Hopsworks writes a per-DAG API key into an Airflow Variable named `hopsworks_api_key_<sha256(dag_id)[:16]>` (Fernet-encrypted at rest) on every DAG compose.
The hook reads it at task runtime via `Variable.get(...)` and uses it as `Authorization: ApiKey <key>`.
3. **Airflow Connection** `hopsworks_default` — `conn.password` is read as a literal API key.
Useful for out-of-cluster operators.
4. **`HOPSWORKS_API_KEY` env var** — manual override for power users.

The generated DAG file **never carries the API key** — the secret lives only in the Airflow Variables table (admin-only via `HopsworksAuthManager`), so the DAG `.py` is safe to inspect, version-control, or share.
Re-generate the DAG from the Hopsworks UI to rotate the key.

DAG files composed before this change still embed `os.environ.setdefault("HOPSWORKS_API_KEY", "<key>")` near the top of the file.
They continue to work because the env-var path is the fourth tier in the hook's fallback, but the secret is in the file.
Regenerate the DAG from the Hopsworks UI to drop the embed and switch to the Variable-fetch path.

## DAG identity

The Airflow `dag_id` for a Hopsworks-composed DAG is now:

```text
p_<project_slug>_<project_id>__<dag_user_name>
```

For example, project `acme` (id `42`) with a DAG named `daily_ingest`
becomes `p_acme_42__daily_ingest` in the Airflow UI. The Hopsworks UI
hides the prefix when displaying DAG names.

If you reference your DAGs by `dag_id` from external code (XCom pulls
across DAGs, `TriggerDagRunOperator`, REST API integrations), update
those references to the new format.

## REST authentication

The Airflow API server is reached through the standard Hopsworks reverse
proxy at `https://<hopsworks>/hopsworks-api/airflow/`. Browsers carry an
HttpOnly `_token` cookie set by the auth manager; external clients use
bearer tokens.

To obtain a bearer token from a Hopsworks JWT:

```bash
curl -X POST "https://<hopsworks>/hopsworks-api/airflow/auth/token" \
-H "Content-Type: application/json" \
-d '{"hopsworks_jwt": "<your-hopsworks-jwt>"}'
```

Response:

```json
{"access_token": "<airflow-jwt>", "token_type": "Bearer", "expires_in": 3600}
```

Use the returned `access_token` in `Authorization: Bearer` on all
`/api/v2/*` calls.

## Recent runs in the Hopsworks UI

The Hopsworks Airflow page lists each DAG with a **Last runs** column that renders the most recent runs as colored squares (green = success, red = failed, blue = running, yellow = queued / scheduled, gray = other).
Hover any square for the run id, state, and start time.
The data is read from a project-scoped Hopsworks endpoint that proxies to the auth manager and walks the same `dag_run` table the Airflow UI does, so the two views stay consistent.

Clicking anywhere on a DAG row opens the DAG in the Airflow UI (in a new tab).
The pencil column at the row's end opens the generated Python file in a Hopsworks editor without leaving Hopsworks.

## Metadata DB on upgrade

Upgrading from Airflow 1.10 to Airflow 3 drops and recreates the Airflow metadata DB.
Historical DAG-run records, task logs in the DB, and any ad-hoc Variables / Connections you had configured are not preserved.
Snapshot HopsFS `Projects/<P>/Airflow/` before upgrade so you can roll back DAG files; the DB itself is not recoverable through the chart.

## Task logs across pod restarts

Airflow 3 with the LocalExecutor writes task logs to the scheduler pod's local filesystem and serves them on port 8793 to the api-server pod.
The scheduler records the source endpoint (host + port) on the task instance row.
Hopsworks configures the scheduler with `[core] hostname_callable = airflow.utils.net.get_host_ip_address` so that endpoint is the pod IP (routable across pods), not the pod's DNS hostname (not resolvable from sibling pods).

Logs from runs that started before the scheduler pod was last restarted are unrecoverable — the pod's filesystem is ephemeral.
Re-trigger the DAG to regenerate task logs if you need them.
Loading
Loading