diff --git a/AGENTS.md b/AGENTS.md new file mode 120000 index 0000000000..ac55cbdc9c --- /dev/null +++ b/AGENTS.md @@ -0,0 +1 @@ +.claude/CLAUDE.md \ No newline at end of file diff --git a/docs/setup_installation/admin/airflow3.md b/docs/setup_installation/admin/airflow3.md new file mode 100644 index 0000000000..c4de368cf0 --- /dev/null +++ b/docs/setup_installation/admin/airflow3.md @@ -0,0 +1,81 @@ +# Operator Notes: Airflow 3 on Hopsworks + +Administrative reference for cluster operators upgrading or installing Hopsworks with Airflow 3. + +## What the chart deploys + +The Airflow subchart now creates four Kubernetes objects in addition to +what the v1 chart deployed: + +1. `dag-processor` Deployment — runs `airflow dag-processor`, parses + DAGs listed in the manifest. Carries only validator keys (no private + keys). +2. `keys-bootstrap` pre-install Job — generates two RSA 4096 keypairs + (api-server + scheduler) and writes them to the + `hopsworks-airflow-keys` Secret. Idempotent; re-runs are no-ops. +3. `db-reset` pre-install Job — drops and recreates the Airflow metadata database before migration. + Install-only: gated by `hopsworkslib.isInstall`, so it never re-fires on `helm upgrade` or ArgoCD resync. + Set `global._hopsworks.mode=install` on the first install only. +4. `hopsworks-airflow-keys` Secret — four PEM files (two private, two + public). Pods project only the keys they need. + +The existing `webserver` (now Airflow `api-server`) and `scheduler` +Deployments keep their resource names; only the container command and +environment changed. + +## Resource matrix + +| Component | Image | Command | Private keys mounted | +| --- | --- | --- | --- | +| `airflow-webserver` | `apache/airflow:3.0.6-python3.12` + Hopsworks layers | `airflow api-server --proxy-headers` | api-server-private | +| `airflow-scheduler` | same | `airflow scheduler` | scheduler-private | +| `airflow-dag-processor` | same | `airflow dag-processor` | none (validator-only) | + +All three pods render `/opt/airflow/airflow.cfg` from `airflow.cfg.template` at container start (the runtime MySQL password is substituted into the template before the main process execs). +Liveness probes use a TCP check on the scheduler's serve_logs port (8793) and a `/proc/1/comm` read for the dag-processor; `pgrep` is unreliable under kernel hardening such as `kernel.yama.ptrace_scope >= 1` or `hidepid`. + +## Key rotation + +Re-run the `keys-bootstrap` Job with the `--rotate` flag (or delete the +`hopsworks-airflow-keys` Secret and run a `helm upgrade`), then +rolling-restart in this order: + +```text +api-server → scheduler → dag-processor +``` + +Outstanding user cookies and outstanding Execution-API task tokens are +invalidated; the proxy re-mints on 401. Plan for ~30s of UI +unavailability during the api-server restart. + +## Metadata DB on upgrade + +The v3 chart drops and recreates the Airflow metadata database **on install only**, gated by `hopsworkslib.isInstall` (set `global._hopsworks.mode=install` on the first install). +There is no in-place 1.x → 3.x schema migration path; DAG-run history, ad-hoc Variables, ad-hoc Connections, and audit records from a 1.x deployment are not preserved by the cutover. +Schema changes after install go through the existing `migrate` job's Alembic migrations. + +Customer DAG files in HopsFS at `Projects/

/Airflow/` are untouched. +Snapshot HopsFS before the upgrade if you need a rollback path that preserves DAG sources. + +## Reverse proxy contract + +`AirflowProxyServlet` in hopsworks-ee validates the Hopsworks JWT and forwards `Set-Cookie` from Airflow unchanged. +The proxy does not rewrite cookie `Path=`; Airflow sets the cookie path from `[api] base_url` automatically. + +Membership is **not** refreshed on every forwarded request. +The Airflow JWT carries `project_ids` / `project_roles` / `is_admin` at mint time and is stable for the cookie's TTL (1 hour by default). +Real-time membership changes are propagated by the Hopsworks backend pushing to `POST /auth/internal/invalidate`, which evicts the affected user's cached entry so the next login re-fetches the membership. +A 60-second safety-net TTL on the cache catches drift even without an explicit invalidation. +See [Airflow Security Model](../../user_guides/projects/airflow/security_model.md#token--cookie-behavior) for the full description. + +## Metrics + +The legacy `airflow-exporter` 1.3.0 does not support Airflow 3. Metrics +are now emitted via Airflow-native StatsD into a sidecar +`statsd_exporter` Pod, scraped by Prometheus on its `/metrics` +endpoint. The legacy `AllowMetricsSecurityManager` is gone. + +## See also + +- User-facing release notes: `user_guides/projects/airflow/airflow3_upgrade.md` +- Security model: `user_guides/projects/airflow/security_model.md` diff --git a/docs/user_guides/projects/airflow/airflow.md b/docs/user_guides/projects/airflow/airflow.md index 8d2bb6f638..90a443f745 100644 --- a/docs/user_guides/projects/airflow/airflow.md +++ b/docs/user_guides/projects/airflow/airflow.md @@ -7,25 +7,30 @@ description: Documentation on how to orchestrate Hopsworks jobs using Apache Air ## Introduction Hopsworks jobs can be orchestrated using [Apache Airflow](https://airflow.apache.org/). -You can define a Airflow DAG (Directed Acyclic Graph) containing the dependencies between Hopsworks jobs. +You can define an Airflow DAG (Directed Acyclic Graph) containing the dependencies between Hopsworks jobs. You can then schedule the DAG to be executed at a specific schedule using a [cron](https://en.wikipedia.org/wiki/Cron) expression. Airflow DAGs are defined as Python files. Within the Python file, different operators can be used to trigger different actions. -Hopsworks provides an operator to execute jobs on Hopsworks and a sensor to wait for a specific job to finish. +Hopsworks provides operators to execute jobs on Hopsworks and sensors to wait for a specific job to finish or for a HopsFS path to appear. + +Hopsworks ships **Airflow 3.0.6**. +The DAG-authoring API differs significantly from Airflow 1.x; see [the Airflow 3 upgrade page](airflow3_upgrade.md) for a porting checklist. ### Use Apache Airflow in Hopsworks Hopsworks deployments include a deployment of Apache Airflow. You can access it from the Hopsworks UI by clicking on the _Airflow_ button on the left menu. -Airflow is configured to enforce Role Based Access Control (RBAC) to the Airflow DAGs. -Admin users on Hopsworks have access to all the DAGs in the deployment. -Regular users can access all the DAGs of the projects they are a member of. +Authorization is per Hopsworks project: +admins on Hopsworks (`HOPS_ADMIN`) have access to all DAGs; +regular users see only DAGs of the projects they are a member of (DAGs, runs, logs, triggers, edits — all surfaces). +The Audit Log is row-filtered for non-admins to events for DAGs in their projects. +See the [security model](security_model.md) for the full surface-by-surface contract. -!!! note "Access Control" - Airflow does not have any knowledge of the Hopsworks project you are currently working on. - As such, when opening the Airflow UI, you will see all the DAGs all of the projects you are a member of. +The Hopsworks UI's Airflow page shows each DAG's most recent runs as colored squares in a **Last runs** column (green = success, red = failed, blue = running, yellow = queued / scheduled, gray = other). +Clicking anywhere on a DAG row opens the DAG in the Airflow UI. +The pencil at the row's end opens the generated Python file in an in-app editor. #### Hopsworks DAG Builder @@ -63,54 +68,67 @@ You can then create the DAG and Hopsworks will generate the Python file. If you prefer to code the DAGs or you want to edit a DAG built with the builder tool, you can do so. The Airflow DAGs are stored in the _Airflow_ dataset which you can access using the file browser in the project settings. -When writing the code for the DAG you can invoke the operator as follows: +The Hopsworks operators and sensors are shipped in the `apache-airflow-providers-hopsworks` provider package that is preinstalled on Hopsworks-managed Airflow. +Import them from the standard provider namespace: + +```python +from hopsworks.airflow.operators import HopsworksLaunchOperator # noqa: F401 +from hopsworks.airflow.sensors import ( # noqa: F401 + HopsworksHdfsSensor, + HopsworksJobSuccessSensor, +) +``` + +Launch a Hopsworks job: ```python +from hopsworks.airflow.operators import HopsworksLaunchOperator + + HopsworksLaunchOperator( - dag=dag, task_id="profiles_fg_0", - project_name="airflow_doc", + project_id=42, job_name="profiles_fg", - job_arguments="", + args="", wait_for_completion=True, ) ``` -You should provide the name of the Airflow task (`task_id`) and the Hopsworks job information (`project_name`, `job_name`, `job_arguments`). -You can set the `wait_for_completion` flag to `True` if you want the operator to block and wait for the job execution to be finished. +Provide the Airflow task name (`task_id`) and the Hopsworks job information (`project_id`, `job_name`, `args`). +Set `wait_for_completion=True` to block until the job execution finishes. -Similarly, you can invoke the sensor as shown below. -You should provide the name of the Airflow task (`task_id`) and the Hopsworks job information (`project_name`, `job_name`) +Wait for a job's most recent execution to be successful: ```python +from hopsworks.airflow.sensors import HopsworksJobSuccessSensor + + HopsworksJobSuccessSensor( - dag=dag, task_id="wait_for_profiles_fg", - project_name="airflow_doc", + project_id=42, job_name="profiles_fg", ) ``` -When writing the DAG file, you should also add the `access_control` parameter to the DAG configuration. -The `access_control` parameter specifies which projects have access to the DAG and which actions the project members can perform on it. -If you do not specify the `access_control` option, project members will not be able to see the DAG in the Airflow UI. - -!!! warning "Admin access" - The `access_control` configuration does not apply to Hopsworks admin users which have full access to all the DAGs even if they are not member of the project. +Wait for a HopsFS path to exist (replaces the Airflow 1.x `HopsworksHdfsSensor` plugin): ```python - dag = DAG( - dag_id="example_dag", - default_args=args, - access_control={ - "project_name": {"can_dag_read", "can_dag_edit"}, - }, - schedule_interval="0 4 * * *", - ) +from hopsworks.airflow.sensors import HopsworksHdfsSensor + + +HopsworksHdfsSensor( + task_id="wait_for_arrival", + project_id=42, + path="Resources/landing/2026-05-22/_SUCCESS", +) ``` -!!! note "Project Name" - You should replace the `project_name` in the snippet above with the name of your own project +`project_id` can be replaced with `project_name=` if you prefer name-based lookup. + +!!! note "Authorization is automatic" + Airflow 3 on Hopsworks does not use Airflow's `access_control` parameter. + DAG visibility is enforced from the dag_id-to-project mapping that Hopsworks writes when the DAG is composed (see the [security model](security_model.md)); editing the DAG file cannot change ownership. + Hopsworks admins (`HOPS_ADMIN`) have full access to every DAG; project members access DAGs of their own projects. #### Manage Airflow DAGs using Git diff --git a/docs/user_guides/projects/airflow/airflow3_upgrade.md b/docs/user_guides/projects/airflow/airflow3_upgrade.md new file mode 100644 index 0000000000..e8e38661e9 --- /dev/null +++ b/docs/user_guides/projects/airflow/airflow3_upgrade.md @@ -0,0 +1,137 @@ +# Airflow 3 in Hopsworks + +Hopsworks now ships Apache Airflow 3.0.6 as its workflow engine. +Airflow 3 is a major release with breaking changes to the DAG authoring API; the old 1.10-era DAGs do not run on it. +This page covers what changed, what you need to do to your DAGs, and what the new per-project security model guarantees. + +## Per-project DAG isolation + +A non-admin Hopsworks user can only see, trigger, edit, pause, clear, or read logs of DAGs that belong to a Hopsworks project they are a member of. +The boundary is enforced on every authenticated request to the Airflow API server, every navigation in the Airflow UI, and every CLI call. + +The **Audit Log** is visible to every authenticated user but its rows are filtered server-side: non-admin users see only events whose `dag_id` belongs to one of their projects. +The Hopsworks platform admin (`HOPS_ADMIN`) sees the unfiltered log. + +What this **does not** isolate in this release: + +- **Execution-time data access.** DAG tasks run in one shared scheduler process (LocalExecutor). + A task can in principle read any Airflow Variable, Connection, or XCom row, regardless of project. + Treat Airflow Variables, Connections, and Pools as cluster-wide. +- **DAG parsing.** DAGs from all projects are parsed in one shared process. + Module-top-level code in a DAG file runs with the dag-processor's privileges; treat it as cluster-wide too. + +These are tracked for a future release that switches to KubernetesExecutor plus per-team dag-processors. +Until then, do not put project-private secrets in Airflow Variables or Connections — the per-DAG Hopsworks API key written by Hopsworks (see [API key for operators](#api-key-for-operators-no-embed)) is the exception, written by the platform itself rather than by users. + +## What changed in the DAG API + +You **must rewrite** your existing 1.10 DAGs for Airflow 3. +No automated rewrite tool ships with this release. +Concrete things to change: + +| Old (Airflow 1.10) | New (Airflow 3.0.6) | +| --- | --- | +| `schedule_interval='@daily'` | `schedule='@daily'` | +| `provide_context=True` | implicit, remove the argument | +| `execution_date` in a template | `logical_date` | +| `from airflow.operators.python_operator import ...` | `from airflow.operators.python import ...` | +| `@apply_defaults` on custom operators | removed; declare `__init__` params normally | +| `SubDagOperator` | TaskGroups + Assets | +| `from airflow.models import BaseOperator` | `from airflow.sdk.bases.operator import BaseOperator` | +| Custom Hopsworks operators imported via plugins | Provider package `apache-airflow-providers-hopsworks` | +| Default `catchup_by_default = True` | Default `catchup=False`; set explicitly | + +The Hopsworks-provided operators are now exposed via a standard provider: + +```python +from hopsworks.airflow.operators import HopsworksLaunchOperator # noqa: F401 +from hopsworks.airflow.sensors import ( # noqa: F401 + HopsworksHdfsSensor, + HopsworksJobSuccessSensor, +) +``` + +`HopsworksHdfsSensor` replaces the legacy `HopsworksHdfsSensor` plugin from the 1.x shim. +It polls `/hopsworks-api/api/project//dataset/?action=stat` and accepts either `project_id` or `project_name`. + +## API key for operators (no embed) + +Hopsworks operators and sensors authenticate via the `HopsworksHook`, which resolves a credential in this order: + +1. **Task token exchange** — the scheduler signs a per-task RS256 token; the hook POSTs it to `/api/auth/airflow-task-exchange/exchange` on Hopsworks and gets a project-scoped JWT back. +2. **Per-DAG Airflow Variable** — Hopsworks writes a per-DAG API key into an Airflow Variable named `hopsworks_api_key_` (Fernet-encrypted at rest) on every DAG compose. + The hook reads it at task runtime via `Variable.get(...)` and uses it as `Authorization: ApiKey `. +3. **Airflow Connection** `hopsworks_default` — `conn.password` is read as a literal API key. + Useful for out-of-cluster operators. +4. **`HOPSWORKS_API_KEY` env var** — manual override for power users. + +The generated DAG file **never carries the API key** — the secret lives only in the Airflow Variables table (admin-only via `HopsworksAuthManager`), so the DAG `.py` is safe to inspect, version-control, or share. +Re-generate the DAG from the Hopsworks UI to rotate the key. + +DAG files composed before this change still embed `os.environ.setdefault("HOPSWORKS_API_KEY", "")` near the top of the file. +They continue to work because the env-var path is the fourth tier in the hook's fallback, but the secret is in the file. +Regenerate the DAG from the Hopsworks UI to drop the embed and switch to the Variable-fetch path. + +## DAG identity + +The Airflow `dag_id` for a Hopsworks-composed DAG is now: + +```text +p____ +``` + +For example, project `acme` (id `42`) with a DAG named `daily_ingest` +becomes `p_acme_42__daily_ingest` in the Airflow UI. The Hopsworks UI +hides the prefix when displaying DAG names. + +If you reference your DAGs by `dag_id` from external code (XCom pulls +across DAGs, `TriggerDagRunOperator`, REST API integrations), update +those references to the new format. + +## REST authentication + +The Airflow API server is reached through the standard Hopsworks reverse +proxy at `https:///hopsworks-api/airflow/`. Browsers carry an +HttpOnly `_token` cookie set by the auth manager; external clients use +bearer tokens. + +To obtain a bearer token from a Hopsworks JWT: + +```bash +curl -X POST "https:///hopsworks-api/airflow/auth/token" \ + -H "Content-Type: application/json" \ + -d '{"hopsworks_jwt": ""}' +``` + +Response: + +```json +{"access_token": "", "token_type": "Bearer", "expires_in": 3600} +``` + +Use the returned `access_token` in `Authorization: Bearer` on all +`/api/v2/*` calls. + +## Recent runs in the Hopsworks UI + +The Hopsworks Airflow page lists each DAG with a **Last runs** column that renders the most recent runs as colored squares (green = success, red = failed, blue = running, yellow = queued / scheduled, gray = other). +Hover any square for the run id, state, and start time. +The data is read from a project-scoped Hopsworks endpoint that proxies to the auth manager and walks the same `dag_run` table the Airflow UI does, so the two views stay consistent. + +Clicking anywhere on a DAG row opens the DAG in the Airflow UI (in a new tab). +The pencil column at the row's end opens the generated Python file in a Hopsworks editor without leaving Hopsworks. + +## Metadata DB on upgrade + +Upgrading from Airflow 1.10 to Airflow 3 drops and recreates the Airflow metadata DB. +Historical DAG-run records, task logs in the DB, and any ad-hoc Variables / Connections you had configured are not preserved. +Snapshot HopsFS `Projects/

/Airflow/` before upgrade so you can roll back DAG files; the DB itself is not recoverable through the chart. + +## Task logs across pod restarts + +Airflow 3 with the LocalExecutor writes task logs to the scheduler pod's local filesystem and serves them on port 8793 to the api-server pod. +The scheduler records the source endpoint (host + port) on the task instance row. +Hopsworks configures the scheduler with `[core] hostname_callable = airflow.utils.net.get_host_ip_address` so that endpoint is the pod IP (routable across pods), not the pod's DNS hostname (not resolvable from sibling pods). + +Logs from runs that started before the scheduler pod was last restarted are unrecoverable — the pod's filesystem is ephemeral. +Re-trigger the DAG to regenerate task logs if you need them. diff --git a/docs/user_guides/projects/airflow/security_model.md b/docs/user_guides/projects/airflow/security_model.md new file mode 100644 index 0000000000..45859fbec2 --- /dev/null +++ b/docs/user_guides/projects/airflow/security_model.md @@ -0,0 +1,71 @@ +# Airflow Security Model + +Hopsworks deploys Apache Airflow 3 with a custom auth manager that enforces per-Hopsworks-project DAG visibility. +This page is the authoritative reference for what the security boundary covers and what it does not. + +## What is isolated + +Every authenticated request to the Airflow API server, every page in the Airflow UI, every CLI call: + +| Surface | Behavior for a non-admin user | +| --- | --- | +| `GET /api/v2/dags` | Returns only DAGs in the user's projects | +| `GET /api/v2/dags/{dag_id}` | 404 for cross-project DAGs | +| `POST /api/v2/dags/{dag_id}/dagRuns` (trigger) | 403 for cross-project DAGs | +| `POST /api/v2/dags/{dag_id}/pause` | 403 for cross-project | +| `GET /api/v2/dags/.../taskInstances/.../logs` | 403 for cross-project | +| Airflow UI menu | Connections / Variables / Pools / Config / Plugins / Providers / Cluster Activity stripped (admin-only). Audit Log is visible to all users but rows are filtered server-side to `dag_id IN (user's project dag_ids)`. | +| `GET /api/v2/dagSources/{dag_id}` | 403 for cross-project | +| Direct HopsFS access to `Projects/

/Airflow/` | POSIX ACLs deny cross-project read | + +The dag-to-project map is written by the Hopsworks backend whenever a +DAG is composed or deleted, and stored in a table inside Airflow's own +metadata DB. Editing a DAG file directly (e.g. changing `tags=[...]`) +cannot move the DAG to a different project's namespace. + +## What is **not** isolated + +The shared `dag-processor` parses DAGs from all projects. +The shared LocalExecutor runs tasks from all projects in one process tree. +As a consequence: + +- Tasks can read any Variable, Connection, or Pool present in the + metadata DB. +- Tasks can read XCom rows belonging to other projects' tasks. +- DAG-author code at module top level runs in the shared parser with + metadata DB credentials. +- A task can in principle read environment variables of co-running tasks + on the same scheduler pod. + +A future release switches to KubernetesExecutor with per-team +dag-processors to close these gaps. Until then: + +- **Do not store project-private secrets in Airflow Variables or Connections.** + Use the Hopsworks-side secrets API for per-project credentials. + Hopsworks operators obtain a short-lived project-scoped token at task runtime via the Execution-API token exchange, so the primary auth path does not depend on Airflow Variables. + +- **The per-DAG `hopsworks_api_key_` Variable is a fallback only, and is not per-DAG isolated at runtime.** + Hopsworks writes it during compose so the task-token-exchange path has a working credential to fall back to if the task-instance JWT is unreachable for any reason. + The Variables UI is admin-only via `HopsworksAuthManager`, but inside a running task `Variable.get("hopsworks_api_key_")` is not blocked — `dag_id` is non-secret and the hash is reproducible, so DAG code that calls `Variable.get` for another DAG's hashed name can read that key. + Until per-task secret isolation lands with KubernetesExecutor, treat the fallback path as a shared credential surface within the cluster: don't run untrusted DAG code, and prefer the task-token-exchange path which is signed per-task and not stored in the metadata DB. +- **Do not assume DAG code is sandboxed at parse time.** Module-top-level + code in a customer DAG can execute network calls and reach anything + the dag-processor's ServiceAccount can reach. + +## Token + cookie behavior + +The auth manager sets a `_token` cookie on UI logins. +The cookie is `HttpOnly`, `Secure` (under TLS), `SameSite=Lax`, and scoped to `/hopsworks-api/airflow`. +The Hopsworks proxy passes it through unchanged; cookie path scoping is set by Airflow itself from the `[api] base_url` config. + +The same auth manager mints bearer tokens for external clients via `POST /hopsworks-api/airflow/auth/token` (the auth manager's `/auth` router is mounted under the Airflow `[api] base_url`, which the Hopsworks chart pins to `/hopsworks-api/airflow/`). +The bearer token format and validation are identical to the cookie-borne token; only the carrier differs. + +The Airflow JWT carries the user's `project_ids`, `project_roles`, and `is_admin` flag at mint time. +Membership is **not** refreshed per request — Airflow 3.0.x does not expose a synchronous refresh hook on `BaseAuthManager` — so a user's authorization is stable for the cookie's TTL (1 hour by default). + +## Project membership changes + +When a Hopsworks project's membership changes (add member, remove member, role change), the Hopsworks backend immediately invalidates the affected user's entry in the Airflow auth manager's cache via `/auth/internal/invalidate`. +The next login by that user re-fetches their project list and reflects the new membership. +The cache also has a 60-second safety-net TTL even without an explicit invalidation. diff --git a/mkdocs.yml b/mkdocs.yml index f8dc7bf80f..222c040a80 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -179,7 +179,10 @@ nav: - Base: user_guides/projects/scheduling/kube_scheduler.md - Kueue: user_guides/projects/scheduling/kueue_details.md - - Airflow: user_guides/projects/airflow/airflow.md + - Airflow: + - Overview: user_guides/projects/airflow/airflow.md + - Airflow 3 upgrade: user_guides/projects/airflow/airflow3_upgrade.md + - Security model: user_guides/projects/airflow/security_model.md - OpenSearch: - Connect: user_guides/projects/opensearch/connect.md - KNN: user_guides/projects/opensearch/knn.md @@ -258,6 +261,7 @@ nav: - Configure Alerts: setup_installation/admin/alert.md - IAM Role Chaining: setup_installation/admin/roleChaining.md - Configure Project Mapping: setup_installation/admin/configure-project-mapping.md + - Airflow 3 operator notes: setup_installation/admin/airflow3.md - Monitoring: - Services Dashboards: setup_installation/admin/monitoring/grafana.md - Export metrics: setup_installation/admin/monitoring/export-metrics.md