Skip to content
Merged
158 changes: 148 additions & 10 deletions docs/setup_installation/admin/ha-dr/dr.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,15 +121,22 @@ For S3 object storage, you can also configure a bucket lifecycle policy to expir

## Restore

Hopsworks supports two restore modes:

- **New cluster restore**: Install a fresh cluster and restore data from a backup during installation.
- **In-place restore**: Restore data onto an existing running cluster via `helm upgrade`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you add the version since this feature is available?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good point. let me add it


!!! Note
Restore is only supported in a newly created cluster; in-place restore is not supported. Use the exact Hopsworks version that was used to create the backup.
Use the exact Hopsworks version that was used to create the backup.

The restore process has two phases:
### New Cluster Restore

The new cluster restore process has two phases:

- Restore Kubernetes objects required for the cluster restore.
- Install the cluster with Helm using the correct backup IDs.

### Restore Kubernetes objects
#### Restore Kubernetes objects

Restore the Kubernetes objects that were backed up using Velero.

Expand Down Expand Up @@ -202,19 +209,18 @@ done

# Restores the latest - if specific backup is needed then backupName instead
echo "=== Creating Velero Restore object for k8s-backups-main ==="
RESTORE_SUFFIX=$(date +%s)
kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
name: k8s-backups-main-restore-$RESTORE_SUFFIX
name: k8s-backups-main
namespace: velero
spec:
scheduleName: k8s-backups-main
EOF

echo "=== Waiting for Velero restore to finish ==="
until [ "$(kubectl get restore k8s-backups-main-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
until [ "$(kubectl get restore k8s-backups-main -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
echo "Still waiting..."; sleep 5;
done

Expand All @@ -224,14 +230,14 @@ kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
name: k8s-backups-users-resources-restore-$RESTORE_SUFFIX
name: k8s-backups-users-resources
namespace: velero
spec:
scheduleName: k8s-backups-users-resources
EOF

echo "=== Waiting for Velero restore to finish ==="
until [ "$(kubectl get restore k8s-backups-users-resources-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
until [ "$(kubectl get restore k8s-backups-users-resources -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
echo "Still waiting..."; sleep 5;
done
```
Expand All @@ -248,7 +254,7 @@ kubectl get configmap opensearch-backups-metadata -n hopsworks -o json \
| sort -nr
```

### Restore on Cluster installation
#### Restore on Cluster installation

To restore a cluster during installation, configure the backup ID in the values YAML file:

Expand All @@ -262,7 +268,7 @@ global:
backupId: "254811200"
```

#### Customizations
##### Customizations

!!! Warning
Even if you override the backup IDs for RonDB and Opensearch, you must still set `.global._hopsworks.restoreFromBackup.backupId` to ensure HopsFS is restored.
Expand Down Expand Up @@ -327,3 +333,135 @@ olk:
payload:
indices: "-myindex"
```

### In-Place Restore

!!! Note
In-place restore is available from Hopsworks version 4.8.0.

In-place restore allows you to restore data onto an existing running cluster using `helm upgrade`. Unlike a new cluster restore, this does not require provisioning a fresh cluster — the existing stateful services are shut down, wiped if necessary, and restored from backup.

!!! Warning
In-place restore **replaces all existing data** in the cluster with the backup data. Any data written after the backup was taken will be lost.

!!! Info
After a fresh install from backup (new cluster restore), in-place restores can only be performed using backups taken **after** that fresh install, because the cluster certificates are regenerated during installation. To restore to a backup that was taken **before** the fresh install, you must perform another new cluster restore from that backup instead of an in-place restore.

#### In-place restore prerequisites

- A running Hopsworks cluster deployed via Helm.
- A previously created backup with a known backup ID.
- Object storage configured and accessible with the backup data.
- Velero installed and configured as described in the [prerequisites](#prerequisites).

#### Identify the backup ID

Get the backup ID from the **Cluster Settings > Backup** tab or by using the following commands.

```bash
# RonDB backup IDs (newest first)
kubectl get configmap rondb-backups-metadata -n hopsworks -o json \
| jq -r '.data | to_entries[] | select(.value | fromjson | .state == "SUCCESS") | .key' \
| sort -nr

# Opensearch backup IDs (newest first)
kubectl get configmap opensearch-backups-metadata -n hopsworks -o json \
| jq -r '.data | to_entries[] | select(.value | fromjson | .state == "SUCCESS") | .key' \
| sort -nr

# Velero backup IDs for the main schedule (newest first)
kubectl get backups -n velero -o json \
| jq -r '[.items[] | select(.spec.storageLocation == "hopsworks-bsl" and .metadata.labels["velero.io/schedule-name"] == "k8s-backups-main" and .status.phase == "Completed")] | sort_by(.status.completionTimestamp) | reverse[] | .metadata.name'

# Velero backup IDs for the users schedule (newest first)
kubectl get backups -n velero -o json \
| jq -r '[.items[] | select(.spec.storageLocation == "hopsworks-bsl" and .metadata.labels["velero.io/schedule-name"] == "k8s-backups-users-resources" and .status.phase == "Completed")] | sort_by(.status.completionTimestamp) | reverse[] | .metadata.name'
```

#### Run the in-place restore

Configure the restore in the values file and run `helm upgrade`:

```yaml
global:
_hopsworks:
backups:
enabled: true
schedule: "@weekly"
restoreFromBackup:
backupId: "254811200"
inPlace: true
forceDataClear: true

# Optional: specify Velero backup IDs. If not set, the latest completed backup is used.
hopsworks:
velero:
restore:
mainScheduleBackupId: "k8s-backups-main-20260213T153627Z"
usersScheduleBackupId: "k8s-backups-users-resources-20260213T153627Z"
```

Then run:

```bash
helm upgrade hopsworks hopsworks/hopsworks --version <CHART_VERSION> \
--namespace hopsworks \
-f values.yaml \
--timeout 1200s
```

You can also pass the restore flags directly on the command line:

```bash
helm upgrade hopsworks hopsworks/hopsworks --version <CHART_VERSION> \
--namespace hopsworks \
--set-string global._hopsworks.restoreFromBackup.backupId="254811200" \
--set global._hopsworks.restoreFromBackup.inPlace=true \
--set global._hopsworks.restoreFromBackup.forceDataClear=true \
--set-string hopsworks.velero.restore.mainScheduleBackupId="k8s-backups-main-20260213T153627Z" \
--set-string hopsworks.velero.restore.usersScheduleBackupId="k8s-backups-users-resources-20260213T153627Z" \
--timeout 1200s
```

The required flags are:

| Parameter | Description |
| --------- | ----------- |
| `global._hopsworks.restoreFromBackup.backupId` | The backup ID to restore from. |
| `global._hopsworks.restoreFromBackup.inPlace` | Must be `true` to enable in-place restore mode. |
| `global._hopsworks.restoreFromBackup.forceDataClear` | Must be `true` to confirm that existing data will be replaced. This is a safety mechanism to prevent accidental data loss. |

The following flags are optional. If not set, the latest available Velero backup will be used:

| Parameter | Description |
| --------- | ----------- |
| `hopsworks.velero.restore.mainScheduleBackupId` | The Velero backup ID for the main schedule (`k8s-backups-main`). |
| `hopsworks.velero.restore.usersScheduleBackupId` | The Velero backup ID for the users schedule (`k8s-backups-users-resources`). |

!!! Important
After a successful restore, remove the `restoreFromBackup` blocks from your values file and run `helm upgrade` to apply the change.
If left in place, these blocks can cause subsequent upgrades to fail or behave unexpectedly.

#### Re-running an in-place restore

In-place restore creates marker resources to prevent accidental re-runs. If you need to run the restore again with the same backup ID, delete the marker resources first:

```bash
# Delete the HopsFS restore job
kubectl delete job hopsfs-inplace-restore-<BACKUP_ID> -n hopsworks --ignore-not-found=true

# Delete the RonDB restore jobs
kubectl delete job restore-native-backup-<BACKUP_ID> -n hopsworks --ignore-not-found=true
kubectl delete job setup-mysqld-dont-remove-<BACKUP_ID> -n hopsworks --ignore-not-found=true

# Delete the Opensearch restore job
kubectl delete job opensearch-restore-default-default-<BACKUP_ID> -n hopsworks --ignore-not-found=true

# Delete the velero restore objects, use the exact backup name or schedule name
kubectl delete restore.velero.io k8s-backups-main -n velero --ignore-not-found=true
kubectl delete restore.velero.io k8s-backups-users-resources -n velero --ignore-not-found=true
```

#### In-place restore customizations

The same customization options for [RonDB and Opensearch](#customizations) backup IDs apply to in-place restore. You can override individual service backup IDs while keeping the global backup ID for HopsFS.
Loading