Skip to content

protegeproject/webprotege-user-migration-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebProtege User Migration: MongoDB to Keycloak

A Python tool for migrating user accounts from WebProtege's MongoDB database to Keycloak identity management. Designed to handle large-scale migrations (170K+ users) with configurable filtering, resume capability, and detailed logging.

Background

WebProtege stores user accounts in MongoDB (webprotege.Users). As part of moving authentication to Keycloak, all user accounts need to be migrated. Since MongoDB passwords are MD5-salted hashes that are incompatible with Keycloak, migrated users are required to reset their password on first login.

What gets migrated

MongoDB Field Keycloak Field Notes
emailAddress username and email Passed through as-is; validated by filters before migration
_id attributes.webprotege_username Original WebProtégé username, preserves project ownership links
(not migrated) firstName, lastName Left blank — realName splitting was unreliable for many name formats
(not migrated) requiredActions Empty — migrated users use "Forgot password?" to set their initial password
(added) attributes.mongo_migrated Tags migrated users for auditing

Passwords, salt values, and password digests are not migrated.

The _id and emailAddress fields are passed through without modification. If they contain characters Keycloak rejects, the filters catch them pre-migration and log the user as skipped.

Requirements

  • Python 3.12+
  • Access to the WebProtege MongoDB instance
  • A running Keycloak instance with the webprotege realm created
  • Keycloak admin credentials

Installation

cd webprotege-user-migration
pip install -r requirements.txt

Configuration

Environment variables

Sensitive values are read from environment variables. Copy the example file and fill in your values:

cp .env.example .env
# Edit .env with your actual credentials

Required variables (see .env.example):

Variable Description
MONGODB_URI MongoDB connection string
KEYCLOAK_ADMIN_USERNAME Keycloak admin username
KEYCLOAK_ADMIN_PASSWORD Keycloak admin password

config.yaml

Non-sensitive settings are in config.yaml. Sensitive values use ${ENV_VAR} placeholders that are resolved from environment variables at runtime:

mongodb:
  uri: "${MONGODB_URI}"
  database: "webprotege"
  collection: "Users"
  batch_size: 500

keycloak:
  base_url: "http://webprotege-local.edu/keycloak"
  realm: "webprotege"
  admin_realm: "master"
  client_id: "admin-cli"
  username: "${KEYCLOAK_ADMIN_USERNAME}"
  password: "${KEYCLOAK_ADMIN_PASSWORD}"
  request_delay: 0.1    # seconds between API calls
  max_retries: 3
  retry_backoff: 2.0

Edit the Keycloak URL and non-sensitive settings directly in config.yaml.

Usage

Dry run (recommended first step)

Processes all users through filters and transformation without touching Keycloak. Produces full statistics and CSV logs of which users would be skipped and why.

python migrate.py --dry-run

Live migration

Before running a live migration, make sure you have a Keycloak H2 database backup so you can revert if needed. See Keycloak H2 Database Backup and Restore.

python migrate.py

Resume after interruption

Progress is saved after every batch. If the script is interrupted (Ctrl+C, crash, etc.), simply re-run it and it picks up where it left off:

python migrate.py

Reset and start fresh

python migrate.py --reset

Custom config file

python migrate.py --config /path/to/my-config.yaml

Filter System

The migration includes a configurable filter engine that determines which users to migrate and how. Filters are defined in config.yaml and come in three types:

  • Exclude filters — user is not migrated if any enabled exclude filter matches (OR logic)
  • Disable filters — user is migrated but disabled in Keycloak if any enabled disable filter matches (OR logic)
  • Include filters — user must pass all enabled include filters to be migrated (AND logic)

Evaluation order: exclude → disable → include (most restrictive wins).

Built-in exclude filters (enabled by default)

Filter Description
xss_injection Excludes users with HTML/script injection patterns in username, realName, or email
invalid_email Excludes users with empty or structurally invalid email addresses (bad characters, consecutive dots, non-ASCII, etc.)
username_too_long Excludes usernames exceeding a configurable max length (default: 255)
regex_username_blocklist Excludes usernames matching blocklist patterns (e.g., blank/whitespace-only)
duplicate_email When multiple users share an email, keeps the first alphabetically and skips the rest

Disable filters

Disable filters are for users you consider real but inactive. They are migrated to Keycloak with enabled: false. No built-in disable filters are provided — add your own in migration/filters/custom_filters.py and reference them in config.yaml with type: disable.

Built-in include filters (disabled by default)

Filter Description
email_domain_whitelist Only migrates users from specified email domains
username_regex_whitelist Only migrates usernames matching a regex pattern

Adding custom filters

  1. Write a filter function in migration/filters/custom_filters.py:
def exclude_test_accounts(doc: dict, params: dict) -> tuple[bool, str]:
    username = doc.get("_id", "")
    if username.lower().startswith("test_"):
        return (True, "Test account")
    return (False, "")
  1. Reference it in config.yaml:
filters:
  - name: exclude_test_accounts
    enabled: true
    type: exclude

Testing a Small Batch

Before running the full migration, test with a small subset of users:

  1. Enable an include filter in config.yaml to limit scope:
  - name: email_domain_whitelist
    enabled: true
    type: include
    params:
      domains:
        - "your-test-domain.edu"
  1. Run the migration:
python migrate.py
  1. Verify the migrated users in the Keycloak admin console and test logging in (you should be prompted to set a new password).

  2. Once satisfied, disable the include filter, reset progress, and run the full migration:

python migrate.py --reset

Output and Logs

All logs are written to the logs/ directory:

File Contents
migration.log Full migration log with timestamps, batch progress, and final summary
skipped_users.csv Every skipped user with: username, email, filter name, reason
failed_users.csv Every failed Keycloak API call with: username, email, error detail

The progress state is saved in migration_state.json at the project root. This file tracks last_processed_id and cumulative counters, enabling resume after interruption.

Keycloak H2 Database Backup and Restore

Keycloak uses an embedded H2 database for its internal state (realms, users, sessions, etc.). Before running a migration, back up this database so you can revert if needed.

Prerequisites

The Keycloak container name used below is webprotege-deploy-keycloak-1. Replace it with your actual container name if different.

Backup

  1. Stop Keycloak to ensure a consistent snapshot:
docker stop webprotege-deploy-keycloak-1
  1. Copy the H2 database files from the container to a local backup directory:
mkdir -p keycloak-h2-backup
docker cp webprotege-deploy-keycloak-1:/opt/keycloak/data/h2/. ./keycloak-h2-backup/
  1. Start Keycloak again:
docker start webprotege-deploy-keycloak-1

Restore

  1. Stop Keycloak:
docker stop webprotege-deploy-keycloak-1
  1. Copy the backup files back into the container:
docker cp ./keycloak-h2-backup/. webprotege-deploy-keycloak-1:/opt/keycloak/data/h2/
  1. Fix file ownership inside the container. docker cp sets files to root, but Keycloak runs as the keycloak user:
docker exec -u 0 webprotege-deploy-keycloak-1 chown -R keycloak:keycloak /opt/keycloak/data/h2/
  1. Start Keycloak:
docker start webprotege-deploy-keycloak-1

Note: Skipping the chown step causes Keycloak to open the database in read-only mode, resulting in JdbcBatchUpdateException: The database is read only errors.

Design Decisions

  • No passwords migrated — MD5-salted hashes are incompatible with Keycloak. Migrated users set their password via the "Forgot password?" flow on first login. Enable "Forgot password" and "User registration" in Realm Settings → Login in the Keycloak admin console.
  • Email as username — The user's email address is used as the Keycloak username. This avoids character-validation issues with MongoDB _id values and provides a familiar login identifier.
  • _id stored as webprotege_username attribute — The MongoDB _id is stored as a Keycloak user attribute (webprotege_username). This attribute serves as the identifier to query the project list that belongs to that user.
  • Raw field values — The _id and emailAddress fields are passed through to Keycloak without modification. Altering these values (e.g., stripping characters) could break downstream references such as project ownership. Invalid values are caught by filters before reaching Keycloak.
  • Credentials via environment variables — Sensitive values (MONGODB_URI, KEYCLOAK_ADMIN_USERNAME, KEYCLOAK_ADMIN_PASSWORD) are read from environment variables using ${VAR} placeholders in config.yaml.
  • Sequential API calls with throttle — Keycloak has no bulk user creation endpoint. A configurable delay (default 0.1s) between calls prevents overloading the server. Estimated full migration time at default rate: ~5 hours for 170K users.
  • Resume via state file — The migration cursor is sorted by _id. After each batch, the last processed _id is saved. On restart, a $gt filter skips already-processed records.
  • Direct requests library — No dependency on python-keycloak. Simpler, fewer version compatibility issues with Keycloak 26.1.

About

A Python tool for migrating user accounts from WebProtege's MongoDB database to Keycloak identity management.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages