[HWORKS-2682] websocket executor pools improvements and scalable executor pools#593
Draft
o-alex wants to merge 5 commits into
Draft
[HWORKS-2682] websocket executor pools improvements and scalable executor pools#593o-alex wants to merge 5 commits into
o-alex wants to merge 5 commits into
Conversation
…utor pools https://hopsworks.atlassian.net/browse/HWORKS-2682 The earlier WebSocket-pump fix on this ticket bounded the proxy's thread pool but left pool exhaustion invisible to users: when the cluster ran out of pump threads, new sessions would either hang in a connecting state or come up and close unexpectedly, with no actionable signal anywhere. The pool is shared across Jupyter notebooks, terminals, and Streamlit apps, and on a busy multi-pod deployment there was no way for an admin to see how close the cluster was to saturation, no way for a user to learn why their session was failing, and no way for the backend to refuse a request that could not possibly succeed. This change set adds the missing visibility and gating end-to-end. Each hopsworks-instance pod publishes its pool state into a Hazelcast map every five seconds; a JAX-RS endpoint exposes the cluster aggregate; backend pre-flight on the three session-start paths rejects with HTTP 503 when the cluster is at the limit; the proxy itself pre-checks WebSocket upgrade requests so in-session opens fail at HTTP level rather than after the 101 handshake; per-pod MicroProfile metrics drive new Grafana panels for sessions, CPU draw, and allocation rate; and the UI on every tool page shows a compact badge and disables the start button when the cluster is full. The docs site gains two pages. A new admin page, setup_installation/admin/monitoring/websocket-pool.md, covers how the pool relates to concurrent users, the five Grafana panels in the Hopsworks dashboard (rejection rate, duration percentiles, sessions, pool CPU, pool allocation rate as GC pressure), the full MP-Metrics reference table including ws_pool_cpu_cores and ws_pool_alloc_bytes_per_second, the per-pod legend convention (pod_short rewrite stripping the hopsworks-instance- prefix), the Helm values that govern sizing and Grizzly timeouts, and a four-panel watchlist for after a sizing change (Memory + Max-heap line, pool CPU, allocation rate, GC). A new user-facing page, user_guides/projects/jupyter/session_capacity_warnings.md, explains the two badges (instance and cluster) with their orange (warning) and red (critical) states, what happens to the action buttons when red, and the recovery steps; three screenshots from Jupyter, Terminal, and Apps show the badges in context. The admin doc links to the user page and vice versa, and mkdocs.yml is extended with the new user-guide entry under Projects > Jupyter. Signed-off-by: Alex Ormenisan <alex@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…utor pools https://hopsworks.atlassian.net/browse/HWORKS-2682 Document the auto-size path that was just added to hopsworks-helm under payara.executorService.websockets.autoSize. The admin page now describes that core follows the worker's CPU request (always-on capacity), max follows the CPU limit (burst ceiling), and a memory cap derived from resources.worker.jvm.memory.buffer clamps the burst ceiling so the formula cannot exceed the JVM's native memory budget at very high CPU counts. A three-row table covers the common pod shapes; the guidance is deliberately user-visible only (which knobs to set, when to bump jvm.memory.buffer alongside CPU). The detailed implementation reference (formula derivation, memory cap math, when the cap actually binds, how this relates to the fleet-wide guard in HWORKS-2682_extra) lives in hopsworks-helm/.claude/docs/architecture/websockets-pool-auto-sizing.md rather than the public docs. Signed-off-by: Alex Ormenisan <alex@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…utor pools https://hopsworks.atlassian.net/browse/HWORKS-2682 Codex pre-review found that the user-facing capacity-warnings page claimed buttons disable while *either* badge is red, but the frontend hook only sets the disabled state when `clusterAvailableThreads <= 0` (cluster-critical only). The instance-only-critical case keeps the buttons enabled by design, because refreshing or signing back in may land the user on a different instance pod that still has capacity. Correct the "What happens when a badge turns red" bullet to match the actual behavior. Signed-off-by: Alex Ormenisan <alex@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds documentation for the new scalable WebSocket executor pool feature: an admin monitoring/tuning page and a user-facing page explaining the capacity badges that surface in the UI.
Changes:
- New admin page
docs/setup_installation/admin/monitoring/websocket-pool.mdcovering the pool model, Grafana panels, MP-Metrics, Helm tuning (autoSize,threadsPerCore, etc.), and Grizzly timeouts. - New user page
docs/user_guides/projects/jupyter/session_capacity_warnings.mddescribing the instance/cluster × orange/red badge matrix, recovery steps, and three screenshots. mkdocs.ymlnav entries added under Projects → Jupyter and Setup → Administration → Monitoring.
Reviewed changes
Copilot reviewed 3 out of 6 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
mkdocs.yml |
Registers the two new pages in the navigation tree. |
docs/setup_installation/admin/monitoring/websocket-pool.md |
New admin/monitoring reference for the WebSocket proxy pool. |
docs/user_guides/projects/jupyter/session_capacity_warnings.md |
New user guide for the capacity badges shown in Jupyter, Terminal, and Apps. |
…utor pools https://hopsworks.atlassian.net/browse/HWORKS-2682 Apply Copilot review cycle 1 auto-fix from the hopsworks-helm review. The admin doc's "Override threadsPerCore" hint said the default rule of thumb was "25 connections per CPU core sustained / 50 connections per CPU core burst", which overstated capacity by 2×: the chart- default multipliers are threads-per-core (`core: 25`, `max: 50`), and each WebSocket session uses two threads, so they translate to 12.5 sustained and 25 burst connections per CPU core. Rewrite the hint with the correct halving and a brief explanation of where the factor of 2 comes from. Reviewed-by: Copilot Signed-off-by: Alex Ormenisan <alex@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…utor pools https://hopsworks.atlassian.net/browse/HWORKS-2682 Apply Copilot review cycle 2 auto-fix: move the three WebSocket badge screenshots from `docs/assets/images/user_guides/jupyter/` to the repo's existing `docs/assets/images/guides/jupyter/` convention and update the references in `session_capacity_warnings.md`. The earlier under-`user_guides/` path I created was unique to this PR; every other Jupyter-related screenshot (in `ray_notebook.md`, `python_notebook.md`, `spark_job.md`, `pyspark_job.md`) lives under `guides/jupyter/`. Following the existing convention keeps the assets directory uncluttered and matches what an `mkdocs serve` user would expect. Reviewed-by: Copilot Signed-off-by: Alex Ormenisan <alex@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docs/setup_installation/admin/monitoring/websocket-pool.md: pool model (2 threads per WebSocket connection), the five Grafana panels in the Hopsworks dashboard, the full MP-Metrics reference table (ws_connection_*,ws_pool_*), the per-pod legend convention, the Helm values that govern sizing and the auto-size formula, the Grizzly timeouts, and a four-panel watchlist for after a sizing change.docs/user_guides/projects/jupyter/session_capacity_warnings.md: two-badge state matrix (instance + cluster × orange/red), what happens when each turns red, recovery steps. Three screenshots from the Jupyter, Terminal, and Apps pages indocs/assets/images/user_guides/jupyter/.Test plan
touch docs/javadoc; uv run mkdocs build -sclean.mkdocs serve; cross-links between admin and user pages resolve.Companion PRs: hopsworks-ee#2875, hopsworks-front (new), hopsworks-helm (new).