Skip to content

[HWORKS-2682] websocket executor pools improvements and scalable executor pools#593

Draft
o-alex wants to merge 5 commits into
logicalclocks:mainfrom
o-alex:HWORKS-2682
Draft

[HWORKS-2682] websocket executor pools improvements and scalable executor pools#593
o-alex wants to merge 5 commits into
logicalclocks:mainfrom
o-alex:HWORKS-2682

Conversation

@o-alex
Copy link
Copy Markdown
Contributor

@o-alex o-alex commented May 29, 2026

Summary

  • New admin page docs/setup_installation/admin/monitoring/websocket-pool.md: pool model (2 threads per WebSocket connection), the five Grafana panels in the Hopsworks dashboard, the full MP-Metrics reference table (ws_connection_*, ws_pool_*), the per-pod legend convention, the Helm values that govern sizing and the auto-size formula, the Grizzly timeouts, and a four-panel watchlist for after a sizing change.
  • New user-facing page docs/user_guides/projects/jupyter/session_capacity_warnings.md: two-badge state matrix (instance + cluster × orange/red), what happens when each turns red, recovery steps. Three screenshots from the Jupyter, Terminal, and Apps pages in docs/assets/images/user_guides/jupyter/.
  • mkdocs nav extended under Projects → Jupyter.

Test plan

  • touch docs/javadoc; uv run mkdocs build -s clean.
  • Page rendering verified locally via mkdocs serve; cross-links between admin and user pages resolve.

Companion PRs: hopsworks-ee#2875, hopsworks-front (new), hopsworks-helm (new).

o-alex and others added 3 commits May 28, 2026 11:06
…utor pools

https://hopsworks.atlassian.net/browse/HWORKS-2682

The earlier WebSocket-pump fix on this ticket bounded the proxy's
thread pool but left pool exhaustion invisible to users: when the
cluster ran out of pump threads, new sessions would either hang in a
connecting state or come up and close unexpectedly, with no actionable
signal anywhere. The pool is shared across Jupyter notebooks,
terminals, and Streamlit apps, and on a busy multi-pod deployment
there was no way for an admin to see how close the cluster was to
saturation, no way for a user to learn why their session was failing,
and no way for the backend to refuse a request that could not possibly
succeed.

This change set adds the missing visibility and gating end-to-end.
Each hopsworks-instance pod publishes its pool state into a Hazelcast
map every five seconds; a JAX-RS endpoint exposes the cluster
aggregate; backend pre-flight on the three session-start paths
rejects with HTTP 503 when the cluster is at the limit; the proxy
itself pre-checks WebSocket upgrade requests so in-session opens
fail at HTTP level rather than after the 101 handshake; per-pod
MicroProfile metrics drive new Grafana panels for sessions, CPU
draw, and allocation rate; and the UI on every tool page shows a
compact badge and disables the start button when the cluster is full.

The docs site gains two pages. A new admin page,
setup_installation/admin/monitoring/websocket-pool.md, covers how the
pool relates to concurrent users, the five Grafana panels in the
Hopsworks dashboard (rejection rate, duration percentiles, sessions,
pool CPU, pool allocation rate as GC pressure), the full MP-Metrics
reference table including ws_pool_cpu_cores and
ws_pool_alloc_bytes_per_second, the per-pod legend convention
(pod_short rewrite stripping the hopsworks-instance- prefix), the
Helm values that govern sizing and Grizzly timeouts, and a four-panel
watchlist for after a sizing change (Memory + Max-heap line, pool
CPU, allocation rate, GC). A new user-facing page,
user_guides/projects/jupyter/session_capacity_warnings.md, explains
the two badges (instance and cluster) with their orange (warning)
and red (critical) states, what happens to the action buttons when
red, and the recovery steps; three screenshots from Jupyter,
Terminal, and Apps show the badges in context. The admin doc links
to the user page and vice versa, and mkdocs.yml is extended with
the new user-guide entry under Projects > Jupyter.

Signed-off-by: Alex Ormenisan <alex@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…utor pools

https://hopsworks.atlassian.net/browse/HWORKS-2682

Document the auto-size path that was just added to hopsworks-helm under
payara.executorService.websockets.autoSize. The admin page now describes
that core follows the worker's CPU request (always-on capacity), max
follows the CPU limit (burst ceiling), and a memory cap derived from
resources.worker.jvm.memory.buffer clamps the burst ceiling so the
formula cannot exceed the JVM's native memory budget at very high CPU
counts. A three-row table covers the common pod shapes; the guidance is
deliberately user-visible only (which knobs to set, when to bump
jvm.memory.buffer alongside CPU). The detailed implementation reference
(formula derivation, memory cap math, when the cap actually binds, how
this relates to the fleet-wide guard in HWORKS-2682_extra) lives in
hopsworks-helm/.claude/docs/architecture/websockets-pool-auto-sizing.md
rather than the public docs.

Signed-off-by: Alex Ormenisan <alex@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…utor pools

https://hopsworks.atlassian.net/browse/HWORKS-2682

Codex pre-review found that the user-facing capacity-warnings page
claimed buttons disable while *either* badge is red, but the frontend
hook only sets the disabled state when `clusterAvailableThreads <= 0`
(cluster-critical only). The instance-only-critical case keeps the
buttons enabled by design, because refreshing or signing back in may
land the user on a different instance pod that still has capacity.

Correct the "What happens when a badge turns red" bullet to match the
actual behavior.

Signed-off-by: Alex Ormenisan <alex@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation for the new scalable WebSocket executor pool feature: an admin monitoring/tuning page and a user-facing page explaining the capacity badges that surface in the UI.

Changes:

  • New admin page docs/setup_installation/admin/monitoring/websocket-pool.md covering the pool model, Grafana panels, MP-Metrics, Helm tuning (autoSize, threadsPerCore, etc.), and Grizzly timeouts.
  • New user page docs/user_guides/projects/jupyter/session_capacity_warnings.md describing the instance/cluster × orange/red badge matrix, recovery steps, and three screenshots.
  • mkdocs.yml nav entries added under Projects → Jupyter and Setup → Administration → Monitoring.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated no comments.

File Description
mkdocs.yml Registers the two new pages in the navigation tree.
docs/setup_installation/admin/monitoring/websocket-pool.md New admin/monitoring reference for the WebSocket proxy pool.
docs/user_guides/projects/jupyter/session_capacity_warnings.md New user guide for the capacity badges shown in Jupyter, Terminal, and Apps.

…utor pools

https://hopsworks.atlassian.net/browse/HWORKS-2682

Apply Copilot review cycle 1 auto-fix from the hopsworks-helm review.
The admin doc's "Override threadsPerCore" hint said the default rule
of thumb was "25 connections per CPU core sustained / 50 connections
per CPU core burst", which overstated capacity by 2×: the chart-
default multipliers are threads-per-core (`core: 25`, `max: 50`), and
each WebSocket session uses two threads, so they translate to 12.5
sustained and 25 burst connections per CPU core. Rewrite the hint
with the correct halving and a brief explanation of where the factor
of 2 comes from.

Reviewed-by: Copilot
Signed-off-by: Alex Ormenisan <alex@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 6 changed files in this pull request and generated 1 comment.

Comment thread docs/user_guides/projects/jupyter/session_capacity_warnings.md Outdated
…utor pools

https://hopsworks.atlassian.net/browse/HWORKS-2682

Apply Copilot review cycle 2 auto-fix: move the three WebSocket badge
screenshots from `docs/assets/images/user_guides/jupyter/` to the
repo's existing `docs/assets/images/guides/jupyter/` convention and
update the references in `session_capacity_warnings.md`.

The earlier under-`user_guides/` path I created was unique to this PR;
every other Jupyter-related screenshot (in `ray_notebook.md`,
`python_notebook.md`, `spark_job.md`, `pyspark_job.md`) lives under
`guides/jupyter/`. Following the existing convention keeps the assets
directory uncluttered and matches what an `mkdocs serve` user would
expect.

Reviewed-by: Copilot
Signed-off-by: Alex Ormenisan <alex@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants