Skip to content

Add clustering evaluation with color toggle, ARI/NMI, and cluster analysis#23

Closed
NetZissou wants to merge 1 commit intofix/precalculated-tooltip-nullfrom
feature/clustering-evaluation
Closed

Add clustering evaluation with color toggle, ARI/NMI, and cluster analysis#23
NetZissou wants to merge 1 commit intofix/precalculated-tooltip-nullfrom
feature/clustering-evaluation

Conversation

@NetZissou
Copy link
Copy Markdown
Collaborator

@NetZissou NetZissou commented Mar 30, 2026

Summary

Adds evaluation capabilities to the "Use column values" clustering mode, allowing users to assess how well KMeans clusters align with known categorical labels.

  • Color toggle: radio in plot options to switch between KMeans cluster coloring and ground truth column coloring -- same points, same positions, instant comparison
  • ARI/NMI metrics: Adjusted Rand Index and Normalized Mutual Information computed automatically after KMeans, displayed in the Cluster Analysis section
  • Cluster analysis tree: full-width console output showing per-cluster composition with purity, entropy, and ranked breakdown by ground truth category
  • Cardinality cap at 16: blocks clustering when column has >16 unique values to prevent misleading duplicate colors -- suggests filtering first
  • Removed "Label by column" mode: redundant with the color toggle in "Use column values"
  • Cleaned up unused run_dim_reduction/run_dim_reduction_safe from ClusteringService

Depends on #22

Generated with Claude Code

…r analysis

- Preserve KMeans labels in "Use column values" mode (no longer overwritten)
- Color toggle: switch between KMeans clusters and ground truth column in plot
- ARI/NMI metrics computed automatically for evaluation mode
- Cluster analysis tree: full-width console output with per-cluster purity,
  entropy, and ranked breakdown by ground truth category
- Cardinality cap at 16: blocks clustering with error when column has too many
  unique values to avoid misleading duplicate colors
- Remove "Label by column" mode (redundant with color toggle)
- Clean up unused dim-reduction-only methods from ClusteringService

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@egrace479
Copy link
Copy Markdown
Member

It'd be great if we could color by other column values too -- maybe a dropdown?

Let's use a color scheme with 20 colors, and just have a warning of color duplication, instead of limiting based on the distinct color limitation.

@NetZissou
Copy link
Copy Markdown
Collaborator Author

Closing this PR in favor of a redesigned approach discussed in a follow-up meeting.

What's changing: The current design couples dim reduction and KMeans into a single "Run Clustering" button with mode switches. The redesign separates them into independent operations:

  1. Dim reduction runs first (standalone) -- produces the 2D scatter plot
  2. Color by any column via a general dropdown -- metadata columns are immediately available
  3. KMeans is optional -- when run, it just adds another label to the color dropdown (since KMeans on full embeddings is just a mapping of observation -> cluster ID)

This better reflects the actual data flow (both operations independently consume the same full-size embeddings) and makes the UI more intuitive.

The evaluation features from this PR (ARI/NMI, cluster analysis tree, color comparison) will carry forward into the new implementation -- just triggered by the general color-by selector rather than a dedicated mode.

PR #22 (bug fixes) remains valid and should merge first.

@NetZissou NetZissou closed this Mar 31, 2026
@NetZissou NetZissou deleted the feature/clustering-evaluation branch April 1, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants