Reapply "Fix stats and Improve Search with Insights (#25495)" #25500

mohityadav766 · 2026-01-24T06:25:16Z

This reverts commit c006bdb.

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

New API endpoints:
- GET /v1/search/stats provides cluster health, index statistics, and orphan index detection
- DELETE /v1/search/stats/orphan removes orphaned indexes with safety checks
- GET /v1/search/reindex/failures retrieves reindexing failure records with pagination and filtering
Search cluster insights:
- New SearchStatsResponse schema tracks documents, shards, storage, and orphan indexes
- Settings UI route added for search insights dashboard
Reindex failure tracking:
- ReindexFailures UI component displays failed entity indexing with details drawer
- IndexingFailureRecorder persists failures with timestamps and error context in both ElasticSearch and OpenSearch sinks
Stats reconciliation:
- StatsReconciler and DistributedJobStatsAggregator aggregate success/failure counts across distributed workers
- CollectionDAO methods support failure record retrieval with entity type filtering
AWS IAM authentication:
- Configuration added for search, pipeline, and Redis cache services

_{This will update automatically on new commits.}

This reverts commit c006bdb.

github-actions · 2026-01-26T05:11:27Z

TypeScript types have been updated based on the JSON schema changes in the PR

gitar-bot · 2026-01-26T07:46:59Z

🔍 CI failure analysis for 3ef0e36: Nine CI jobs failed: SearchInsights (strict mode) and AWS credential tests (RELATED - consistent across 2 Maven jobs) need fixes; UserResourceIT, Python tests, Playwright flaky tests, and AppsResourceTest (UNRELATED).

Issue

Nine CI jobs failed across different test suites:

integration-tests-postgres-opensearch (job 61444540295) - 1 test failure
integration-tests-mysql-elasticsearch (job 61444540272) - 1 test failure
playwright-ci-postgresql (6, 6) (job 61444540146) - 2 test failures
playwright-ci-postgresql (2, 6) (job 61444540147) - 2 test failures
playwright-ci-postgresql (4, 6) (job 61444540185) - 2 test failures
py-run-tests (3.11) (job 61444540131) - 2 test failures
py-run-tests (3.10) (jobs 61444540033 and 61444540132) - 2 test failures (identical)
maven-postgresql-ci (job 61444540268) - 4 test failures
maven-sonarcloud-ci (job 61444705510) - 4 test failures (identical to maven-postgresql-ci)

Root Cause Analysis

RELATED TO PR - SearchInsights Playwright Tests

The PR introduces new SearchInsights UI components with Playwright strict mode violations:

Failures in job 61444540146:

getByText('57') resolved to 4 elements - duplicate text elements
getByText('Orphan Indexes') resolved to 3 elements - text appears multiple times

The fix is: Add unique data-testid attributes to SearchInsights components or use more specific selectors.

RELATED TO PR - AWS Credential Tests

The PR adds an enabled field to AWS IAM authentication configuration, causing 3 test errors in AwsCredentialsUtilTest (consistent across both Maven jobs):

Failures in jobs 61444540268 and 61444705510:

testBuildCredentialsProviderWithEmptyCredentials:77
testBuildCredentialsProviderWithNoCredentials:52
testBuildCredentialsProviderWithOnlyAccessKey:64

Error: IllegalArgumentException: AWS credentials not configured. Either provide accessKeyId/secretAccessKey or set enabled=true to use IAM authentication via the default credential provider chain.

Why this is related:

PR adds enabled field to AWS config:

aws:
  enabled: ${SEARCH_AWS_IAM_AUTH_ENABLED:-false}
  region: ${AWS_DEFAULT_REGION:-""}
  accessKeyId: ${AWS_ACCESS_KEY_ID:-""}

New validation logic requires either valid credentials OR enabled=true
Tests expect previous behavior where empty credentials were silently ignored
Identical failures in both maven-postgresql-ci and maven-sonarcloud-ci confirm this is consistent

The fix is: Update test expectations to account for the new enabled field validation, or adjust the validation logic to maintain backward compatibility.

UNRELATED TO PR - Integration Tests

Failed Test: UserResourceIT.test_listUsersWithAdminFilter:629

Error: Non-admin user should be in filtered list ==> expected: <true> but was: <false>
Occurs in: Both PostgreSQL+OpenSearch and MySQL+Elasticsearch
Statistics: 8,499 tests, 8,458 passed, 1 failed (99.99% pass rate)

Why this is unrelated: PR does not modify user management or admin filtering logic.

UNRELATED TO PR - Python Integration Tests

Failed Tests (consistent across Python 3.10 and 3.11):

test_it_returns_the_expected_classifications - SpacyRecognizer ML model not working in CI
test_s3_ingestion - AWS STS API version mismatch and Minio testcontainer issues

Why these are unrelated: PR does not modify Python ingestion, auto-classification, or S3 connectors.

UNRELATED TO PR - Playwright Flaky Tests

3 Playwright jobs with environmental timing issues, element visibility, and browser state problems.

UNRELATED TO PR - Maven Flaky Test

AppsResourceTest.post_trigger_app_200:369 - Max retries exceeded polling for eventual assert

Occurs in both maven-postgresql-ci and maven-sonarcloud-ci (2 runs, identical failure)
Timing/polling issue unrelated to search indexing or AWS config changes
Test statistics: 7,804 tests run, 701 skipped, 4 failed (99.95% pass rate)

Summary

2 failure categories are PR-related:
- SearchInsights strict mode violations (needs data-testid attributes)
- AWS credential validation tests (need update for new enabled field) - consistent across 2 Maven jobs
4 failure categories are unrelated:
- UserResourceIT admin filtering test
- Python flaky tests (SpacyRecognizer ML, S3/Minio testcontainer)
- Playwright environmental flakiness
- AppsResourceTest polling timeout - consistent across 2 Maven jobs

Code Review ⚠️ Changes requested 6 resolved / 8 findings

PR adds search insights features including stats endpoints, failure tracking, and orphan index management. Two issues remain: integer overflow in stats calculation and missing authorization on getFailures endpoint.

⚠️

Bug: Integer overflow when casting long to int for documents/size stats

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java:1438 📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java:1441 📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java:1464 📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java:1473 📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java:1474

The SearchResource.getSearchStats() method casts long values to int for document counts and size statistics. This will cause integer overflow for large indices:

Affected locations:

Line 1438: indexStat.setDocuments((int) stats.documents());
Line 1441: indexStat.setSizeInBytes((int) stats.sizeInBytes());
Line 1464: orphan.setSizeInBytes((int) size);
Line 1473: response.setTotalDocuments((int) totalDocs);
Line 1474: response.setTotalSizeInBytes((int) totalSize);

Impact: For production clusters with:

More than 2.1 billion documents (Integer.MAX_VALUE), document counts will wrap to negative values
Indices larger than 2GB will report incorrect sizes

The JSON schema already defines these fields as int64, so the issue is in the Java code casting.

Suggested fix: Use Long type setters or change the generated Java classes to use long for these fields.

💡 Security: Missing authorization check on getFailures endpoint

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchReindexResource.java:74

The SearchReindexResource.getFailures() endpoint does not call any authorization method despite having the Authorizer injected. This endpoint exposes internal failure details including entity IDs, error messages, and stack traces, which could reveal sensitive information about the system internals.

Affected code:

public ReindexFailuresResponse getFailures(
    @Context SecurityContext securityContext,
    ...

Note that the securityContext is received but not used for authorization.

Suggested fix: Add authorization check similar to other admin endpoints:

authorizer.authorizeAdminOrBot(securityContext);

✅ 6 resolved

✅ Edge Case: ConcurrentHashMap not bounded - potential memory growth

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSink.java:405 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/OpenSearchBulkSink.java:398
In ElasticSearchBulkSink.CustomBulkProcessor and OpenSearchBulkSink.CustomBulkProcessor, a ConcurrentHashMap<String, String> docIdToEntityType is used to track entity types for failure reporting. While entries are removed on success/failure, there are scenarios where entries may accumulate:

If a bulk operation fails catastrophically before the response handler runs, entries remain

If the bulk processor is closed mid-operation, pending entries are not cleaned up

For long-running indexing jobs with failures, this map could grow unbounded.

Suggested fix: Consider:

Adding a bounded cache (e.g., Caffeine with max size)

Or adding explicit cleanup in the close() method

Or documenting the expected behavior and maximum size constraints

✅ Bug: Integer overflow when casting long to int for document/size stats

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java
The SearchResource.getSearchStats() method casts long values to int in multiple places, which will cause integer overflow and incorrect values for large search indices:

response.setTotalDocuments((int) totalDocs) - Total document counts can easily exceed 2 billion

response.setTotalSizeInBytes((int) totalSize) - Total size in bytes will overflow at ~2GB

indexStat.setDocuments((int) stats.documents()) - Per-index document counts can overflow

indexStat.setSizeInBytes((int) stats.sizeInBytes()) - Per-index size can overflow

orphan.setSizeInBytes((int) size) - Orphan index sizes can overflow

The JSON schema correctly defines these fields as int64 (long), but the Java code incorrectly downcasts them. This will cause negative numbers or incorrect values to be returned for production deployments with large indices.

Suggested fix: Remove the (int) casts and ensure the generated SearchStatsResponse classes use Long types for these fields. The schema already specifies "format": "int64", so the generated classes should support long values.

✅ Bug: OpenSearchIndexManager always reports health as "GREEN"

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchIndexManager.java
The getAllIndexStats() method in OpenSearchIndexManager hardcodes the health status as "GREEN" instead of getting the actual health from the cluster like ElasticSearchIndexManager does.

In ElasticSearchIndexManager:
String health = stats.health() != null ? stats.health().name().toUpperCase() : "UNKNOWN";
In OpenSearchIndexManager:
String health = "GREEN";  // Always hardcoded!
This means the Search Insights dashboard will always show all OpenSearch indexes as healthy even when they may be in YELLOW or RED state, defeating the purpose of the health monitoring feature.

Suggested fix: Retrieve actual health status from the OpenSearch client similar to how it's done for Elasticsearch.

✅ Bug: Missing isSearchIndexingRunning check in frontend before cleanup

📄 openmetadata-ui/src/main/resources/ui/src/rest/searchAPI.ts 📄 openmetadata-ui/src/main/resources/ui/src/pages/SearchInsightsPage/SearchInsightsPage.tsx
The backend endpoint DELETE /search/stats/orphan properly checks if search indexing is running and returns a 409 Conflict status, but the frontend SearchInsightsPage doesn't fetch or display the isSearchIndexingRunning property from the stats response.

The SearchStatsResponse interface includes isSearchIndexingRunning but:

The TypeScript interface in searchAPI.ts doesn't include this field

The UI doesn't disable the cleanup button when indexing is in progress

Users may see confusing error messages when trying to clean during indexing

Impact: Users might attempt cleanup during active indexing and receive an error, causing confusion.

Suggested fix:

Add isSearchIndexingRunning: boolean to the SearchStatsResponse interface

Disable the "Clean Orphan Indexes" button when indexing is running

Show a tooltip explaining why cleanup is disabled

✅ Edge Case: PartitionResult missing warningsCount in stopped case

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/PartitionWorker.java
In PartitionWorker.java, when a partition is stopped by request (e.g., during shutdown), the exception handler creates a PartitionResult without passing the warningsCount:
} catch (Exception e) {
  LOG.error("Error processing partition {}", partition.getId(), e);
  coordinator.failPartition(partition.getId(), e.getMessage());
  return new PartitionResult(successCount.get(), failedCount.get(), true, 0);
}
The last parameter 0 for readerFailed is passed, but warningsCount is missing (the constructor now takes 5 parameters). This appears to be incorrect as the exception handling path creates a 4-parameter result while the new signature requires 5 parameters.

Looking more carefully at the diff, it seems the stopped case does pass warningsCount.get() correctly. However, the exception handler on line ~229 in the truncated portion may still be missing the warnings count. This needs verification.

Suggested fix: Ensure all PartitionResult constructor calls include the warningsCount parameter.

...and 1 more resolved from earlier reviews

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-01-26T08:19:15Z

Quality Gate passed for 'open-metadata-ui'

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Reapply "Fix stats and Improve Search with Insights (#25495)"

b29967a

This reverts commit c006bdb.

mohityadav766 requested a review from aniketkatkar97 January 24, 2026 06:25

mohityadav766 self-assigned this Jan 24, 2026

mohityadav766 requested review from a team, akash-jain-10, harshach and tutte as code owners January 24, 2026 06:25

mohityadav766 temporarily deployed to test January 24, 2026 06:25 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test January 24, 2026 06:25 — with GitHub Actions Failure

mohityadav766 temporarily deployed to test January 24, 2026 06:25 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test January 24, 2026 06:25 — with GitHub Actions Failure

mohityadav766 had a problem deploying to test January 24, 2026 06:25 — with GitHub Actions Error

github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jan 24, 2026

mohityadav766 removed the backend label Jan 24, 2026

mohityadav766 added 2 commits January 24, 2026 13:57

Fix Sonar coverage

cdccd64

Fix lang

7c9527d

mohityadav766 had a problem deploying to test January 24, 2026 08:28 — with GitHub Actions Error

github-actions bot added the backend label Jan 24, 2026

mohityadav766 temporarily deployed to test January 24, 2026 08:29 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test January 24, 2026 08:29 — with GitHub Actions Failure

mohityadav766 temporarily deployed to test January 25, 2026 03:52 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test January 25, 2026 03:52 — with GitHub Actions Failure

Merge branch 'main' into fix-stats-revert

1be1dea

mohityadav766 had a problem deploying to test January 25, 2026 11:55 — with GitHub Actions Failure

mohityadav766 temporarily deployed to test January 25, 2026 11:55 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test January 25, 2026 11:55 — with GitHub Actions Failure

mohityadav766 temporarily deployed to test January 25, 2026 11:55 — with GitHub Actions Inactive

Merge branch 'main' into fix-stats-revert

0b36882

mohityadav766 temporarily deployed to test January 26, 2026 05:07 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test January 26, 2026 05:07 — with GitHub Actions Failure

mohityadav766 temporarily deployed to test January 26, 2026 05:07 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test January 26, 2026 05:07 — with GitHub Actions Failure

Update generated TypeScript types

61c5234

Merge branch 'main' into fix-stats-revert

3ef0e36

pmbrull had a problem deploying to test January 26, 2026 07:46 — with GitHub Actions Failure

pmbrull temporarily deployed to test January 26, 2026 07:46 — with GitHub Actions Inactive

pmbrull had a problem deploying to test January 26, 2026 07:46 — with GitHub Actions Failure

pmbrull temporarily deployed to test January 26, 2026 07:46 — with GitHub Actions Inactive

pmbrull had a problem deploying to test January 26, 2026 07:46 — with GitHub Actions Failure

pmbrull temporarily deployed to test January 26, 2026 07:46 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reapply "Fix stats and Improve Search with Insights (#25495)" #25500

Reapply "Fix stats and Improve Search with Insights (#25495)" #25500

mohityadav766 commented Jan 24, 2026 •

edited by gitar-bot bot

Loading

Uh oh!

github-actions bot commented Jan 26, 2026

Uh oh!

gitar-bot bot commented Jan 26, 2026 •

edited

Loading

Issue

Root Cause Analysis

RELATED TO PR - SearchInsights Playwright Tests

RELATED TO PR - AWS Credential Tests

UNRELATED TO PR - Integration Tests

UNRELATED TO PR - Python Integration Tests

UNRELATED TO PR - Playwright Flaky Tests

UNRELATED TO PR - Maven Flaky Test

Summary

Uh oh!

sonarqubecloud bot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Reapply "Fix stats and Improve Search with Insights (#25495)" #25500

Are you sure you want to change the base?

Reapply "Fix stats and Improve Search with Insights (#25495)" #25500

Conversation

mohityadav766 commented Jan 24, 2026 • edited by gitar-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

github-actions bot commented Jan 26, 2026

Uh oh!

gitar-bot bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Root Cause Analysis

RELATED TO PR - SearchInsights Playwright Tests

RELATED TO PR - AWS Credential Tests

UNRELATED TO PR - Integration Tests

UNRELATED TO PR - Python Integration Tests

UNRELATED TO PR - Playwright Flaky Tests

UNRELATED TO PR - Maven Flaky Test

Summary

Uh oh!

sonarqubecloud bot commented Jan 26, 2026

Quality Gate passed for 'open-metadata-ui'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mohityadav766 commented Jan 24, 2026 •

edited by gitar-bot bot

Loading

gitar-bot bot commented Jan 26, 2026 •

edited

Loading