Skip to content

Conversation

@mohityadav766
Copy link
Member

@mohityadav766 mohityadav766 commented Jan 24, 2026

This reverts commit c006bdb.

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • New API endpoints:
    • GET /v1/search/stats provides cluster health, index statistics, and orphan index detection
    • DELETE /v1/search/stats/orphan removes orphaned indexes with safety checks
    • GET /v1/search/reindex/failures retrieves reindexing failure records with pagination and filtering
  • Search cluster insights:
    • New SearchStatsResponse schema tracks documents, shards, storage, and orphan indexes
    • Settings UI route added for search insights dashboard
  • Reindex failure tracking:
    • ReindexFailures UI component displays failed entity indexing with details drawer
    • IndexingFailureRecorder persists failures with timestamps and error context in both ElasticSearch and OpenSearch sinks
  • Stats reconciliation:
    • StatsReconciler and DistributedJobStatsAggregator aggregate success/failure counts across distributed workers
    • CollectionDAO methods support failure record retrieval with entity type filtering
  • AWS IAM authentication:
    • Configuration added for search, pipeline, and Redis cache services

This will update automatically on new commits.


@github-actions
Copy link
Contributor

TypeScript types have been updated based on the JSON schema changes in the PR

@gitar-bot
Copy link

gitar-bot bot commented Jan 26, 2026

🔍 CI failure analysis for 3ef0e36: Nine CI jobs failed: SearchInsights (strict mode) and AWS credential tests (RELATED - consistent across 2 Maven jobs) need fixes; UserResourceIT, Python tests, Playwright flaky tests, and AppsResourceTest (UNRELATED).

Issue

Nine CI jobs failed across different test suites:

  1. integration-tests-postgres-opensearch (job 61444540295) - 1 test failure
  2. integration-tests-mysql-elasticsearch (job 61444540272) - 1 test failure
  3. playwright-ci-postgresql (6, 6) (job 61444540146) - 2 test failures
  4. playwright-ci-postgresql (2, 6) (job 61444540147) - 2 test failures
  5. playwright-ci-postgresql (4, 6) (job 61444540185) - 2 test failures
  6. py-run-tests (3.11) (job 61444540131) - 2 test failures
  7. py-run-tests (3.10) (jobs 61444540033 and 61444540132) - 2 test failures (identical)
  8. maven-postgresql-ci (job 61444540268) - 4 test failures
  9. maven-sonarcloud-ci (job 61444705510) - 4 test failures (identical to maven-postgresql-ci)

Root Cause Analysis

RELATED TO PR - SearchInsights Playwright Tests

The PR introduces new SearchInsights UI components with Playwright strict mode violations:

Failures in job 61444540146:

  • getByText('57') resolved to 4 elements - duplicate text elements
  • getByText('Orphan Indexes') resolved to 3 elements - text appears multiple times

The fix is: Add unique data-testid attributes to SearchInsights components or use more specific selectors.

RELATED TO PR - AWS Credential Tests

The PR adds an enabled field to AWS IAM authentication configuration, causing 3 test errors in AwsCredentialsUtilTest (consistent across both Maven jobs):

Failures in jobs 61444540268 and 61444705510:

  • testBuildCredentialsProviderWithEmptyCredentials:77
  • testBuildCredentialsProviderWithNoCredentials:52
  • testBuildCredentialsProviderWithOnlyAccessKey:64

Error: IllegalArgumentException: AWS credentials not configured. Either provide accessKeyId/secretAccessKey or set enabled=true to use IAM authentication via the default credential provider chain.

Why this is related:

  • PR adds enabled field to AWS config:
    aws:
      enabled: ${SEARCH_AWS_IAM_AUTH_ENABLED:-false}
      region: ${AWS_DEFAULT_REGION:-""}
      accessKeyId: ${AWS_ACCESS_KEY_ID:-""}
  • New validation logic requires either valid credentials OR enabled=true
  • Tests expect previous behavior where empty credentials were silently ignored
  • Identical failures in both maven-postgresql-ci and maven-sonarcloud-ci confirm this is consistent

The fix is: Update test expectations to account for the new enabled field validation, or adjust the validation logic to maintain backward compatibility.

UNRELATED TO PR - Integration Tests

Failed Test: UserResourceIT.test_listUsersWithAdminFilter:629

  • Error: Non-admin user should be in filtered list ==> expected: <true> but was: <false>
  • Occurs in: Both PostgreSQL+OpenSearch and MySQL+Elasticsearch
  • Statistics: 8,499 tests, 8,458 passed, 1 failed (99.99% pass rate)

Why this is unrelated: PR does not modify user management or admin filtering logic.

UNRELATED TO PR - Python Integration Tests

Failed Tests (consistent across Python 3.10 and 3.11):

  1. test_it_returns_the_expected_classifications - SpacyRecognizer ML model not working in CI
  2. test_s3_ingestion - AWS STS API version mismatch and Minio testcontainer issues

Why these are unrelated: PR does not modify Python ingestion, auto-classification, or S3 connectors.

UNRELATED TO PR - Playwright Flaky Tests

3 Playwright jobs with environmental timing issues, element visibility, and browser state problems.

UNRELATED TO PR - Maven Flaky Test

AppsResourceTest.post_trigger_app_200:369 - Max retries exceeded polling for eventual assert

  • Occurs in both maven-postgresql-ci and maven-sonarcloud-ci (2 runs, identical failure)
  • Timing/polling issue unrelated to search indexing or AWS config changes
  • Test statistics: 7,804 tests run, 701 skipped, 4 failed (99.95% pass rate)

Summary

  • 2 failure categories are PR-related:
    • SearchInsights strict mode violations (needs data-testid attributes)
    • AWS credential validation tests (need update for new enabled field) - consistent across 2 Maven jobs
  • 4 failure categories are unrelated:
    • UserResourceIT admin filtering test
    • Python flaky tests (SpacyRecognizer ML, S3/Minio testcontainer)
    • Playwright environmental flakiness
    • AppsResourceTest polling timeout - consistent across 2 Maven jobs
Code Review ⚠️ Changes requested 6 resolved / 8 findings

PR adds search insights features including stats endpoints, failure tracking, and orphan index management. Two issues remain: integer overflow in stats calculation and missing authorization on getFailures endpoint.

⚠️ Bug: Integer overflow when casting long to int for documents/size stats

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java:1438 📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java:1441 📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java:1464 📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java:1473 📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java:1474

The SearchResource.getSearchStats() method casts long values to int for document counts and size statistics. This will cause integer overflow for large indices:

Affected locations:

  • Line 1438: indexStat.setDocuments((int) stats.documents());
  • Line 1441: indexStat.setSizeInBytes((int) stats.sizeInBytes());
  • Line 1464: orphan.setSizeInBytes((int) size);
  • Line 1473: response.setTotalDocuments((int) totalDocs);
  • Line 1474: response.setTotalSizeInBytes((int) totalSize);

Impact: For production clusters with:

  • More than 2.1 billion documents (Integer.MAX_VALUE), document counts will wrap to negative values
  • Indices larger than 2GB will report incorrect sizes

The JSON schema already defines these fields as int64, so the issue is in the Java code casting.

Suggested fix: Use Long type setters or change the generated Java classes to use long for these fields.

💡 Security: Missing authorization check on getFailures endpoint

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchReindexResource.java:74

The SearchReindexResource.getFailures() endpoint does not call any authorization method despite having the Authorizer injected. This endpoint exposes internal failure details including entity IDs, error messages, and stack traces, which could reveal sensitive information about the system internals.

Affected code:

public ReindexFailuresResponse getFailures(
    @Context SecurityContext securityContext,
    ...

Note that the securityContext is received but not used for authorization.

Suggested fix: Add authorization check similar to other admin endpoints:

authorizer.authorizeAdminOrBot(securityContext);
✅ 6 resolved
Edge Case: ConcurrentHashMap not bounded - potential memory growth

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSink.java:405 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/OpenSearchBulkSink.java:398
In ElasticSearchBulkSink.CustomBulkProcessor and OpenSearchBulkSink.CustomBulkProcessor, a ConcurrentHashMap<String, String> docIdToEntityType is used to track entity types for failure reporting. While entries are removed on success/failure, there are scenarios where entries may accumulate:

  1. If a bulk operation fails catastrophically before the response handler runs, entries remain
  2. If the bulk processor is closed mid-operation, pending entries are not cleaned up

For long-running indexing jobs with failures, this map could grow unbounded.

Suggested fix: Consider:

  • Adding a bounded cache (e.g., Caffeine with max size)
  • Or adding explicit cleanup in the close() method
  • Or documenting the expected behavior and maximum size constraints
Bug: Integer overflow when casting long to int for document/size stats

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/search/SearchResource.java
The SearchResource.getSearchStats() method casts long values to int in multiple places, which will cause integer overflow and incorrect values for large search indices:

  1. response.setTotalDocuments((int) totalDocs) - Total document counts can easily exceed 2 billion
  2. response.setTotalSizeInBytes((int) totalSize) - Total size in bytes will overflow at ~2GB
  3. indexStat.setDocuments((int) stats.documents()) - Per-index document counts can overflow
  4. indexStat.setSizeInBytes((int) stats.sizeInBytes()) - Per-index size can overflow
  5. orphan.setSizeInBytes((int) size) - Orphan index sizes can overflow

The JSON schema correctly defines these fields as int64 (long), but the Java code incorrectly downcasts them. This will cause negative numbers or incorrect values to be returned for production deployments with large indices.

Suggested fix: Remove the (int) casts and ensure the generated SearchStatsResponse classes use Long types for these fields. The schema already specifies "format": "int64", so the generated classes should support long values.

Bug: OpenSearchIndexManager always reports health as "GREEN"

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchIndexManager.java
The getAllIndexStats() method in OpenSearchIndexManager hardcodes the health status as "GREEN" instead of getting the actual health from the cluster like ElasticSearchIndexManager does.

In ElasticSearchIndexManager:

String health = stats.health() != null ? stats.health().name().toUpperCase() : "UNKNOWN";

In OpenSearchIndexManager:

String health = "GREEN";  // Always hardcoded!

This means the Search Insights dashboard will always show all OpenSearch indexes as healthy even when they may be in YELLOW or RED state, defeating the purpose of the health monitoring feature.

Suggested fix: Retrieve actual health status from the OpenSearch client similar to how it's done for Elasticsearch.

Bug: Missing isSearchIndexingRunning check in frontend before cleanup

📄 openmetadata-ui/src/main/resources/ui/src/rest/searchAPI.ts 📄 openmetadata-ui/src/main/resources/ui/src/pages/SearchInsightsPage/SearchInsightsPage.tsx
The backend endpoint DELETE /search/stats/orphan properly checks if search indexing is running and returns a 409 Conflict status, but the frontend SearchInsightsPage doesn't fetch or display the isSearchIndexingRunning property from the stats response.

The SearchStatsResponse interface includes isSearchIndexingRunning but:

  1. The TypeScript interface in searchAPI.ts doesn't include this field
  2. The UI doesn't disable the cleanup button when indexing is in progress
  3. Users may see confusing error messages when trying to clean during indexing

Impact: Users might attempt cleanup during active indexing and receive an error, causing confusion.

Suggested fix:

  1. Add isSearchIndexingRunning: boolean to the SearchStatsResponse interface
  2. Disable the "Clean Orphan Indexes" button when indexing is running
  3. Show a tooltip explaining why cleanup is disabled
Edge Case: PartitionResult missing warningsCount in stopped case

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/PartitionWorker.java
In PartitionWorker.java, when a partition is stopped by request (e.g., during shutdown), the exception handler creates a PartitionResult without passing the warningsCount:

} catch (Exception e) {
  LOG.error("Error processing partition {}", partition.getId(), e);
  coordinator.failPartition(partition.getId(), e.getMessage());
  return new PartitionResult(successCount.get(), failedCount.get(), true, 0);
}

The last parameter 0 for readerFailed is passed, but warningsCount is missing (the constructor now takes 5 parameters). This appears to be incorrect as the exception handling path creates a 4-parameter result while the new signature requires 5 parameters.

Looking more carefully at the diff, it seems the stopped case does pass warningsCount.get() correctly. However, the exception handler on line ~229 in the truncated portion may still be missing the warnings count. This needs verification.

Suggested fix: Ensure all PartitionResult constructor calls include the warningsCount parameter.

...and 1 more resolved from earlier reviews

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants