Skip to content

feat: Add setting to force enable/disable text index support for a source#2277

Merged
pulpdrew merged 1 commit into
mainfrom
drew/force-enable-text-index
May 20, 2026
Merged

feat: Add setting to force enable/disable text index support for a source#2277
pulpdrew merged 1 commit into
mainfrom
drew/force-enable-text-index

Conversation

@pulpdrew
Copy link
Copy Markdown
Contributor

@pulpdrew pulpdrew commented May 14, 2026

Summary

This PR adds a new source setting which allows for force-enabling or force-disabling text index support (generation of hasAllToken conditions for lucene implicit column search) for the source.

This is useful when

  1. A text index exists on the implicit column but HyperDX can't detect it due to a non-standard schema setup (typically one that includes the merge table engine). Force-enable handles this case.
  2. A text index is detected but shouldn't be used yet (maybe it isn't yet populated or is yielding undesirable results for some queries). Force-disable handles this case.

The default and recommended behaviour is still auto - where the index is detected at query time.

Screenshots or video

Merge engine based schema
CREATE TABLE otel_logs_local
(
    `Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `TraceId` String CODEC(ZSTD(1)),
    `SpanId` String CODEC(ZSTD(1)),
    `TraceFlags` UInt32 CODEC(ZSTD(1)),
    `SeverityText` LowCardinality(String) CODEC(ZSTD(1)),
    `SeverityNumber` Int32 CODEC(ZSTD(1)),
    `ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
    `Body` String CODEC(ZSTD(1)),
    `ResourceSchemaUrl` String CODEC(ZSTD(1)),
    `ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `ScopeSchemaUrl` String CODEC(ZSTD(1)),
    `ScopeName` String CODEC(ZSTD(1)),
    `ScopeVersion` String CODEC(ZSTD(1)),
    `ScopeAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `LogAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_scope_attr_key mapKeys(ScopeAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_scope_attr_value mapValues(ScopeAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_log_attr_key mapKeys(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_log_attr_value mapValues(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_lower_body lower(Body) TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 8,
    INDEX idx_ngram_body lower(Body) TYPE ngrambf_v1(3, 32768, 3, 0) GRANULARITY 8,
    INDEX fts_body Body TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lowerUTF8(Body)) GRANULARITY 100000000
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SeverityText, toUnixTimestamp(Timestamp), TraceId)
TTL toDateTime(Timestamp) + toIntervalDay(30)
SETTINGS ttl_only_drop_parts = 1, index_granularity = 8192;

CREATE TABLE otel_logs_merged (
  `Timestamp` DateTime64(9),
  `TraceId` String,
  `SpanId` String,
  `TraceFlags` UInt32,
  `SeverityText` LowCardinality(String),
  `SeverityNumber` Int32,
  `ServiceName` LowCardinality(String),
  `Body` String,
  `ResourceSchemaUrl` String,
  `ResourceAttributes` Map(LowCardinality(String), String),
  `ScopeSchemaUrl` String,
  `ScopeName` String,
  `ScopeVersion` String,
  `ScopeAttributes` Map(LowCardinality(String), String),
  `LogAttributes` Map(LowCardinality(String), String)
) ENGINE = Merge ('default', '^otel_logs_local(_old)?$');

CREATE TABLE otel_logs_merged_distributed (
  `Timestamp` DateTime64(9) CODEC (Delta (8), ZSTD (1)),
  `TraceId` String CODEC (ZSTD (1)),
  `SpanId` String CODEC (ZSTD (1)),
  `TraceFlags` UInt32 CODEC (ZSTD (1)),
  `SeverityText` LowCardinality(String) CODEC (ZSTD (1)),
  `SeverityNumber` Int32 CODEC (ZSTD (1)),
  `ServiceName` LowCardinality(String) CODEC (ZSTD (1)),
  `Body` String CODEC (ZSTD (1)),
  `ResourceSchemaUrl` String CODEC (ZSTD (1)),
  `ResourceAttributes` Map(LowCardinality(String), String) CODEC (ZSTD (1)),
  `ScopeSchemaUrl` String CODEC (ZSTD (1)),
  `ScopeName` String CODEC (ZSTD (1)),
  `ScopeVersion` String CODEC (ZSTD (1)),
  `ScopeAttributes` Map(LowCardinality(String), String) CODEC (ZSTD (1)),
  `LogAttributes` Map(LowCardinality(String), String) CODEC (ZSTD (1))
) ENGINE = Distributed (
  'hdx_cluster',
  'default',
  'otel_logs_merged',
  rand()
);

With this schema, using the default (auto) setting will result in hasToken(lower(Body), token) conditions which don't use the text index effectively due to the lower() function:

Screenshot 2026-05-14 at 10 43 05 AM

When force-enabled, the same query will now use the correct hasAllToken syntax:

Screenshot 2026-05-14 at 10 43 50 AM Screenshot 2026-05-14 at 10 42 41 AM

How to test on Vercel preview

This can be tested locally by creating a schema similar to the one provided above.

References

  • Linear Issue: Closes HDX-4132 Closes HDX-2178
  • Related PRs:

@vercel
Copy link
Copy Markdown

vercel Bot commented May 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 20, 2026 12:26pm

Request Review

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 14, 2026

🦋 Changeset detected

Latest commit: 4d1ece1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 5 packages
Name Type
@hyperdx/common-utils Patch
@hyperdx/api Patch
@hyperdx/app Patch
@hyperdx/cli Patch
@hyperdx/otel-collector Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Critical-path files (3):
    • packages/api/src/routers/external-api/v2/sources.ts
    • packages/api/src/tasks/checkAlerts/index.ts
    • packages/api/src/tasks/checkAlerts/template.ts
  • Cross-layer change: touches frontend (packages/app) + backend (packages/api) + shared utils (packages/common-utils)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 22
  • Production lines changed: 296 (+ 289 in test files, excluded from tier calculation)
  • Branch: drew/force-enable-text-index
  • Author: pulpdrew

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

PR Review

  • ⚠️ Form help text is slightly misleading → The UseTextIndexFormRow help text says "Force disable" falls back to hasToken(), but the implementation (and the Disabled still permits a bloom_filter tokens() index path test) shows Disabled only suppresses the text-index branch — it still uses the bloom_filter tokens() path if one exists. Either tighten the help text to reflect this, or rename the field to convey "text-index only" semantics. Not a bug, but users will assume Disabled means "pure hasToken/LIKE."

  • ⚠️ Enabled bypasses tokenizer detection → In queryParser.ts:1276-1277, Enabled always emits hasAllTokens() without checking parseTokenizerFromTextIndex(...)?.type === 'splitByNonAlpha'. If a user force-enables against a table whose text index uses a non-splitByNonAlpha tokenizer (e.g. ngrams), the emitted SQL will tokenize incorrectly and silently miss matches. Consider either (a) gating force-enable on tokenizer detection succeeding when an index is present and only skipping the check in the "no detectable index" merge-engine case, or (b) documenting this caveat in the form help text.

  • ℹ️ External API surface bound to schema parseformatExternalSource (packages/api/src/routers/external-api/v2/sources.ts:69) relies on SourceSchema to forward the field, which works because LogSourceSchema/TraceSourceSchema were updated in common-utils/src/types.ts. No action needed — flagging because future schema-drift here would silently drop the field from the external API response and the new sources.test.ts coverage is only against the internal /sources route, not the external v2 route.

  • ✅ Mongoose enum: Object.values(UseTextIndex) correctly rejects bad values; tests confirm 400 response.

  • ✅ Default auto semantics preserved end-to-end (undefined → UseTextIndex.Auto in the serializer constructor).

  • ✅ Field plumbed consistently through all chart-config builders, alerts, MCP tools, CLI, and dashboard tiles.

  • ✅ Good test coverage at both the SQL rendering layer (queryParser.test.ts) and the persistence layer (sources.test.ts).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

Deep Review

✅ No critical issues found. The diff is additive (a new optional enum field on LogSource/TraceSource with Auto as the implicit default), and existing sources without the field continue to behave exactly as before. Recommendations below focus on doc accuracy, contract consistency, and test coverage for the new branches.

🟡 P2 -- recommended

  • packages/app/src/components/Sources/SourceForm.tsx:1162 -- Help text and OpenAPI both claim Force disable falls back to hasToken()/LIKE, but the Disabled branch in queryParser.ts still falls through to the bloom_filter hasAll(tokens(...)) path (verified by the new "Disabled still permits a bloom_filter tokens() index path" test).

    • Fix: Update the form helpText and the JSDoc/openapi.json descriptions for both Log and Trace sources to state that disabled skips only the text-index hasAllTokens branch; a bloom_filter tokens() index is still used when present.
    • maintainability, api-contract
  • packages/api/openapi.json:3033 -- useTextIndexForImplicitColumn is declared nullable: true, but UseTextIndexSchema.optional() and the Mongoose enum: Object.values(UseTextIndex) both reject null; a generated client following the spec and sending "useTextIndexForImplicitColumn": null receives a 400.

    • Fix: Remove nullable: true from the new entries in openapi.json and the matching JSDoc in packages/api/src/routers/external-api/v2/sources.ts (the field is already correctly omittable since it is optional).
    • api-contract, adversarial
  • packages/common-utils/src/queryParser.ts:1276 -- Enabled mode emits hasAllTokens(column, term) with splitByNonAlpha-style JS tokenization regardless of the actual index tokenizer, so forcing enable against a text index that uses ngram/default/splitByWhitespace can return silently incorrect (often zero) results.

    • Fix: Either document the splitByNonAlpha assumption in the Force enable help text and OpenAPI description, or detect a non-splitByNonAlpha tokenizer (when an index is visible) and surface a UI warning before persisting Enabled.
    • adversarial, correctness
  • packages/common-utils/src/core/searchChartConfig.ts:145 -- The same isLogSource(source) || isTraceSource(source) ? source.useTextIndexForImplicitColumn : undefined block is repeated at 10+ call sites (DBDashboardPage.tsx:520, ServicesDashboardPage.tsx:39/450/465, ChartEditor/utils.ts:139/166, hooks/useDashboardFilterValues.tsx:95, tasks/checkAlerts/index.ts:137/594, mcp/tools/query/helpers.ts:198/273, runEventPatterns.ts:142); any future call site that forgets the field will silently fall back to Auto and ignore the operator's override.

    • Fix: Extract a pickImplicitColumnProps(source) helper alongside the existing pickSampleWeightExpressionProps in packages/common-utils/src/types.ts returning { implicitColumnExpression?, useTextIndexForImplicitColumn? }, and spread it at the call sites.
    • maintainability, adversarial
  • packages/common-utils/src/__tests__/queryParser.test.ts:1170 -- The new useTextIndexForImplicitColumn source preference block does not cover Enabled with a negated implicit field (NOT foo), Enabled with prefix/suffix wildcards (which must fall through to LIKE, not hasAllTokens), or Enabled with a >HAS_ALL_TOKENS_CHUNK_SIZE-token term (chunk-boundary path), all of which are new code paths in this refactor.

    • Fix: Add three unit tests asserting the negation wrap, the wildcard precedence over hasAllTokens, and the multi-batch hasAllTokens emission under Enabled.
    • testing, adversarial
  • packages/common-utils/src/core/renderChartConfig.ts:833 -- Unit tests construct CustomSchemaSQLSerializerV2 directly; no integration test confirms that useTextIndexForImplicitColumn actually flows from a TSource through buildSearchChartConfig/renderChartConfig (where it is threaded through 5+ renderWhereExpression/renderHaving/renderFiltersToSql call sites) into the rendered SQL.

    • Fix: Add a renderChartConfig integration test that constructs a chart config from a LogSource with useTextIndexForImplicitColumn: Enabled and asserts hasAllTokens appears in the produced SQL, plus a parallel test for Disabled.
    • testing, maintainability
  • packages/api/src/mcp/tools/dashboards/listSources.ts:73 -- The new field is consumed by MCP query helpers but is not surfaced in the hyperdx_list_sources meta object or the CLI hdx sources --json output (declared "for LLM / agent consumption" in packages/cli/src/cli.tsx:749), so agents cannot observe a setting that affects every Lucene query they issue.

    • Fix: Add useTextIndexForImplicitColumn: s.useTextIndexForImplicitColumn to the Log and Trace branches in listSources.ts:73-88, and include the field under expressions in the CLI --json source serializer in cli.tsx.
    • agent-native
🔵 P3 nitpicks (7)
  • packages/app/src/components/Sources/SourceForm.tsx:1180 -- SelectControlled uses placeholder='Auto (detect from schema)' with no defaultValue, so for pre-existing sources the dropdown displays "Auto" while the underlying RHF value is undefined; explicitly picking "Auto" persists the literal string 'auto', creating two storage states with identical UI but distinguishable wire output.

    • Fix: Seed defaultValues.useTextIndexForImplicitColumn = UseTextIndex.Auto in the source form initializer (or normalize undefined to Auto on read) so the UI and storage agree.
  • packages/api/src/mcp/tools/query/helpers.ts:198 -- 'useTextIndexForImplicitColumn' in source and 'implicitColumnExpression' in source are used at MCP sites while every other file uses isLogSource(source) || isTraceSource(source); the in form is more fragile if another source kind ever gains the field.

    • Fix: Switch the two MCP sites to the type-guard form (or replace both with the pickImplicitColumnProps helper proposed above).
  • packages/common-utils/src/queryParser.ts:1276 -- Enabled skips the enable_full_text_index = 1 check, so on ClickHouse versions where the experimental setting is off, every lucene implicit-column search against that source errors at runtime.

    • Fix: Note this caveat in the Force enable help text so operators understand the trade-off before flipping the setting.
    • correctness, adversarial
  • packages/common-utils/src/queryParser.ts:1093 -- enableTextIndexPromise and skipIndicesPromise are kicked off in the constructor unconditionally, so even Enabled (which never awaits them) still triggers metadata round-trips.

    • Fix: Gate the getSetting('enable_full_text_index') pre-fetch on useTextIndexForImplicitColumn === Auto; skipIndicesPromise is still needed for the bloom_filter fall-through under Disabled.
  • packages/api/src/routers/api/__tests__/sources.test.ts:807 -- The invalid-enum test exercises POST with 'maybe' but does not mirror the assertion on PUT, and there is no negative test confirming the field is rejected on Session/Metric source POST/PUT (since LogSourceSchema/TraceSourceSchema are the only schemas with the field).

    • Fix: Add a PUT-with-invalid-value test and a POST-on-Session-source-with-useTextIndexForImplicitColumn rejection test.
  • packages/common-utils/src/__tests__/queryParser.test.ts:1170 -- No Auto test exercises the case where a text index is detected but the tokenizer is not splitByNonAlpha (e.g. array, splitByWhitespace), which is the path that falls through to bloom_filter.

    • Fix: Add a unit test under the new describe block pinning the fall-through behavior so a future tokenizer-handling refactor cannot silently regress it.
  • packages/app/src/components/Sources/SourceForm.tsx:1148 -- UseTextIndexFormRow has no accompanying form-render test verifying it appears on Log and Trace source forms but not on Session or Metric forms.

    • Fix: Add a render-level test for SourceForm covering the per-SourceKind form-row visibility.

Reviewers (10): correctness, testing, maintainability, project-standards, performance, api-contract, adversarial, kieran-typescript, agent-native, learnings-researcher.

Testing gaps:

  • No end-to-end coverage that useTextIndexForImplicitColumn flows through renderChartConfig/buildSearchChartConfig into rendered SQL.
  • New Enabled branch is untested for negation, wildcards, multi-batch chunking, and non-splitByNonAlpha index targets.
  • No checkAlerts/MCP runConfigTile/runEventPatterns/external-API plumbing tests assert the field is forwarded.
  • No UI test asserts the new form row renders for Log/Trace only.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

E2E Test Results

All tests passed • 179 passed • 3 skipped • 1279s

Status Count
✅ Passed 179
❌ Failed 0
⚠️ Flaky 3
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@pulpdrew pulpdrew force-pushed the drew/force-enable-text-index branch from 583106b to 4d1ece1 Compare May 14, 2026 14:58
@pulpdrew pulpdrew requested review from a team, dhable and fleon and removed request for a team and fleon May 14, 2026 15:58
Comment on lines +450 to +453
useTextIndexForImplicitColumn:
isLogSource(source) || isTraceSource(source)
? source.useTextIndexForImplicitColumn
: undefined,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why we have to pass around useTextIndexForImplicitColumn everywhere, but we really ought to come up with a more scalable solution.

Potential idea:

  1. Rendering functions have sources passed down to them
  2. For charts querying a raw table, we define rules for each of these settings and when to apply them. ex: apply useTextIndexForImplicitColumn if FROM matches source.from.tableName and current WHERE node is
  3. For MV optimizations we apply settings on a case by case basis, depending on the MV definition and if it makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! The same applies to timestampValueExpression and implicitColumnExpression, and probably some others like source filters in the EE repo. I think passing source down to the rendering functions is probably the way to go - I will ticket that up as a tech debt item if that's alright with you (since this change in particular is following the pattern established by implicitColumnExpression)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDX-4290

@pulpdrew pulpdrew merged commit 8810ff0 into main May 20, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants