feat: Add setting to force enable/disable text index support for a source by pulpdrew · Pull Request #2277 · hyperdxio/hyperdx

pulpdrew · 2026-05-14T14:45:17Z

Summary

This PR adds a new source setting which allows for force-enabling or force-disabling text index support (generation of hasAllToken conditions for lucene implicit column search) for the source.

This is useful when

A text index exists on the implicit column but HyperDX can't detect it due to a non-standard schema setup (typically one that includes the merge table engine). Force-enable handles this case.
A text index is detected but shouldn't be used yet (maybe it isn't yet populated or is yielding undesirable results for some queries). Force-disable handles this case.

The default and recommended behaviour is still auto - where the index is detected at query time.

Screenshots or video

Merge engine based schema

CREATE TABLE otel_logs_local
(
    `Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `TraceId` String CODEC(ZSTD(1)),
    `SpanId` String CODEC(ZSTD(1)),
    `TraceFlags` UInt32 CODEC(ZSTD(1)),
    `SeverityText` LowCardinality(String) CODEC(ZSTD(1)),
    `SeverityNumber` Int32 CODEC(ZSTD(1)),
    `ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
    `Body` String CODEC(ZSTD(1)),
    `ResourceSchemaUrl` String CODEC(ZSTD(1)),
    `ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `ScopeSchemaUrl` String CODEC(ZSTD(1)),
    `ScopeName` String CODEC(ZSTD(1)),
    `ScopeVersion` String CODEC(ZSTD(1)),
    `ScopeAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `LogAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_scope_attr_key mapKeys(ScopeAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_scope_attr_value mapValues(ScopeAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_log_attr_key mapKeys(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_log_attr_value mapValues(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_lower_body lower(Body) TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 8,
    INDEX idx_ngram_body lower(Body) TYPE ngrambf_v1(3, 32768, 3, 0) GRANULARITY 8,
    INDEX fts_body Body TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lowerUTF8(Body)) GRANULARITY 100000000
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SeverityText, toUnixTimestamp(Timestamp), TraceId)
TTL toDateTime(Timestamp) + toIntervalDay(30)
SETTINGS ttl_only_drop_parts = 1, index_granularity = 8192;

CREATE TABLE otel_logs_merged (
  `Timestamp` DateTime64(9),
  `TraceId` String,
  `SpanId` String,
  `TraceFlags` UInt32,
  `SeverityText` LowCardinality(String),
  `SeverityNumber` Int32,
  `ServiceName` LowCardinality(String),
  `Body` String,
  `ResourceSchemaUrl` String,
  `ResourceAttributes` Map(LowCardinality(String), String),
  `ScopeSchemaUrl` String,
  `ScopeName` String,
  `ScopeVersion` String,
  `ScopeAttributes` Map(LowCardinality(String), String),
  `LogAttributes` Map(LowCardinality(String), String)
) ENGINE = Merge ('default', '^otel_logs_local(_old)?$');

CREATE TABLE otel_logs_merged_distributed (
  `Timestamp` DateTime64(9) CODEC (Delta (8), ZSTD (1)),
  `TraceId` String CODEC (ZSTD (1)),
  `SpanId` String CODEC (ZSTD (1)),
  `TraceFlags` UInt32 CODEC (ZSTD (1)),
  `SeverityText` LowCardinality(String) CODEC (ZSTD (1)),
  `SeverityNumber` Int32 CODEC (ZSTD (1)),
  `ServiceName` LowCardinality(String) CODEC (ZSTD (1)),
  `Body` String CODEC (ZSTD (1)),
  `ResourceSchemaUrl` String CODEC (ZSTD (1)),
  `ResourceAttributes` Map(LowCardinality(String), String) CODEC (ZSTD (1)),
  `ScopeSchemaUrl` String CODEC (ZSTD (1)),
  `ScopeName` String CODEC (ZSTD (1)),
  `ScopeVersion` String CODEC (ZSTD (1)),
  `ScopeAttributes` Map(LowCardinality(String), String) CODEC (ZSTD (1)),
  `LogAttributes` Map(LowCardinality(String), String) CODEC (ZSTD (1))
) ENGINE = Distributed (
  'hdx_cluster',
  'default',
  'otel_logs_merged',
  rand()
);

With this schema, using the default (auto) setting will result in hasToken(lower(Body), token) conditions which don't use the text index effectively due to the lower() function:

When force-enabled, the same query will now use the correct hasAllToken syntax:

How to test on Vercel preview

This can be tested locally by creating a schema similar to the one provided above.

References

Linear Issue: Closes HDX-4132 Closes HDX-2178
Related PRs:

vercel · 2026-05-14T14:45:23Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
hyperdx-oss	Ready	Preview, Comment	May 20, 2026 12:26pm

changeset-bot · 2026-05-14T14:45:23Z

🦋 Changeset detected

Latest commit: 4d1ece1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 5 packages

Name	Type
@hyperdx/common-utils	Patch
@hyperdx/api	Patch
@hyperdx/app	Patch
@hyperdx/cli	Patch
@hyperdx/otel-collector	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

github-actions · 2026-05-14T14:45:45Z

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

Critical-path files (3):
- packages/api/src/routers/external-api/v2/sources.ts
- packages/api/src/tasks/checkAlerts/index.ts
- packages/api/src/tasks/checkAlerts/template.ts
Cross-layer change: touches frontend (packages/app) + backend (packages/api) + shared utils (packages/common-utils)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats

Production files changed: 22
Production lines changed: 296 (+ 289 in test files, excluded from tier calculation)
Branch: drew/force-enable-text-index
Author: pulpdrew

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

github-actions · 2026-05-14T14:48:01Z

PR Review

⚠️ Form help text is slightly misleading → The UseTextIndexFormRow help text says "Force disable" falls back to hasToken(), but the implementation (and the Disabled still permits a bloom_filter tokens() index path test) shows Disabled only suppresses the text-index branch — it still uses the bloom_filter tokens() path if one exists. Either tighten the help text to reflect this, or rename the field to convey "text-index only" semantics. Not a bug, but users will assume Disabled means "pure hasToken/LIKE."
⚠️ Enabled bypasses tokenizer detection → In queryParser.ts:1276-1277, Enabled always emits hasAllTokens() without checking parseTokenizerFromTextIndex(...)?.type === 'splitByNonAlpha'. If a user force-enables against a table whose text index uses a non-splitByNonAlpha tokenizer (e.g. ngrams), the emitted SQL will tokenize incorrectly and silently miss matches. Consider either (a) gating force-enable on tokenizer detection succeeding when an index is present and only skipping the check in the "no detectable index" merge-engine case, or (b) documenting this caveat in the form help text.
ℹ️ External API surface bound to schema parse → formatExternalSource (packages/api/src/routers/external-api/v2/sources.ts:69) relies on SourceSchema to forward the field, which works because LogSourceSchema/TraceSourceSchema were updated in common-utils/src/types.ts. No action needed — flagging because future schema-drift here would silently drop the field from the external API response and the new sources.test.ts coverage is only against the internal /sources route, not the external v2 route.
✅ Mongoose enum: Object.values(UseTextIndex) correctly rejects bad values; tests confirm 400 response.
✅ Default auto semantics preserved end-to-end (undefined → UseTextIndex.Auto in the serializer constructor).
✅ Field plumbed consistently through all chart-config builders, alerts, MCP tools, CLI, and dashboard tiles.
✅ Good test coverage at both the SQL rendering layer (queryParser.test.ts) and the persistence layer (sources.test.ts).

github-actions · 2026-05-14T14:55:17Z

Deep Review

✅ No critical issues found. The diff is additive (a new optional enum field on LogSource/TraceSource with Auto as the implicit default), and existing sources without the field continue to behave exactly as before. Recommendations below focus on doc accuracy, contract consistency, and test coverage for the new branches.

🟡 P2 -- recommended

packages/app/src/components/Sources/SourceForm.tsx:1162 -- Help text and OpenAPI both claim Force disable falls back to hasToken()/LIKE, but the Disabled branch in queryParser.ts still falls through to the bloom_filter hasAll(tokens(...)) path (verified by the new "Disabled still permits a bloom_filter tokens() index path" test).
- Fix: Update the form helpText and the JSDoc/openapi.json descriptions for both Log and Trace sources to state that disabled skips only the text-index hasAllTokens branch; a bloom_filter tokens() index is still used when present.
- _{maintainability, api-contract}
packages/api/openapi.json:3033 -- useTextIndexForImplicitColumn is declared nullable: true, but UseTextIndexSchema.optional() and the Mongoose enum: Object.values(UseTextIndex) both reject null; a generated client following the spec and sending "useTextIndexForImplicitColumn": null receives a 400.
- Fix: Remove nullable: true from the new entries in openapi.json and the matching JSDoc in packages/api/src/routers/external-api/v2/sources.ts (the field is already correctly omittable since it is optional).
- _{api-contract, adversarial}
packages/common-utils/src/queryParser.ts:1276 -- Enabled mode emits hasAllTokens(column, term) with splitByNonAlpha-style JS tokenization regardless of the actual index tokenizer, so forcing enable against a text index that uses ngram/default/splitByWhitespace can return silently incorrect (often zero) results.
- Fix: Either document the splitByNonAlpha assumption in the Force enable help text and OpenAPI description, or detect a non-splitByNonAlpha tokenizer (when an index is visible) and surface a UI warning before persisting Enabled.
- _{adversarial, correctness}
packages/common-utils/src/core/searchChartConfig.ts:145 -- The same isLogSource(source) || isTraceSource(source) ? source.useTextIndexForImplicitColumn : undefined block is repeated at 10+ call sites (DBDashboardPage.tsx:520, ServicesDashboardPage.tsx:39/450/465, ChartEditor/utils.ts:139/166, hooks/useDashboardFilterValues.tsx:95, tasks/checkAlerts/index.ts:137/594, mcp/tools/query/helpers.ts:198/273, runEventPatterns.ts:142); any future call site that forgets the field will silently fall back to Auto and ignore the operator's override.
- Fix: Extract a pickImplicitColumnProps(source) helper alongside the existing pickSampleWeightExpressionProps in packages/common-utils/src/types.ts returning { implicitColumnExpression?, useTextIndexForImplicitColumn? }, and spread it at the call sites.
- _{maintainability, adversarial}
packages/common-utils/src/__tests__/queryParser.test.ts:1170 -- The new useTextIndexForImplicitColumn source preference block does not cover Enabled with a negated implicit field (NOT foo), Enabled with prefix/suffix wildcards (which must fall through to LIKE, not hasAllTokens), or Enabled with a >HAS_ALL_TOKENS_CHUNK_SIZE-token term (chunk-boundary path), all of which are new code paths in this refactor.
- Fix: Add three unit tests asserting the negation wrap, the wildcard precedence over hasAllTokens, and the multi-batch hasAllTokens emission under Enabled.
- _{testing, adversarial}
packages/common-utils/src/core/renderChartConfig.ts:833 -- Unit tests construct CustomSchemaSQLSerializerV2 directly; no integration test confirms that useTextIndexForImplicitColumn actually flows from a TSource through buildSearchChartConfig/renderChartConfig (where it is threaded through 5+ renderWhereExpression/renderHaving/renderFiltersToSql call sites) into the rendered SQL.
- Fix: Add a renderChartConfig integration test that constructs a chart config from a LogSource with useTextIndexForImplicitColumn: Enabled and asserts hasAllTokens appears in the produced SQL, plus a parallel test for Disabled.
- _{testing, maintainability}
packages/api/src/mcp/tools/dashboards/listSources.ts:73 -- The new field is consumed by MCP query helpers but is not surfaced in the hyperdx_list_sources meta object or the CLI hdx sources --json output (declared "for LLM / agent consumption" in packages/cli/src/cli.tsx:749), so agents cannot observe a setting that affects every Lucene query they issue.
- Fix: Add useTextIndexForImplicitColumn: s.useTextIndexForImplicitColumn to the Log and Trace branches in listSources.ts:73-88, and include the field under expressions in the CLI --json source serializer in cli.tsx.
- _agent-native

🔵 P3 nitpicks (7)

packages/app/src/components/Sources/SourceForm.tsx:1180 -- SelectControlled uses placeholder='Auto (detect from schema)' with no defaultValue, so for pre-existing sources the dropdown displays "Auto" while the underlying RHF value is undefined; explicitly picking "Auto" persists the literal string 'auto', creating two storage states with identical UI but distinguishable wire output.
- Fix: Seed defaultValues.useTextIndexForImplicitColumn = UseTextIndex.Auto in the source form initializer (or normalize undefined to Auto on read) so the UI and storage agree.
packages/api/src/mcp/tools/query/helpers.ts:198 -- 'useTextIndexForImplicitColumn' in source and 'implicitColumnExpression' in source are used at MCP sites while every other file uses isLogSource(source) || isTraceSource(source); the in form is more fragile if another source kind ever gains the field.
- Fix: Switch the two MCP sites to the type-guard form (or replace both with the pickImplicitColumnProps helper proposed above).
packages/common-utils/src/queryParser.ts:1276 -- Enabled skips the enable_full_text_index = 1 check, so on ClickHouse versions where the experimental setting is off, every lucene implicit-column search against that source errors at runtime.
- Fix: Note this caveat in the Force enable help text so operators understand the trade-off before flipping the setting.
- _{correctness, adversarial}
packages/common-utils/src/queryParser.ts:1093 -- enableTextIndexPromise and skipIndicesPromise are kicked off in the constructor unconditionally, so even Enabled (which never awaits them) still triggers metadata round-trips.
- Fix: Gate the getSetting('enable_full_text_index') pre-fetch on useTextIndexForImplicitColumn === Auto; skipIndicesPromise is still needed for the bloom_filter fall-through under Disabled.
packages/api/src/routers/api/__tests__/sources.test.ts:807 -- The invalid-enum test exercises POST with 'maybe' but does not mirror the assertion on PUT, and there is no negative test confirming the field is rejected on Session/Metric source POST/PUT (since LogSourceSchema/TraceSourceSchema are the only schemas with the field).
- Fix: Add a PUT-with-invalid-value test and a POST-on-Session-source-with-useTextIndexForImplicitColumn rejection test.
packages/common-utils/src/__tests__/queryParser.test.ts:1170 -- No Auto test exercises the case where a text index is detected but the tokenizer is not splitByNonAlpha (e.g. array, splitByWhitespace), which is the path that falls through to bloom_filter.
- Fix: Add a unit test under the new describe block pinning the fall-through behavior so a future tokenizer-handling refactor cannot silently regress it.
packages/app/src/components/Sources/SourceForm.tsx:1148 -- UseTextIndexFormRow has no accompanying form-render test verifying it appears on Log and Trace source forms but not on Session or Metric forms.
- Fix: Add a render-level test for SourceForm covering the per-SourceKind form-row visibility.

Reviewers (10): correctness, testing, maintainability, project-standards, performance, api-contract, adversarial, kieran-typescript, agent-native, learnings-researcher.

Testing gaps:

No end-to-end coverage that useTextIndexForImplicitColumn flows through renderChartConfig/buildSearchChartConfig into rendered SQL.
New Enabled branch is untested for negation, wildcards, multi-batch chunking, and non-splitByNonAlpha index targets.
No checkAlerts/MCP runConfigTile/runEventPatterns/external-API plumbing tests assert the field is forwarded.
No UI test asserts the new form row renders for Log/Trace only.

github-actions · 2026-05-14T14:55:22Z

E2E Test Results

✅ All tests passed • 179 passed • 3 skipped • 1279s

Status	Count
✅ Passed	179
❌ Failed	0
⚠️ Flaky	3
⏭️ Skipped	3

Tests ran across 4 shards in parallel.

View full report →

…urce

knudtty · 2026-05-20T13:59:06Z

+        useTextIndexForImplicitColumn:
+          isLogSource(source) || isTraceSource(source)
+            ? source.useTextIndexForImplicitColumn
+            : undefined,


I understand why we have to pass around useTextIndexForImplicitColumn everywhere, but we really ought to come up with a more scalable solution.

Potential idea:

Rendering functions have sources passed down to them

For charts querying a raw table, we define rules for each of these settings and when to apply them. ex: apply useTextIndexForImplicitColumn if FROM matches source.from.tableName and current WHERE node is

For MV optimizations we apply settings on a case by case basis, depending on the MV definition and if it makes sense.

I agree! The same applies to timestampValueExpression and implicitColumnExpression, and probably some others like source filters in the EE repo. I think passing source down to the rendering functions is probably the way to go - I will ticket that up as a tech debt item if that's alright with you (since this change in particular is following the pattern established by implicitColumnExpression)

github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 14, 2026

pulpdrew force-pushed the drew/force-enable-text-index branch from 583106b to 4d1ece1 Compare May 14, 2026 14:58

vercel Bot deployed to Preview May 14, 2026 15:01 View deployment

pulpdrew requested review from a team, dhable and fleon and removed request for a team and fleon May 14, 2026 15:58

feat: Add setting to force enable/disable text index support for a so…

08d968c

…urce

pulpdrew force-pushed the drew/force-enable-text-index branch from 4d1ece1 to 08d968c Compare May 20, 2026 12:21

vercel Bot deployed to Preview May 20, 2026 12:26 View deployment

pulpdrew requested a review from fleon May 20, 2026 13:47

knudtty reviewed May 20, 2026

View reviewed changes

dhable approved these changes May 20, 2026

View reviewed changes

pulpdrew merged commit 8810ff0 into main May 20, 2026
26 checks passed

pulpdrew mentioned this pull request May 20, 2026

HyperDX UI do not detect FTS when merge tree is in the hierarchy #2178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add setting to force enable/disable text index support for a source#2277

feat: Add setting to force enable/disable text index support for a source#2277
pulpdrew merged 1 commit into
mainfrom
drew/force-enable-text-index

pulpdrew commented May 14, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 14, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

knudtty May 20, 2026

Uh oh!

pulpdrew May 20, 2026

Uh oh!

pulpdrew May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pulpdrew commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Screenshots or video

How to test on Vercel preview

References

Uh oh!

vercel Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Tier 4 — Critical

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deep Review

🟡 P2 -- recommended

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

knudtty May 20, 2026

Choose a reason for hiding this comment

Uh oh!

pulpdrew May 20, 2026

Choose a reason for hiding this comment

Uh oh!

pulpdrew May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pulpdrew commented May 14, 2026 •

edited

Loading

vercel Bot commented May 14, 2026 •

edited

Loading

changeset-bot Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading