Skip to content

Add Highspot connector + fix slack-bot create button silent submit#41

Merged
rajivml merged 3 commits intofeature/darwinfrom
feature/highspot
May 6, 2026
Merged

Add Highspot connector + fix slack-bot create button silent submit#41
rajivml merged 3 commits intofeature/darwinfrom
feature/highspot

Conversation

@rajivml
Copy link
Copy Markdown
Collaborator

@rajivml rajivml commented May 6, 2026

Summary

Adds a new Highspot connector to Darwin and includes a small drive-by fix for the slack-bot config admin page (Create button silently doing nothing on a fresh form).

What was implemented

1. Highspot connector

Indexes Spots and the Items inside them via Highspot's REST API (https://api-su2.highspot.com/v1.0/). Auth: HTTP Basic with an API key + secret pair generated from the Highspot admin console; an optional highspot_url covers tenants on non-default Highspot regions.

For each Item, a Document is built whose section text comes from one of three tiers:

  1. WebLink items → headless-Chromium scrape of the linked URL via Playwright. Falls back to title + description if the scrape returns empty.
  2. File items with a downloadable, supported extension (.pdf, .docx, .pptx, .xlsx, .eml, .epub, .html, .txt) → extract_file_text over the bytes returned by items/{id}/content.
  3. Else / on any errortitle + "\\n" + description.

Backend:

  • backend/danswer/connectors/highspot/{__init__,client,utils,connector}.py — new package.
  • backend/danswer/configs/constants.pyDocumentSource.HIGHSPOT = "highspot".
  • backend/danswer/connectors/factory.py — registered.
  • backend/danswer/server/documents/connector.py — new GET /manage/admin/connector/highspot/spots/{credential_id} route that returns the live list of Spots visible to a saved credential (powers the multi-select on the admin page).

Frontend:

  • web/src/lib/types.ts\"highspot\" in ValidSources, HighspotConfig, HighspotCredentialJson.
  • web/src/components/icons/icons.tsxHighspotIcon (placeholder asset; see TODO below).
  • web/src/lib/sources.ts — tile entry under SOURCE_METADATA_MAP.highspot.
  • web/src/app/admin/connectors/highspot/page.tsx — full Step 1 (credentials) + Step 2 (multi-select Spots + create connector) admin page. Selecting at least one Spot is mandatory; the dropdown is populated live from the new backend route using the saved credential.
  • web/public/Highspot.png — placeholder icon (see TODO).

Notable adaptations vs upstream Onyx

This fork is ~2 years behind upstream and lacks the perm-sync / slim-doc / OnyxFileExtensions / IndexingHeartbeatInterface rewrite that upstream's connector depends on. Adjustments:

  • Drops the Slim/perm-sync interface entirely; the connector is a plain LoadConnector + PollConnector.
  • Replaces upstream's TextSection with this fork's Section.
  • Replaces OnyxFileExtensions.TEXT_AND_DOCUMENT_EXTENSIONS with an inline tuple matching this fork's extract_file_text dispatch.
  • extract_file_text argument order is (file_name, file, ...) here vs upstream's (file, file_name, ...).
  • Document.doc_updated_at is datetime | None here (upstream is str); ISO strings are parsed before assignment.

Lifecycle / perf adaptations to coexist with other connectors

The naive upstream Highspot connector spawns a fresh Chromium process per WebLink item. We observed this starving the worker's FDs / RAM and causing co-running connectors (specifically Slack's conversations.list) to fail with IncompleteRead mid-response. Fixed:

  • Shared browser per poll_source run — mirrors connectors/web/connector.py's pattern. One playwright.start() + chromium.launch() for the entire run, context.new_page() per WebLink, page.close() after each, full teardown in a try/finally at end-of-run (or on error).
  • Bounded scroll loopWEB_CONNECTOR_MAX_SCROLL_ATTEMPTS = 10 (down from upstream's 20) and per-scroll wait_for_load_state(\"networkidle\", timeout=5000) (down from 60000). Caps single-WebLink worst case at ~110s vs the upstream ~20-minute stall on pages where networkidle never settles.
  • Smaller yield batch_YIELD_BATCH_SIZE = 4 decoupled from INDEX_BATCH_SIZE so the indexer's docs_indexed counter ticks up more often. Per-item processing in this connector is slow enough (Playwright + extract_file_text) that yielding every 16 items can mean minutes between UI counter updates.

2. Drive-by fix: slack-bot config Create button

web/src/app/admin/bot/SlackBotConfigCreationForm.tsx — gate curated_response_config.response_message validation behind the enable_curated_response_integration toggle. Without the gate, the schema unconditionally required the field but the input is only rendered when the toggle is on (default off), so on a fresh /admin/bot/new validation silently failed, the Create button did nothing, and no error was visible because the errored field wasn't on screen. Mirrors the existing jira_config .when() pattern.

(Same fix as commit 4ed8bcbd on feature/multilanguage-support; applied here independently because feature/highspot was branched off feature/darwin before that PR landed.)

What was tested

Pre-commit / quality checks (per .pre-commit-config.yaml):

  • black --check on every Python file touched
  • reorder_python_imports --py311-plus
  • ruff (clean)
  • prettier --check on every TS/TSX touched
  • tsc --noEmit (clean, exit 0)
  • ✅ Backend module imports cleanly (from danswer.connectors.highspot.connector import HighspotConnector succeeds in the venv)

Manual verification path:

  • Standalone smoke test inside connector.py's if __name__ == \"__main__\": block:
    cd backend
    PYTHONPATH=$(pwd) HIGHSPOT_KEY=… HIGHSPOT_SECRET=… HIGHSPOT_SPOT_NAMES=\"My Spot\" \\
      python danswer/connectors/highspot/connector.py
  • End-to-end via admin UI (verified in dev):
    1. Open /admin/connectors/highspot → enter key + secret in Step 1 → save credential.
    2. Step 2: dropdown populates live with the actual Spots from the API.
    3. Pick one or more Spots → Create.
    4. cc-pair appears, indexing kicks off, docs_indexed ticks up every ~few items (the smaller yield-batch effect).
    5. Search for content from one of the indexed Items returns it as a Highspot citation.

What's NOT in this PR (follow-ups, if needed)

  • Replace web/public/Highspot.png — currently a placeholder copy of HubSpot.png. Swap it for the real Highspot logo before merge.
  • Per-spot config concurrencyclient.get_item(item_id) is called sequentially for every item even just for time-window rejection. Parallelizing this with a ThreadPoolExecutor(5-10) would 2-5× indexing throughput; deferred until we see real-world Spot sizes.
  • Time-budget-based scroll loopWEB_CONNECTOR_MAX_SCROLL_ATTEMPTS=10 is a count cap; a wall-clock cap (e.g. 30s/page) would handle the long-tail pages better. Also deferred.

Process bounce required after deploy

Per CLAUDE.md's footgun list:

  1. dapi (api-server) — new ORM imports + new admin route.
  2. dbe (background indexer + Celery worker) — new connector class registered in factory.
  3. dsl (slack listener) — adding a DocumentSource enum value triggers pydantic ValidationError: source_type on the slackbot otherwise.

Test plan

  • Replace web/public/Highspot.png with the real Highspot logo
  • Pull, bounce dapi + dbe + dsl
  • Open /admin/connectors/highspot, enter creds, verify the Spot multi-select populates from the live API
  • Try Create with zero Spots selected → expect inline "please select at least one Spot" error
  • Pick 1-2 Spots, create the connector, watch indexing — docs_indexed should tick up every ~few items
  • Search for content from an indexed Highspot Item — expect a citation linking back to the original
  • Smoke /admin/bot/new Create button — should now either submit successfully or display backend validation errors (no more silent "nothing happens")

🤖 Generated with Claude Code

rajivml and others added 3 commits May 6, 2026 15:06
Indexes Spots and the Items inside them via Highspot's REST API.
Authenticates with HTTP Basic (key+secret) generated from the
Highspot admin console; an optional base URL covers tenants on
non-default Highspot regions.

Per-item content extraction is tiered:
  1. WebLink items -> headless-Chromium scrape via Playwright,
     reusing one shared browser/context for the whole poll_source
     run (mirrors connectors/web/connector.py — spawning Chromium
     per item starves worker FDs/RAM and was making co-running
     Slack indexing fail with IncompleteRead).
  2. Items with a downloadable, supported extension
     (.pdf .docx .pptx .xlsx .eml .epub .html .txt) ->
     extract_file_text over the bytes from items/{id}/content.
  3. Else / on any error -> title + description fallback.

Notable adaptations vs upstream Onyx:
  - Drops the Slim/perm-sync interface; this fork has no
    SlimConnectorWithPermSync / SlimDocument / TextSection /
    OnyxFileExtensions / IndexingHeartbeatInterface in the
    upstream shape.
  - Uses Section instead of TextSection.
  - extract_file_text arg order is (file_name, file, ...) here;
    upstream is (file, file_name, ...).
  - Parses ISO date_updated to datetime before assignment because
    Document.doc_updated_at is typed datetime | None.
  - Scroll loop bounds: max_attempts=10 (down from 20), per-scroll
    networkidle timeout=5s (down from 60s) — caps single-WebLink
    worst case at ~110s, vs the upstream 20-min stall.
  - _YIELD_BATCH_SIZE=4 so the indexer's docs_indexed counter
    ticks more frequently; API pagination still uses
    INDEX_BATCH_SIZE.

Frontend:
  - HighspotConfig + HighspotCredentialJson in lib/types.ts.
  - HighspotIcon (placeholder Highspot.png — replace with the
    real logo before merge).
  - Tile in lib/sources.ts (AppConnection category).
  - Admin page at /admin/connectors/highspot mirrors the
    sf-account/page.tsx template; Spot selection is a live
    multi-select dropdown driven by GET /manage/admin/connector/
    highspot/spots/{credential_id} that calls the Highspot API
    using the saved credential and renders the actual Spot list.
    Selecting >=1 Spot is mandatory.

Process bounce after merge + deploy: dapi + dbe + dsl
(DocumentSource enum addition footgun per CLAUDE.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Yup schema unconditionally required curated_response_config.
response_message, but the matching text input is only rendered
when enable_curated_response_integration is true. Default is
false, so on a fresh /admin/bot/new the field was empty,
validation failed silently, the Create button did nothing, and
no error rendered because the errored field wasn't on screen.

Mirror the jira_config pattern: only require when the toggle is
enabled.

Same fix as commit 4ed8bcb on feature/multilanguage-support;
applied here independently because feature/highspot was branched
off feature/darwin before that PR landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's pre-commit prettier (v3.1.0) defaults to trailingComma:"all"
and flags the missing comma after the last generic param.
Local npm prettier 2.8.8 defaults to "es5" and didn't catch it.
Adding the comma to satisfy the canonical CI hook.

Pre-existing issue surfaced by this PR's diff scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rajivml rajivml merged commit c7108dc into feature/darwin May 6, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants