Skip to content

feat: add word_confidence_threshold parameter to whisper#33

Merged
johnyrahul merged 3 commits into
mainfrom
feat/add-word-confidence-threshold
Jun 10, 2026
Merged

feat: add word_confidence_threshold parameter to whisper#33
johnyrahul merged 3 commits into
mainfrom
feat/add-word-confidence-threshold

Conversation

@johnyrahul

Copy link
Copy Markdown
Contributor

Summary

Adds the new word_confidence_threshold parameter to the v2 client's whisper() method, in sync with llmwhisperer-docs#57.

  • word_confidence_threshold (float, default 0.3): the minimum OCR confidence score a word must have to be included in the extracted text. Words below the threshold are excluded from the final output.
  • Works only with form, high_quality and table modes.

Changes

  • Added the parameter to the whisper() signature, docstring, and the request params dict in client_v2.py.
  • Added unit tests verifying the default (0.3) and a custom value are forwarded as request params.

Testing

uv run pytest tests/unit/client_v2_test.py -k word_confidence -v
# 2 passed

🤖 Generated with Claude Code

Adds the `word_confidence_threshold` (float, default 0.3) parameter to
the v2 client's whisper() method, forwarding it to the API. Words whose
OCR confidence falls below the threshold are excluded from the output.
Works with form, high_quality and table modes.

Adds unit tests covering the default and a custom value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a word_confidence_threshold parameter (default 0.3) to the whisper() method in the v2 client, allowing callers to filter out low-confidence OCR words from extracted text. It also patches the integration test with a pre-test webhook cleanup to prevent failures from stale state.

  • client_v2.py: new word_confidence_threshold: float = 0.3 parameter appended to the signature, docstring, and params dict; the table mode is also added to the mode docstring.
  • tests/unit/client_v2_test.py: two new unit tests verify the default and a custom value are forwarded correctly as URL query params.
  • tests/integration/client_v2_test.py: defensive delete_webhook call added before webhook registration to clean up any leftover state from prior runs.

Confidence Score: 5/5

Safe to merge — the change is a straightforward additive parameter with a sensible default, backed by unit tests and consistent with the existing parameter-passing pattern.

The new parameter is always forwarded unconditionally in the params dict (matching every other param in the method), the default of 0.3 preserves existing server-side behaviour, and two focused unit tests confirm both the default and a custom value reach the prepared request URL. No auth, data-integrity, or breaking-change concerns are introduced.

No files require special attention.

Important Files Changed

Filename Overview
src/unstract/llmwhisperer/client_v2.py Adds word_confidence_threshold parameter to whisper() signature, docstring, and params dict; also adds "table" to the mode docstring. No validation is performed client-side, consistent with all other numeric params in this method.
tests/unit/client_v2_test.py Two new unit tests check that the default (0.3) and a custom (0.75) word_confidence_threshold value are forwarded as query params; uses parse_qs on the prepared request URL, which is the correct approach.
tests/integration/client_v2_test.py Pre-test webhook cleanup added to prevent duplicate-registration failures from prior failed runs; no functional changes to the test assertions.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant LLMWhispererClientV2
    participant LLMWhispererAPI

    Caller->>LLMWhispererClientV2: "whisper(url, word_confidence_threshold=0.3, ...)"
    Note over LLMWhispererClientV2: Build params dict including<br/>word_confidence_threshold
    LLMWhispererClientV2->>LLMWhispererAPI: "POST /whisper?...&word_confidence_threshold=0.3"
    LLMWhispererAPI-->>LLMWhispererClientV2: "{"status_code": 200, "extraction": {...}}"
    LLMWhispererClientV2-->>Caller: response dict
Loading

Reviews (3): Last reviewed commit: "test: delete pre-existing webhook before..." | Re-trigger Greptile

@jaseemjaskp jaseemjaskp left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review Toolkit pass (Code Reviewer, Silent Failure Hunter, Type Design Analyzer, PR Test Analyzer, Comment Analyzer, Code Simplifier).

The change is small, additive, and follows the established whisper() convention exactly (new kwarg with default -> entry in the params dict). Blast radius is contained to this one public method; no callers need changes. Inline comments below, ordered by priority. Net recommendation: address the docstring "table" mode inaccuracy and tighten the test assertions; the rest are confirm/optional.

Comment thread src/unstract/llmwhisperer/client_v2.py Outdated
Comment thread src/unstract/llmwhisperer/client_v2.py
Comment thread src/unstract/llmwhisperer/client_v2.py
Comment thread tests/unit/client_v2_test.py Outdated
Comment thread tests/unit/client_v2_test.py Outdated
johnyrahul and others added 2 commits June 9, 2026 15:57
- Add "table" to the list of valid whisper() modes in the docstring
- Document the valid range [0.0, 1.0] for word_confidence_threshold and
  make the wording word-consistent
- Assert on the parsed query value (parse_qs) instead of a URL substring
  so the tests can't false-match on a prefix (e.g. 0.3 vs 0.35)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A webhook left over from a previous failed run caused register_webhook to
fail on a stale record. Delete any existing webhook (ignoring not-found)
before registering so the test starts from a clean slate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@johnyrahul johnyrahul requested a review from jaseemjaskp June 9, 2026 10:44

@jaseemjaskp jaseemjaskp left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the latest revision with the PR Review Toolkit (code review, silent-failure, type-design, test, comment, and simplifier passes). All previously-raised threads have been addressed and resolved. Remaining observations are NIT/Suggestion-level only (no range validation by design — consistent with sibling numeric params; param forwarded for all modes and ignored server-side like include_line_confidence). No correctness, contract, or breaking-change issues found. LGTM 👍

@johnyrahul johnyrahul merged commit 3832713 into main Jun 10, 2026
1 of 3 checks passed
@johnyrahul johnyrahul deleted the feat/add-word-confidence-threshold branch June 10, 2026 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants