⚡️ Speed up function calculate_percent_missing_text by 18%#271
Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
Open
⚡️ Speed up function calculate_percent_missing_text by 18%#271codeflash-ai[bot] wants to merge 1 commit intomainfrom
calculate_percent_missing_text by 18%#271codeflash-ai[bot] wants to merge 1 commit intomainfrom
Conversation
The optimized code achieves an 18% speedup through two key algorithmic improvements in the `bag_of_words` function: **1. Replaced nested while-loop with single-pass enumeration** The original code used a manual while-loop with complex index manipulation (`i`, `j`) to scan through words, including an inner while-loop to concatenate consecutive single-character tokens. This approach required: - Repeated `len(words)` calls in loop conditions - Manual index incrementing and jumping (`i = j`) - Building intermediate `incorrect_word` strings that were often discarded The optimized version uses Python's `enumerate()` for a single pass with direct indexing, eliminating the nested loop overhead and string concatenation entirely. **2. Streamlined single-character word detection** Instead of scanning ahead to concatenate consecutive single-character tokens (which the original logic then mostly rejected), the optimized code makes local adjacency checks: - `prev_single = i > 0 and len(words[i - 1]) == 1` - `next_single = i + 1 < n and len(words[i + 1]) == 1` This only processes isolated single alphanumeric characters, matching the original behavior while avoiding string building overhead. **3. Dictionary lookup optimization in `calculate_percent_missing_text`** Replaced the `if source_word not in output_bow` check followed by separate dictionary access with `output_bow.get(source_word, 0)`, reducing dictionary lookups from two to one per iteration. **Performance impact based on workloads:** From the line profiler, the original code spent 20.9% of time in the while-loop condition and 18.6% checking word lengths. The optimized version reduces this to 22.1% for enumeration (which processes both iteration and indexing) and 20.8% for length checks—a net reduction in control flow overhead. The test results show the optimization is particularly effective for: - Large texts with repeated words: 31-42% faster (e.g., `test_large_identical_texts`, `test_large_repeated_words_some_missing`) - Documents with many unique words: 9-25% faster (e.g., `test_large_scale_half_missing`, `test_large_text_unique_words`) - Text with minimal single-character tokens that would have triggered the expensive concatenation path Since `calculate_percent_missing_text` is called in `_process_document` for document evaluation, this optimization directly benefits document processing pipelines where text extraction quality metrics are computed. The function is in a hot path during batch document evaluation, making the 18% improvement particularly valuable when processing large document sets.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 18% (0.18x) speedup for
calculate_percent_missing_textinunstructured/metrics/text_extraction.py⏱️ Runtime :
7.27 milliseconds→6.14 milliseconds(best of45runs)📝 Explanation and details
The optimized code achieves an 18% speedup through two key algorithmic improvements in the
bag_of_wordsfunction:1. Replaced nested while-loop with single-pass enumeration
The original code used a manual while-loop with complex index manipulation (
i,j) to scan through words, including an inner while-loop to concatenate consecutive single-character tokens. This approach required:len(words)calls in loop conditionsi = j)incorrect_wordstrings that were often discardedThe optimized version uses Python's
enumerate()for a single pass with direct indexing, eliminating the nested loop overhead and string concatenation entirely.2. Streamlined single-character word detection
Instead of scanning ahead to concatenate consecutive single-character tokens (which the original logic then mostly rejected), the optimized code makes local adjacency checks:
prev_single = i > 0 and len(words[i - 1]) == 1next_single = i + 1 < n and len(words[i + 1]) == 1This only processes isolated single alphanumeric characters, matching the original behavior while avoiding string building overhead.
3. Dictionary lookup optimization in
calculate_percent_missing_textReplaced the
if source_word not in output_bowcheck followed by separate dictionary access withoutput_bow.get(source_word, 0), reducing dictionary lookups from two to one per iteration.Performance impact based on workloads:
From the line profiler, the original code spent 20.9% of time in the while-loop condition and 18.6% checking word lengths. The optimized version reduces this to 22.1% for enumeration (which processes both iteration and indexing) and 20.8% for length checks—a net reduction in control flow overhead.
The test results show the optimization is particularly effective for:
test_large_identical_texts,test_large_repeated_words_some_missing)test_large_scale_half_missing,test_large_text_unique_words)Since
calculate_percent_missing_textis called in_process_documentfor document evaluation, this optimization directly benefits document processing pipelines where text extraction quality metrics are computed. The function is in a hot path during batch document evaluation, making the 18% improvement particularly valuable when processing large document sets.✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
metrics/test_text_extraction.py::test_calculate_percent_missing_text🌀 Click to see Generated Regression Tests
🔎 Click to see Concolic Coverage Tests
codeflash_concolic_xdo_puqm/tmpj3sc54h5/test_concolic_coverage.py::test_calculate_percent_missing_textTo edit these changes
git checkout codeflash/optimize-calculate_percent_missing_text-mks2sdnkand push.