Skip to content

Add Boyer-Moore-Horspool string-search algorithm#14630

Open
ChrisJr404 wants to merge 1 commit intoTheAlgorithms:masterfrom
ChrisJr404:add-boyer-moore-horspool
Open

Add Boyer-Moore-Horspool string-search algorithm#14630
ChrisJr404 wants to merge 1 commit intoTheAlgorithms:masterfrom
ChrisJr404:add-boyer-moore-horspool

Conversation

@ChrisJr404
Copy link
Copy Markdown

Describe your change

Adds the Boyer-Moore-Horspool string-search algorithm under strings/boyer_moore_horspool.py.

Horspool (1980) is the most widely-taught simplification of the Boyer-Moore family: it keeps only the bad-character shift table and uses the rightmost character of the current text window to drive the shift. It is sub-linear on average (~O(n / m) on random text), worst-case O(n * m), and uses O(sigma) memory where sigma is the size of the alphabet that appears in the pattern. It is often the simplest sub-linear matcher to implement and is the algorithm used internally by libraries such as glibc's memmem for short patterns.

The repository already contains the original Boyer-Moore (strings/boyer_moore_search.py) and Knuth-Morris-Pratt (strings/knuth_morris_pratt.py); Horspool is a distinct, well-known variant and was missing. I confirmed the file does not already exist (rg horspool only matches the dictionary corpus) and DIRECTORY.md had no entry for it.

Reference: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm

The module exposes:

  • boyer_moore_horspool_search(text, pattern) -> int - first-match index, -1 on no match, behaves like str.find for the empty-pattern edge case.
  • boyer_moore_horspool_search_all(text, pattern) -> list[int] - all (overlapping) match indices.
  • _build_shift_table(pattern) - private helper, also doctested.

Checklist

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. The file does not require a third-party library.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms include at least one URL that points to Wikipedia or another similar explanation.
  • If this pull request resolves one or more open issues then the description above includes the issue number(s) with a closing keyword: "Fixes #ISSUE-NUMBER".

Tests

  • 18 doctests across the three functions, covering empty pattern, empty text, pattern longer than text, no match, single match, overlapping matches, and parity with str.find.
  • python -m doctest strings/boyer_moore_horspool.py -v -> 18 passed.
  • python -m pytest strings/boyer_moore_horspool.py --doctest-modules -> 3 passed.
  • Local fuzz of 2000 random inputs over a 4-letter alphabet against str.find and a naive all-occurrences oracle: 100% agreement.
  • pre-commit run --files strings/boyer_moore_horspool.py DIRECTORY.md passes (ruff check, ruff format, codespell, auto-walrus, validate-filenames).

DIRECTORY.md updated alphabetically in the Strings section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant