Skip to content

feat: make min_word_length parameter configurable (#120)#153

Open
Yegorov wants to merge 2 commits into
cardmagic:masterfrom
Yegorov:120
Open

feat: make min_word_length parameter configurable (#120)#153
Yegorov wants to merge 2 commits into
cardmagic:masterfrom
Yegorov:120

Conversation

@Yegorov
Copy link
Copy Markdown

@Yegorov Yegorov commented May 14, 2026

Closes #120

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@Yegorov
Copy link
Copy Markdown
Author

Yegorov commented May 14, 2026

@cardmagic can you take a look, please?

Copy link
Copy Markdown
Owner

@cardmagic cardmagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! The feature looks good overall. I've left a few inline comments - one critical bug to fix before merging.

Comment thread lib/classifier/tfidf.rb Outdated
Comment thread test/config/config_test.rb
Comment thread lib/classifier/config.rb
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR adds a configurable min_word_length parameter across all classifiers (Bayes, LSI, TFIDF, LogisticRegression) and a new Classifier::Config singleton so the default can be set globally. The feature is well-threaded through training, classification, and serialization paths — with one copy-paste mistake in tfidf.rb's JSON deserialization and a test isolation issue in the new config test.

  • lib/classifier/tfidf.rb: from_json_data sets :@storage instead of :@min_word_length, so @storage is corrupted to an integer and the persisted word-length setting is silently lost on every JSON round-trip.
  • test/config/config_test.rb: test_configure mutates the global Classifier.config singleton without a teardown reset, which can cause downstream tests to see min_word_length = 1 instead of the default 3.
  • All other classifiers (Bayes, LSI, LogisticRegression) handle the new parameter correctly through initialization, training, and both marshal and JSON serialization.

Confidence Score: 3/5

The overall change is straightforward but carries one copy-paste mistake in TFIDF's JSON deserialization that corrupts @storage and silently drops the saved min_word_length on every round-trip — fix before merging.

The TFIDF from_json_data method overwrites @storage with an integer instead of setting @min_word_length, leaving the instance in a broken state after JSON deserialization. The config test also mutates global singleton state without cleanup. Two concrete defects on new code paths pull the score down despite the rest of the implementation being clean.

lib/classifier/tfidf.rb (wrong ivar in from_json_data) and test/config/config_test.rb (missing teardown for global state mutation).

Important Files Changed

Filename Overview
lib/classifier/tfidf.rb Adds min_word_length parameter throughout, but from_json_data uses :@storage instead of :@min_word_length, corrupting @storage and losing the persisted value on deserialization.
lib/classifier/config.rb New module-level Config class with min_word_length defaulting to 3; clean implementation with module_function wiring.
lib/classifier/extensions/word_hash.rb Adds min_word_length parameter to word_hash, clean_word_hash, and word_hash_for_words; correctly changes > 2 to >= min_word_length preserving existing behavior at the default of 3.
lib/classifier/bayes.rb Threads min_word_length through initialize, train/untrain, classifications, serialization, and JSON deserialization correctly.
lib/classifier/lsi.rb Adds min_word_length via options hash and wires it through add_item and marshal serialization correctly.
lib/classifier/logistic_regression.rb Correctly propagates min_word_length through all train/classify paths, JSON, and marshal serialization.
test/config/config_test.rb New config test mutates global singleton state in test_configure without teardown, risking cross-test pollution.
test/bayes/bayesian_test.rb Adds a sensible min_word_length: 5 integration test for Bayes.
test/tfidf/tfidf_test.rb Adds min_word_length: 5 test for TFIDF tokenization, validates expected vocabulary reduction.

Sequence Diagram

sequenceDiagram
    participant User
    participant Classifier
    participant Config
    participant WordHash as String#word_hash

    User->>Classifier: "Classifier.configure { |c| c.min_word_length = 5 }"
    Classifier->>Config: "config.min_word_length = 5"

    User->>Classifier: Bayes.new('Spam', min_word_length: 4)
    Classifier->>Classifier: "@min_word_length = 4"

    User->>Classifier: train('Spam', 'some text')
    Classifier->>WordHash: "text.word_hash(@min_word_length)"
    WordHash-->>Classifier: filtered word hash

    User->>Classifier: classifier.to_json
    Classifier-->>User: JSON with min_word_length: 4

    User->>Classifier: Bayes.from_json(json)
    Classifier->>Config: "data['min_word_length'] || config.min_word_length"
    Classifier-->>User: restored instance
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
lib/classifier/tfidf.rb:238-239
Wrong ivar name in `from_json_data`. Line 238 sets `@storage` to `nil`, then line 239 immediately overwrites `@storage` with the deserialized integer value. `@min_word_length` is never restored, so a round-tripped TFIDF instance silently falls back to the global default instead of its saved setting, and `@storage` is left as an integer instead of `nil`.

```suggestion
      instance.instance_variable_set(:@storage, nil)
      instance.instance_variable_set(:@min_word_length, data['min_word_length'] || Classifier.config.min_word_length)
```

### Issue 2 of 2
test/config/config_test.rb:5-11
`test_configure` mutates the memoized `Classifier.config` singleton (sets `min_word_length` to 1) and never resets it. Any test that runs after this one in the same process and relies on `Classifier.config.min_word_length` defaulting to 3 — for example, Bayes or TFIDF tests that omit an explicit `min_word_length` — will see 1 instead of 3, causing spurious failures. Add a `teardown` that resets `Classifier.instance_variable_set(:@config, nil)` (or sets it back to 3) to isolate the mutation.

Reviews (1): Last reviewed commit: "feat: make `min_word_length` parameter c..." | Re-trigger Greptile

Comment thread lib/classifier/tfidf.rb Outdated
Comment on lines +238 to +239
instance.instance_variable_set(:@storage, nil)
instance.instance_variable_set(:@storage, data['min_word_length'] || Classifier.config.min_word_length)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Wrong ivar name in from_json_data. Line 238 sets @storage to nil, then line 239 immediately overwrites @storage with the deserialized integer value. @min_word_length is never restored, so a round-tripped TFIDF instance silently falls back to the global default instead of its saved setting, and @storage is left as an integer instead of nil.

Suggested change
instance.instance_variable_set(:@storage, nil)
instance.instance_variable_set(:@storage, data['min_word_length'] || Classifier.config.min_word_length)
instance.instance_variable_set(:@storage, nil)
instance.instance_variable_set(:@min_word_length, data['min_word_length'] || Classifier.config.min_word_length)
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 238-239

Comment:
Wrong ivar name in `from_json_data`. Line 238 sets `@storage` to `nil`, then line 239 immediately overwrites `@storage` with the deserialized integer value. `@min_word_length` is never restored, so a round-tripped TFIDF instance silently falls back to the global default instead of its saved setting, and `@storage` is left as an integer instead of `nil`.

```suggestion
      instance.instance_variable_set(:@storage, nil)
      instance.instance_variable_set(:@min_word_length, data['min_word_length'] || Classifier.config.min_word_length)
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +5 to +11
def test_configure
Classifier.configure do |config|
config.min_word_length = 1
end

assert_equal(1, Classifier.config.min_word_length)
end
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 test_configure mutates the memoized Classifier.config singleton (sets min_word_length to 1) and never resets it. Any test that runs after this one in the same process and relies on Classifier.config.min_word_length defaulting to 3 — for example, Bayes or TFIDF tests that omit an explicit min_word_length — will see 1 instead of 3, causing spurious failures. Add a teardown that resets Classifier.instance_variable_set(:@config, nil) (or sets it back to 3) to isolate the mutation.

Prompt To Fix With AI
This is a comment left during a code review.
Path: test/config/config_test.rb
Line: 5-11

Comment:
`test_configure` mutates the memoized `Classifier.config` singleton (sets `min_word_length` to 1) and never resets it. Any test that runs after this one in the same process and relies on `Classifier.config.min_word_length` defaulting to 3 — for example, Bayes or TFIDF tests that omit an explicit `min_word_length` — will see 1 instead of 3, causing spurious failures. Add a `teardown` that resets `Classifier.instance_variable_set(:@config, nil)` (or sets it back to 3) to isolate the mutation.

How can I resolve this? If you propose a fix, please make it concise.

cardmagic
cardmagic previously approved these changes May 16, 2026
Add missing instance variable declarations for Steep type checker:
- @min_word_length in Bayes, TFIDF, LogisticRegression, LSI
- @config and method signatures in Config module

These annotations allow the type checker to properly verify
the new min_word_length parameter added in this PR.
@cardmagic
Copy link
Copy Markdown
Owner

Steep Type Check Fix

The CI is failing because the @min_word_length instance variable is missing RBS inline annotations. Here are the required changes:

1. lib/classifier/bayes.rb - Add to the instance variable declarations block:

    # @rbs @min_word_length: Integer

2. lib/classifier/config.rb - Replace the entire file with:

# rbs_inline: enabled

module Classifier
  # @rbs @config: Config?

  # This lazy initialization is not thread-safe.
  # In multi-threaded environments, ensure this method is called
  # or configuration is set explicitly during startup before using classifiers.
  # @rbs () -> Config
  def config
    @config ||= Config.new
  end

  # @rbs () { (Config) -> void } -> void
  def configure(&block)
    block&.call(config)
  end

  module_function :config, :configure

  class Config
    # @rbs @min_word_length: Integer

    attr_accessor :min_word_length #: Integer

    # @rbs () -> void
    def initialize
      @min_word_length = 3
    end
  end
end

3. lib/classifier/tfidf.rb - Add to the instance variable declarations block:

    # @rbs @min_word_length: Integer

4. lib/classifier/logistic_regression.rb - Add to the instance variable declarations block:

    # @rbs @min_word_length: Integer

5. lib/classifier/lsi.rb - Add to the instance variable declarations block:

    # @rbs @min_word_length: Integer

After making these changes, run bundle exec rbs-inline --output sig/generated lib/ to regenerate the RBS files, then CI should pass.

@cardmagic cardmagic self-requested a review May 16, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Configurable minimum word length for tokenization

2 participants