Is your feature request related to a problem? Please describe.
In the current implementation, when drop_last=True, OneHotEncoder always drops the last category (alphabetically). This makes it impossible for users to control which category is used as the reference group. In many modeling scenarios (for example logistic regression or other linear models), the choice of the reference category matters and users may want to drop a different category.
Describe the solution you'd like
Add a drop parameter that allows users to control which dummy category is dropped.
drop: str = "last" # options: "last", "first", "most_frequent"
"last" (default): preserves current behaviour — drops the last category alphabetically.
"first": drops the first category alphabetically.
"most_frequent": drops the most frequent category found during fit(), which can be a more statistically meaningful reference group.
If drop="most_frequent" and multiple categories have the same highest frequency, the transformer should raise a UserWarning and fall back to dropping the first category found.
The existing drop_last parameter should remain for backward compatibility, but a deprecation warning should be raised if it is used together with the new drop parameter.
Describe alternatives you've considered
Users can currently control the reference category only by manually reordering or preprocessing the categorical values before applying the encoder. However, this is inconvenient and error-prone, especially in larger pipelines.
Additional context
Adding this parameter would make OneHotEncoder more flexible and align it better with common machine learning workflows, particularly when building statistical or linear models where the reference category is important.
Is your feature request related to a problem? Please describe.
In the current implementation, when
drop_last=True,OneHotEncoderalways drops the last category (alphabetically). This makes it impossible for users to control which category is used as the reference group. In many modeling scenarios (for example logistic regression or other linear models), the choice of the reference category matters and users may want to drop a different category.Describe the solution you'd like
Add a
dropparameter that allows users to control which dummy category is dropped."last"(default): preserves current behaviour — drops the last category alphabetically."first": drops the first category alphabetically."most_frequent": drops the most frequent category found duringfit(), which can be a more statistically meaningful reference group.If
drop="most_frequent"and multiple categories have the same highest frequency, the transformer should raise aUserWarningand fall back to dropping the first category found.The existing
drop_lastparameter should remain for backward compatibility, but a deprecation warning should be raised if it is used together with the newdropparameter.Describe alternatives you've considered
Users can currently control the reference category only by manually reordering or preprocessing the categorical values before applying the encoder. However, this is inconvenient and error-prone, especially in larger pipelines.
Additional context
Adding this parameter would make
OneHotEncodermore flexible and align it better with common machine learning workflows, particularly when building statistical or linear models where the reference category is important.