Skip to content

Give precises types to all **opts argument, optimize ensure_list()#7393

Merged
georgesittas merged 36 commits intotobymao:mainfrom
OutSquareCapital:unpacked-args
Mar 31, 2026
Merged

Give precises types to all **opts argument, optimize ensure_list()#7393
georgesittas merged 36 commits intotobymao:mainfrom
OutSquareCapital:unpacked-args

Conversation

@OutSquareCapital
Copy link
Copy Markdown
Contributor

@OutSquareCapital OutSquareCapital commented Mar 25, 2026

This PR is the natural next step of my previous two.

Now that I'm more familiarised with the codebase and that all **opts have a base annotation, I could finally start giving them precise type hints.
To do this, a hierarchy of TypedDict has been created in _typing.py module, and typing_extensions.Unpack is used at each call site.
Unfortunately, this type of "brutal" change (from Any | object to T) can't be easily localized and often require a lot of changes at various places, hence the PR size.

This PR also bring performance improvements.

  • ensure_list() is called 22 times, but had an inefficient implementation -> doing list(x) when x is already one will copy it.
    Now it simply return it directly, vastly improving the efficiency. This function was also incorrectly typed, and thus this required various typing adjustements at other places (only list and tuple containers handled, instead of Collection).
  • Expr.isin method internal t.cast call is moved from the elements to the container. Python call overhead is a real thing, so each small improvement like this help.

Yes the TypedDict hierarchy is a bit daunting at first (naming could be better..), but unless there's a refactor toward a builder pattern, this is the only way to manage various default arguments across different functions AND **kwargs.
OFC, another solution could be to have kwargs for the public API, and then create dataclasses from them for internals.
Both of those solution would avoid a TypedDict hierarchy, and would improve performance (repeated unpacking of arguments, especially kwargs, is relatively expensive)

All in all, strong typing across all public API entry points is a big improvement for the end user, who can now directly see all possible options in his IDE, with early warnings from LSP and type checkers in case of incorrect values.

In VSCode with basedpyright for example:

image
No overloads for "parse_one" match the provided arguments
  Argument types: (Literal[''], Literal[5])basedpyright[reportCallIssue](https://docs.basedpyright.com/v1.38.3/configuration/config-files/#reportCallIssue)

Question for the next ones

I saw that most of the time t.Union and t.Optional were used, which is what I prioritized in my own work.
However, I saw already existing annotations with the modern syntax, e.g str | int | None instead of t.Optional[t.Union[str, int]]
I have a strong preference for the modern one. What should be the convention?

@geooo109
Copy link
Copy Markdown
Collaborator

/benchmark

@georgesittas
Copy link
Copy Markdown
Collaborator

@OutSquareCapital thanks for the PR- seems like we'll need some time to get to reviewing this. Thanks for the patience.

@georgesittas georgesittas self-assigned this Mar 27, 2026
Copy link
Copy Markdown
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OutSquareCapital did a quick first pass. Can you take a look and see if we can simplify this a but more before I review everything?

- deleted dead code
- deleted "table" existence from the TypedDict hierarchy
@georgesittas
Copy link
Copy Markdown
Collaborator

Let me know when this is ready for another review.

@OutSquareCapital
Copy link
Copy Markdown
Contributor Author

Let me know when this is ready for another review.

Checks passed, ready

Copy link
Copy Markdown
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress @OutSquareCapital.

Comment on lines -1128 to +1138
def tokenize(self, sql: str, **opts: object) -> list[Token]:
return self.tokenizer(**opts).tokenize(sql)
def tokenize(self, sql: str, dialect: DialectType = None, **opts: object) -> list[Token]:
return self.tokenizer(dialect=dialect, **opts).tokenize(sql)

def tokenizer(self, **opts: t.Any) -> Tokenizer:
return self.tokenizer_class(**{"dialect": self, **opts})
def tokenizer(self, dialect: DialectType = None, **opts: object) -> Tokenizer:
return self.tokenizer_class(dialect=dialect or self, **opts)

def jsonpath_tokenizer(self, **opts: t.Any) -> JSONPathTokenizer:
return self.jsonpath_tokenizer_class(**{"dialect": self, **opts})
def jsonpath_tokenizer(self, dialect: DialectType = None) -> JSONPathTokenizer:
return self.jsonpath_tokenizer_class(dialect=dialect or self)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokenizer's constructor appears to expect an **ops but it's currently dead code. Only the dialect kwarg is used. This is a good candidate for cleaning up, post-merge.

Copy link
Copy Markdown
Contributor Author

@OutSquareCapital OutSquareCapital Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually used at one place, in the Athena Dialect.
But then it call super().tokenize() and pass it the same updated kwargs, who will anyway be ignored. So there's this place to clean up as well.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for pointing it out. Looks like a remnant from the recent refactors.

@OutSquareCapital
Copy link
Copy Markdown
Contributor Author

OutSquareCapital commented Mar 30, 2026

FYI, I adressed all your latest comments and done the needed changes, except the maybe_parse signature.

@georgesittas
Copy link
Copy Markdown
Collaborator

A couple of final comments + need to fix CI, then this should be good to go. Thank you for the quick turnaround.

@OutSquareCapital
Copy link
Copy Markdown
Contributor Author

A couple of final comments + need to fix CI, then this should be good to go. Thank you for the quick turnaround.

Yep. I'm trying to work on the maybe_parse signature to adress the NoneType and "necessary overloads" comments but it's a bit ... challenging. Will notify when I'm done

- Use `isinstance` check in `Expression.isin` instead of type ignore coment
- Revert `convert` and `maybe_parse` to `copy=False` by default
- Annotate new_join as `t.Any` to try to make it pass checks
- clean up parse_one overloads, revert them mostly to what it was originally
- allow to delete `if into is None` branch in `maybe_parse` function body
@OutSquareCapital
Copy link
Copy Markdown
Contributor Author

OutSquareCapital commented Mar 31, 2026

Okay! all checks are passing and I have reworked the maybe_parse and parse_one overloads to make them as close as possible as the current main branch. See the last 3 commits.
It's important to note that currently the second overload of maybe parse don't work as intended. It will return Any when sql_or_expression is an ExpOrStr AND when into is None.
This is why I had to annotate new_join here
This should be inferred at least as Expr and never as Any, but changing things here immediatly create +50 errors.
This would be better adressed in another PR IMO, as this is a pre-existing issue who is made apparent by the improved typing coverage from this PR.

@georgesittas
Copy link
Copy Markdown
Collaborator

Ok, just one more small comment from me, LGTM otherwise.

This #7393 (comment) should be a small lift. Good candidate for a follow-up PR, if you want to take it on.

Thanks a bunch for these improvements, great work!

@OutSquareCapital
Copy link
Copy Markdown
Contributor Author

Ok, just one more small comment from me, LGTM otherwise.

This #7393 (comment) should be a small lift. Good candidate for a follow-up PR, if you want to take it on.

Thanks a bunch for these improvements, great work!

Thanks for the kind words!
I'll take it on then.

By the way, if it's ok I will continue with follow-ups on typing PR's.
If at the same time this can help end users, dev time, mypc compilation, and also teach me how to use best this library, it's a win-win I guess :)
Just, I'll try to really limit the size of the next ones, should it mean more PR's at the end.
I realized that a few small ones are easier to review, and to act on said reviews, than a big one!
Oh and before we forget, what would be the answer to the question I asked in the description about Union types?

@georgesittas
Copy link
Copy Markdown
Collaborator

Just, I'll try to really limit the size of the next ones, should it mean more PR's at the end.
I realized that a few small ones are easier to review, and to act on said reviews, than a big one!

Definitely agreed. Let's do that moving forward, and thanks for proactively suggesting it.

Oh and before we forget, what would be the answer to the question I asked in the description about Union types?

I saw that most of the time t.Union and t.Optional were used, which is what I prioritized in my own work. However, I saw already existing annotations with the modern syntax, e.g str | int | None instead of t.Optional[t.Union[str, int]]. I have a strong preference for the modern one. What should be the convention?

Is there a PEP for what to prefer? I'm assuming the "modern syntax", as you refer to it, is the proper way to type unions? If so, we can do that in a follow-up, sure.

@georgesittas georgesittas merged commit 7a91128 into tobymao:main Mar 31, 2026
8 checks passed
@OutSquareCapital OutSquareCapital deleted the unpacked-args branch March 31, 2026 17:49
@OutSquareCapital
Copy link
Copy Markdown
Contributor Author

OutSquareCapital commented Mar 31, 2026

Just, I'll try to really limit the size of the next ones, should it mean more PR's at the end.
I realized that a few small ones are easier to review, and to act on said reviews, than a big one!

Definitely agreed. Let's do that moving forward, and thanks for proactively suggesting it.

Oh and before we forget, what would be the answer to the question I asked in the description about Union types?

I saw that most of the time t.Union and t.Optional were used, which is what I prioritized in my own work. However, I saw already existing annotations with the modern syntax, e.g str | int | None instead of t.Optional[t.Union[str, int]]. I have a strong preference for the modern one. What should be the convention?

Is there a PEP for what to prefer? I'm assuming the "modern syntax", as you refer to it, is the proper way to type unions? If so, we can do that in a follow-up, sure.

Yes, ,more infos here:
https://docs.python.org/3/library/stdtypes.html#types-union
and there:
https://peps.python.org/pep-0604/

By the way with a few more rules enabled Ruff would allow to clean up and standardize this across the codebase at once, and for other type syntaxes also. But that's another story/decision.
EDIT:
After some testing, only typing valid at python 3.9 should be used.
This entails my changes of collections.abc/builtins instead of typing (e.g Sequence, dict), but NOT Union and Optional

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants