Give precises types to all **opts argument, optimize `ensure_list()` by OutSquareCapital · Pull Request #7393 · tobymao/sqlglot

OutSquareCapital · 2026-03-25T16:11:31Z

This PR is the natural next step of my previous two.

Now that I'm more familiarised with the codebase and that all **opts have a base annotation, I could finally start giving them precise type hints.
To do this, a hierarchy of TypedDict has been created in _typing.py module, and typing_extensions.Unpack is used at each call site.
Unfortunately, this type of "brutal" change (from Any | object to T) can't be easily localized and often require a lot of changes at various places, hence the PR size.

This PR also bring performance improvements.

ensure_list() is called 22 times, but had an inefficient implementation -> doing list(x) when x is already one will copy it.
Now it simply return it directly, vastly improving the efficiency. This function was also incorrectly typed, and thus this required various typing adjustements at other places (only list and tuple containers handled, instead of Collection).
Expr.isin method internal t.cast call is moved from the elements to the container. Python call overhead is a real thing, so each small improvement like this help.

Yes the TypedDict hierarchy is a bit daunting at first (naming could be better..), but unless there's a refactor toward a builder pattern, this is the only way to manage various default arguments across different functions AND **kwargs.
OFC, another solution could be to have kwargs for the public API, and then create dataclasses from them for internals.
Both of those solution would avoid a TypedDict hierarchy, and would improve performance (repeated unpacking of arguments, especially kwargs, is relatively expensive)

All in all, strong typing across all public API entry points is a big improvement for the end user, who can now directly see all possible options in his IDE, with early warnings from LSP and type checkers in case of incorrect values.

In VSCode with basedpyright for example:

No overloads for "parse_one" match the provided arguments
  Argument types: (Literal[''], Literal[5])basedpyright[reportCallIssue](https://docs.basedpyright.com/v1.38.3/configuration/config-files/#reportCallIssue)

Question for the next ones

I saw that most of the time t.Union and t.Optional were used, which is what I prioritized in my own work.
However, I saw already existing annotations with the modern syntax, e.g str | int | None instead of t.Optional[t.Union[str, int]]
I have a strong preference for the modern one. What should be the convention?

…s in core funcs

- it was only checking for tuple or list, `Collection` was way too broad - uncessary full copy of the list in case it was already one

- avoid overloads with simple type union - narrow the accepted types at various places This allows to delete a few `t.cast` calls

geooo109 · 2026-03-26T14:01:31Z

/benchmark

georgesittas · 2026-03-27T14:01:23Z

@OutSquareCapital thanks for the PR- seems like we'll need some time to get to reviewing this. Thanks for the patience.

georgesittas

@OutSquareCapital did a quick first pass. Can you take a look and see if we can simplify this a but more before I review everything?

sqlglot/expressions/dml.py

sqlglot/_typing.py

- deleted dead code - deleted "table" existence from the TypedDict hierarchy

georgesittas · 2026-03-27T17:22:48Z

Let me know when this is ready for another review.

OutSquareCapital · 2026-03-27T18:01:59Z

Let me know when this is ready for another review.

Checks passed, ready

georgesittas

Great progress @OutSquareCapital.

georgesittas · 2026-03-30T13:25:46Z

sqlglot/dialects/dialect.py

-    def tokenize(self, sql: str, **opts: object) -> list[Token]:
-        return self.tokenizer(**opts).tokenize(sql)
+    def tokenize(self, sql: str, dialect: DialectType = None, **opts: object) -> list[Token]:
+        return self.tokenizer(dialect=dialect, **opts).tokenize(sql)

-    def tokenizer(self, **opts: t.Any) -> Tokenizer:
-        return self.tokenizer_class(**{"dialect": self, **opts})
+    def tokenizer(self, dialect: DialectType = None, **opts: object) -> Tokenizer:
+        return self.tokenizer_class(dialect=dialect or self, **opts)

-    def jsonpath_tokenizer(self, **opts: t.Any) -> JSONPathTokenizer:
-        return self.jsonpath_tokenizer_class(**{"dialect": self, **opts})
+    def jsonpath_tokenizer(self, dialect: DialectType = None) -> JSONPathTokenizer:
+        return self.jsonpath_tokenizer_class(dialect=dialect or self)


The tokenizer's constructor appears to expect an **ops but it's currently dead code. Only the dialect kwarg is used. This is a good candidate for cleaning up, post-merge.

It's actually used at one place, in the Athena Dialect.
But then it call super().tokenize() and pass it the same updated kwargs, who will anyway be ignored. So there's this place to clean up as well.

I see, thanks for pointing it out. Looks like a remnant from the recent refactors.

sqlglot/expressions/builders.py

sqlglot/expressions/core.py

sqlglot/expressions/query.py

OutSquareCapital · 2026-03-30T19:06:37Z

FYI, I adressed all your latest comments and done the needed changes, except the maybe_parse signature.

- re-revert back to original overloads in maybe_parse - improve Select.join internal typing, avoid creating a new dict just to pass kwargs

… correct annotation of parse_args

sqlglot/expressions/builders.py

sqlglot/expressions/core.py

georgesittas · 2026-03-31T12:49:01Z

A couple of final comments + need to fix CI, then this should be good to go. Thank you for the quick turnaround.

OutSquareCapital · 2026-03-31T13:04:59Z

A couple of final comments + need to fix CI, then this should be good to go. Thank you for the quick turnaround.

Yep. I'm trying to work on the maybe_parse signature to adress the NoneType and "necessary overloads" comments but it's a bit ... challenging. Will notify when I'm done

…by default

- Use `isinstance` check in `Expression.isin` instead of type ignore coment - Revert `convert` and `maybe_parse` to `copy=False` by default - Annotate new_join as `t.Any` to try to make it pass checks

- clean up parse_one overloads, revert them mostly to what it was originally - allow to delete `if into is None` branch in `maybe_parse` function body

…n` is NOT already an `Expr`

OutSquareCapital · 2026-03-31T15:39:25Z

Okay! all checks are passing and I have reworked the maybe_parse and parse_one overloads to make them as close as possible as the current main branch. See the last 3 commits.
It's important to note that currently the second overload of maybe parse don't work as intended. It will return Any when sql_or_expression is an ExpOrStr AND when into is None.
This is why I had to annotate new_join here
This should be inferred at least as Expr and never as Any, but changing things here immediatly create +50 errors.
This would be better adressed in another PR IMO, as this is a pre-existing issue who is made apparent by the improved typing coverage from this PR.

sqlglot/expressions/core.py

georgesittas · 2026-03-31T17:20:31Z

Ok, just one more small comment from me, LGTM otherwise.

This #7393 (comment) should be a small lift. Good candidate for a follow-up PR, if you want to take it on.

Thanks a bunch for these improvements, great work!

OutSquareCapital · 2026-03-31T17:45:31Z

Ok, just one more small comment from me, LGTM otherwise.

This #7393 (comment) should be a small lift. Good candidate for a follow-up PR, if you want to take it on.

Thanks a bunch for these improvements, great work!

Thanks for the kind words!
I'll take it on then.

By the way, if it's ok I will continue with follow-ups on typing PR's.
If at the same time this can help end users, dev time, mypc compilation, and also teach me how to use best this library, it's a win-win I guess :)
Just, I'll try to really limit the size of the next ones, should it mean more PR's at the end.
I realized that a few small ones are easier to review, and to act on said reviews, than a big one!
Oh and before we forget, what would be the answer to the question I asked in the description about Union types?

georgesittas · 2026-03-31T17:48:26Z

Just, I'll try to really limit the size of the next ones, should it mean more PR's at the end.
I realized that a few small ones are easier to review, and to act on said reviews, than a big one!

Definitely agreed. Let's do that moving forward, and thanks for proactively suggesting it.

Oh and before we forget, what would be the answer to the question I asked in the description about Union types?

I saw that most of the time t.Union and t.Optional were used, which is what I prioritized in my own work. However, I saw already existing annotations with the modern syntax, e.g str | int | None instead of t.Optional[t.Union[str, int]]. I have a strong preference for the modern one. What should be the convention?

Is there a PEP for what to prefer? I'm assuming the "modern syntax", as you refer to it, is the proper way to type unions? If so, we can do that in a follow-up, sure.

OutSquareCapital · 2026-03-31T19:46:00Z

Just, I'll try to really limit the size of the next ones, should it mean more PR's at the end.
I realized that a few small ones are easier to review, and to act on said reviews, than a big one!

Definitely agreed. Let's do that moving forward, and thanks for proactively suggesting it.

Oh and before we forget, what would be the answer to the question I asked in the description about Union types?

I saw that most of the time t.Union and t.Optional were used, which is what I prioritized in my own work. However, I saw already existing annotations with the modern syntax, e.g str | int | None instead of t.Optional[t.Union[str, int]]. I have a strong preference for the modern one. What should be the convention?

Is there a PEP for what to prefer? I'm assuming the "modern syntax", as you refer to it, is the proper way to type unions? If so, we can do that in a follow-up, sure.

Yes, ,more infos here:
https://docs.python.org/3/library/stdtypes.html#types-union
and there:
https://peps.python.org/pep-0604/

By the way with a few more rules enabled Ruff would allow to clean up and standardize this across the codebase at once, and for other type syntaxes also. But that's another story/decision.
EDIT:
After some testing, only typing valid at python 3.9 should be used.
This entails my changes of collections.abc/builtins instead of typing (e.g Sequence, dict), but NOT Union and Optional

OutSquareCapital added 15 commits March 25, 2026 01:06

WIP: adding typedDict for opts arguments across the codebase

f1e97c0

core builders with generic return

df5dee4

added needed type annotations

96186fc

added overload to to_bool helper func, narrowed the into arg type…

deaffa8

…s in core funcs

fixed various uncorrect "table" arguments

0ea8bea

continuing on typing fixes

f7aeeae

continuation of typing fixes

43ba1bd

type narrowing + substantial optimization for ensure_list helper:

e832891

- it was only checking for tuple or list, `Collection` was way too broad - uncessary full copy of the list in case it was already one

apply_list_builder don't handle Sequence

2990263

align cast usage

c75b223

ensure_list refactor:

78f5936

- avoid overloads with simple type union - narrow the accepted types at various places This allows to delete a few `t.cast` calls

mypy is bad to infer nested unions

10bb57d

import issues and hive arg to tokenizer fixs

b9e56ed

ruff format

3a43bdc

Ruff check fix

f672e95

georgesittas self-assigned this Mar 27, 2026

georgesittas reviewed Mar 27, 2026

View reviewed changes

sqlglot/expressions/dml.py Show resolved Hide resolved

sqlglot/expressions/dml.py Show resolved Hide resolved

sqlglot/_typing.py Outdated Show resolved Hide resolved

sqlglot/_typing.py Outdated Show resolved Hide resolved

sqlglot/_typing.py Outdated Show resolved Hide resolved

OutSquareCapital added 2 commits March 27, 2026 18:06

Merge branch 'tobymao:main' into unpacked-args

414691a

fixs:

4d7ffba

- deleted dead code - deleted "table" existence from the TypedDict hierarchy

refactor: make copy argument explicit everywhere

42c2de2

Merge branch 'tobymao:main' into unpacked-args

654b458

georgesittas reviewed Mar 30, 2026

View reviewed changes

OutSquareCapital added 4 commits March 30, 2026 19:43

fixs various review issues

4eab549

Merge branch 'main' into unpacked-args

3e24858

linter fix

c90c0cf

revert back to the original maybe_parse overload structure

41920af

OutSquareCapital added 6 commits March 30, 2026 21:15

try with re-reverting change for mypc

0d7af5d

reduce unecessary imports duplication

9c5be5f

refactor:

af74b20

- re-revert back to original overloads in maybe_parse - improve Select.join internal typing, avoid creating a new dict just to pass kwargs

fix: shouldn't have passed the "dialect" arg to new_join

2a0d270

fix: last commit was wrong. reverting back to original code, but with…

781a0ca

… correct annotation of parse_args

widen type union of new_join

d1cb085

georgesittas reviewed Mar 31, 2026

View reviewed changes

sqlglot/expressions/builders.py Outdated Show resolved Hide resolved

sqlglot/expressions/core.py Outdated Show resolved Hide resolved

inline msg for ValueError path in _apply_child_list_builder

31ec85d

OutSquareCapital added 5 commits March 31, 2026 15:23

refactor: maybe_parse, convert, and update are now copy=True …

f3be4b0

…by default

fixs:

a2c3a60

- Use `isinstance` check in `Expression.isin` instead of type ignore coment - Revert `convert` and `maybe_parse` to `copy=False` by default - Annotate new_join as `t.Any` to try to make it pass checks

fixs:

5e6e685

- clean up parse_one overloads, revert them mostly to what it was originally - allow to delete `if into is None` branch in `maybe_parse` function body

maybe_parse can return a generic Expression ONLY IF `sql_or_expressio…

3e07593

…n` is NOT already an `Expr`

try to make checks pass with Join instead of t.Any

d5c310e

Merge branch 'main' into unpacked-args

39441f5

georgesittas approved these changes Mar 31, 2026

View reviewed changes

sqlglot/expressions/core.py Show resolved Hide resolved

georgesittas merged commit 7a91128 into tobymao:main Mar 31, 2026
8 checks passed

OutSquareCapital deleted the unpacked-args branch March 31, 2026 17:49

This was referenced Mar 31, 2026

Chore: clean tokenizer/dialect code from **opts, and minor typing improvements #7422

Merged

typing: improve core expression typing coverage #7424

Merged

Conversation

OutSquareCapital commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Question for the next ones

Uh oh!

geooo109 commented Mar 26, 2026

Uh oh!

georgesittas commented Mar 27, 2026

Uh oh!

georgesittas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

georgesittas commented Mar 27, 2026

Uh oh!

OutSquareCapital commented Mar 27, 2026

Uh oh!

georgesittas left a comment

Choose a reason for hiding this comment

Uh oh!

georgesittas Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

OutSquareCapital Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

georgesittas Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OutSquareCapital commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

georgesittas commented Mar 31, 2026

Uh oh!

OutSquareCapital commented Mar 31, 2026

Uh oh!

OutSquareCapital commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

georgesittas commented Mar 31, 2026

Uh oh!

OutSquareCapital commented Mar 31, 2026

Uh oh!

georgesittas commented Mar 31, 2026

Uh oh!

Uh oh!

OutSquareCapital commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OutSquareCapital commented Mar 25, 2026 •

edited

Loading

OutSquareCapital Mar 30, 2026 •

edited

Loading

OutSquareCapital commented Mar 30, 2026 •

edited

Loading

OutSquareCapital commented Mar 31, 2026 •

edited

Loading

OutSquareCapital commented Mar 31, 2026 •

edited

Loading