Skip to content

Using sql_tokenize() table function for tokenization. #338

@Dtenwolde

Description

@Dtenwolde

What happens?

Original issue here: https://github.com/duckdblabs/duckdb-internal/issues/6733

duckdb/duckdb#20171 introduced a new table function sql_tokenize, which takes as input the query and returns the tokens and their category.
Currently in the python client we use a custom implementation somewhere near PyTokenize (see here). I think this can now be moved over to use the table function instead, making it a bit simpler, and possible to provide more information about tokenization in the future.

To Reproduce

See above

OS:

MacOS

DuckDB Package Version:

latest

Python Version:

3.14?

Full Name:

Daniel ten Wolde

Affiliation:

DuckDB Labs

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions