-
Notifications
You must be signed in to change notification settings - Fork 21
Use duckdb loading throughout pyprophet #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
since duckdb is now a dependency and does joining of sql tables more efficiently use it more widespread throughout pyprophet
since sql queries to no guarentee order, change from duckdb to regular sql in this method as it is used in testing
|
I am not sure why the tests are not being conducted. |
|
Ok I think the tests are passing just not appearing in this PR for some reason |
singjc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review. Thanks for implementing and propagating the use of duckgb across the modules. It definitely seems much better and faster at retrieving the data.
I just had a few questions:
- Do we still need to import and have sqlite3 connections, or can we fully remove those?
- What is the purpose of the duckdb extension?
I haven't tested the modules, but I assume they should all still work if it's just simply swapping out the way the data is retrieved. Otherwise everything looks good.
| @@ -1,16 +1,21 @@ | |||
| import pandas as pd | |||
| import numpy as np | |||
| import sqlite3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need sqlite3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point probably for export we don't need sqlite3
| from .data_handling import write_scores_sql_command | ||
| from .report import plot_scores | ||
|
|
||
| ## ensure proper extension installed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a bit more info on why this extension is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duckdb base extension does not come with sqlite3. To install duckdb with sqlite3 after duckdb is installed something like db.install_extension("sqlite3") needs to be executed or else the program crashes.
However, this command cannot be run in singularity containers because you are modifying the container.
A workaround is to install the sqlite3 extension with pip (community extension) which is done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm, how stable is that? I tried to install the community extension on cedar, but it fails to find a matching distribution
$ pip install duckdb-extension-sqlite-scanner
Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo2023/x86-64-v3, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo2023/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic
ERROR: Could not find a version that satisfies the requirement duckdb-extension-sqlite-scanner (from versions: none)
ERROR: No matching distribution found for duckdb-extension-sqlite-scanner
If I try install from the wheel and modify the filename (since the manylinux does not get recognized), then it fails when trying to load the extension
extension_importer.import_extension("sqlite_scanner")
File "/project/6011811/singjust/bin/py310/lib/python3.10/site-packages/duckdb_extensions/extension_importer.py", line 24, in import_extension
duckdb.sql(f"INSTALL '{extension_file}'")
duckdb.duckdb.IOException: IO Error: Failed to install '/project/6011811/singjust/bin/py310/lib/python3.10/site-packages/duckdb_extension_sqlite_scanner/extensions/v1.2.0/sqlite_scanner.duckdb_extension'
The file was built for the platform 'linux_amd64_gcc4', but we can only load extensions built for platform 'linux_amd64'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how stable it is but I have had some problems in the past with duckdb on the cluster. I guess just the way it is built
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm I think we will have to put this on hold for now, or implement it as an extra feature if someone wants to use a version of pyprophet with duckdb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sounds good. When I have time I will implement duckdb as a default and sqlite3 as a fallback.
|
Will close this, as the recent PR #142 covers this. |
With export-parquet, we have an additional dependency of duckdb which allows for fast SQL queries, especially those involving lots of joins.
Here roll out duckdb SQL queries in pyprophet for greater data loading efficiency.
Examples
Conducted on dell XPS ubuntu
Export Command
time pyprophet export --in=39041_Hela_500ng_15SPD_DIA_Py3_1_S2-A7_1_4502.oswOld timings:
real 0m56.284s
user 0m35.997s
sys 0m15.130s
New timings:
real 0m12.832s
user 0m40.578s
sys 0m8.378s
Score Command
time pyprophet score --in=39041_Hela_500ng_15SPD_DIA_Py3_1_S2-A7_1_4502.osw --ss_num_iter=1Old Timings:
real 0m59.466s
user 1m30.275s
sys 0m11.004s
New timings:
real 0m30.482s
user 1m21.186s
sys 0m9.460s