Skip to content

Conversation

@jcharkow
Copy link
Contributor

@jcharkow jcharkow commented Dec 4, 2024

With export-parquet, we have an additional dependency of duckdb which allows for fast SQL queries, especially those involving lots of joins.

Here roll out duckdb SQL queries in pyprophet for greater data loading efficiency.

Examples

Conducted on dell XPS ubuntu

Export Command

time pyprophet export --in=39041_Hela_500ng_15SPD_DIA_Py3_1_S2-A7_1_4502.osw

Old timings:
real 0m56.284s
user 0m35.997s
sys 0m15.130s

New timings:
real 0m12.832s
user 0m40.578s
sys 0m8.378s

Score Command

  • Only 1 iteration so most of the time showcased is loading the data
    time pyprophet score --in=39041_Hela_500ng_15SPD_DIA_Py3_1_S2-A7_1_4502.osw --ss_num_iter=1

Old Timings:
real 0m59.466s
user 1m30.275s
sys 0m11.004s

New timings:
real 0m30.482s
user 1m21.186s
sys 0m9.460s

since duckdb is now a dependency and does joining of sql tables more
efficiently use it more widespread throughout pyprophet
since sql queries to no guarentee order, change from duckdb to regular
sql in this method as it is used in testing
@jcharkow
Copy link
Contributor Author

jcharkow commented Dec 4, 2024

I am not sure why the tests are not being conducted.

@jcharkow
Copy link
Contributor Author

jcharkow commented Dec 4, 2024

Ok I think the tests are passing just not appearing in this PR for some reason

Copy link
Contributor

@singjc singjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. Thanks for implementing and propagating the use of duckgb across the modules. It definitely seems much better and faster at retrieving the data.

I just had a few questions:

  1. Do we still need to import and have sqlite3 connections, or can we fully remove those?
  2. What is the purpose of the duckdb extension?

I haven't tested the modules, but I assume they should all still work if it's just simply swapping out the way the data is retrieved. Otherwise everything looks good.

@@ -1,16 +1,21 @@
import pandas as pd
import numpy as np
import sqlite3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need sqlite3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point probably for export we don't need sqlite3

from .data_handling import write_scores_sql_command
from .report import plot_scores

## ensure proper extension installed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit more info on why this extension is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duckdb base extension does not come with sqlite3. To install duckdb with sqlite3 after duckdb is installed something like db.install_extension("sqlite3") needs to be executed or else the program crashes.

However, this command cannot be run in singularity containers because you are modifying the container.
A workaround is to install the sqlite3 extension with pip (community extension) which is done here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm, how stable is that? I tried to install the community extension on cedar, but it fails to find a matching distribution

$ pip install duckdb-extension-sqlite-scanner
Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo2023/x86-64-v3, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo2023/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic
ERROR: Could not find a version that satisfies the requirement duckdb-extension-sqlite-scanner (from versions: none)
ERROR: No matching distribution found for duckdb-extension-sqlite-scanner

If I try install from the wheel and modify the filename (since the manylinux does not get recognized), then it fails when trying to load the extension

extension_importer.import_extension("sqlite_scanner")
  File "/project/6011811/singjust/bin/py310/lib/python3.10/site-packages/duckdb_extensions/extension_importer.py", line 24, in import_extension
    duckdb.sql(f"INSTALL '{extension_file}'")
duckdb.duckdb.IOException: IO Error: Failed to install '/project/6011811/singjust/bin/py310/lib/python3.10/site-packages/duckdb_extension_sqlite_scanner/extensions/v1.2.0/sqlite_scanner.duckdb_extension'
The file was built for the platform 'linux_amd64_gcc4', but we can only load extensions built for platform 'linux_amd64'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how stable it is but I have had some problems in the past with duckdb on the cluster. I guess just the way it is built

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm I think we will have to put this on hold for now, or implement it as an extra feature if someone wants to use a version of pyprophet with duckdb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good. When I have time I will implement duckdb as a default and sqlite3 as a fallback.

@singjc
Copy link
Contributor

singjc commented Jun 4, 2025

Will close this, as the recent PR #142 covers this.

@singjc singjc closed this Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants