-
Notifications
You must be signed in to change notification settings - Fork 220
Description
We have a somewhat unique situation where, when nodeenv downloads Node and then calls extractall it can take up to three or four hours to execute the extraction. This is due to various security scanners we are required to use and the fact that extractall is a synchronous/one-at-a-time extraction operation. Note this is on Windows, so there isn't currently an option to support the system Node (which would also solve a lot of problems).
Specifically, we're running into this in the context of using pre-commit, which for each Node-based pre-commit validator, sets up a separate Node environment using nodeenv. If you have four or five Node-based hooks, that means it can take up to a day to get pre-commit initialized and then when it's time to update a hook... be prepared to spend some time.
I downloaded a single version of the Node.js zip file locally just to test the differences. I replicated the download_node_src method (basically) and just had it extract those files in the way that works now.
Running this script takes three hours to finish extracting for me.
import operator
import re
import zipfile
def main():
ctx = zipfile.ZipFile('node-v22.4.1-win-x64.zip')
members = operator.methodcaller('namelist')
member_name = lambda s: s # noqa: E731
args_node = "22.4.1"
src_dir = "C:\\Users\\username\\temp\\extract-destination"
with ctx as archive:
node_ver = re.escape(args_node)
rexp_string = r"node-v%s[^/]*/(README\.md|CHANGELOG\.md|LICENSE)"\
% node_ver
extract_list = [
member
for member in members(archive)
if re.match(rexp_string, member_name(member)) is None
]
archive.extractall(src_dir, extract_list)
if __name__ == '__main__':
main()I found this interesting blog article that explained how to use the ThreadPoolExecutor to unzip in parallel. This allows me to unzip in three minutes because the security scanner can do its thing in parallel along with the thread pool. In the example below I have it set to 100 threads. If I increase that to 200 threads, it cuts the corresponding time in half to about 90 seconds.
import operator
import re
import zipfile
from concurrent.futures import ThreadPoolExecutor
def unzip_file(handle, filename, path):
handle.extract(filename, path)
def main():
ctx = zipfile.ZipFile('node-v22.4.1-win-x64.zip', 'r')
members = operator.methodcaller('namelist')
member_name = lambda s: s # noqa: E731
args_node = "22.4.1"
src_dir = "C:\\Users\\username\\temp\\extract-destination"
with ctx as archive:
node_ver = re.escape(args_node)
rexp_string = r"node-v%s[^/]*/(README\.md|CHANGELOG\.md|LICENSE)"\
% node_ver
extract_list = [
member
for member in members(archive)
if re.match(rexp_string, member_name(member)) is None
]
with ThreadPoolExecutor(100) as exe:
_ = [exe.submit(unzip_file, ctx, m, src_dir) for m in extract_list]
if __name__ == '__main__':
main()I'm curious if this project would be interested in a pull request to update the zip file extraction to work in parallel. I'm not a huge Python developer but I'd be happy to give it a go.