Skip to content

Support parallel extraction for zip files #366

@tillig

Description

@tillig

We have a somewhat unique situation where, when nodeenv downloads Node and then calls extractall it can take up to three or four hours to execute the extraction. This is due to various security scanners we are required to use and the fact that extractall is a synchronous/one-at-a-time extraction operation. Note this is on Windows, so there isn't currently an option to support the system Node (which would also solve a lot of problems).

Specifically, we're running into this in the context of using pre-commit, which for each Node-based pre-commit validator, sets up a separate Node environment using nodeenv. If you have four or five Node-based hooks, that means it can take up to a day to get pre-commit initialized and then when it's time to update a hook... be prepared to spend some time.

I downloaded a single version of the Node.js zip file locally just to test the differences. I replicated the download_node_src method (basically) and just had it extract those files in the way that works now.

Running this script takes three hours to finish extracting for me.

import operator
import re
import zipfile

def main():
    ctx = zipfile.ZipFile('node-v22.4.1-win-x64.zip')
    members = operator.methodcaller('namelist')
    member_name = lambda s: s  # noqa: E731
    args_node = "22.4.1"
    src_dir = "C:\\Users\\username\\temp\\extract-destination"
    with ctx as archive:
        node_ver = re.escape(args_node)
        rexp_string = r"node-v%s[^/]*/(README\.md|CHANGELOG\.md|LICENSE)"\
            % node_ver
        extract_list = [
            member
            for member in members(archive)
            if re.match(rexp_string, member_name(member)) is None
        ]
        archive.extractall(src_dir, extract_list)

if __name__ == '__main__':
    main()

I found this interesting blog article that explained how to use the ThreadPoolExecutor to unzip in parallel. This allows me to unzip in three minutes because the security scanner can do its thing in parallel along with the thread pool. In the example below I have it set to 100 threads. If I increase that to 200 threads, it cuts the corresponding time in half to about 90 seconds.

import operator
import re
import zipfile
from concurrent.futures import ThreadPoolExecutor

def unzip_file(handle, filename, path):
    handle.extract(filename, path)

def main():
    ctx = zipfile.ZipFile('node-v22.4.1-win-x64.zip', 'r')
    members = operator.methodcaller('namelist')
    member_name = lambda s: s  # noqa: E731
    args_node = "22.4.1"
    src_dir = "C:\\Users\\username\\temp\\extract-destination"
    with ctx as archive:
        node_ver = re.escape(args_node)
        rexp_string = r"node-v%s[^/]*/(README\.md|CHANGELOG\.md|LICENSE)"\
            % node_ver
        extract_list = [
            member
            for member in members(archive)
            if re.match(rexp_string, member_name(member)) is None
        ]
        with ThreadPoolExecutor(100) as exe:
            _ = [exe.submit(unzip_file, ctx, m, src_dir) for m in extract_list]

if __name__ == '__main__':
    main()

I'm curious if this project would be interested in a pull request to update the zip file extraction to work in parallel. I'm not a huge Python developer but I'd be happy to give it a go.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions