Support parallel extraction for zip files

We have a somewhat unique situation where, when [`nodeenv` downloads Node and then calls `extractall`](https://github.com/ekalinin/nodeenv/blob/a6585e9a63e1601c4a37f3a1bb8fd0722dd6b51c/nodeenv.py#L639) it can take up to three or four hours to execute the extraction. This is due to various security scanners we are required to use and the fact that `extractall` is a synchronous/one-at-a-time extraction operation. Note this is on Windows, so there isn't currently an option to support the `system` Node (which would also solve a lot of problems).

Specifically, we're running into this in the context of using `pre-commit`, which [for each Node-based pre-commit validator, sets up a separate Node environment using `nodeenv`](https://github.com/pre-commit/pre-commit/blob/main/pre_commit/languages/node.py). If you have four or five Node-based hooks, that means it can take up to a day to get `pre-commit` initialized and then when it's time to update a hook... be prepared to spend some time.

I downloaded a single version of the Node.js zip file locally just to test the differences. I replicated [the `download_node_src` method (basically)](https://github.com/ekalinin/nodeenv/blob/a6585e9a63e1601c4a37f3a1bb8fd0722dd6b51c/nodeenv.py#L613) and just had it extract those files in the way that works now.

Running this script takes _three hours_ to finish extracting for me.

```python
import operator
import re
import zipfile

def main():
    ctx = zipfile.ZipFile('node-v22.4.1-win-x64.zip')
    members = operator.methodcaller('namelist')
    member_name = lambda s: s  # noqa: E731
    args_node = "22.4.1"
    src_dir = "C:\\Users\\username\\temp\\extract-destination"
    with ctx as archive:
        node_ver = re.escape(args_node)
        rexp_string = r"node-v%s[^/]*/(README\.md|CHANGELOG\.md|LICENSE)"\
            % node_ver
        extract_list = [
            member
            for member in members(archive)
            if re.match(rexp_string, member_name(member)) is None
        ]
        archive.extractall(src_dir, extract_list)

if __name__ == '__main__':
    main()
```

[I found this interesting blog article](https://superfastpython.com/multithreaded-zip-files/) that explained how to use the `ThreadPoolExecutor` to unzip in parallel. This allows me to unzip in _three minutes_ because the security scanner can do its thing in parallel along with the thread pool. In the example below I have it set to 100 threads. If I increase that to 200 threads, it cuts the corresponding time in half to about 90 seconds.

```python
import operator
import re
import zipfile
from concurrent.futures import ThreadPoolExecutor

def unzip_file(handle, filename, path):
    handle.extract(filename, path)

def main():
    ctx = zipfile.ZipFile('node-v22.4.1-win-x64.zip', 'r')
    members = operator.methodcaller('namelist')
    member_name = lambda s: s  # noqa: E731
    args_node = "22.4.1"
    src_dir = "C:\\Users\\username\\temp\\extract-destination"
    with ctx as archive:
        node_ver = re.escape(args_node)
        rexp_string = r"node-v%s[^/]*/(README\.md|CHANGELOG\.md|LICENSE)"\
            % node_ver
        extract_list = [
            member
            for member in members(archive)
            if re.match(rexp_string, member_name(member)) is None
        ]
        with ThreadPoolExecutor(100) as exe:
            _ = [exe.submit(unzip_file, ctx, m, src_dir) for m in extract_list]

if __name__ == '__main__':
    main()
```

I'm curious if this project would be interested in a pull request to update the zip file extraction to work in parallel. I'm not a huge Python developer but I'd be happy to give it a go.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support parallel extraction for zip files #366

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support parallel extraction for zip files #366

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions