fix: avoid NLTK download race condition in parallel test workers#477
fix: avoid NLTK download race condition in parallel test workers#477
Conversation
Try loading stopwords before downloading to prevent concurrent pytest-xdist workers from racing on nltk.download(), which raises FileExistsError when multiple processes write to the same path.
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition that occurs when multiple pytest-xdist workers attempt to download NLTK stopwords data concurrently. The fix implements a "try-load-first" pattern that checks if the data is already available before attempting to download it, preventing FileExistsError exceptions when multiple processes write to the same path simultaneously.
Changes:
- Wrapped NLTK stopwords loading in a try-except block to check if data exists before downloading
- Applied the fix consistently in both
_get_stopwordsand_set_stopwordsmethods
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| redisvl/utils/full_text_query_helper.py | Updated _get_stopwords to try loading stopwords before downloading, preventing race conditions in parallel test workers |
| redisvl/query/query.py | Updated _set_stopwords in TextQuery class with the same race condition fix |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
vishal-bala
left a comment
There was a problem hiding this comment.
LGTM! I'd guess there's technically still a race condition in that
- two workers could try to start downloading at the same time, or
- a worker could try to load a partially downloaded dataset, fail, and then decide to start downloading it as well
But these are both fairly niche and unlikely, and probably not worth designing for.
Try loading stopwords before downloading to prevent concurrent pytest-xdist workers from racing on nltk.download(), which raises FileExistsError when multiple processes write to the same path.