-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Elsewhere, @jameshadfield wrote:
Using
--augur ~/github/nextstrain/augur(as per #419 (comment)) works but it takes 20min to upload (because an in-use augur repo size baloons to 675MB). #295 should help here, or excluding certain paths (our docs,.mypy_cacheetc).
I replied with some potential solutions:
Perhaps
runner.aws_batch.s3.upload_workdirneeds to be extended to a) use.gitignorefiles for the overlay volumes (but not the workdir itself) and/or b) support a Nextstrain-specific ignore file (e.g..nextstrain-ignore) which could be applied everywhere (as suggested in 6d465f0).
I looked into this a bit more, and we could use git check-ignore (conditional on git being available) to filter paths for upload. It could be invoked one-at-a-time on each path, which would be simplest to integrate but slow, or be fed a stream of paths on stdin while we read its stdout concurrently, which would be more complex to integrate but fast. This seemed promising and would be transparent, Just Working™ without a new ignores file or intervention from anyone.
In some ad-hoc testing, though, I realized a big caveat: for overlay purposes, we actually need some of the files ignored by git, e.g. Python packaging metadata in nextstrain_augur.egg-info/ for Augur (for the installed augur to locate its entrypoint) and dist/ for Auspice (for the transpiled code served by auspice view). The way I realized this was by routinely deleting all ignored files with git clean -fXd and then noticing (to my surprise) that an overlay no longer worked.
This makes git ignores entirely inappropriate for use as excludes in overlays, I think. And for basically the same reason we wouldn't apply git ignores to workdirs themselves: build time artifacts not suitable for version control are important for execution time.
That leaves us with the new feature of Nextstrain-specific ignore files, which nicely enough can be applied to both overlay sources and workdirs alike. Implementation will require a fair bit of new complexity, but I don't see any major algorithm questions or uncertainty. Biggest questions are about design/interface, perhaps:
- What's the filename we use?
.nextstrain-ignore?.nextstrain-exclude - Is there anywhere else this ignore file would be used that we should be taking into account?
- If not (2), perhaps we should make the filename (1) more specific to AWS Batch or Nextstrain CLI?
And of course, maybe there's another option/solution to consider.