Skip to content

[aws batch] Overlays (e.g. --augur) in uploaded workdir ZIP include too much #421

@tsibley

Description

@tsibley

Elsewhere, @jameshadfield wrote:

Using --augur ~/github/nextstrain/augur (as per #419 (comment)) works but it takes 20min to upload (because an in-use augur repo size baloons to 675MB). #295 should help here, or excluding certain paths (our docs, .mypy_cache etc).

I replied with some potential solutions:

Perhaps runner.aws_batch.s3.upload_workdir needs to be extended to a) use .gitignore files for the overlay volumes (but not the workdir itself) and/or b) support a Nextstrain-specific ignore file (e.g. .nextstrain-ignore) which could be applied everywhere (as suggested in 6d465f0).

I looked into this a bit more, and we could use git check-ignore (conditional on git being available) to filter paths for upload. It could be invoked one-at-a-time on each path, which would be simplest to integrate but slow, or be fed a stream of paths on stdin while we read its stdout concurrently, which would be more complex to integrate but fast. This seemed promising and would be transparent, Just Working™ without a new ignores file or intervention from anyone.

In some ad-hoc testing, though, I realized a big caveat: for overlay purposes, we actually need some of the files ignored by git, e.g. Python packaging metadata in nextstrain_augur.egg-info/ for Augur (for the installed augur to locate its entrypoint) and dist/ for Auspice (for the transpiled code served by auspice view). The way I realized this was by routinely deleting all ignored files with git clean -fXd and then noticing (to my surprise) that an overlay no longer worked.

This makes git ignores entirely inappropriate for use as excludes in overlays, I think. And for basically the same reason we wouldn't apply git ignores to workdirs themselves: build time artifacts not suitable for version control are important for execution time.

That leaves us with the new feature of Nextstrain-specific ignore files, which nicely enough can be applied to both overlay sources and workdirs alike. Implementation will require a fair bit of new complexity, but I don't see any major algorithm questions or uncertainty. Biggest questions are about design/interface, perhaps:

  1. What's the filename we use? .nextstrain-ignore? .nextstrain-exclude
  2. Is there anywhere else this ignore file would be used that we should be taking into account?
  3. If not (2), perhaps we should make the filename (1) more specific to AWS Batch or Nextstrain CLI?

And of course, maybe there's another option/solution to consider.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions