Snakemake Storage Plugin: Cached HTTP

A Snakemake storage plugin for downloading files via HTTP with local caching, checksum verification, and adaptive rate limiting.

Supported sources:

zenodo.org - Zenodo data repository (checksum from API)
data.pypsa.org - PyPSA data repository (checksum from manifest.yaml)

Features

Local caching: Downloads are cached to avoid redundant transfers (can be disabled)
Checksum verification: Automatically verifies checksums (from Zenodo API or data.pypsa.org manifests)
Rate limit handling: Automatically respects Zenodo's rate limits using X-RateLimit-* headers with exponential backoff retry
Concurrent download control: Limits simultaneous downloads to prevent overwhelming Zenodo
Progress bars: Shows download progress with tqdm
Immutable URLs: Returns mtime=0 since Zenodo URLs are persistent
Environment variable support: Configure via environment variables for CI/CD workflows

Installation

From the pypsa-eur repository root:

pip install -e plugins/snakemake-storage-plugin-cached-http

Configuration

The Zenodo storage plugin works alongside other storage providers (like HTTP). Snakemake automatically routes URLs to the correct provider based on the URL pattern.

Register additional settings in your Snakefile if you want to customize the defaults:

# Optional: Configure cached HTTP storage with custom settings
# This extends the existing storage configuration (e.g., for HTTP)
storage cached_http:
    provider="cached-http",
    cache="~/.cache/snakemake-pypsa-eur",  # Default location
    max_concurrent_downloads=3,  # Download max 3 files at once

If you don't explicitly configure it, the plugin will use default settings automatically.

Settings

cache (optional): Cache directory for downloaded files
- Default: Platform-dependent user cache directory (via platformdirs.user_cache_dir("snakemake-pypsa-eur"))
- Set to "" (empty string) to disable caching
- Files are cached here to avoid re-downloading
- Environment variable: SNAKEMAKE_STORAGE_CACHED_HTTP_CACHE
skip_remote_checks (optional): Skip metadata checking with remote API
- Default: False (perform checks)
- Set to True or "1" to skip remote existence/size checks (useful for CI/CD)
- Environment variable: SNAKEMAKE_STORAGE_CACHED_HTTP_SKIP_REMOTE_CHECKS
max_concurrent_downloads (optional): Maximum concurrent downloads
- Default: 3
- Controls how many files can be downloaded simultaneously
- No environment variable support

Usage

Use Zenodo or data.pypsa.org URLs directly in your rules. Snakemake automatically detects supported URLs and routes them to this plugin:

rule download_zenodo:
    input:
        storage("https://zenodo.org/records/3520874/files/natura.tiff"),
    output:
        "resources/natura.tiff"
    shell:
        "cp {input} {output}"

rule download_pypsa:
    input:
        storage("https://data.pypsa.org/workflows/eur/eez/v12_20231025/World_EEZ_v12_20231025_LR.zip"),
    output:
        "resources/eez.zip"
    shell:
        "cp {input} {output}"

Or if you configured a tagged storage entity:

rule download_data:
    input:
        storage.cached_http(
            "https://zenodo.org/records/3520874/files/natura.tiff"
        ),
    output:
        "resources/natura.tiff"
    shell:
        "cp {input} {output}"

The plugin will:

Check if the file exists in the cache (if caching is enabled)
If cached, copy from cache (fast)
If not cached, download with:
- Progress bar showing download status
- Automatic rate limit handling with exponential backoff retry
- Concurrent download limiting
- Checksum verification (from Zenodo API or data.pypsa.org manifest)
Store in cache for future use (if caching is enabled)

Example: CI/CD Configuration

For continuous integration environments where you want to skip caching and remote checks:

# GitHub Actions example
- name: Run snakemake workflows
  env:
    SNAKEMAKE_STORAGE_CACHED_HTTP_CACHE: ""
    SNAKEMAKE_STORAGE_CACHED_HTTP_SKIP_REMOTE_CHECKS: "1"
  run: |
    snakemake --cores all

Rate Limiting and Retry

Zenodo API limits:

Guest users: 60 requests/minute
Authenticated users: 100 requests/minute

The plugin automatically:

Monitors X-RateLimit-Remaining header
Waits when rate limit is reached
Uses X-RateLimit-Reset to calculate wait time
Retries failed requests with exponential backoff (up to 5 attempts)
Handles transient errors: HTTP errors, timeouts, checksum mismatches, and network issues

URL Handling

Handles URLs from zenodo.org, sandbox.zenodo.org, and data.pypsa.org
Other HTTP(S) URLs are handled by the standard snakemake-storage-plugin-http
Both plugins can coexist in the same workflow

Plugin Priority

When using storage() without specifying a plugin name, Snakemake checks all installed plugins:

Cached HTTP plugin: Only accepts zenodo.org and data.pypsa.org URLs
HTTP plugin: Accepts all HTTP/HTTPS URLs (including zenodo.org)

If both plugins are installed, supported URLs would be ambiguous - both plugins accept them. Typically snakemake would raise an error: "Multiple suitable storage providers found" if you try to use storage() without specifying which plugin to use, ie. one needs to explicitly call the Cached HTTP provider using storage.cached_http(url) instead of storage(url), but we monkey-patch the http plugin to refuse zenodo.org and data.pypsa.org URLs.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
LICENSES		LICENSES
src/snakemake_storage_plugin_cached_http		src/snakemake_storage_plugin_cached_http
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snakemake Storage Plugin: Cached HTTP

Features

Installation

Configuration

Settings

Usage

Example: CI/CD Configuration

Rate Limiting and Retry

URL Handling

Plugin Priority

License

About

Uh oh!

Releases 4

Packages

Contributors 2

Uh oh!

Languages

PyPSA/snakemake-storage-plugin-cached-http

Folders and files

Latest commit

History

Repository files navigation

Snakemake Storage Plugin: Cached HTTP

Features

Installation

Configuration

Settings

Usage

Example: CI/CD Configuration

Rate Limiting and Retry

URL Handling

Plugin Priority

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 2

Uh oh!

Languages

Packages