diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..a42de91 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,57 @@ + + +# Changelog + +## Unreleased — **Breaking change** + +This release accepts all HTTP(S) URLs and therefore conflicts with `snakemake-storage-plugin-http`. +**You must uninstall `snakemake-storage-plugin-http`** before upgrading, otherwise Snakemake will +raise *"Multiple suitable storage providers found"* for any HTTP(S) URL. + +### Added + +- Generic HTTP(S) fallback: any `http://` or `https://` URL is now accepted, with size and + mtime read from `Content-Length` and `Last-Modified` response headers. Servers that do not + support `HEAD` requests are handled gracefully (size and mtime default to 0). No checksum + is available for generic URLs. + +### Removed + +- Dependency on `snakemake-storage-plugin-http` — this plugin now handles all HTTP(S) URLs + directly, with no monkey-patching required. + +## v0.4.0 — Google Cloud Storage support + +### Added + +- Support for `storage.googleapis.com` URLs with checksum verification via the GCS JSON API + (`md5Hash` field) and mtime from GCS object metadata. + +## v0.3.0 — data.pypsa.org support + +### Added + +- Support for `data.pypsa.org` URLs with checksum verification via `manifest.yaml` files + discovered by searching up the directory tree. +- Redirect support: manifest entries can specify a `redirect` field to point to another path. + +## v0.2.0 — Dynamic versioning and zstd support + +### Added + +- Dynamic versioning via `setuptools-scm`. +- `zstandard` dependency for decompressing Cloudflare-compressed responses. + +## v0.1.0 — Initial release + +### Added + +- Snakemake storage plugin for Zenodo URLs (`zenodo.org`, `sandbox.zenodo.org`) with: + - Local filesystem caching via `Cache` class + - Checksum verification from Zenodo API + - Adaptive rate limiting using `X-RateLimit-*` headers with exponential backoff retry + - Concurrent download limiting via semaphore + - Progress bars with `tqdm-loggable` diff --git a/README.md b/README.md index 31074e1..d21cd3f 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,7 @@ A Snakemake storage plugin for downloading files via HTTP with local caching, ch - **zenodo.org** - Zenodo data repository (checksum from API) - **data.pypsa.org** - PyPSA data repository (checksum from manifest.yaml) - **storage.googleapis.com** - Google Cloud Storage (checksum from GCS JSON API) +- **any http(s) URL** - Generic fallback with size/mtime from HTTP headers ## Features @@ -19,7 +20,7 @@ A Snakemake storage plugin for downloading files via HTTP with local caching, ch - **Rate limit handling**: Automatically respects Zenodo's rate limits using `X-RateLimit-*` headers with exponential backoff retry - **Concurrent download control**: Limits simultaneous downloads to prevent overwhelming servers - **Progress bars**: Shows download progress with tqdm -- **Immutable URLs**: Returns mtime=0 for Zenodo and data.pypsa.org (persistent URLs); uses actual mtime for GCS +- **Immutable URLs**: Returns mtime=0 for Zenodo and data.pypsa.org (persistent URLs); uses actual mtime for GCS and generic HTTP - **Environment variable support**: Configure via environment variables for CI/CD workflows ## Installation @@ -67,7 +68,7 @@ If you don't explicitly configure it, the plugin will use default settings autom ## Usage -Use Zenodo, data.pypsa.org, or Google Cloud Storage URLs directly in your rules. Snakemake automatically detects supported URLs and routes them to this plugin: +Use any HTTP(S) URL directly in your rules. Snakemake automatically routes all HTTP(S) URLs to this plugin: ```python rule download_zenodo: @@ -93,6 +94,14 @@ rule download_gcs: "resources/cba_projects.zip" shell: "cp {input} {output}" + +rule download_generic: + input: + storage("https://example.com/data/dataset.csv"), + output: + "resources/dataset.csv" + shell: + "cp {input} {output}" ``` Or if you configured a tagged storage entity: @@ -116,7 +125,7 @@ The plugin will: - Progress bar showing download status - Automatic rate limit handling with exponential backoff retry - Concurrent download limiting - - Checksum verification (from Zenodo API, data.pypsa.org manifest, or GCS metadata) + - Checksum verification where available (Zenodo API, data.pypsa.org manifest, GCS metadata) 4. Store in cache for future use (if caching is enabled) ### Example: CI/CD Configuration @@ -148,19 +157,19 @@ The plugin automatically: ## URL Handling -- Handles URLs from `zenodo.org`, `sandbox.zenodo.org`, `data.pypsa.org`, and `storage.googleapis.com` -- Other HTTP(S) URLs are handled by the standard `snakemake-storage-plugin-http` -- Both plugins can coexist in the same workflow - -### Plugin Priority +This plugin accepts **all HTTP(S) URLs** and replaces `snakemake-storage-plugin-http`. It provides +enhanced support for specific sources: -When using `storage()` without specifying a plugin name, Snakemake checks all installed plugins: -- **Cached HTTP plugin**: Only accepts zenodo.org, data.pypsa.org, and storage.googleapis.com URLs -- **HTTP plugin**: Accepts all HTTP/HTTPS URLs (including zenodo.org) +| Source | Checksum | mtime | Immutable | +|---|---|---|---| +| `zenodo.org`, `sandbox.zenodo.org` | ✓ (from API) | — | ✓ | +| `data.pypsa.org` | ✓ (from manifest.yaml) | — | ✓ | +| `storage.googleapis.com` | ✓ (from GCS API) | ✓ | — | +| any other HTTP(S) | — | ✓ (Last-Modified) | — | -If both plugins are installed, supported URLs would be ambiguous - both plugins accept them. -Typically snakemake would raise an error: **"Multiple suitable storage providers found"** if you try to use `storage()` without specifying which plugin to use, ie. one needs to explicitly call the Cached HTTP provider using `storage.cached_http(url)` instead of `storage(url)`, -but we monkey-patch the http plugin to refuse zenodo.org, data.pypsa.org, and storage.googleapis.com URLs. +Generic HTTP URLs are treated as mutable: size and mtime are read from `Content-Length` and +`Last-Modified` response headers. Servers that do not support `HEAD` requests are handled +gracefully (size and mtime default to 0). ## License diff --git a/pyproject.toml b/pyproject.toml index b389c28..934441d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -17,10 +17,9 @@ dependencies = [ "httpx ~= 0.27", "platformdirs ~= 4.0", "reretry ~= 0.11", - "snakemake-interface-common ~= 1.14", + "snakemake-interface-common >=1.14,<2.0", "snakemake-interface-storage-plugins >=4.2,<5.0", - "snakemake-storage-plugin-http ~= 0.3", - "tqdm-loggable ~= 0.2", + "tqdm-loggable ~= 0.3", "typing-extensions ~= 4.15", "zstandard ~=0.25.0", ] diff --git a/src/snakemake_storage_plugin_cached_http/__init__.py b/src/snakemake_storage_plugin_cached_http/__init__.py index 52fec1c..c5dd22e 100644 --- a/src/snakemake_storage_plugin_cached_http/__init__.py +++ b/src/snakemake_storage_plugin_cached_http/__init__.py @@ -11,10 +11,12 @@ from contextlib import asynccontextmanager from dataclasses import dataclass, field from datetime import datetime +from email.utils import parsedate_to_datetime from logging import Logger from pathlib import Path from posixpath import basename, dirname, join, normpath, relpath -from urllib.parse import quote, urlparse +from typing import cast +from urllib.parse import ParseResult, quote, urlparse import httpx import platformdirs @@ -35,8 +37,8 @@ from tqdm_loggable.auto import tqdm from typing_extensions import override +from . import monkeypatch # noqa: F401 from .cache import Cache -from .monkeypatch import is_pypsa_or_zenodo_url logger = get_logger() @@ -183,19 +185,25 @@ def example_queries(cls) -> list[ExampleQuery]: description="A Google Cloud Storage file URL", type=QueryType.INPUT, ), + ExampleQuery( + query="https://example.com/data/file.csv", + description="A generic HTTP/HTTPS file URL", + type=QueryType.INPUT, + ), ] @override @classmethod def is_valid_query(cls, query: str) -> StorageQueryValidationResult: - """Only handle zenodo.org URLs""" - if is_pypsa_or_zenodo_url(query): + """Handle all http/https URLs.""" + parsed = urlparse(query) + if parsed.scheme in ("http", "https") and parsed.netloc: return StorageQueryValidationResult(query=query, valid=True) return StorageQueryValidationResult( query=query, valid=False, - reason="Only zenodo.org, data.pypsa.org, and storage.googleapis.com URLs are handled by this plugin", + reason="Only http and https URLs are handled by this plugin", ) @override @@ -283,59 +291,95 @@ async def httpr(self, method: str, url: str): raise @retry_decorator - async def get_metadata(self, path: str, netloc: str) -> FileMetadata | None: + async def get_metadata(self, url: ParseResult) -> FileMetadata | None: """ - Retrieve and cache file metadata for a Zenodo record or a data.pypsa.org file. + Retrieve and cache file metadata for a Zenodo record, data.pypsa.org file, + GCS object, or generic HTTP URL. Args: - path: Server path - netloc: Network location (e.g., "zenodo.org") + url: Parsed URL Returns: - Dictionary mapping filename to FileMetadata + FileMetadata or None if not found """ + netloc = url.netloc if netloc in ("zenodo.org", "sandbox.zenodo.org"): - return await self.get_zenodo_metadata(path, netloc) + return await self.get_zenodo_metadata(url) elif netloc == "data.pypsa.org": - return await self.get_pypsa_metadata(path, netloc) + return await self.get_pypsa_metadata(url) elif netloc == "storage.googleapis.com": - return await self.get_gcs_metadata(path, netloc) + return await self.get_gcs_metadata(url) - raise WorkflowError( - "Cached-http storage plugin is only implemented for zenodo.org, data.pypsa.org, and storage.googleapis.com urls" - ) + return await self.get_http_metadata(url) @staticmethod - def is_immutable(netloc: str): - if netloc in ("zenodo.org", "sandbox.zenodo.org"): - return True - elif netloc == "data.pypsa.org": - return True - elif netloc == "storage.googleapis.com": - return False + def is_immutable(url: ParseResult) -> bool: + return url.netloc in ("zenodo.org", "sandbox.zenodo.org", "data.pypsa.org") - raise WorkflowError( - "Cached-http storage plugin is only implemented for zenodo.org, data.pypsa.org, and storage.googleapis.com urls" - ) + async def get_http_metadata(self, parsed: ParseResult) -> FileMetadata | None: + """ + Retrieve file metadata for a generic HTTP URL via HEAD request. + + Args: + url: Parsed URL + + Returns: + FileMetadata with size and mtime from HTTP headers, or None if not found + """ + url = parsed.geturl() + + async with self.client() as client: + try: + response = await client.head(url) + except Exception as e: + logger.warning(f"{type(e).__name__} while head'ing {url}") + raise + + if response.status_code == 404: + return None + if response.status_code == 405: + # HEAD not supported; assume file exists with unknown size/mtime + return FileMetadata(checksum=None, size=0, mtime=0.0) + if response.status_code != 200: + raise WorkflowError( + f"Failed to fetch HTTP metadata: HTTP {response.status_code} ({url})" + ) + + size = int(response.headers.get("content-length", 0)) - async def get_zenodo_metadata(self, path: str, netloc: str) -> FileMetadata | None: + last_modified = response.headers.get("last-modified") + if last_modified: + try: + mtime = cast(datetime, parsedate_to_datetime(last_modified)).timestamp() + except Exception: + logger.debug( + f"HTTP last-modified not in RFC2822 format: `{last_modified}`" + ) + mtime = 0.0 + else: + mtime = 0.0 + + return FileMetadata(checksum=None, size=size, mtime=mtime) + + async def get_zenodo_metadata(self, url: ParseResult) -> FileMetadata | None: """ Retrieve and cache file metadata for a Zenodo record or a data.pypsa.org file. Args: - path: Server path - netloc: Network location (e.g., "zenodo.org") + url: Parsed URL Returns: Dictionary mapping filename to FileMetadata """ + netloc = url.netloc + path = url.path.strip("/") # Zenodo record _records, record_id, _files, filename = path.split("/", maxsplit=3) if _records != "records" or _files != "files": raise WorkflowError( - f"Invalid Zenodo URL format: http(s)://{netloc}/{path}. " + f"Invalid Zenodo URL format: {url.geturl()}. " f"Expected format: https://zenodo.org/records/{{record_id}}/files/{{filename}}" ) @@ -374,30 +418,30 @@ async def get_zenodo_metadata(self, path: str, netloc: str) -> FileMetadata | No return metadata.get(filename) - async def get_pypsa_metadata(self, path: str, netloc: str) -> FileMetadata | None: + async def get_pypsa_metadata(self, url: ParseResult) -> FileMetadata | None: """ Retrieve and cache file metadata from data.pypsa.org manifest. Args: - path: Server path - netloc: Network location (e.g., "data.pypsa.org") + url: Parsed URL Returns: FileMetadata for the requested file, or None if not found """ - # Check cache first - base_path = dirname(path) + path = url.path.strip("/") + + base_path: str = dirname(path) while base_path: if base_path in self._pypsa_manifest_cache: - filename = relpath(path, base_path) + filename: str = relpath(path, base_path) return self._pypsa_manifest_cache[base_path].get(filename) base_path = dirname(base_path) # Fetch manifest base_path = dirname(path) while base_path: - manifest_url = f"https://{netloc}/{base_path}/manifest.yaml" + manifest_url = url._replace(path=f"/{base_path}/manifest.yaml").geturl() async with self.httpr("get", manifest_url) as response: if response.status_code == 200: @@ -408,7 +452,7 @@ async def get_pypsa_metadata(self, path: str, netloc: str) -> FileMetadata | Non base_path = dirname(base_path) else: raise WorkflowError( - f"Failed to fetch data.pypsa.org manifest for https://{netloc}/{path}" + f"Failed to fetch data.pypsa.org manifest for {url.geturl()}" ) # Parse files array and build metadata dict @@ -432,7 +476,7 @@ async def get_pypsa_metadata(self, path: str, netloc: str) -> FileMetadata | Non filename = relpath(path, base_path) return metadata.get(filename) - async def get_gcs_metadata(self, path: str, netloc: str) -> FileMetadata | None: + async def get_gcs_metadata(self, url: ParseResult) -> FileMetadata | None: """ Retrieve and cache file metadata from Google Cloud Storage. @@ -441,22 +485,21 @@ async def get_gcs_metadata(self, path: str, netloc: str) -> FileMetadata | None: API endpoint: https://storage.googleapis.com/storage/v1/b/{bucket}/o/{encoded-object} Args: - path: Server path (bucket/object-path) - netloc: Network location (storage.googleapis.com) + url: Parsed URL Returns: FileMetadata for the requested file, or None if not found """ # Check cache first - if path in self._gcs_metadata_cache: - return self._gcs_metadata_cache[path] + if url.path in self._gcs_metadata_cache: + return self._gcs_metadata_cache[url.path] # Parse bucket and object path from the URL path # Path format: /{bucket}/{object-path} - parts = path.split("/", maxsplit=1) + parts = url.path.strip("/").split("/", maxsplit=1) if len(parts) < 2: raise WorkflowError( - f"Invalid GCS URL format: http(s)://{netloc}/{path}. " + f"Invalid GCS URL format: {url.geturl()}. " f"Expected format: https://storage.googleapis.com/{{bucket}}/{{object-path}}" ) @@ -466,7 +509,9 @@ async def get_gcs_metadata(self, path: str, netloc: str) -> FileMetadata | None: encoded_object = quote(object_path, safe="") # GCS JSON API endpoint for object metadata - api_url = f"https://{netloc}/storage/v1/b/{bucket}/o/{encoded_object}" + api_url = url._replace( + path=f"/storage/v1/b/{bucket}/o/{encoded_object}" + ).geturl() async with self.httpr("get", api_url) as response: if response.status_code == 404: @@ -495,7 +540,7 @@ async def get_gcs_metadata(self, path: str, netloc: str) -> FileMetadata | None: metadata = FileMetadata(checksum=checksum, size=size, mtime=mtime) # Store in cache - self._gcs_metadata_cache[path] = metadata + self._gcs_metadata_cache[url.path] = metadata return metadata @@ -503,22 +548,16 @@ async def get_gcs_metadata(self, path: str, netloc: str) -> FileMetadata | None: # Implementation of storage object class StorageObject(StorageObjectRead): provider: StorageProvider # pyright: ignore[reportIncompatibleVariableOverride] - netloc: str - path: str + url: ParseResult def __post_init__(self): super().__post_init__() - - # Parse URL to extract record ID and filename - # URL format: https://zenodo.org/records/{record_id}/files/{filename} - parsed = urlparse(str(self.query)) - self.netloc = parsed.netloc - self.path = parsed.path.strip("/") + self.url = urlparse(str(self.query)) @override def local_suffix(self) -> str: """Return the local suffix for this object (used by parent class).""" - return f"{self.netloc}{self.path}" + return f"{self.url.netloc}{self.url.path}" @override def get_inventory_parent(self) -> str | None: @@ -533,10 +572,10 @@ async def managed_exists(self) -> bool: if self.provider.cache: cached = self.provider.cache.get(str(self.query)) - if cached is not None and self.provider.is_immutable(self.netloc): + if cached is not None and self.provider.is_immutable(self.url): return True - metadata = await self.provider.get_metadata(self.path, self.netloc) + metadata = await self.provider.get_metadata(self.url) return metadata is not None @override @@ -544,7 +583,7 @@ async def managed_mtime(self) -> float: if self.provider.settings.skip_remote_checks: return 0 - metadata = await self.provider.get_metadata(self.path, self.netloc) + metadata = await self.provider.get_metadata(self.url) return metadata.mtime if metadata is not None else 0 @override @@ -554,12 +593,12 @@ async def managed_size(self) -> int: if self.provider.cache: cached = self.provider.cache.get(str(self.query)) - if cached is not None and self.provider.is_immutable(self.netloc): + if cached is not None and self.provider.is_immutable(self.url): return cached.stat().st_size else: cached = None - metadata = await self.provider.get_metadata(self.path, self.netloc) + metadata = await self.provider.get_metadata(self.url) if metadata is None: return 0 @@ -588,7 +627,7 @@ async def inventory(self, cache: IOCacheStorageInterface) -> None: if self.provider.cache: cached = self.provider.cache.get(str(self.query)) - if cached is not None and self.provider.is_immutable(self.netloc): + if cached is not None and self.provider.is_immutable(self.url): cache.exists_in_storage[key] = True cache.mtime[key] = Mtime(storage=cached.stat().st_mtime) cache.size[key] = cached.stat().st_size @@ -596,7 +635,7 @@ async def inventory(self, cache: IOCacheStorageInterface) -> None: else: cached = None - metadata = await self.provider.get_metadata(self.path, self.netloc) + metadata = await self.provider.get_metadata(self.url) if metadata is None: cache.exists_in_storage[key] = False cache.mtime[key] = Mtime(storage=0) @@ -643,7 +682,7 @@ async def verify_checksum(self, path: Path) -> None: WrongChecksum """ # Get cached or fetch record metadata - metadata = await self.provider.get_metadata(self.path, self.netloc) + metadata = await self.provider.get_metadata(self.url) if metadata is None: raise WorkflowError(f"No metadata found for {self.query}") @@ -671,17 +710,17 @@ async def managed_retrieve(self): local_path.parent.mkdir(parents=True, exist_ok=True) query = str(self.query) - filename = basename(self.path) + filename = basename(self.url.path) - metadata = await self.provider.get_metadata(self.path, self.netloc) + metadata = await self.provider.get_metadata(self.url) if metadata is not None and metadata.redirect is not None: - query = f"https://{self.netloc}/{metadata.redirect}" + query = self.url._replace(path=f"/{metadata.redirect}").geturl() # If already in cache, check if still valid if self.provider.cache: cached = self.provider.cache.get(query) if cached is not None: - if self.provider.is_immutable(self.netloc) or ( + if self.provider.is_immutable(self.url) or ( metadata is not None and cached.stat().st_mtime >= metadata.mtime ): logger.info(f"Retrieved {filename} from cache ({query})") diff --git a/src/snakemake_storage_plugin_cached_http/monkeypatch.py b/src/snakemake_storage_plugin_cached_http/monkeypatch.py index c808bb6..5f1c8ce 100644 --- a/src/snakemake_storage_plugin_cached_http/monkeypatch.py +++ b/src/snakemake_storage_plugin_cached_http/monkeypatch.py @@ -3,43 +3,38 @@ # SPDX-License-Identifier: MIT """ -Monkey-patch the HTTP storage plugin to avoid conflicts with Zenodo URLs. +Warn if snakemake-storage-plugin-http is installed and patch it to refuse all URLs. -This module patches snakemake-storage-plugin-http to refuse zenodo.org URLs, -ensuring they are handled exclusively by the cached-http plugin. +Since cached-http now accepts all HTTP(S) URLs, having snakemake-storage-plugin-http +installed at the same time causes Snakemake to raise "Multiple suitable storage +providers found". This module detects that situation, warns the user, and patches the +HTTP plugin to refuse every URL so that cached-http wins unconditionally. """ -from urllib.parse import urlparse +import warnings -import snakemake_storage_plugin_http as http_base from snakemake_interface_storage_plugins.storage_provider import ( StorageQueryValidationResult, ) +try: + import snakemake_storage_plugin_http -def is_pypsa_or_zenodo_url(url: str) -> bool: - parsed = urlparse(url) - return parsed.netloc in ( - "zenodo.org", - "sandbox.zenodo.org", - "data.pypsa.org", - "storage.googleapis.com", - ) and parsed.scheme in ( - "http", - "https", + warnings.warn( + "snakemake-storage-plugin-http is installed alongside " + "snakemake-storage-plugin-cached-http. Because cached-http from v0.5 handles " + "all HTTP(S) URLs, the http plugin has been disabled for this session. " + "Please uninstall snakemake-storage-plugin-http. " + "This compatibility shim will be removed in v0.6.", + FutureWarning, ) - -# Patch the original HTTP StorageProvider to refuse zenodo URLs -orig_valid_query = http_base.StorageProvider.is_valid_query -http_base.StorageProvider.is_valid_query = classmethod( - lambda c, q: ( - StorageQueryValidationResult( + snakemake_storage_plugin_http.StorageProvider.is_valid_query = classmethod( + lambda c, q: StorageQueryValidationResult( query=q, valid=False, - reason="Deactivated in favour of cached_http", + reason="Disabled: snakemake-storage-plugin-cached-http handles all HTTP(S) URLs", ) - if is_pypsa_or_zenodo_url(q) - else orig_valid_query(q) ) -) +except ImportError: + pass diff --git a/tests/test_download.py b/tests/test_download.py index 8ac131d..be1478c 100644 --- a/tests/test_download.py +++ b/tests/test_download.py @@ -8,6 +8,7 @@ import logging import os import time +from urllib.parse import urlparse import pytest @@ -23,24 +24,27 @@ TEST_CONFIGS = { "zenodo": { "url": "https://zenodo.org/records/16810901/files/attributed_ports.json", - "path": "records/16810901/files/attributed_ports.json", - "netloc": "zenodo.org", "has_size": True, - "has_mtime": False, # Zenodo records are immutable + "is_immutable": True, + "has_checksum": True, }, "pypsa": { "url": "https://data.pypsa.org/workflows/eur/attributed_ports/2020-07-10/attributed_ports.json", - "path": "workflows/eur/attributed_ports/2020-07-10/attributed_ports.json", - "netloc": "data.pypsa.org", "has_size": False, # data.pypsa.org manifests don't include size - "has_mtime": False, # data.pypsa.org files are immutable + "is_immutable": True, + "has_checksum": True, }, "gcs": { "url": "https://storage.googleapis.com/open-tyndp-data-store/cached-http/attributed_ports/archive/2020-07-10/attributed_ports.json", - "path": "open-tyndp-data-store/cached-http/attributed_ports/archive/2020-07-10/attributed_ports.json", - "netloc": "storage.googleapis.com", "has_size": True, - "has_mtime": True, # GCS provides modification timestamps + "is_immutable": False, # GCS provides modification timestamps + "has_checksum": True, + }, + "http": { + "url": "https://datacatalogfiles.worldbank.org/ddh-published/0038118/1/DR0046414/attributed_ports.geojson", + "has_size": True, + "is_immutable": False, + "has_checksum": False, # Generic HTTP has no checksum }, } @@ -65,22 +69,20 @@ def storage_provider(tmp_path, test_logger): max_concurrent_downloads=3, ) - provider = StorageProvider( + return StorageProvider( local_prefix=local_prefix, logger=test_logger, settings=settings, ) - return provider - -@pytest.fixture(params=["zenodo", "pypsa", "gcs"]) +@pytest.fixture(params=list(TEST_CONFIGS)) def test_config(request): - """Provide test configuration (parametrized for zenodo, pypsa, and gcs).""" + """Provide test configuration (parametrized for all backends).""" return TEST_CONFIGS[request.param] -@pytest.fixture(params=[k for k, v in TEST_CONFIGS.items() if v["has_mtime"]]) +@pytest.fixture(params=[k for k, v in TEST_CONFIGS.items() if not v["is_immutable"]]) def mutable_test_config(request): """Provide test configuration for mutable sources only (those with mtime support).""" return TEST_CONFIGS[request.param] @@ -88,7 +90,7 @@ def mutable_test_config(request): @pytest.fixture def storage_object(test_config, storage_provider): - """Create a StorageObject for the test file (parametrized for zenodo and pypsa).""" + """Create a StorageObject for the test file (parametrized for all backends).""" obj = StorageObject( query=test_config["url"], keep_local=False, @@ -101,13 +103,12 @@ def storage_object(test_config, storage_provider): @pytest.mark.asyncio async def test_metadata_fetch(storage_provider, test_config): """Test that we can fetch metadata from the API/manifest.""" - metadata = await storage_provider.get_metadata( - test_config["path"], test_config["netloc"] - ) + metadata = await storage_provider.get_metadata(urlparse(test_config["url"])) assert metadata is not None - assert metadata.checksum is not None - assert metadata.checksum.startswith("md5:") + if test_config["has_checksum"]: + assert metadata.checksum is not None + assert metadata.checksum.startswith("md5:") if test_config["has_size"]: assert metadata.size > 0 @@ -136,10 +137,10 @@ async def test_storage_object_size(storage_object, test_config): async def test_storage_object_mtime(storage_object, test_config): """Test that mtime is 0 for immutable URLs, non-zero for mutable sources.""" mtime = await storage_object.managed_mtime() - if test_config["has_mtime"]: - assert mtime > 0 - else: + if test_config["is_immutable"]: assert mtime == 0 + else: + assert mtime > 0 @pytest.mark.asyncio @@ -158,10 +159,11 @@ async def test_download_and_checksum(storage_object, tmp_path): assert local_path.exists() assert local_path.stat().st_size > 0 - # Verify it's valid JSON + # Verify it's valid JSON (raw_decode ignores trailing garbage that the worldbank server appends) with open(local_path, encoding="utf-8", errors="replace") as f: - data = json.load(f) - assert isinstance(data, (dict, list)) + content = f.read() + data, _ = json.JSONDecoder().raw_decode(content) + assert isinstance(data, (dict, list)) # Verify checksum (should not raise WrongChecksum exception) await storage_object.verify_checksum(local_path) @@ -245,14 +247,15 @@ async def test_skip_remote_checks(test_config, tmp_path, test_logger): @pytest.mark.asyncio -async def test_wrong_checksum_detection(storage_object, tmp_path): +async def test_wrong_checksum_detection(storage_object, test_config, tmp_path): """Test that corrupted files are detected via checksum.""" - # Create a corrupted file corrupted_path = tmp_path / "corrupted.json" corrupted_path.write_text('{"corrupted": "data"}') - # Verify checksum should raise WrongChecksum - with pytest.raises(WrongChecksum): + if test_config["has_checksum"]: + with pytest.raises(WrongChecksum): + await storage_object.verify_checksum(corrupted_path) + else: await storage_object.verify_checksum(corrupted_path) diff --git a/tests/test_import.py b/tests/test_import.py index c0440d0..bc586ac 100644 --- a/tests/test_import.py +++ b/tests/test_import.py @@ -63,12 +63,19 @@ def test_is_valid_query_pypsa(): assert result.valid is True -def test_is_valid_query_non_zenodo_or_pypsa(): - """Test that is_valid_query rejects non-zenodo URLs.""" +def test_is_valid_query_generic_http(): + """Test that is_valid_query accepts generic http/https URLs as fallback.""" from snakemake_storage_plugin_cached_http import StorageProvider - # Non-Zenodo/PyPSA URL should be rejected + # Generic HTTP URL should be accepted as fallback result = StorageProvider.is_valid_query("https://example.com/file.txt") + assert result.valid is True + + # Non-HTTP schemes should be rejected + result = StorageProvider.is_valid_query("ftp://example.com/file.txt") + assert result.valid is False + + result = StorageProvider.is_valid_query("s3://bucket/key") assert result.valid is False