Skip to content

Feature/drs server#32

Open
matthewpeterkort wants to merge 13 commits intodevelopfrom
feature/drs-server
Open

Feature/drs server#32
matthewpeterkort wants to merge 13 commits intodevelopfrom
feature/drs-server

Conversation

@matthewpeterkort
Copy link
Copy Markdown
Collaborator

@matthewpeterkort matthewpeterkort commented Mar 26, 2026

big architectural refactor to support drs server changelog:

Full Metadata Migration: The entire indexd package (1,158 lines) has been deprecated and removed, replaced by a native GA4GH-compliant drs registry system.

Universal Cloud Storage: Added gocloud.dev integration, enabling native S3, Google Storage, and Azure Blob support via a unified transfer backend.

Client-Side Orchestration: Shifted multipart upload management from server-side to a robust client-driven orchestrator, capable of handling enterprise-scale files (5GB+).

Type-Safe Infrastructure: Integrated 18.5k lines of OpenAPI-generated model code (apigen) to ensure 100% fidelity with the latest DRS specifications.

@bwalsh
Copy link
Copy Markdown
Contributor

bwalsh commented Mar 31, 2026

ADR: Adopt GA4GH DRS Server as the canonical metadata and transfer surface

Branch: feature/drs-server

Context

Gen3 data-client historically coupled multiple concerns: metadata resolution through Indexd, presigned URL orchestration through Fence, and bespoke download/upload flows wired directly to those services. That architecture made it difficult to:

  • support GA4GH DRS-compatible runtimes or local object services,
  • reuse metadata + transfer logic across upload/download commands, and
  • reason about chunk sizing, concurrency, and resumability in a unified way.

As new projects adopt GA4GH-native endpoints and non-Gen3 storage, we need the client to speak DRS natively, generate API bindings from the published schema, and encapsulate transfer orchestration behind a stable interface. We also need a local/runtime abstraction so that the same CLI can target either a Gen3 commons or a standalone DRS service.

Decision

  • Replace the bespoke Indexd client with a GA4GH DRS metadata client (drs/client.go) generated from upstream OpenAPI specs under apigen/api/openapi.yaml and apigen/drs/*.
  • Introduce a transfer backend abstraction (transfer/interface.go, transfer/service.go) that composes with the DRS metadata client via drs.ComposeServerClient, allowing both Gen3 (transfer/signer/gen3) and local DRS (transfer/signer/local) runtimes to share upload/download logic.
  • Extend the CLI runtime bootstrap (runtime/client.go, localclient/client.go, g3client/client.go) so commands resolve a drs.ServerClient regardless of mode, while still exposing Fence/Sower/Requestor surfaces when running against Gen3.
  • Rework download and upload flows (download/transfer.go, upload/multipart.go, upload/utils.go, transfer/storage/gocloud.go) to consume the new backend, add resumable single-stream fallbacks, smarter chunk sizing (OptimalChunkSize), and multi-cloud storage support through gocloud.dev.
  • Update commands, mocks, and tests (cmd/*.go, mocks/mock_gen3interface.go, tests/*.go, download/transfer_test.go, upload/multipart_test.go) to depend on the new interfaces and to exercise the DRS pathways.
  • Add build tooling to keep the generated models in sync (Makefile targets gen/gen-internal, ga4gh/data-repository-service-schemas submodule) and raise the module toolchain to Go 1.26.1 (go.mod, go.sum).

Implementation Details

GA4GH DRS metadata layer

  • apigen/api/openapi.yaml now bundles the upstream GA4GH schema plus local extensions (apigen/specs/drs-extensions-overlay.yaml), and make gen re-generates the Go models in apigen/drs/ and apigen/internalapi/.
  • drs/client.go, drs/drs.go, drs/convert.go, drs/resolve.go, and related helpers provide rich operations: object CRUD, listing, hash queries, project authorization filtering, and registration helpers that map Calypr storage layout to DRS resources.
  • A new drs.ServerClient interface (drs/server_client.go) wraps both metadata and transfer behaviors so higher layers interact with a single facade. Legacy indexd/* sources and mocks were removed.

Transfer backend abstraction & runtime selection

  • transfer/interface.go defines downloader/uploader contracts; transfer/service.go implements a concrete backend that delegates signed URL work to pluggable signers.
  • transfer/signer/gen3 bridges Fence + Shepherd when running against a Gen3 commons, whereas transfer/signer/local speaks directly to /data/* endpoints exposed by standalone DRS services. Both reuse drs.ResolveDownloadURL when Fence does not produce URLs.
  • transfer/storage/gocloud.go introduces a gocloud.dev/blob-backed helper so future runtimes can address arbitrary cloud buckets with a uniform API.
  • runtime/client.go and localclient/client.go determine whether the CLI should wire itself to Gen3 (g3client.Gen3Interface) or a local DRS endpoint based on the --mode flag, but always expose a drs.ServerClient to the command layer.

CLI, download, and upload workflows

  • All CLI commands updated under cmd/ request the appropriate runtime and operate through the new DRS client; e.g. cmd/download-multiple.go and cmd/upload.go no longer talk to Fence/Indexd directly.
  • download/transfer.go, download/orchestrator.go, and helpers now fetch metadata via drs.ResolveObject, perform capability checks, and stream data using the injected backend (transfer.Downloader). The same file adds resumable single-stream handling and improved progress reporting.
  • Upload workflows (upload/multipart.go, upload/batch.go, upload/request.go) rely on transfer.Uploader, apply OptimalChunkSize from upload/utils.go, and consolidate retry/reporting rules. Unit tests in upload/multipart_test.go and tests/utils_test.go cover chunk math and orchestration edge cases.

Tooling, dependencies, and tests

  • Makefile now contains OpenAPI generation, schema init, and coverage helpers so regenerated clients stay reproducible.
  • go.mod bumps the toolchain to Go 1.26.1 and adds new dependencies (gocloud.dev, Azure SDKs, updated AWS SDKs) required by the transfer and generation layers; go.sum captures the resolved versions.
  • The tests/ tree gained integration-style suites (tests/download-multiple_test.go, tests/utils_test.go) plus a new tests/mock_gen3_adapter.go to bridge the refactored interfaces. Legacy Indexd mocks were removed, while DRS-aware mocks were added under mocks/.

Consequences

  • Benefits: Unified metadata + transfer surface reduces duplicate logic, unlocks GA4GH interoperability, and simplifies adding new runtime modes. OpenAPI-driven clients reduce drift from server APIs, and smarter chunk sizing improves upload reliability on large files.
  • Costs: Generated code and the GA4GH schema submodule add build overhead. The dependency graph is larger (Azure/GCP SDKs, gocloud.dev), and Go 1.26.1 becomes a hard requirement for contributors.
  • Risks: The removal of Indexd paths means older runtimes that relied on fence/indexd invariants must upgrade to DRS-compatible endpoints. Any divergence between server OpenAPI versions and the generated client must be handled by re-running make gen.

Alternatives Considered

  1. Continue using Indexd/Fence contracts only: Rejected because it locks the client to Gen3 infrastructures and cannot serve GA4GH-only deployments or local testing scenarios.
  2. Add DRS support without refactoring transfer logic: Rejected because upload/download flows would remain duplicated and error-prone; composable backends make future runtime additions (e.g., direct cloud storage) much easier.
  3. Maintain separate binaries for Gen3 vs DRS: Rejected to avoid user confusion and duplicated maintenance; the runtime switch in runtime/client.go keeps a single CLI artifact.

Follow-Up

  • Harden the new tests to cover multi-cloud URL signing paths under transfer/storage/gocloud.go once those runtimes are available.
  • Document the new runtime modes and make gen workflow in docs/ for contributors.
  • Monitor performance telemetry from download/transfer.go and adjust default chunk sizes/concurrency when real-world evidence suggests new thresholds.

@bwalsh
Copy link
Copy Markdown
Contributor

bwalsh commented Mar 31, 2026

PR Review: feature/drs-server

Findings (ordered by severity)

  1. Multipart downloads pass query strings as access IDs
    File: download/transfer.go lines 333-339.
    downloadToPathMultipart builds protocolText := "?protocol=" + protocol and hands that to bk.ResolveDownloadURL as the accessID. The DRS/Fence API expects the identifier segment at /access/{access_id}, not a query string. Servers will look up an access ID literally named ?protocol=s3, so multipart downloads with --protocol fail up front while single-stream downloads still work. Pass the actual access ID (e.g. s3) and keep query parameters separate, mirroring DownloadSingleWithProgress. Without this fix the new code path is unusable for protocol overrides.

  2. Multipart download buffers entire parts into memory
    File: download/transfer.go lines 388-439.
    Each goroutine calls io.ReadAll(partResp.Body) before file.WriteAt, so peak RAM equals chunkSize * concurrency. With the new OptimalChunkSize this can reach ~1 GB (128 MB chunk × 8 workers, or worse for 512 MB chunks), which is both unnecessary and dangerous for CLI users. Stream each part directly with io.CopyN(io.NewSectionWriter(file, ps, pe-ps+1), partResp.Body, ...) or reuse a pooled buffer. This keeps memory bounded and improves testability—right now the implementation cannot be sensibly fuzzed for large files.

  3. MinChunkSize comment/docs still promise git-config overrides
    File: common/constants.go lines 77-90 (and related docs).
    The file says "MinChunkSize is configurable via git config and initialized in init()", but the init logic and git config lookups were removed in this branch. That is a silent behavioral change with no documentation and will confuse existing operators who rely on the override. Either reinstate the configuration hook (with tests) or update the constant, docs, and ADR to state that the knob was removed—right now maintenance is harder because the code lies about its capabilities.

  4. No tests cover the new multipart range downloader
    File: download/transfer.go (downloadToPathMultipart).
    The only download test (download/transfer_test.go) exercises single-stream flows; nothing asserts the behavior of the new concurrent range writer. As a result, regressions like the access-ID bug and the unbounded buffering slipped through. Please extend the fakeBackend harness (it already supports RangeStart/RangeEnd) to cover: successful ranged writes, retry/error propagation, and progress accounting. This dramatically improves confidence in the most complicated logic added in this branch.

  5. downloadToPathMultipart progress updates can over-count
    File: download/transfer.go lines 428-455.
    Progress updates add len(buf) after each range—when io.ReadAll returns fewer bytes than requested (e.g., mid-range error or short read), the code still reports the partial amount but then the retry will add the same bytes again, yielding duplicate accounting. Once streaming writes replace io.ReadAll, please increment based on actual bytes copied from the response reader and only after success, or consolidate all progress reporting in a shared helper as the single-stream path does. This keeps progress observers accurate and simplifies maintenance across both download strategies.

Open Questions / Follow-Ups

  • Should the removal of git-config-driven chunk sizing be documented in docs/optimal-chunk-size.md and the user-facing release notes? Right now it is an undocumented breaking change.
  • Are we planning to add integration tests that hit both Gen3 and local DRS runtimes for uploads/downloads? The new runtime abstraction enables it but there is no automated coverage yet.

Suggested Next Steps

  1. Fix the multipart access-ID plumbing and add unit tests that would have caught it.
  2. Rework multipart range writes to stream data (avoiding io.ReadAll) and tighten progress accounting.
  3. Decide whether MinChunkSize remains configurable; either restore the functionality or update the code/docs to reflect the new constant behavior.
  4. Add dedicated tests for downloadToPathMultipart to lock in the expected concurrency, error handling, and progress semantics.

@bwalsh
Copy link
Copy Markdown
Contributor

bwalsh commented Mar 31, 2026

Failing test:

The job is failing in unit tests due to a gomock return type mismatch for Gen3Interface.DRSClient().

From the logs (ref dfecdcd8ecd2ea4df6cc970a8d2bc9bba2f494b3), multiple tests fail with:

wrong type of argument 0 to Return for *mocks.MockGen3Interface.DRSClient: *mocks.MockDrsClient is not assignable to drs.ServerClient
plus follow-on missing call(s) to *mocks.MockGen3Interface.DRSClient() because gomock aborts after the bad Return(...).
This points to a signature mismatch: the mock MockDrsClient type you’re returning does not implement the interface type that DRSClient() is declared to return (the logs say that return type is drs.ServerClient).

Copy link
Copy Markdown
Contributor

@bwalsh bwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

@matthewpeterkort
Copy link
Copy Markdown
Collaborator Author

I moved syfon client out of this repo so your feedbacks for this repo are no longer relevant. I will still address these feedbacks in syfon repo though.

@matthewpeterkort matthewpeterkort requested a review from bwalsh April 8, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants