Skip to content

fix: add retry, timeout, and backoff defaults for cloud storage access#1304

Merged
jonbrenas merged 5 commits intomalariagen:masterfrom
khushthecoder:fix/issue-1303-cloud-storage-retry-backoff
Apr 18, 2026
Merged

fix: add retry, timeout, and backoff defaults for cloud storage access#1304
jonbrenas merged 5 commits intomalariagen:masterfrom
khushthecoder:fix/issue-1303-cloud-storage-retry-backoff

Conversation

@khushthecoder
Copy link
Copy Markdown
Contributor

Summary

Fixes #1303

All remote data access (GCS via gcsfs, S3 via s3fs, and HTTP via fsspec) previously relied entirely on the default timeout/retry behavior of the underlying filesystem libraries. The codebase had zero explicit timeout, retry, or backoff configuration, meaning:

  • A single transient error (GCS 503, network timeout, DNS failure) caused the entire operation to fail with a cryptic OSError
  • Operations could hang indefinitely on unreliable networks with no timeout
  • This particularly affects MalariaGEN's primary user base — researchers in malaria-endemic regions (sub-Saharan Africa, Southeast Asia) where internet connectivity can be unreliable

Changes

Added sensible retry and timeout defaults to _init_filesystem() in malariagen_data/util.py:

Backend Parameter Default Description
GCS (gcsfs) retry 3 Number of retry attempts for transient errors
GCS (gcsfs) timeout 60s Request timeout in seconds
S3 (s3fs) retries 3 Number of retry attempts for transient errors
S3 (s3fs) connect_timeout 60s Connection timeout in seconds
S3 (s3fs) read_timeout 60s Read timeout in seconds
HTTP/HTTPS (fsspec) timeout 60s Request timeout in seconds

Design decisions

  • setdefault() everywhere — all defaults can be overridden by users via storage_options, so this is fully backwards-compatible
  • Local filesystems unaffected — the HTTP timeout is only applied when the URL starts with http:// or https://, not for local file paths
  • Parameter names match each library's APIretry for gcsfs, retries for s3fs, connect_timeout/read_timeout via s3fs config_kwargs

Impact

  • Transient network errors are automatically retried (up to 3 times) instead of immediately failing
  • Operations have a 60-second timeout instead of hanging indefinitely
  • Users on unreliable connections get clear timeout errors instead of indefinite hangs
  • Fully backwards-compatible — existing code and user overrides work unchanged

Test Plan

  • Python syntax validated
  • Existing test_util_filesystem.py tests unaffected (local filesystem paths)
  • CI linting passes
  • CI unit tests pass across Python 3.10, 3.11, 3.12
  • Verified setdefault() does not override user-provided values

khushthecoder added 2 commits April 15, 2026 23:44
…ilesystem access

Configure sensible retry and timeout defaults in _init_filesystem() so
that transient network errors (503, 429, DNS failures, timeouts) are
automatically retried instead of immediately crashing with a cryptic
OSError.

Changes:
- GCS (gcsfs): retry=3, timeout=60s
- S3 (s3fs): retries=3, connect_timeout=60s, read_timeout=60s
- HTTP/HTTPS (fsspec): timeout=60s
- Local filesystems are unaffected

All defaults use setdefault() so users can still override via
storage_options if needed.

Closes malariagen#1303
When the frequency DataFrame has only one row (e.g., due to sample
query filtering), the "contig" column is unique, so passing
index="contig" does not raise ValueError. Only assert the error when
the contig values are actually non-unique.

This fixes a flaky test failure in
test_gene_cnv_frequencies_with_str_cohorts_and_sample_query.
@khushthecoder
Copy link
Copy Markdown
Contributor Author

Hi @jonbrenas

Thanks so much for merging #1294 — really appreciate the quick review!

I also have another PR up that I'd love your eyes on when you get a chance:
#1303 (this pr)— Cloud Storage Retry Backoff

It addresses the retry/backoff logic for cloud storage operations. Since you're already familiar with the codebase context from my recent contributions, your feedback would be really valuable here as well.

No rush at all — happy to wait until you have the bandwidth.

@jonbrenas
Copy link
Copy Markdown
Collaborator

Thanks @khushthecoder, I don't think that we have received any complaint from partners that they couldn't access their data when their connection is spotty but you never know. I am not quite sure what the change in test_frq has to do with it, though.

@khushthecoder
Copy link
Copy Markdown
Contributor Author

Thanks @khushthecoder, I don't think that we have received any complaint from partners that they couldn't access their data when their connection is spotty but you never know. I am not quite sure what the change in test_frq has to do with it, though.

Thanks @jonbrenas — you're right that the test_frq change is unrelated to the retry/backoff itself, apologies for the noise.

I hit the failure while running the suite locally against this branch: when sample_query narrows the fixture down to a single row, the contig column is trivially unique, so plot_frequencies_heatmap(..., index="contig") doesn't raise and the pytest.raises(ValueError) block fails. The existing assertion assumed contig would always be non-unique. The guard (and not frq_df["contig"].is_unique) only exercises the error path when it's actually reachable.

I'd rather not leave it bundled here. Happy to either:

Drop it from this PR and open a small standalone one against the flakiness, or
Keep it here if you'd prefer to land both together — your call.
On the retry/backoff severity: agreed that it's defensive rather than driven by a reported incident. The motivation was more about making transient GCS hiccups recoverable without surfacing as hard failures to users on institutional networks. If that framing isn't worth the added surface area, I'm fine scaling it back.

@khushthecoder
Copy link
Copy Markdown
Contributor Author

Thanks for the review @jonbrenas — appreciate you picking up the merge commits along the way. Let me know if anything else comes up before merge.

@jonbrenas jonbrenas merged commit c8344ea into malariagen:master Apr 18, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cloud Storage Access (GCS/S3) Has No Retry, Backoff, or Timeout Configuration

2 participants