fix: add retry, timeout, and backoff defaults for cloud storage access by khushthecoder · Pull Request #1304 · malariagen/malariagen-data-python

khushthecoder · 2026-04-15T18:15:52Z

Summary

All remote data access (GCS via gcsfs, S3 via s3fs, and HTTP via fsspec) previously relied entirely on the default timeout/retry behavior of the underlying filesystem libraries. The codebase had zero explicit timeout, retry, or backoff configuration, meaning:

A single transient error (GCS 503, network timeout, DNS failure) caused the entire operation to fail with a cryptic OSError
Operations could hang indefinitely on unreliable networks with no timeout
This particularly affects MalariaGEN's primary user base — researchers in malaria-endemic regions (sub-Saharan Africa, Southeast Asia) where internet connectivity can be unreliable

Changes

Added sensible retry and timeout defaults to _init_filesystem() in malariagen_data/util.py:

Backend	Parameter	Default	Description
GCS (`gcsfs`)	`retry`	`3`	Number of retry attempts for transient errors
GCS (`gcsfs`)	`timeout`	`60s`	Request timeout in seconds
S3 (`s3fs`)	`retries`	`3`	Number of retry attempts for transient errors
S3 (`s3fs`)	`connect_timeout`	`60s`	Connection timeout in seconds
S3 (`s3fs`)	`read_timeout`	`60s`	Read timeout in seconds
HTTP/HTTPS (`fsspec`)	`timeout`	`60s`	Request timeout in seconds

Design decisions

setdefault() everywhere — all defaults can be overridden by users via storage_options, so this is fully backwards-compatible
Local filesystems unaffected — the HTTP timeout is only applied when the URL starts with http:// or https://, not for local file paths
Parameter names match each library's API — retry for gcsfs, retries for s3fs, connect_timeout/read_timeout via s3fs config_kwargs

Impact

Transient network errors are automatically retried (up to 3 times) instead of immediately failing
Operations have a 60-second timeout instead of hanging indefinitely
Users on unreliable connections get clear timeout errors instead of indefinite hangs
Fully backwards-compatible — existing code and user overrides work unchanged

Test Plan

Python syntax validated
Existing test_util_filesystem.py tests unaffected (local filesystem paths)
CI linting passes
CI unit tests pass across Python 3.10, 3.11, 3.12
Verified setdefault() does not override user-provided values

…ilesystem access Configure sensible retry and timeout defaults in _init_filesystem() so that transient network errors (503, 429, DNS failures, timeouts) are automatically retried instead of immediately crashing with a cryptic OSError. Changes: - GCS (gcsfs): retry=3, timeout=60s - S3 (s3fs): retries=3, connect_timeout=60s, read_timeout=60s - HTTP/HTTPS (fsspec): timeout=60s - Local filesystems are unaffected All defaults use setdefault() so users can still override via storage_options if needed. Closes malariagen#1303

When the frequency DataFrame has only one row (e.g., due to sample query filtering), the "contig" column is unique, so passing index="contig" does not raise ValueError. Only assert the error when the contig values are actually non-unique. This fixes a flaky test failure in test_gene_cnv_frequencies_with_str_cohorts_and_sample_query.

khushthecoder · 2026-04-17T16:02:10Z

Hi @jonbrenas

Thanks so much for merging #1294 — really appreciate the quick review!

I also have another PR up that I'd love your eyes on when you get a chance:
#1303 (this pr)— Cloud Storage Retry Backoff

It addresses the retry/backoff logic for cloud storage operations. Since you're already familiar with the codebase context from my recent contributions, your feedback would be really valuable here as well.

No rush at all — happy to wait until you have the bandwidth.

jonbrenas · 2026-04-18T16:07:07Z

Thanks @khushthecoder, I don't think that we have received any complaint from partners that they couldn't access their data when their connection is spotty but you never know. I am not quite sure what the change in test_frq has to do with it, though.

khushthecoder · 2026-04-18T16:14:01Z

Thanks @khushthecoder, I don't think that we have received any complaint from partners that they couldn't access their data when their connection is spotty but you never know. I am not quite sure what the change in test_frq has to do with it, though.

Thanks @jonbrenas — you're right that the test_frq change is unrelated to the retry/backoff itself, apologies for the noise.

I hit the failure while running the suite locally against this branch: when sample_query narrows the fixture down to a single row, the contig column is trivially unique, so plot_frequencies_heatmap(..., index="contig") doesn't raise and the pytest.raises(ValueError) block fails. The existing assertion assumed contig would always be non-unique. The guard (and not frq_df["contig"].is_unique) only exercises the error path when it's actually reachable.

I'd rather not leave it bundled here. Happy to either:

Drop it from this PR and open a small standalone one against the flakiness, or
Keep it here if you'd prefer to land both together — your call.
On the retry/backoff severity: agreed that it's defensive rather than driven by a reported incident. The motivation was more about making transient GCS hiccups recoverable without surfacing as hard failures to users on institutional networks. If that framing isn't worth the added surface area, I'm fine scaling it back.

khushthecoder · 2026-04-18T17:05:21Z

Thanks for the review @jonbrenas — appreciate you picking up the merge commits along the way. Let me know if anything else comes up before merge.

khushthecoder added 2 commits April 15, 2026 23:44

khushthecoder mentioned this pull request Apr 15, 2026

Cloud Storage Access (GCS/S3) Has No Retry, Backoff, or Timeout Configuration #1303

Closed

Merge branch 'master' into fix/issue-1303-cloud-storage-retry-backoff

960466b

khushthecoder and others added 2 commits April 17, 2026 23:21

Merge branch 'master' into fix/issue-1303-cloud-storage-retry-backoff

170ecf3

Merge branch 'master' into fix/issue-1303-cloud-storage-retry-backoff

9295ec4

jonbrenas approved these changes Apr 18, 2026

View reviewed changes

jonbrenas merged commit c8344ea into malariagen:master Apr 18, 2026
8 checks passed

khushthecoder mentioned this pull request Apr 18, 2026

Make the sex-linked contig configurable in CNV frequency analysis #1314

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add retry, timeout, and backoff defaults for cloud storage access#1304

fix: add retry, timeout, and backoff defaults for cloud storage access#1304
jonbrenas merged 5 commits intomalariagen:masterfrom
khushthecoder:fix/issue-1303-cloud-storage-retry-backoff

khushthecoder commented Apr 15, 2026

Uh oh!

khushthecoder commented Apr 17, 2026

Uh oh!

jonbrenas commented Apr 18, 2026

Uh oh!

khushthecoder commented Apr 18, 2026

Uh oh!

khushthecoder commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

khushthecoder commented Apr 15, 2026

Summary

Changes

Design decisions

Impact

Test Plan

Uh oh!

khushthecoder commented Apr 17, 2026

Uh oh!

jonbrenas commented Apr 18, 2026

Uh oh!

khushthecoder commented Apr 18, 2026

Uh oh!

khushthecoder commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants