fix: add retry, timeout, and backoff defaults for cloud storage access#1304
Conversation
…ilesystem access Configure sensible retry and timeout defaults in _init_filesystem() so that transient network errors (503, 429, DNS failures, timeouts) are automatically retried instead of immediately crashing with a cryptic OSError. Changes: - GCS (gcsfs): retry=3, timeout=60s - S3 (s3fs): retries=3, connect_timeout=60s, read_timeout=60s - HTTP/HTTPS (fsspec): timeout=60s - Local filesystems are unaffected All defaults use setdefault() so users can still override via storage_options if needed. Closes malariagen#1303
When the frequency DataFrame has only one row (e.g., due to sample query filtering), the "contig" column is unique, so passing index="contig" does not raise ValueError. Only assert the error when the contig values are actually non-unique. This fixes a flaky test failure in test_gene_cnv_frequencies_with_str_cohorts_and_sample_query.
|
Hi @jonbrenas Thanks so much for merging #1294 — really appreciate the quick review! I also have another PR up that I'd love your eyes on when you get a chance: It addresses the retry/backoff logic for cloud storage operations. Since you're already familiar with the codebase context from my recent contributions, your feedback would be really valuable here as well. No rush at all — happy to wait until you have the bandwidth. |
|
Thanks @khushthecoder, I don't think that we have received any complaint from partners that they couldn't access their data when their connection is spotty but you never know. I am not quite sure what the change in |
Thanks @jonbrenas — you're right that the test_frq change is unrelated to the retry/backoff itself, apologies for the noise. I hit the failure while running the suite locally against this branch: when sample_query narrows the fixture down to a single row, the contig column is trivially unique, so plot_frequencies_heatmap(..., index="contig") doesn't raise and the pytest.raises(ValueError) block fails. The existing assertion assumed contig would always be non-unique. The guard (and not frq_df["contig"].is_unique) only exercises the error path when it's actually reachable. I'd rather not leave it bundled here. Happy to either: Drop it from this PR and open a small standalone one against the flakiness, or |
|
Thanks for the review @jonbrenas — appreciate you picking up the merge commits along the way. Let me know if anything else comes up before merge. |
Summary
Fixes #1303
All remote data access (GCS via
gcsfs, S3 vias3fs, and HTTP viafsspec) previously relied entirely on the default timeout/retry behavior of the underlying filesystem libraries. The codebase had zero explicit timeout, retry, or backoff configuration, meaning:OSErrorChanges
Added sensible retry and timeout defaults to
_init_filesystem()inmalariagen_data/util.py:gcsfs)retry3gcsfs)timeout60ss3fs)retries3s3fs)connect_timeout60ss3fs)read_timeout60sfsspec)timeout60sDesign decisions
setdefault()everywhere — all defaults can be overridden by users viastorage_options, so this is fully backwards-compatiblehttp://orhttps://, not for local file pathsretryfor gcsfs,retriesfor s3fs,connect_timeout/read_timeoutvia s3fs config_kwargsImpact
Test Plan
test_util_filesystem.pytests unaffected (local filesystem paths)setdefault()does not override user-provided values