Description:
ag3.py line 10 executes the following statement at module import time:
dask.config.set(**{"array.slicing.split_native_chunks": False})
This runs the moment any code imports malariagen_data, since ag3.py is loaded during package initialization. It silently disables a Dask performance optimization globally, affecting all Dask operations in the entire Python process — not just those belonging to malariagen_data.
Steps to Reproduce
import dask
print(dask.config.get("array.slicing.split_native_chunks")) # True (default)
import malariagen_data # Side effect triggered here!
print(dask.config.get("array.slicing.split_native_chunks")) # False — globally changed
# Now ANY dask array operation in this process uses non-split behavior,
# even operations completely unrelated to malariagen_data:
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
result = x[::2] # Uses non-split slicing due to the global config change
Root Cause
The call to dask.config.set() is made at module scope (top-level of ag3.py), rather than inside a context manager scoped to the specific operations that require it. As a result, importing malariagen_data is sufficient to permanently alter the Dask configuration for the entire Python session.
Why This Is a Problem
1. Unexpected Side Effects
Importing a library should not modify the global configuration of other libraries. This violates the principle of least surprise and breaks the assumption that import statements are side-effect-free.
2. Performance Impact
split_native_chunks was introduced to improve performance for certain slicing operations. Disabling it globally may degrade performance for non-malariagen_data Dask workloads running in the same session (e.g., xarray, pangeo, custom pipelines).
3. Difficult to Debug
Researchers combining malariagen_data with other Dask-based tools will observe changed Dask behaviour with no visible indication of the cause. The configuration change is silent, persistent, and attributed to nothing in tracebacks or logs.
Expected Behavior
The dask.config.set() override should be scoped only to the specific operations that require it, using the context manager form:
with dask.config.set(**{"array.slicing.split_native_chunks": False}):
# perform the specific operation that needs this config
...
This ensures the configuration is restored to its previous value immediately after the operation completes, with no lasting effect on the rest of the process.
Proposed Fix
Before (current behaviour — ag3.py, line 10):
# Module-level — runs on import, permanently changes global config
dask.config.set(**{"array.slicing.split_native_chunks": False})
After (proposed fix — scoped to specific methods):
# Inside the method(s) that need this behaviour
def some_method(self, ...):
with dask.config.set(**{"array.slicing.split_native_chunks": False}):
# operation that requires this config
...
Expected Impact After Resolution
- Importing
malariagen_data will no longer modify the global Dask configuration.
- The
split_native_chunks override will be active only during the specific operations that require it.
- Third-party Dask-based tools used alongside
malariagen_data (e.g., xarray, pangeo) will no longer have their performance characteristics silently altered.
- The fix is fully backward-compatible — no changes to public API or user-facing behaviour are required.
Affected File
| File |
Line |
Code |
ag3.py |
10 |
dask.config.set(**{"array.slicing.split_native_chunks": False}) |
Description:
ag3.pyline 10 executes the following statement at module import time:This runs the moment any code imports
malariagen_data, sinceag3.pyis loaded during package initialization. It silently disables a Dask performance optimization globally, affecting all Dask operations in the entire Python process — not just those belonging tomalariagen_data.Steps to Reproduce
Root Cause
The call to
dask.config.set()is made at module scope (top-level ofag3.py), rather than inside a context manager scoped to the specific operations that require it. As a result, importingmalariagen_datais sufficient to permanently alter the Dask configuration for the entire Python session.Why This Is a Problem
1. Unexpected Side Effects
Importing a library should not modify the global configuration of other libraries. This violates the principle of least surprise and breaks the assumption that
importstatements are side-effect-free.2. Performance Impact
split_native_chunkswas introduced to improve performance for certain slicing operations. Disabling it globally may degrade performance for non-malariagen_dataDask workloads running in the same session (e.g.,xarray,pangeo, custom pipelines).3. Difficult to Debug
Researchers combining
malariagen_datawith other Dask-based tools will observe changed Dask behaviour with no visible indication of the cause. The configuration change is silent, persistent, and attributed to nothing in tracebacks or logs.Expected Behavior
The
dask.config.set()override should be scoped only to the specific operations that require it, using the context manager form:This ensures the configuration is restored to its previous value immediately after the operation completes, with no lasting effect on the rest of the process.
Proposed Fix
Before (current behaviour —
ag3.py, line 10):After (proposed fix — scoped to specific methods):
Expected Impact After Resolution
malariagen_datawill no longer modify the global Dask configuration.split_native_chunksoverride will be active only during the specific operations that require it.malariagen_data(e.g.,xarray,pangeo) will no longer have their performance characteristics silently altered.Affected File
ag3.pydask.config.set(**{"array.slicing.split_native_chunks": False})