refactor: migrate karyotype tag loading to GCS and add simulated tests#1278
refactor: migrate karyotype tag loading to GCS and add simulated tests#1278adilraza99 wants to merge 3 commits intomalariagen:masterfrom
Conversation
2da28e8 to
3e7859e
Compare
|
Hi @jonbrenas, I've made the changes. Take a look when you have a moment. |
| "SITE_ANNOTATIONS_ZARR_PATH": "reference/genome/agamp4/Anopheles-gambiae-PEST_SEQANNOTATION_AgamP4.12.zarr", | ||
| "DEFAULT_AIM_ANALYSIS": "20220528", | ||
| "DEFAULT_SITE_FILTERS_ANALYSIS": "dt_20200416", | ||
| "DEFAULT_KARYOTYPE_ANALYSIS": "20231213", |
There was a problem hiding this comment.
Where does this value come from?
There was a problem hiding this comment.
This is a simulated placeholder used in the test config. I’ve added a comment to make that clearer.
There was a problem hiding this comment.
The date 13/12/2023 still seems fairly random.
There was a problem hiding this comment.
Good catch , the previous value was arbitrary and ended up looking like a real date, which is confusing. I have replaced it with "simtest" so it is clearly just a test value and not mistaken for an actual analysis version on GCS.
|
|
||
| # Generate tag SNP data using positions from simulated SNP sites. | ||
| tags = [] | ||
| for contig, inversion in [("2R", "2Rb"), ("2L", "2La")]: |
There was a problem hiding this comment.
Why are the inversions hard coded here? If we ever implement karyotypes for ever species they won't have the same inversions.
There was a problem hiding this comment.
These are defined in the Ag3 simulator, so they are specific to that dataset. When karyotype support is added for other species, their simulators would define their own inversions, similar to how other simulator methods are structured.
| "No inversion tags are available for this data resource." | ||
| path = ( | ||
| f"{self._base_path}/{self._major_version_path}" | ||
| f"/karyotype/{self._karyotype_analysis}/karyotype_tag_snps.csv" |
There was a problem hiding this comment.
The actual path would probably be snp_karyotype to be consistent with the others (e.g., snp_haplotypes). Also, the name of the actual file shouldn't be hard-coded. It should probably be added to the config.
There was a problem hiding this comment.
Updated the path to follow the snp_karyotype convention and made the filename configurable via the config.
5d223e4 to
01111ab
Compare
|
Hi @jonbrenas, I’ve updated this to follow the existing patterns and addressed the review points. Let me know if this looks good. |
|
Could you please take a look when you have a moment? |
jonbrenas
left a comment
There was a problem hiding this comment.
Hi @adilraza99, I am not sure why there is an empty MANIFEST.in.
| "SITE_ANNOTATIONS_ZARR_PATH": "reference/genome/agamp4/Anopheles-gambiae-PEST_SEQANNOTATION_AgamP4.12.zarr", | ||
| "DEFAULT_AIM_ANALYSIS": "20220528", | ||
| "DEFAULT_SITE_FILTERS_ANALYSIS": "dt_20200416", | ||
| "DEFAULT_KARYOTYPE_ANALYSIS": "20231213", |
There was a problem hiding this comment.
The date 13/12/2023 still seems fairly random.
257700c to
fa6bccc
Compare
fa6bccc to
18ce2bb
Compare
|
Hi @jonbrenas, thanks a lot for the review. I've addressed the changes:
Please let me know if anything else needs adjustment. Thank you! |
This PR implements the karyotype refactor outlined in #689.
The main change is aligning karyotype tag loading with existing data access patterns used across the codebase (e.g.,
site_filters). Instead of loading a bundled CSV viaimportlib.resources, tag SNP data is now accessed from GCS using a config-driven analysis version and the shared filesystem interface.Changes:
importlib.resources-based loading with GCS-backed loading viaself._fsDEFAULT_KARYOTYPE_ANALYSISconfig key with optional constructor override{version}/karyotype/{analysis}/karyotype_tag_snps.csvValueErrorfor unknown inputs)inversion[0:2]string slicinginit_karyotype_tags) and test coverage undertests/anoph/karyotype_tag_snps.csvfrom package resourcesload_inversion_tagsinAg3.rstNote:
Simulated unit tests pass independently. Integration tests may require the tag SNP data to be available on GCS along with the corresponding config update.
Fixes #689