Skip to content

Commit b0e4ed0

Browse files
authored
Merge pull request #933 from Tanisha127/improve-sample-metadata-docs
docs: expand documentation for sample_metadata()
2 parents 2f07b41 + 31dc1fd commit b0e4ed0

1 file changed

Lines changed: 97 additions & 2 deletions

File tree

malariagen_data/anoph/sample_metadata.py

Lines changed: 97 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -713,8 +713,103 @@ def clear_extra_metadata(self):
713713

714714
@_check_types
715715
@doc(
716-
summary="Access sample metadata for one or more sample sets.",
717-
returns="A dataframe of sample metadata, one row per sample.",
716+
summary="""
717+
Access sample-level metadata for one or more sample sets.
718+
This method returns a pandas DataFrame where each row corresponds
719+
to a single sample. The metadata is assembled by merging multiple
720+
sources including general metadata, sequence quality control (QC)
721+
metadata, surveillance flags, and—when available—AIM and cohort
722+
metadata.
723+
""",
724+
returns="""
725+
A pandas DataFrame with one row per sample. Columns are grouped
726+
by metadata source:
727+
728+
**General metadata** (present for all sample sets):
729+
730+
- ``sample_id`` - Unique identifier for the sample.
731+
- ``partner_sample_id`` - Sample ID used by the contributing partner.
732+
- ``contributor`` - Name of the contributing institution or individual.
733+
- ``country`` - Country where the sample was collected.
734+
- ``location`` - Specific collection site (e.g., village or site name).
735+
- ``year`` - Year of collection.
736+
- ``month`` - Month of collection.
737+
- ``quarter`` - Quarter of the year derived from month (1–4).
738+
- ``latitude`` - GPS latitude of the collection site.
739+
- ``longitude`` - GPS longitude of the collection site.
740+
- ``sex_call`` - Sex determination call; ``'F'`` for female, ``'M'`` for male.
741+
- ``sample_set`` - Sample set containing the sample.
742+
- ``release`` - Data release containing the sample.
743+
- ``study_id`` - Identifier of the study the sample set belongs to.
744+
- ``study_url`` - URL of the study the sample set belongs to.
745+
- ``terms_of_use_expiry_date`` - Expiry date of terms of use for the sample.
746+
- ``terms_of_use_url`` - URL of the terms of use for the sample.
747+
- ``unrestricted_use`` - Whether the sample can be used without restrictions.
748+
- ``is_surveillance`` - Whether the sample can be used for surveillance.
749+
750+
**Sequence QC metadata** (present for all sample sets, values may
751+
be missing if QC data is unavailable for a given sample set):
752+
753+
- ``mean_cov`` - Mean sequencing coverage across the genome.
754+
- ``median_cov`` - Median sequencing coverage across the genome.
755+
- ``modal_cov`` - Modal (most frequent) sequencing coverage.
756+
- ``mean_cov_2L`` - Mean coverage on chromosome arm 2L.
757+
- ``median_cov_2L`` - Median coverage on chromosome arm 2L.
758+
- ``mode_cov_2L`` - Modal coverage on chromosome arm 2L.
759+
- ``mean_cov_2R`` - Mean coverage on chromosome arm 2R.
760+
- ``median_cov_2R`` - Median coverage on chromosome arm 2R.
761+
- ``mode_cov_2R`` - Modal coverage on chromosome arm 2R.
762+
- ``mean_cov_3L`` - Mean coverage on chromosome arm 3L.
763+
- ``median_cov_3L`` - Median coverage on chromosome arm 3L.
764+
- ``mode_cov_3L`` - Modal coverage on chromosome arm 3L.
765+
- ``mean_cov_3R`` - Mean coverage on chromosome arm 3R.
766+
- ``median_cov_3R`` - Median coverage on chromosome arm 3R.
767+
- ``mode_cov_3R`` - Modal coverage on chromosome arm 3R.
768+
- ``mean_cov_X`` - Mean coverage on chromosome X.
769+
- ``median_cov_X`` - Median coverage on chromosome X.
770+
- ``mode_cov_X`` - Modal coverage on chromosome X.
771+
- ``frac_gen_cov`` - Fraction of the genome covered.
772+
- ``divergence`` - Sequence divergence from the reference.
773+
- ``contam_pct`` - Estimated contamination percentage.
774+
- ``contam_LLR`` - Log-likelihood ratio for contamination estimate.
775+
776+
**AIM (Ancestry-Informative Marker) metadata** (only present when
777+
an AIM analysis is available for the data resource, e.g., *Ag3*):
778+
779+
- ``aim_species_fraction_arab`` - Fraction of gambcolu vs. arabiensis
780+
AIMs indicating arabiensis.
781+
- ``aim_species_fraction_colu`` - Fraction of gambiae vs. coluzzii AIMs
782+
indicating coluzzii.
783+
- ``aim_species_fraction_colu_no2l`` - Fraction of gambiae vs. coluzzii
784+
AIMs indicating coluzzii, excluding chromosome arm 2L.
785+
- ``aim_species_gambcolu_arabiensis`` - Taxon assigned by gambcolu vs.
786+
arabiensis AIMs.
787+
- ``aim_species_gambiae_coluzzii`` - Taxon assigned by gambiae vs.
788+
coluzzii AIMs.
789+
- ``aim_species`` - Final species assignment combining both AIM analyses.
790+
791+
**Cohort metadata** (only present when a cohorts analysis is available
792+
for the data resource; quarter columns are only present for cohorts
793+
analyses from 20230223 onwards):
794+
795+
- ``country_iso`` - ISO code of the country of collection.
796+
- ``admin1_name`` - Name of the first-level administrative region.
797+
- ``admin1_iso`` - ISO code of the first-level administrative region.
798+
- ``admin2_name`` - Name of the second-level administrative region.
799+
- ``taxon`` - Taxon assigned by combining AIM and cohort analyses.
800+
- ``cohort_admin1_year`` - Cohort grouping by admin level 1 and year.
801+
- ``cohort_admin1_month`` - Cohort grouping by admin level 1 and month.
802+
- ``cohort_admin1_quarter`` - Cohort grouping by admin level 1 and
803+
quarter (cohorts analysis >= 20230223 only).
804+
- ``cohort_admin2_year`` - Cohort grouping by admin level 2 and year.
805+
- ``cohort_admin2_month`` - Cohort grouping by admin level 2 and month.
806+
- ``cohort_admin2_quarter`` - Cohort grouping by admin level 2 and
807+
quarter (cohorts analysis >= 20230223 only).
808+
809+
The exact columns present depend on the data resource and sample sets
810+
requested. The returned DataFrame is a copy and can be safely modified
811+
without affecting internal caches.
812+
""",
718813
notes="""
719814
Some samples in the dataset are lab crosses — mosquitoes bred in
720815
the laboratory that have no real collection date. These samples

0 commit comments

Comments
 (0)