@@ -713,8 +713,103 @@ def clear_extra_metadata(self):
713713
714714 @_check_types
715715 @doc (
716- summary = "Access sample metadata for one or more sample sets." ,
717- returns = "A dataframe of sample metadata, one row per sample." ,
716+ summary = """
717+ Access sample-level metadata for one or more sample sets.
718+ This method returns a pandas DataFrame where each row corresponds
719+ to a single sample. The metadata is assembled by merging multiple
720+ sources including general metadata, sequence quality control (QC)
721+ metadata, surveillance flags, and—when available—AIM and cohort
722+ metadata.
723+ """ ,
724+ returns = """
725+ A pandas DataFrame with one row per sample. Columns are grouped
726+ by metadata source:
727+
728+ **General metadata** (present for all sample sets):
729+
730+ - ``sample_id`` - Unique identifier for the sample.
731+ - ``partner_sample_id`` - Sample ID used by the contributing partner.
732+ - ``contributor`` - Name of the contributing institution or individual.
733+ - ``country`` - Country where the sample was collected.
734+ - ``location`` - Specific collection site (e.g., village or site name).
735+ - ``year`` - Year of collection.
736+ - ``month`` - Month of collection.
737+ - ``quarter`` - Quarter of the year derived from month (1–4).
738+ - ``latitude`` - GPS latitude of the collection site.
739+ - ``longitude`` - GPS longitude of the collection site.
740+ - ``sex_call`` - Sex determination call; ``'F'`` for female, ``'M'`` for male.
741+ - ``sample_set`` - Sample set containing the sample.
742+ - ``release`` - Data release containing the sample.
743+ - ``study_id`` - Identifier of the study the sample set belongs to.
744+ - ``study_url`` - URL of the study the sample set belongs to.
745+ - ``terms_of_use_expiry_date`` - Expiry date of terms of use for the sample.
746+ - ``terms_of_use_url`` - URL of the terms of use for the sample.
747+ - ``unrestricted_use`` - Whether the sample can be used without restrictions.
748+ - ``is_surveillance`` - Whether the sample can be used for surveillance.
749+
750+ **Sequence QC metadata** (present for all sample sets, values may
751+ be missing if QC data is unavailable for a given sample set):
752+
753+ - ``mean_cov`` - Mean sequencing coverage across the genome.
754+ - ``median_cov`` - Median sequencing coverage across the genome.
755+ - ``modal_cov`` - Modal (most frequent) sequencing coverage.
756+ - ``mean_cov_2L`` - Mean coverage on chromosome arm 2L.
757+ - ``median_cov_2L`` - Median coverage on chromosome arm 2L.
758+ - ``mode_cov_2L`` - Modal coverage on chromosome arm 2L.
759+ - ``mean_cov_2R`` - Mean coverage on chromosome arm 2R.
760+ - ``median_cov_2R`` - Median coverage on chromosome arm 2R.
761+ - ``mode_cov_2R`` - Modal coverage on chromosome arm 2R.
762+ - ``mean_cov_3L`` - Mean coverage on chromosome arm 3L.
763+ - ``median_cov_3L`` - Median coverage on chromosome arm 3L.
764+ - ``mode_cov_3L`` - Modal coverage on chromosome arm 3L.
765+ - ``mean_cov_3R`` - Mean coverage on chromosome arm 3R.
766+ - ``median_cov_3R`` - Median coverage on chromosome arm 3R.
767+ - ``mode_cov_3R`` - Modal coverage on chromosome arm 3R.
768+ - ``mean_cov_X`` - Mean coverage on chromosome X.
769+ - ``median_cov_X`` - Median coverage on chromosome X.
770+ - ``mode_cov_X`` - Modal coverage on chromosome X.
771+ - ``frac_gen_cov`` - Fraction of the genome covered.
772+ - ``divergence`` - Sequence divergence from the reference.
773+ - ``contam_pct`` - Estimated contamination percentage.
774+ - ``contam_LLR`` - Log-likelihood ratio for contamination estimate.
775+
776+ **AIM (Ancestry-Informative Marker) metadata** (only present when
777+ an AIM analysis is available for the data resource, e.g., *Ag3*):
778+
779+ - ``aim_species_fraction_arab`` - Fraction of gambcolu vs. arabiensis
780+ AIMs indicating arabiensis.
781+ - ``aim_species_fraction_colu`` - Fraction of gambiae vs. coluzzii AIMs
782+ indicating coluzzii.
783+ - ``aim_species_fraction_colu_no2l`` - Fraction of gambiae vs. coluzzii
784+ AIMs indicating coluzzii, excluding chromosome arm 2L.
785+ - ``aim_species_gambcolu_arabiensis`` - Taxon assigned by gambcolu vs.
786+ arabiensis AIMs.
787+ - ``aim_species_gambiae_coluzzii`` - Taxon assigned by gambiae vs.
788+ coluzzii AIMs.
789+ - ``aim_species`` - Final species assignment combining both AIM analyses.
790+
791+ **Cohort metadata** (only present when a cohorts analysis is available
792+ for the data resource; quarter columns are only present for cohorts
793+ analyses from 20230223 onwards):
794+
795+ - ``country_iso`` - ISO code of the country of collection.
796+ - ``admin1_name`` - Name of the first-level administrative region.
797+ - ``admin1_iso`` - ISO code of the first-level administrative region.
798+ - ``admin2_name`` - Name of the second-level administrative region.
799+ - ``taxon`` - Taxon assigned by combining AIM and cohort analyses.
800+ - ``cohort_admin1_year`` - Cohort grouping by admin level 1 and year.
801+ - ``cohort_admin1_month`` - Cohort grouping by admin level 1 and month.
802+ - ``cohort_admin1_quarter`` - Cohort grouping by admin level 1 and
803+ quarter (cohorts analysis >= 20230223 only).
804+ - ``cohort_admin2_year`` - Cohort grouping by admin level 2 and year.
805+ - ``cohort_admin2_month`` - Cohort grouping by admin level 2 and month.
806+ - ``cohort_admin2_quarter`` - Cohort grouping by admin level 2 and
807+ quarter (cohorts analysis >= 20230223 only).
808+
809+ The exact columns present depend on the data resource and sample sets
810+ requested. The returned DataFrame is a copy and can be safely modified
811+ without affecting internal caches.
812+ """ ,
718813 notes = """
719814 Some samples in the dataset are lab crosses — mosquitoes bred in
720815 the laboratory that have no real collection date. These samples
0 commit comments