@@ -709,30 +709,90 @@ def clear_extra_metadata(self):
709709 metadata.
710710 """ ,
711711 returns = """
712- A DataFrame with one row per sample. Columns include:
713-
714- - **sample_id** (*str*) - Unique sample identifier.
715- - **partner_sample_id** (*str*) - Sample ID assigned by the contributing partner.
716- - **contributor** (*str*) - Name of the contributing institution or individual.
717- - **country** (*str*) - Country where the sample was collected.
718- - **location** (*str*) - Specific collection location (e.g. village or site name).
719- - **year** (*int*) - Year of collection.
720- - **month** (*int*) - Month of collection, if available.
721- - **latitude** (*float*) - GPS latitude of the collection site.
722- - **longitude** (*float*) - GPS longitude of the collection site.
723- - **sex_call** (*str*) - Sex determination call; ``'F'`` for female, ``'M'`` for male.
724- - **taxon** (*str*) - Species or taxon assignment.
725- - **mean_cov** (*float*) - Mean sequencing coverage across the genome.
726- - **median_cov** (*float*) - Median sequencing coverage.
727- - **frac_reads_mapped** (*float*) - Fraction of reads mapped to the reference genome.
728- - **contam_pct** (*float*) - Estimated contamination percentage.
729- - **pass_qc** (*bool*) - Whether the sample passed quality control filters.
730- - **cohort_admin1_year** (*str*) - Cohort label combining admin level 1 region and year (if available).
731- - **cohort_admin2_year** (*str*) - Cohort label combining admin level 2 region and year (if available).
732- - **aim_species** (*str*) - Species assignment from ancestry-informative markers (if available).
733-
734- The returned DataFrame is a copy and can be safely modified
735- without affecting internal caches.
712+ A pandas DataFrame with one row per sample. Columns are grouped
713+ by metadata source:
714+
715+ **General metadata** (present for all sample sets):
716+
717+ - ``sample_id`` - Unique identifier for the sample.
718+ - ``partner_sample_id`` - Sample ID used by the contributing partner.
719+ - ``contributor`` - Name of the contributing institution or individual.
720+ - ``country`` - Country where the sample was collected.
721+ - ``location`` - Specific collection site (e.g., village or site name).
722+ - ``year`` - Year of collection.
723+ - ``month`` - Month of collection.
724+ - ``quarter`` - Quarter of the year derived from month (1–4; -1 if missing).
725+ - ``latitude`` - GPS latitude of the collection site.
726+ - ``longitude`` - GPS longitude of the collection site.
727+ - ``sex_call`` - Sex determination call; ``'F'`` for female, ``'M'`` for male.
728+ - ``sample_set`` - Sample set containing the sample.
729+ - ``release`` - Data release containing the sample.
730+ - ``study_id`` - Identifier of the study the sample set belongs to.
731+ - ``study_url`` - URL of the study the sample set belongs to.
732+ - ``terms_of_use_expiry_date`` - Expiry date of terms of use for the sample.
733+ - ``terms_of_use_url`` - URL of the terms of use for the sample.
734+ - ``unrestricted_use`` - Whether the sample can be used without restrictions.
735+
736+ **Sequence QC metadata**:
737+
738+ - ``mean_cov`` - Mean sequencing coverage across the genome.
739+ - ``median_cov`` - Median sequencing coverage across the genome.
740+ - ``modal_cov`` - Modal (most frequent) sequencing coverage.
741+ - ``mean_cov_2L`` - Mean coverage on chromosome arm 2L.
742+ - ``median_cov_2L`` - Median coverage on chromosome arm 2L.
743+ - ``mode_cov_2L`` - Modal coverage on chromosome arm 2L.
744+ - ``mean_cov_2R`` - Mean coverage on chromosome arm 2R.
745+ - ``median_cov_2R`` - Median coverage on chromosome arm 2R.
746+ - ``mode_cov_2R`` - Modal coverage on chromosome arm 2R.
747+ - ``mean_cov_3L`` - Mean coverage on chromosome arm 3L.
748+ - ``median_cov_3L`` - Median coverage on chromosome arm 3L.
749+ - ``mode_cov_3L`` - Modal coverage on chromosome arm 3L.
750+ - ``mean_cov_3R`` - Mean coverage on chromosome arm 3R.=
751+ - ``median_cov_3R`` - Median coverage on chromosome arm 3R.
752+ - ``mode_cov_3R`` - Modal coverage on chromosome arm 3R.
753+ - ``mean_cov_X`` - Mean coverage on chromosome X.
754+ - ``median_cov_X`` - Median coverage on chromosome X.
755+ - ``mode_cov_X`` - Modal coverage on chromosome X.
756+ - ``frac_gen_cov`` - Fraction of the genome covered.
757+ - ``divergence`` - Sequence divergence from the reference.
758+ - ``contam_pct`` - Estimated contamination percentage.
759+ - ``contam_LLR`` - Log-likelihood ratio for contamination estimate.
760+
761+ **Surveillance flags**:
762+
763+ - ``is_surveillance`` - Whether the sample can be used for surveillance.
764+
765+ **AIM (Ancestry-Informative Marker) metadata** (if available):
766+
767+ - ``aim_species_fraction_arab`` - Fraction of gambcolu vs. arabiensis AIMs
768+ indicating arabiensis.
769+ - ``aim_species_fraction_colu`` - Fraction of gambiae vs. coluzzii AIMs
770+ indicating coluzzii.
771+ - ``aim_species_fraction_colu_no2l`` - Fraction of gambiae vs. coluzzii AIMs
772+ indicating coluzzii, excluding chromosome arm 2L.
773+ - ``aim_species_gambcolu_arabiensis`` - Taxon assigned by gambcolu vs.
774+ arabiensis AIMs.
775+ - ``aim_species_gambiae_coluzzii`` - Taxon assigned by gambiae vs.
776+ coluzzii AIMs.
777+ - ``aim_species`` - Final species assignment combining both AIM analyses.
778+
779+ **Cohort metadata** (if available):
780+
781+ - ``country_iso`` - ISO code of the country of collection.
782+ - ``admin1_name`` - Name of the first-level administrative region.
783+ - ``admin1_iso`` - ISO code of the first-level administrative region.
784+ - ``admin2_name`` - Name of the second-level administrative region.
785+ - ``taxon`` - Taxon assigned by combining AIM and cohort analyses.
786+ - ``cohort_admin1_year`` - Cohort grouping by admin level 1 and year.
787+ - ``cohort_admin1_month`` - Cohort grouping by admin level 1 and month.
788+ - ``cohort_admin1_quarter`` - Cohort grouping by admin level 1 and quarter.
789+ - ``cohort_admin2_year`` - Cohort grouping by admin level 2 and year.
790+ - ``cohort_admin2_month`` - Cohort grouping by admin level 2 and month.
791+ - ``cohort_admin2_quarter`` - Cohort grouping by admin level 2 and quarter.
792+
793+ The exact columns present depend on the sample sets requested and
794+ which analyses are available. The returned DataFrame is a copy and
795+ can be safely modified without affecting internal caches.
736796 """ ,
737797 )
738798 def sample_metadata (
0 commit comments