Skip to content

Commit 293cf4c

Browse files
committed
docs: fix indentation in sample_metadata() returns docstring
1 parent b34d043 commit 293cf4c

1 file changed

Lines changed: 84 additions & 24 deletions

File tree

malariagen_data/anoph/sample_metadata.py

Lines changed: 84 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -709,30 +709,90 @@ def clear_extra_metadata(self):
709709
metadata.
710710
""",
711711
returns="""
712-
A DataFrame with one row per sample. Columns include:
713-
714-
- **sample_id** (*str*) - Unique sample identifier.
715-
- **partner_sample_id** (*str*) - Sample ID assigned by the contributing partner.
716-
- **contributor** (*str*) - Name of the contributing institution or individual.
717-
- **country** (*str*) - Country where the sample was collected.
718-
- **location** (*str*) - Specific collection location (e.g. village or site name).
719-
- **year** (*int*) - Year of collection.
720-
- **month** (*int*) - Month of collection, if available.
721-
- **latitude** (*float*) - GPS latitude of the collection site.
722-
- **longitude** (*float*) - GPS longitude of the collection site.
723-
- **sex_call** (*str*) - Sex determination call; ``'F'`` for female, ``'M'`` for male.
724-
- **taxon** (*str*) - Species or taxon assignment.
725-
- **mean_cov** (*float*) - Mean sequencing coverage across the genome.
726-
- **median_cov** (*float*) - Median sequencing coverage.
727-
- **frac_reads_mapped** (*float*) - Fraction of reads mapped to the reference genome.
728-
- **contam_pct** (*float*) - Estimated contamination percentage.
729-
- **pass_qc** (*bool*) - Whether the sample passed quality control filters.
730-
- **cohort_admin1_year** (*str*) - Cohort label combining admin level 1 region and year (if available).
731-
- **cohort_admin2_year** (*str*) - Cohort label combining admin level 2 region and year (if available).
732-
- **aim_species** (*str*) - Species assignment from ancestry-informative markers (if available).
733-
734-
The returned DataFrame is a copy and can be safely modified
735-
without affecting internal caches.
712+
A pandas DataFrame with one row per sample. Columns are grouped
713+
by metadata source:
714+
715+
**General metadata** (present for all sample sets):
716+
717+
- ``sample_id`` - Unique identifier for the sample.
718+
- ``partner_sample_id`` - Sample ID used by the contributing partner.
719+
- ``contributor`` - Name of the contributing institution or individual.
720+
- ``country`` - Country where the sample was collected.
721+
- ``location`` - Specific collection site (e.g., village or site name).
722+
- ``year`` - Year of collection.
723+
- ``month`` - Month of collection.
724+
- ``quarter`` - Quarter of the year derived from month (1–4; -1 if missing).
725+
- ``latitude`` - GPS latitude of the collection site.
726+
- ``longitude`` - GPS longitude of the collection site.
727+
- ``sex_call`` - Sex determination call; ``'F'`` for female, ``'M'`` for male.
728+
- ``sample_set`` - Sample set containing the sample.
729+
- ``release`` - Data release containing the sample.
730+
- ``study_id`` - Identifier of the study the sample set belongs to.
731+
- ``study_url`` - URL of the study the sample set belongs to.
732+
- ``terms_of_use_expiry_date`` - Expiry date of terms of use for the sample.
733+
- ``terms_of_use_url`` - URL of the terms of use for the sample.
734+
- ``unrestricted_use`` - Whether the sample can be used without restrictions.
735+
736+
**Sequence QC metadata**:
737+
738+
- ``mean_cov`` - Mean sequencing coverage across the genome.
739+
- ``median_cov`` - Median sequencing coverage across the genome.
740+
- ``modal_cov`` - Modal (most frequent) sequencing coverage.
741+
- ``mean_cov_2L`` - Mean coverage on chromosome arm 2L.
742+
- ``median_cov_2L`` - Median coverage on chromosome arm 2L.
743+
- ``mode_cov_2L`` - Modal coverage on chromosome arm 2L.
744+
- ``mean_cov_2R`` - Mean coverage on chromosome arm 2R.
745+
- ``median_cov_2R`` - Median coverage on chromosome arm 2R.
746+
- ``mode_cov_2R`` - Modal coverage on chromosome arm 2R.
747+
- ``mean_cov_3L`` - Mean coverage on chromosome arm 3L.
748+
- ``median_cov_3L`` - Median coverage on chromosome arm 3L.
749+
- ``mode_cov_3L`` - Modal coverage on chromosome arm 3L.
750+
- ``mean_cov_3R`` - Mean coverage on chromosome arm 3R.=
751+
- ``median_cov_3R`` - Median coverage on chromosome arm 3R.
752+
- ``mode_cov_3R`` - Modal coverage on chromosome arm 3R.
753+
- ``mean_cov_X`` - Mean coverage on chromosome X.
754+
- ``median_cov_X`` - Median coverage on chromosome X.
755+
- ``mode_cov_X`` - Modal coverage on chromosome X.
756+
- ``frac_gen_cov`` - Fraction of the genome covered.
757+
- ``divergence`` - Sequence divergence from the reference.
758+
- ``contam_pct`` - Estimated contamination percentage.
759+
- ``contam_LLR`` - Log-likelihood ratio for contamination estimate.
760+
761+
**Surveillance flags**:
762+
763+
- ``is_surveillance`` - Whether the sample can be used for surveillance.
764+
765+
**AIM (Ancestry-Informative Marker) metadata** (if available):
766+
767+
- ``aim_species_fraction_arab`` - Fraction of gambcolu vs. arabiensis AIMs
768+
indicating arabiensis.
769+
- ``aim_species_fraction_colu`` - Fraction of gambiae vs. coluzzii AIMs
770+
indicating coluzzii.
771+
- ``aim_species_fraction_colu_no2l`` - Fraction of gambiae vs. coluzzii AIMs
772+
indicating coluzzii, excluding chromosome arm 2L.
773+
- ``aim_species_gambcolu_arabiensis`` - Taxon assigned by gambcolu vs.
774+
arabiensis AIMs.
775+
- ``aim_species_gambiae_coluzzii`` - Taxon assigned by gambiae vs.
776+
coluzzii AIMs.
777+
- ``aim_species`` - Final species assignment combining both AIM analyses.
778+
779+
**Cohort metadata** (if available):
780+
781+
- ``country_iso`` - ISO code of the country of collection.
782+
- ``admin1_name`` - Name of the first-level administrative region.
783+
- ``admin1_iso`` - ISO code of the first-level administrative region.
784+
- ``admin2_name`` - Name of the second-level administrative region.
785+
- ``taxon`` - Taxon assigned by combining AIM and cohort analyses.
786+
- ``cohort_admin1_year`` - Cohort grouping by admin level 1 and year.
787+
- ``cohort_admin1_month`` - Cohort grouping by admin level 1 and month.
788+
- ``cohort_admin1_quarter`` - Cohort grouping by admin level 1 and quarter.
789+
- ``cohort_admin2_year`` - Cohort grouping by admin level 2 and year.
790+
- ``cohort_admin2_month`` - Cohort grouping by admin level 2 and month.
791+
- ``cohort_admin2_quarter`` - Cohort grouping by admin level 2 and quarter.
792+
793+
The exact columns present depend on the sample sets requested and
794+
which analyses are available. The returned DataFrame is a copy and
795+
can be safely modified without affecting internal caches.
736796
""",
737797
)
738798
def sample_metadata(

0 commit comments

Comments
 (0)