Awesome MAG

A curated list of metagenome-assembled genome (MAG) datasets, catalogs, and database websites, with reproducible download notes and lightweight automation helpers for sources that still require manual clicks or multi-step browser workflows.

Scope

This repository is intended to collect:

MAG datasets and genome catalogs
MAG-oriented databases and project portals
Download pages, mirrors, and release notes
Access notes for sources that are difficult to fetch reproducibly
Automation scripts for sources that are not easily downloadable from a single stable URL

This repository is not intended to:

mirror large upstream datasets in Git
replace official documentation from source websites
host full analysis pipelines for assembly, binning, or annotation

Why This Repository Exists

MAG resources are distributed across project websites, supplemental data pages, institutional portals, and database interfaces. Discovery is often easy; reproducible access is not. A useful awesome repository for MAG resources should therefore separate three concerns:

A human-readable curated index for quick discovery
Structured source metadata for consistent maintenance
Small source-specific scripts for awkward download flows

Suggested Top-Level Sections

As the list grows, the main README.md should stay concise and group links by user intent. A practical structure is:

General MAG catalogs and database portals
Human-associated MAG resources
Animal-associated MAG resources
Marine and freshwater MAG resources
Soil and terrestrial environment MAG resources
Wastewater, engineered, and extreme-environment resources
Integrated or multi-biome collections
Metadata, annotation, and companion resources
Download notes, mirrors, and access restrictions
Deprecated, moved, or archived resources

General MAG Catalogs and Database Portals

Resource	Scope	Type	Access	Automation	Notes
MAGdb	Clinical, Environment, Animal	Database portal	Public listings; cookie-gated archive downloads	Download script	Per-study `data.tar.gz`; 74 study packages downloaded; see notes
gcMeta	Multi-biome catalogues	Database portal	Public catalogue APIs; public direct archive files	Download script	50 catalogue bundles; public `catalogueTree` and `catalogueNameList` enumeration plus derived direct files on `open.nmdc.cn`; see notes
GEM	Global multi-biome bacterial and archaeal MAGs	Static dataset portal	Public NERSC file indexes and direct archives	None	52,515 MAGs; 39.5G genome FASTA tar, 26.7G protein tar, 39.3G CDS tar, metadata, OTU, BGC, prophage, protein-cluster, and tree files; see notes
GOMC	Global ocean microbial MAG/genome catalogue	Dataset portal and CNGBdb archive	Public direct bulk archives, MD5 file, and CNGB/CNSA accession files	None	43,191 recovered MAGs, 24,195 GOMC genomes, 171.3G protein catalogue, supplementary files, and CNP0004049 accession archive; see notes
SPIRE	Multi-biome MAGs and assemblies	Dataset portal	Public direct URLs; Apache indexes	URL helper	714 page-listed studies; script prints URLs only for use with `wget`, `aria2c`, or other tools; see notes
mOTUs DB	Multi-biome prokaryotic genomes and mOTUs	Database portal and tool-backed dataset	Public bulk 4.0 file host; targeted access through `motus-tool`	Official `motus-tool`	2.7T all-genomes tar, full metadata, supplementary tables, and marker/annotation DBs; see notes
Microbiome Datahub	Multi-biome MAG metadata, annotations, and sequences	Database portal and API-backed dataset	Public Zenodo metadata; public NIG bulk sequence files; targeted download API	Download helper	218,653 MAGs in site docs; 146G all-contig FASTA, 79G all-protein FASTA, Zenodo metadata/matrix files, and targeted URL APIs; see notes
Bin Chicken Rare Biosphere Genomes	Multi-biome rare biosphere MAGs	Zenodo supplementary dataset	Public direct Zenodo files; latest record is metadata-only	None	77,562 Bin Chicken-recovered genomes; MAG archives are earlier explicit Zenodo versions, while latest record is revised metadata; see notes
SMAG	Global soil MAGs	Project portal, Zenodo dataset, and code repository	Public Zenodo split archive; CyVerse and S3 mirrors have access friction	None	40,039 soil MAG bins from 3,304 metagenomes, 21,077 SGBs, plus SNV and virus files; see notes

Directory Design

`README.md`

The main landing page for humans. In an awesome repository, this is the canonical entry point and should remain easy to scan. Keep descriptions short and avoid turning the front page into a raw data dump.

`sources/`

Stores structured information for individual data sources when a one-line README entry is not enough. Over time, each source can grow into a dedicated folder such as:

sources/<slug>/
├── metadata.yaml
├── notes.md
└── download.md

Recommended use:

metadata.yaml: machine-readable source metadata
notes.md: short curation notes, caveats, or history
download.md: manual download steps, quirks, tokens, cookies, or browser requirements

`scripts/`

Contains automation helpers for websites that require clicking through multiple pages, filling forms, resolving dynamic URLs, or replaying authenticated browser actions. Keep scripts source-specific and reproducible.

Example future layout:

scripts/
├── shared/
└── <slug>/
    ├── README.md
    └── download.py

`docs/`

Holds repository conventions that should not clutter the front page, such as style rules, source field definitions, and roadmap notes.

`templates/`

Contains reusable starter files for adding new sources consistently.

Suggested Entry Fields

Each source should eventually capture as many of the following fields as practical:

name
homepage
short summary
environment or biome
source type
release or version
update date
license or terms of use
download method
automation availability
known access quirks

Curation Rules

Prefer original project or database pages over third-party mirrors.
Record direct download URLs when they are stable, but keep the landing page as the primary reference.
Call out access friction explicitly, such as login requirements, JavaScript-only buttons, request forms, or temporary tokens.
Keep main-list descriptions brief and move detailed notes into sources/ or scripts/.
Do not commit downloaded datasets or large derived files.

Contributing

See CONTRIBUTING.md.

License

Add a repository license for the curation text and scripts, while respecting the original licenses and terms attached to each upstream dataset or database.

Disclaimer

This repository contains automated download scripts designed to simplify the retrieval of public data. Please note:

Compliance & Fair Use: All scripts provided in this repository are simple wrappers that interact exclusively with publicly available APIs, direct download links, or standard web endpoints as intended by the data providers.
No Malicious Activity: These scripts do not perform aggressive scraping, bypass access controls, or cause malicious intrusion to the host servers. They typically include built-in rate-limiting (e.g., delays between requests) to respect server workloads.
User Responsibility: Users of these scripts are responsible for ensuring that their data retrieval complies with the specific terms of service, data usage policies, and licensing requirements of each respective database or dataset provider.
Takedown Requests: If any database maintainer or data provider believes that a script in this repository infringes upon their rights or violates their policies, please open an issue. We will promptly review and remove the relevant scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome MAG

Scope

Why This Repository Exists

Suggested Top-Level Sections

General MAG Catalogs and Database Portals

Directory Design

`README.md`

`sources/`

`scripts/`

`docs/`

`templates/`

Suggested Entry Fields

Curation Rules

Contributing

License

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
docs		docs
scripts		scripts
sources		sources
templates		templates
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome MAG

Scope

Why This Repository Exists

Suggested Top-Level Sections

General MAG Catalogs and Database Portals

Directory Design

README.md

sources/

scripts/

docs/

templates/

Suggested Entry Fields

Curation Rules

Contributing

License

Disclaimer

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`README.md`

`sources/`

`scripts/`

`docs/`

`templates/`

Packages