The Genome Taxonomy DataBase (GTDB) is an initiative to establish a standardised microbial taxonomy based on genome phylogeny, primarily funded by an Australian Research Council Laureate Fellowship.
The genomes used to construct the phylogeny are obtained from RefSeq and Genbank, and GTDB releases are indexed to RefSeq releases, starting with release 76. Importantly and increasingly, this dataset includes draft genomes of uncultured microorganisms obtained from metagenomes and single cells, ensuring improved genomic representation of the microbial world. All genomes are independently quality controlled using CheckM before inclusion in GTDB, see statistics here.
The genome tree on which the taxonomy is based is inferred using FastTree from an aligned concatenated set of 120 single copy marker proteins for Bacteria, and 122 marker proteins for Archaea (download page here). Additional marker sets are also used to cross-validate tree topologies including a concatenation of 16S and 23S ribosomal RNA genes.
NCBI taxonomy was initially used to decorate the genome tree via tax2tree. The 16S rRNA-based Greengenes taxonomy is used to supplement the taxonomy particularly in regions of the tree with no cultured representatives. LPSN is used as the primary taxonomic authority for establishing naming priorities. Taxonomic ranks are normalised using phylorank and the taxonomy manually curated to remove polyphyletic groups. Polyphyly and rank evenness can be visualised in phylorank plots.
The GTDB taxonomy can be queried and downloaded through a number of tools on this website.
Stay tuned for a publication. In the meantime if you use GTDB taxonomy in your research, please cite this website.