The GTDB version indicates both the GTDB and RefSeq release numbers.
For example, R05-RS95 designates the fifth release of the GTDB and indicates reference
genomes were obtained from RefSeq release 95.
[Back to top]
A strain identifier is used as a placeholder for the genus name when
there is no existing genus
name and no binomially named representative genome. For example, the genome
GCF_000318095.2 has the NCBI organism
name Prevotella sp. oral taxon 473 str. F0040. However, this genome is more closely
related to Prevotellamassilia
and Alloprevotella. Consequently, we assign it to the placeholder genus g__F0040. If the
organism had been assigned
a binomial species name such as Prevotella oralitaxus str. F0040 we would assign it to
the placeholder genus g__Prevotella_A
to indicate it is not a true Prevotella species, but that there are representative
genomes that have been assigned to a species.
A strain identifier is used as a placeholder for the genus name when there is no
existing genus name and no binomially named
representative genome. For example, the genome GCF_000318095.2 has the NCBI organism
name Prevotella sp. oral taxon 473 str. F0040
and is assigned to the genus Alloprevotella in NCBI. However, this genome appears to be
neither assigned to Prevotella,
Alloprevotella or another closely related genus Prevotellamassilia in GTDB.
Consequently, we assign it to the placeholder
genus g__F0040. If the organism had been assigned a binomial species name such as
Prevotella oralitaxus str. F0040, and it is
not part of true Prevotella in GTDB, we would assign it to the placeholder genus
g__Prevotella_A to indicate it is not a true
Prevotella species, but that there are representative genomes that have been assigned to
[Back to top]
Genus names ending with an alphabetic suffix indicate genera that are i) polyphyletic according to the current GTDB reference tree, or ii) subdivided based on taxonomic rank normalisation according to the current GTDB reference tree.
Species names end with an alphabetic suffix if the GTDB species cluster is (or was previously) associated with a species name, but the correct application of this name is ambiguous or the name assigned to a different GTDB species cluster based on the presence of type material or via majority voting.
The lineage or species cluster containing the nomenclature type or, in case of species, satisfying the majority vote criteria retains the unsuffixed name and all other lineages/clusters are given alphabetic suffixes, indicating that they are placeholder names that need to be replaced in due course. A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.[Back to top]
Taxon names above the rank of genus appended with an alphabetic suffix indicate groups that are under the following category: i) groups that are not monophyletic in the GTDB reference tree, but for which there exists alternative evidence that they are monophyletic groups; ii) groups whose placement is unstable between releases.
A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.[Back to top]
Genomes are obtained from NCBI and must meet the following criteria to be included in the GTDB reference trees and database:
Bacterial and archaeal multiple sequence alignments (MSAs) are formed
from the concatenation of 120 or 122
phylogenetically informative markers, respectively. These marker sets are referred to as
bac120 and arc122 for
bacterial and archaeal markers, respectively, and are comprised of proteins or protein
domains specified in the
Pfam v27 or TIGRFAMs v15.0 databases. Details on these markers are available for
calling is performed
with Prodigal v2.6.3, and markers identified and aligned using HMMER v3.1b1. Columns in
the MSA with >50% gaps
or with a single amino acid spanning <25% or >95% of taxa are removed. In order to
requirements, 42 amino acids per marker were randomly selected from the remaining
columns to produce MSAs of
~5,000 columns. The final masks applied to the concatenated MSAs are available for
and the identical filtering approach is implemented in GTDB-Tk.
[Back to top]
Bacterial and archaeal reference trees are inferred from the filtered
bac120 and ar122 multiple sequence
alignments, respectively. Reference trees contain 1 genome per GTDB species cluster. The
tree is inferred with FastTree v2.1.10 under the WAG model. The archaeal reference tree
is inferred with IQ-Tree
v1.6.9 under the PMSF model, a rapid approximation of the C10 mixture model
(LG+C10+F+G), using FastTree v2.1.10
to infer an initial guide tree. Both trees contain non-parametric bootstrap support
[Back to top]
The full methodology used to establish species clusters is described in:
Parks, D.H., et al. (2020). "A complete domain-to-species taxonomy for Bacteria and Archaea." Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8.
An internal node representing a genus without any descendant genomes with validly or effectively published genus names is assigned a placeholder name. This placeholder genus name is generally derived from the oldest representative genome within the lineage and formed, in priority order, from the:
Many of these placeholder names have been automatically generated with manual inspection
used to modify names to
more suitable, human-readable names when appropriate.
[Back to top]
GTDB species clusters without any validly or effectively published
specific name are assigned a placeholder name
which is formed from the NCBI accession number of the GTDB representative genome of the
species. For example,
if GCF_000192635.1 is the representative genome of a species cluster within the genus
Agrobacterium the cluster will be named Agrobacterium sp000192635. Representative
genomes of a GTDB species cluster are
updated between releases when genomes of sufficiently higher quality become available,
but placeholder names are not updated as
preference is given to the stability of names. As a consequence, the placeholder name of
GTDB species clusters may not reflect the
current representative genome.
[Back to top]
Each taxon at the rank of species and genus are counted, including
those with an alphabetic suffix.
For ranks higher than genus, suffixed names are collapsed and counted once
(e.g. Firmicutes, Firmicutes_A, Firmicutes_B, ... is counted as a single phylum).
[Back to top]
Each GTDB species is defined by a single representative genome and species assignments established by considering the ANI and AF to these representative genomes (Parks et al., Nature Biotechnology, 2020). Species representatives are re-evaluated each GTDB release with an emphasis placed on retaining representatives so they can serve as effective nomenclatural type material. However, the goal of stable representatives must be balanced with the desire to use high-quality genomes as representatives, the incorporation of changing taxonomic opinion, and identified errors in genome classification or assembly.
GTDB representatives are updated according to two primary principles: i) representatives should be assembled from the type strain of a species whenever possible, and ii) representatives should only be replaced by assemblies of suitably higher overall quality. These two principles are quantitatively defined by the balanced ANI score (BAS) which is 0.5 * (ANI score) + 0.5 * (quality score), where the ANI score is 100 – 20 * (100 - ANI to current representative) and the quality score is defined by the criteria given in Table 1. An existing representative is only replaced by a new representative if it has a BAS ≥ 10 above the BAS of the current representative. Intuitively, the BAS achieves the goal of stable representatives by requiring a new representative to be of increasingly higher quality (as defined by the quality score) the more dissimilar it is from the current representative (as defined by the ANI score).
Representatives are also updated to account for genome assemblies being removed from NCBI and representatives are updated whenever the underlying assembly is updated at NCBI.Table 1. Criteria used to establish quality score of an assembly
|Type species of genome||100,000|
|Effective type strain of species according to NCBI||10,000|
|NCBI representative of species||1,000|
|CheckM quality estimate||completeness - 5*contamination|
|MAG or SAG||-100|
|Contig count||-5 * (no. contigs/100)|
|Undetermined bases||-5 * (no. undetermined bases/10,000)|
|Full length 16S rRNA gene||10|
The names assigned to GTDB species clusters are re-evaluated each GTDB release with an emphasis placed on nomenclature stability. However, names are changed in some cases to reflect changes in taxonomic opinions and/or to correct identified errors in GTDB or NCBI assignments. Species clusters containing one or more genomes assembled from the type strain of a species are named after the species with nomenclatural priority (Parker et al., 2019), with the generic and specific names changed as necessary to reflect any genus level reclassifications in the GTDB. Species names identified as synonyms are provided as separate files in the GTDB repository and updated each release.
Species clusters without a type strain genome are assigned via a majority voting approach based on NCBI species assignments regarded as correct under the GTDB framework. A genome is considered to have an erroneous NCBI species assignment if a genome assembled from the type strain of this species exists and resides in a different GTDB species cluster. A cluster is assigned a name by majority voting if >50% of genomes in the cluster with a GTDB-validated NCBI name are from a single species and >50% of all genomes with this species classification are in the cluster. Otherwise, the species cluster is assigned an alphanumeric or Latin suffixed placeholder name. In order to maximize the stability of GTDB names, placeholder names are not updated to new placeholder names (e.g., Bacillus sp002153395 to B. subtilis_A or vice versa) even if an updated placeholder name might better reflect the current classification of genomes within a cluster.
Species clusters containing an assembly from the type strain of a subspecies or a subspecies satisfying the majority voting criteria will have the subspecies name promoted to the specific name of the cluster in cases where a placeholder name would otherwise be required.[Back to top]
In GTDB, we violate the rule of priority in cases where correct names may lead to
confusion (see Rule 38, ICNP).
Such situations may happen when a taxon, whose name is considered as a later heterotypic
synonym in GTDB,
serves as a nomenclature type of its parent taxon or when the name of the earlier
synonym is typified by a
type genus different from that of a parent taxon. Implementation of the correct name in
such cases can
result in classifications that are likely to cause confusion.
For instance, after merging families Burkholderiaceae, Alcaligenaceae, Comamonadaceae and Sutterellaceae in GTDB, we chose to name the merged taxon Burkholderiaceae. This results in the classification o__Burkholderiales; f__Burkholderiaceae; g__Burkholderia. Application of correct name with priority, Alcaligenaceae, would result in the ‘virtual’ absence of the family name f__Burkholderiaceae as the classification would be o__Burkholderiales; f__Alcaligenaceae; g__Burkholderia. We believe that it is more logical and practical to preserve both order and family names based on the same type genus in taxonomy in order to know what genus (type) is included in the taxon. This aids in reclassification and typification of this and other taxa. Another example of name with the lower priority applied to GTDB taxon (and elsewhere) is the order name Rhizobiales that is regarded as illegitimate since the order contains the type genus of another order Hyphomicrobiales that has priority (see https://pubmed.ncbi.nlm.nih.gov/32373076/).
The methodology used by the GTDB changes over times in order to reflect best practices in the field and updates to reference databases. As such, a separate METHODS file is provided with each GTDB release. You can find the methods used in the latest GTDB release at: https://data.gtdb.ecogenomic.org/releases/latest/METHODS[Back to top]
Oren A, et al. (2015). Proposal to include the rank of phylum in the international code of nomenclature of prokaryotes. Int J Syst Evol Microbiol 65, 4284-4287.
Whitman WB, et al. (2018). Proposal of the suffix -ota to denote phyla. Addendum to 'Proposal to include the rank of phylum in the International Code of Nomenclature of Prokaryotes'. Int J Syst Evol Microbiol 68, 967-969.[Back to top]