Genus names ending with an alphabetic suffix indicate genera that are i) polyphyletic according to the current GTDB reference tree, or ii) subdivided based on taxonomic rank normalisation according to the current GTDB reference tree.
Species names end with an alphabetic suffix if the GTDB species cluster is (or was previously) associated with a species name, but the correct application of this name is ambiguous or the name assigned to a different GTDB species cluster based on the presence of type material or via majority voting.
The lineage or species cluster containing the nomenclature type or, in case of species, satisfying the majority vote criteria retains the unsuffixed name and all other lineages/clusters are given alphabetic suffixes, indicating that they are placeholder names that need to be replaced in due course. A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.
Taxon names above the rank of genus appended with an alphabetic suffix indicate groups that are under the following category: i) groups that are not monophyletic in the GTDB reference tree, but for which there exists alternative evidence that they are monophyletic groups; ii) groups whose placement is unstable between releases.
A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.
Genomes are obtained from NCBI and must meet the following criteria to be included in the GTDB reference trees and database:
- CheckM completeness estimate >50%
- CheckM contamination estimate <10%
- quality score, defined as completeness - 5*contamination, >50
- contain >40% of the bac120 or arc53 marker genes
- contain <1000 contigs
- have an N50 >5kb
- contain <100,000 ambiguous bases
Bacterial and archaeal multiple sequence alignments (MSAs) are formed from the concatenation of 120 (bac120) or 53 (arc53) phylogenetically informative markers, respectively. These markers are comprised of proteins or protein domains specified in the Pfam v33.1 or TIGRFAMs v15.0 databases. Details on these markers are available for download (here). Gene calling is performed with Prodigal v2.6.3, and markers identified and aligned using HMMER v3.1b1. Columns in the MSA with >50% gaps or with a single amino acid spanning <25% or >95% of taxa are removed. In order to reduce computational requirements of the bacterial reference tree, 42 amino acids per marker were randomly selected from the remaining columns to produce a MSA of ~5,000 columns. The final masks applied to the concatenated MSAs are available for download (here) and the identical filtering approach is implemented in GTDB-Tk.
.
The full methodology used to establish species clusters is described in:
Parks, D.H., et al. (2020). "A complete domain-to-species taxonomy for Bacteria and Archaea." Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8.
- Identify a GTDB representative genome for each validly or effectively published species with one or more genomes passing quality control. In most cases this will be a genome sequenced from the type strain of the species. When this is not possible, the representative genome is selected based on its quality and with consideration to additional metadata (e.g., NCBI reference or representative genome, genome assembled from type strain of subspecies).
- Assign genomes to selected GTDB representative genomes using average nucleotide identity (ANI) and alignment fraction (AF) criteria. GTDB uses an ANI circumscription radius of 95%, though permits this to be as high as 97% in order to retain a larger number of existing species names. Species with an ANI >97% are synonyms within the GTDB. Species assignments use an AF of 50% as of R07-RS207 and 65% prior to this release. ANI and AF values are calculated with skani v0.2.1.
- Remaining genomes are formed into de novo species clusters using a greedy clustering approach that emphasizes selecting representative genomes of high quality. This clustering consists of 3 steps: i) sort remaining genomes by their estimated genome quality, ii) select the highest-quality genome to form a new species cluster, and iii) assign genomes to this species cluster using the ANI and AF criteria. These steps are repeated until all genomes have been assigned to a species.
An internal node representing a genus without any descendant genomes with validly or effectively published genus names is assigned a placeholder name. This placeholder genus name is generally derived from the oldest representative genome within the lineage and formed, in priority order, from the:
- NCBI organism name,
- NCBI infraspecific/strain ID,
- NCBI WGS identifier, or
- NCBI genome assembly ID
Many of these placeholder names have been automatically generated with manual inspection used to modify names to more suitable, human-readable names when appropriate.
Each GTDB species is defined by a single representative genome and species assignments established by considering the ANI and AF to these representative genomes (Parks et al., Nature Biotechnology, 2020). Species representatives are re-evaluated each GTDB release with an emphasis placed on retaining representatives so they can serve as effective nomenclatural type material. However, the goal of stable representatives must be balanced with the desire to use high-quality genomes as representatives, the incorporation of changing taxonomic opinion, and identified errors in genome classification or assembly.
GTDB representatives are updated according to two primary principles: i) representatives should be assembled from the type strain of a species whenever possible, and ii) representatives should only be replaced by assemblies of suitably higher overall quality. These two principles are quantitatively defined by the balanced ANI score (BAS) which is 0.5 * (ANI score) + 0.5 * (quality score), where the ANI score is 100 – 20 * (100 - ANI to current representative) and the quality score is defined by the criteria given in Table 1. An existing representative is only replaced by a new representative if it has a BAS ≥ 10 above the BAS of the current representative. Intuitively, the BAS achieves the goal of stable representatives by requiring a new representative to be of increasingly higher quality (as defined by the quality score) the more dissimilar it is from the current representative (as defined by the ANI score).
Representatives are also updated to account for genome assemblies being removed from NCBI and representatives are updated whenever the underlying assembly is updated at NCBI.
Table 1. Criteria used to establish quality score of an assemblyCRITERIA | SCORE |
---|---|
Type species of genome | 100,000 |
Effective type strain of species according to NCBI | 10,000 |
NCBI representative of species | 1,000 |
Complete genome | 100 |
CheckM quality estimate | completeness - 5*contamination |
MAG or SAG | -100 |
Contig count | -5 * (no. contigs/100) |
Undetermined bases | -5 * (no. undetermined bases/10,000) |
Full length 16S rRNA gene | 10 |
The names assigned to GTDB species clusters are re-evaluated each GTDB release with an emphasis placed on nomenclature stability. However, names are changed in some cases to reflect changes in taxonomic opinions and/or to correct identified errors in GTDB or NCBI assignments. Species clusters containing one or more genomes assembled from the type strain of a species are named after the species with nomenclatural priority (Parker et al., 2019), with the generic and specific names changed as necessary to reflect any genus level reclassifications in the GTDB. Species names identified as synonyms are provided as separate files in the GTDB repository and updated each release.
Species clusters without a type strain genome are assigned via a majority voting approach based on NCBI species assignments regarded as correct under the GTDB framework. A genome is considered to have an erroneous NCBI species assignment if a genome assembled from the type strain of this species exists and resides in a different GTDB species cluster. A cluster is assigned a name by majority voting if >50% of genomes in the cluster with a GTDB-validated NCBI name are from a single species and >50% of all genomes with this species classification are in the cluster. Otherwise, the species cluster is assigned an alphanumeric or Latin suffixed placeholder name. In order to maximize the stability of GTDB names, placeholder names are not updated to new placeholder names (e.g., Bacillus sp002153395 to B. subtilis_A or vice versa) even if an updated placeholder name might better reflect the current classification of genomes within a cluster.
Species clusters containing an assembly from the type strain of a subspecies or a subspecies satisfying the majority voting criteria will have the subspecies name promoted to the specific name of the cluster in cases where a placeholder name would otherwise be required.
For instance, after merging families Burkholderiaceae, Alcaligenaceae, Comamonadaceae and Sutterellaceae in GTDB, we chose to name the merged taxon Burkholderiaceae. This results in the classification o__Burkholderiales; f__Burkholderiaceae; g__Burkholderia. Application of correct name with priority, Alcaligenaceae, would result in the ‘virtual’ absence of the family name f__Burkholderiaceae as the classification would be o__Burkholderiales; f__Alcaligenaceae; g__Burkholderia. We believe that it is more logical and practical to preserve both order and family names based on the same type genus in taxonomy in order to know what genus (type) is included in the taxon. This aids in reclassification and typification of this and other taxa. Another example of name with the lower priority applied to GTDB taxon (and elsewhere) is the order name Rhizobiales that is regarded as illegitimate since the order contains the type genus of another order Hyphomicrobiales that has priority (see https://pubmed.ncbi.nlm.nih.gov/32373076/).
Submission of genomes to NCBI or other INSDC repositories requires genomes to be classified according to the NCBI Taxonomy. There is no direct translation from GTDB taxa to NCBI taxa. For example, GTDB may have merged two families defined in the NCBI Taxonomy into a single family, or split a family in the NCBI Taxonomy into two families. However, we would like to facilitate the process of submitting genomes to an INSDC repository to the extent possible.
We recommend processing genomes through the GTDB-Tk “classify_wf” which will place each of your genomes in a GTDB-Tk reference tree. An NCBI classification can then be determined by considering the NCBI classification of all reference genomes descendant from the parent node of your genome. GTDB-Tk provides a script “gtdb_to_ncbi_majority_vote.py” that can be run on GTDB-Tk output to produce an NCBI classification for your genome based on a majority vote of these NCBI classifications.
Oren A, et al. (2015). Proposal to include the rank of phylum in the international code of nomenclature of prokaryotes. Int J Syst Evol Microbiol 65, 4284-4287.
Whitman WB, et al. (2018). Proposal of the suffix -ota to denote phyla. Addendum to 'Proposal to include the rank of phylum in the International Code of Nomenclature of Prokaryotes'. Int J Syst Evol Microbiol 68, 967-969.