Frequently asked questions

Table of contents

FAQs

How can I classify my own genomes with the GTDB?
We have developed a stand-alone application called GTDB-Tk. A specific FAQ for GTDB-Tk can be found here.
[Back to top]

Why has the suffix of phyla names been changed to -ota?
We have adopted the proposal to include the rank of phylum in the International Code of Nomenclature of Prokaryotes as originally proposed by Oren et al. (2015) and updated by Whitman et al. (2018) to use the suffix -ota to denote phyla.
[Back to top]

Why are some genus names formed from a strain identifier?
A strain identifier is used as a placeholder for the genus name when there is no existing genus name and no binomially named representative genome. For example, the genome GCF_000318095.2 has the NCBI organism name Prevotella sp. oral taxon 473 str. F0040. However, this genome is more closely related to Prevotellamassilia and Alloprevotella. Consequently, we assign it to the placeholder genus g__F0040. If the organism had been assigned a binomial species name such as Prevotella oralitaxus str. F0040 we would assign it to the placeholder genus g__Prevotella_A to indicate it is not a true Prevotella species, but that there are representative genomes that have been assigned to a species.
[Back to top]

Why do some genera and species names end with an alphabetic suffix?
Genera and species names with an alphabetic suffix indicate genera and species that are polyphyletic or needed to be subdivided based on taxonomic rank normalisation according to the current GTDB reference tree. The lineage containing the type strain retains the unsuffixed (valid) name and all other lineages are given alphabetic suffixes, indicating that they are placeholder names that need to be replaced in due course. A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.
[Back to top]

Why do some family and higher rank names end with an alphabetic suffix?
Taxon names above the rank of genus appended with an alphabetic suffix indicate groups that are not monophyletic in the GTDB reference tree, but for which there exists alternative evidence that they are monophyletic groups. A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.
[Back to top]

What criteria are used to select genomes for inclusion in the GTDB?
Genomes are obtained from NCBI and must meet the following criteria to be included in the GTDB reference trees and database:

  1. CheckM completeness estimate >50%
  2. CheckM contamination estimate <10%
  3. quality score, defined as completeness - 5*contamination, >50
  4. contain >40% of the bac120 or arc122 marker genes
  5. contain <1000 contigs
  6. have an N50 >5kb
  7. contain <100,000 ambiguous bases
Filtered genomes are manually inspected and exceptions are made for genomes of high nomenclatural or taxonomic significance, e.g. the isolate genome Ktedonobacter racemifer representing the class Ktedonobacteria in the phylum Chloroflexota has a contamination estimate of 11%.
[Back to top]

How are the bacterial and archaeal multiple sequence alignments constructed?
Bacterial and archaeal multiple sequence alignments (MSAs) are formed from the concatenation of 120 or 122 phylogenetically informative markers, respectively. These marker sets are referred to as bac120 and arc122 for bacterial and archaeal markers, respectively, and are comprised of proteins or protein domains specified in the Pfam v27 or TIGRFAMs v15.0 databases. Details on these markers are available for download (here). Gene calling is performed with Prodigal v2.6.3, and markers identified and aligned using HMMER v3.1b1. Columns in the MSA with >50% gaps or with a single amino acid spanning <25% or >95% of taxa are removed. In order to reduce computational requirements, 42 amino acids per marker were randomly selected from the remaining columns to produce MSAs of ~5,000 columns. The final masks applied to the concatenated MSAs are available for download (here) and the identical filtering approach is implemented in GTDB-Tk.
[Back to top]

How are the bacterial and archaeal reference trees inferred?
Bacterial and archaeal reference trees are inferred from the filtered bac120 and ar122 multiple sequence alignments, respectively. Reference trees contain 1 genome per GTDB species cluster. The bacterial reference tree is inferred with FastTree v2.1.10 under the WAG model. The archaeal reference tree is inferred with IQ-Tree v1.6.9 under the PMSF model, a rapid approximation of the C10 mixture model (LG+C10+F+G), using FastTree v2.1.10 to infer an initial guide tree. Both trees contain non-parametric bootstrap support values.
[Back to top]

How are GTDB species clusters formed?
Species clusters are formed as follows:
1.Identify a GTDB representative genome for each validly or effectively published species with one or more genomes passing quality control. In most cases this will be a genome sequenced from the type strain of the species. When this is not possible, the representative genome is selected based on its quality and with consideration to additional metadata (e.g., NCBI reference or representative genome, genome assembled from type strain of subspecies).
2.Assign genomes to selected GTDB representative genomes using average nucleotide identity (ANI) and alignment fraction (AF) criteria. GTDB uses an ANI circumscription radius of 95%, though permits this to be as high as 97% in order to retain a larger number of existing species names. Species with an ANI >97% are synonyms within the GTDB. Species assignments use an AF of 65%. ANI and AF values are calculated with FastANI v1.1.
3.Remaining genomes are formed into de novo species clusters using a greedy clustering approach that emphasizes selecting representative genomes of high quality. This clustering consists of 3 steps: i) sort remaining genomes by their estimated genome quality, ii) select the highest-quality genome to form a new species cluster, and iii) assign genomes to this species cluster using the ANI and AF criteria. These steps are repeated until all genomes have been assigned to a species.
A manuscript fully describing and evaluating the GTDB species clusters is being prepared.
[Back to top]

How are placeholder genus names formed?
An internal node representing a genus without any descendant genomes with validly or effectively published genus names is assigned a placeholder name. This placeholder genus name is generally derived from the oldest representative genome within the lineage and formed, in priority order, from the:

  1. NCBI organism name,
  2. NCBI infraspecific/strain ID,
  3. NCBI WGS identifier, or
  4. NCBI genome assembly ID

Many of these placeholder names have been automatically generated with manual inspection used to modify names to more suitable, human-readable names when appropriate.
[Back to top]

How is the specific name of novel GTDB species clusters formed?
GTDB species clusters without any validly or effectively published specific name are assigned a placeholder name which is formed from the NCBI accession number of the GTDB representative genome of the species. For example, GCF_000192635.1 is the representative genome of a species cluster within the genus Agrobacterium resulting in the species name Agrobacterium sp000192635.
[Back to top]

How are the number of taxa at each rank counted?
Each taxon at the rank of species and genus are counted, including those with an alphabetic suffix. For ranks higher than genus, suffixed names are collapsed and counted once (e.g. Firmicutes, Firmicutes_A, Firmicutes_B, ... is counted as a single phylum).
[Back to top]

References

Oren A, et al. (2015). Proposal to include the rank of phylum in the international code of nomenclature of prokaryotes. Int J Syst Evol Microbiol 65, 4284-4287.

Whitman WB, et al. (2018). Proposal of the suffix -ota to denote phyla. Addendum to 'Proposal to include the rank of phylum in the International Code of Nomenclature of Prokaryotes'. Int J Syst Evol Microbiol 68, 967-969.

[Back to top]