GTDB - FAQ

Frequently Asked Questions

How can I classify my own genomes with the GTDB?

We have developed a stand-alone application called GTDB-Tk. A specific FAQ for GTDB-Tk can be found here.

What is the GTDB versioning scheme?

The GTDB version indicates both the GTDB and RefSeq release numbers. For example, R05-RS95 designates the fifth release of the GTDB and indicates reference genomes were obtained from RefSeq release 95.

Why has the suffix of phyla names been changed to -ota?

This is based on a Whitman et al. (2018) proposal to normalise the suffix of the rank of phylum as is done with other ranks. See the Microbiology Society website.

Why are some genus names formed from a strain identifier?

A strain identifier is used as a placeholder for the genus name when there is no existing genus name and no binomially named representative genome. For example, the genome GCF_000318095.2 has the NCBI organism name Prevotella sp. oral taxon 473 str. F0040. However, this genome is more closely related to Prevotellamassilia and Alloprevotella. Consequently, we assign it to the placeholder genus g__F0040. If the organism had been assigned a binomial species name such as Prevotella oralitaxus str. F0040 we would assign it to the placeholder genus g__Prevotella_A to indicate it is not a true Prevotella species, but that there are representative genomes that have been assigned to a species.

Why are some higher taxon names formed from a culture collection strain identifier?

A culture collection strain identifier is used as a placeholder for the taxon names above the rank of genus when

There is no existing taxon name (e.g., family or order) for a parent taxon that contains children with validly or effectively published names based on isolates, or

It is based on the genus name formed from the strain identifier.

In the first case, some of the children may have a parent with an existing name, but their classification (parent taxon) is different in GTDB. For example, the order o__DSM-8532 contains the family f__DSM-8532, which has seven children, including the genus Thermoclostridium. This genus was originally assigned to the family Hungateiclostridiaceae (currently illegitimate) and later to the family Oscillospiraceae. However, it is classified as a separate family in GTDB, necessitating the creation of a new parent name. The latter has been formed based on the type strain identifier of the type species of the genus Thermoclostridium to indicate that this family includes representatives with cultured strains.

Why do some genus and species names end with an alphabetic suffix?

Genus names ending with an alphabetic suffix indicate genera that are i) polyphyletic according to the current GTDB reference tree, or ii) subdivided based on taxonomic rank normalisation according to the current GTDB reference tree.

Species names end with an alphabetic suffix if the GTDB species cluster is (or was previously) associated with a species name, but the correct application of this name is ambiguous or the name assigned to a different GTDB species cluster based on the presence of type material or via majority voting.

The lineage or species cluster containing the nomenclature type or, in case of species, satisfying the majority vote criteria retains the unsuffixed name and all other lineages/clusters are given alphabetic suffixes, indicating that they are placeholder names that need to be replaced in due course. A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.

Why do some family and other higher taxon names end with an alphabetic suffix?

Taxon names above the rank of genus appended with an alphabetic suffix indicate groups that are under the following category: i) groups that are not monophyletic in the GTDB reference tree, but for which there exists alternative evidence that they are monophyletic groups; ii) groups whose placement is unstable between releases.

A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.

What criteria are used to select genomes for inclusion in the GTDB?

Genomes are obtained from NCBI and must meet the following criteria to be included in the GTDB reference trees and database:

CheckM completeness estimate >50%
CheckM contamination estimate <10%
quality score, defined as completeness - 5*contamination, >50
contain >40% of the bac120 or arc53 marker genes
Contain <2,000 contigs (raised from 1,000 in R08-RS214 to match RefSeq filtering criteria)
have an N50 >5kb
contain <100,000 ambiguous bases

Starting with GTDB R10-RS226, CheckM v2 genome quality estimates have been incorporated into the QC process. Genomes must now satisfy all three quality metrics (completeness, contamination, and quality score) for both CheckM v1 and v2. The exception is genomes with <10 contigs, which are retained if they pass QC according to either CheckM v1 or v2.

Filtered genomes are manually inspected and exceptions are made for genomes of high nomenclatural or taxonomic significance, e.g. the isolate genome Ktedonobacter racemifer representing the class Ktedonobacteria in the phylum Chloroflexota has a contamination estimate of 11%. Genomes with CheckM contamination between 10% and 20% which pass critieria i and iv to vii are also retained if >80% of all duplicate marker genes are 100% identical as this suggest a large legitimate genome duplication event, e.g. GCF_004799645.1, a complete isolate genome from the type strain of Natronorubrum bangense.

How are the bacterial and archaeal multiple sequence alignments constructed?

Bacterial and archaeal multiple sequence alignments (MSAs) are formed from the concatenation of 120 (bac120) or 53 (arc53) phylogenetically informative markers, respectively. These markers are comprised of proteins or protein domains specified in the Pfam v33.1 or TIGRFAMs v15.0 databases. Details on these markers are available for download (here). Gene calling is performed with Prodigal v2.6.3, and markers identified and aligned using HMMER v3.1b1. Columns in the MSA with >50% gaps or with a single amino acid spanning <25% or >95% of taxa are removed. In order to reduce computational requirements of the bacterial reference tree, 42 amino acids per marker were randomly selected from the remaining columns to produce a MSA of ~5,000 columns. The final masks applied to the concatenated MSAs are available for download (here) and the identical filtering approach is implemented in GTDB-Tk.
.

How are the bacterial and archaeal reference trees inferred?

Bacterial and archaeal reference trees are inferred from the filtered bac120 and ar53 multiple sequence alignments, respectively. Reference trees contain 1 genome per GTDB species cluster. The bacterial reference tree is inferred with FastTree v2.1.10 under the WAG model. The archaeal reference tree is inferred with IQ-Tree v1.6.9 under the PMSF model, a rapid approximation of the C10 mixture model (LG+C10+F+G), using FastTree v2.1.10 to infer an initial guide tree. Both trees contain non-parametric bootstrap support values.

How are GTDB species clusters formed?

The full methodology used to establish species clusters is described in:
Parks, D.H., et al. (2020). "A complete domain-to-species taxonomy for Bacteria and Archaea." Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8.

Briefly, species clusters are formed as follows:

Identify a GTDB representative genome for each validly or effectively published species with one or more genomes passing quality control. In most cases this will be a genome sequenced from the type strain of the species. When this is not possible, the representative genome is selected based on its quality and with consideration to additional metadata (e.g., NCBI reference or representative genome, genome assembled from type strain of subspecies).
Assign genomes to selected GTDB representative genomes using average nucleotide identity (ANI) and alignment fraction (AF) criteria. GTDB uses an ANI circumscription radius of 95%, though permits this to be as high as 97% in order to retain a larger number of existing species names. Species with an ANI >97% are synonyms within the GTDB. Species assignments use an AF of 50% as of R07-RS207 and 65% prior to this release. ANI and AF values are calculated with skani v0.2.1.
Remaining genomes are formed into de novo species clusters using a greedy clustering approach that emphasizes selecting representative genomes of high quality. This clustering consists of 3 steps: i) sort remaining genomes by their estimated genome quality, ii) select the highest-quality genome to form a new species cluster, and iii) assign genomes to this species cluster using the ANI and AF criteria. These steps are repeated until all genomes have been assigned to a species.

How are placeholder genus names formed?

An internal node representing a genus without any descendant genomes with validly or effectively published genus names is assigned a placeholder name. This placeholder genus name is generally derived from the oldest representative genome within the lineage and formed, in priority order, from the:

NCBI organism name,
NCBI infraspecific/strain ID,
NCBI WGS identifier, or
NCBI genome assembly ID

Many of these placeholder names have been automatically generated with manual inspection used to modify names to more suitable, human-readable names when appropriate.

How is the specific name of novel GTDB species clusters formed?

GTDB species clusters without any validly or effectively published specific name are assigned a placeholder name which is formed from the NCBI accession number of the GTDB representative genome of the species. For example, if GCF_000192635.1 is the representative genome of a species cluster within the genus Agrobacterium the cluster will be named Agrobacterium sp000192635. Representative genomes of a GTDB species cluster are updated between releases when genomes of sufficiently higher quality become available, but placeholder names are not updated as preference is given to the stability of names. As a consequence, the placeholder name of GTDB species clusters may not reflect the current representative genome.

How are the number of taxa at each rank counted?

Each taxon at the rank of species and genus are counted, including those with an alphabetic suffix. For ranks higher than genus, suffixed names are collapsed and counted once (e.g. Firmicutes, Firmicutes_A, Firmicutes_B, ... is counted as a single phylum).

How are GTDB species representatives updated with each release?

Each GTDB species is defined by a single representative genome and species assignments established by considering the ANI and AF to these representative genomes (Parks et al., Nature Biotechnology, 2020). Species representatives are re-evaluated each GTDB release with an emphasis placed on retaining representatives so they can serve as effective nomenclatural type material. However, the goal of stable representatives must be balanced with the desire to use high-quality genomes as representatives, the incorporation of changing taxonomic opinion, and identified errors in genome classification or assembly.

GTDB representatives are updated according to two primary principles: i) representatives should be assembled from the type strain of a species whenever possible, and ii) representatives should only be replaced by assemblies of suitably higher overall quality. These two principles are quantitatively defined by the balanced ANI score (BAS) which is 0.5 * (ANI score) + 0.5 * (quality score), where the ANI score is 100 – 20 * (100 - ANI to current representative) and the quality score is defined by the criteria given in Table 1. An existing representative is only replaced by a new representative if it has a BAS ≥ 10 above the BAS of the current representative. Intuitively, the BAS achieves the goal of stable representatives by requiring a new representative to be of increasingly higher quality (as defined by the quality score) the more dissimilar it is from the current representative (as defined by the ANI score).

Representatives are also updated to account for genome assemblies being removed from NCBI and representatives are updated whenever the underlying assembly is updated at NCBI.

Table 1. Criteria used to establish quality score of an assembly

CRITERIA	SCORE
Type species of genome	100,000
Effective type strain of species according to NCBI	10,000
NCBI representative of species	1,000
Complete genome	100
CheckM quality estimate	completeness - 5*contamination
MAG or SAG	-100
Contig count	-5 * (no. contigs/100)
Undetermined bases	-5 * (no. undetermined bases/10,000)
Full length 16S rRNA gene	10

How are the names of GTDB species clusters updated with each release?

The names assigned to GTDB species clusters are re-evaluated each GTDB release with an emphasis placed on nomenclature stability. However, names are changed in some cases to reflect changes in taxonomic opinions and/or to correct identified errors in GTDB or NCBI assignments. Species clusters containing one or more genomes assembled from the type strain of a species are named after the species with nomenclatural priority (Parker et al., 2019), with the generic and specific names changed as necessary to reflect any genus level reclassifications in the GTDB. Species names identified as synonyms are provided as separate files in the GTDB repository and updated each release.

Species clusters without a type strain genome are assigned via a majority voting approach based on NCBI species assignments regarded as correct under the GTDB framework. A genome is considered to have an erroneous NCBI species assignment if a genome assembled from the type strain of this species exists and resides in a different GTDB species cluster. A cluster is assigned a name by majority voting if >50% of genomes in the cluster with a GTDB-validated NCBI name are from a single species and >50% of all genomes with this species classification are in the cluster. Otherwise, the species cluster is assigned an alphanumeric or Latin suffixed placeholder name. In order to maximize the stability of GTDB names, placeholder names are not updated to new placeholder names (e.g., Bacillus sp002153395 to B. subtilis_A or vice versa) even if an updated placeholder name might better reflect the current classification of genomes within a cluster.

Species clusters containing an assembly from the type strain of a subspecies or a subspecies satisfying the majority voting criteria will have the subspecies name promoted to the specific name of the cluster in cases where a placeholder name would otherwise be required.

Why has the priority rule been violated for a number of selected taxa that were merged in GTDB?

In GTDB, we violate the rule of priority in cases where correct names may lead to confusion (see Rule 38, ICNP). Such situations may happen when a taxon, whose name is considered as a later heterotypic synonym in GTDB, serves as a nomenclature type of its parent taxon or when the name of the earlier synonym is typified by a type genus different from that of a parent taxon. Implementation of the correct name in such cases can result in classifications that are likely to cause confusion.

For instance, after merging families Burkholderiaceae, Alcaligenaceae, Comamonadaceae and Sutterellaceae in GTDB, we chose to name the merged taxon Burkholderiaceae. This results in the classification o__Burkholderiales; f__Burkholderiaceae; g__Burkholderia. Application of correct name with priority, Alcaligenaceae, would result in the ‘virtual’ absence of the family name f__Burkholderiaceae as the classification would be o__Burkholderiales; f__Alcaligenaceae; g__Burkholderia. We believe that it is more logical and practical to preserve both order and family names based on the same type genus in taxonomy in order to know what genus (type) is included in the taxon. This aids in reclassification and typification of this and other taxa. Another example of name with the lower priority applied to GTDB taxon (and elsewhere) is the order name Rhizobiales that is regarded as illegitimate since the order contains the type genus of another order Hyphomicrobiales that has priority (see https://pubmed.ncbi.nlm.nih.gov/32373076/).

What nomenclatural resources does the GTDB use for determining validly published names?

GTDB makes extensive use of the LPSN and makes a best effort to follow the International Code of Nomenclature of Prokaryotes.

Where can I find details on the methods used by the GTDB?

The methodology used by the GTDB changes over times in order to reflect best practices in the field and updates to reference databases. As such, a separate METHODS file is provided with each GTDB release. You can find the methods used in the latest GTDB release at: https://data.gtdb.ecogenomic.org/releases/latest/METHODS

Why doesn't the GTDB contain Shigella species?

Shigella species are considered heterotypic synonyms of E. coli (Parks et al., 2021)

Why do some published phylum (and other higher rank) names not appear in the GTDB taxonomy?

Effectively published Latin names above the rank of genus without designated type material, either a sequenced type strain or MAG, will no longer be incorporated into GTDB, and those that do will only be introduced when the associated type genome is present in GTDB. This change is necessary as establishing the correct interior node in the reference tree for taxa without type material can be ambiguous, particularly when the addition of new genomes or alternative inference methods results in the named taxon becoming polyphyletic in later releases.

Why is my taxon of interest not present in the GTDB taxonomy?

The most common reason why a given taxon does not appear in GTDB is due to rank normalization that can result in splitting or lumping of taxa depending on their relative evolutionary divergence (RED). For example, the class Cytophagia does not exist in GTDB because it is too shallow for a class according to RED and has been united with the class Bacteroidia . Ultimately, the GTDB reflects the taxonomic opinion of the GTDB curators who adhere to the principle of taxonomic freedom enshrined in all nomenclatural codes.

Can I obtain an NCBI taxonomy string for my genomes to facilitate submitting them to NCBI?

Submission of genomes to NCBI or other INSDC repositories requires genomes to be classified according to the NCBI Taxonomy. There is no direct translation from GTDB taxa to NCBI taxa. For example, GTDB may have merged two families defined in the NCBI Taxonomy into a single family, or split a family in the NCBI Taxonomy into two families. However, we would like to facilitate the process of submitting genomes to an INSDC repository to the extent possible.

We recommend processing genomes through the GTDB-Tk “classify_wf” which will place each of your genomes in a GTDB-Tk reference tree. An NCBI classification can then be determined by considering the NCBI classification of all reference genomes descendant from the parent node of your genome. GTDB-Tk provides a script “gtdb_to_ncbi_majority_vote.py” that can be run on GTDB-Tk output to produce an NCBI classification for your genome based on a majority vote of these NCBI classifications.

How can I download data from NCBI?

Here we discuss three ways to obtain genomic data from NCBI for genomes in GTDB:

Genomic FASTA files for GTDB species representative genomes are available from the GTDB FTP sites. Specifically, the file gtdb_genomes_reps_.tar.gz in the genomic_files_reps directory.
If you only require this data and already have the GTDB-Tk reference data on your system, you can find these genomic FASTA files in the skani directory. The genome_paths.tsv file indicates the relative path to each FASTA file for each genome.
For small numbers of genomes (<1,000), the GTDB Advance Search interface can be used. This allows searching for a specific set of genomes by taxonomy, genomic features, and/or other genome metadata. The GENOMES button can then by used to download a script that allows different genomic data files to be downloaded via curl or the NCBI Datasets tool.
For larger numbers of genomes (>1,000), we suggest using the NCBI Datasets tool and the dehydrate / rehydrate approach recommended by NCBI. This allows efficient downloading of large numbers of genomes. This can be done by creating a text file listing the NCBI accession numbers of all genomes to download, e.g.:
GCF_000290775.1
GCF_000633475.1
GCF_000783055.1
…

For example, the following commands can be used to obtain the genomic FASTA files for genomes specified in a file named genomes.lst:

> datasets download genome accession --dehydrated --include genome --inputfile genomes.lst --filename my-genomes.zip
> unzip my-genomes.zip -d my-genomes
> datasets rehydrate --directory my-genomes

In addition to the genomic FASTA file for each genome, you can download additional data files such as protein or CDS sequences. The desired file types are specified with the --include parameter (see: the CLI help).

Genomic data files for genomes that are suppressed at NCBI are no longer available for download. This can occur when a genome has been removed at the submitter's request because the corresponding paper has not yet been published (e.g. GCF_024053485.1). Genomes can also be suppressed for other reasons, including a newer version being available (e.g. GCA_026184055.1). Unfortunately, it is possible that such genomes were available when a given GTDB release was created and thus are included in the release. We are working with NCBI to see if this situation can be resolved as we appreciate it causes complications for GTDB users.

References

Oren A, et al. (2015). Proposal to include the rank of phylum in the international code of nomenclature of prokaryotes. Int J Syst Evol Microbiol 65, 4284-4287.

Whitman WB, et al. (2018). Proposal of the suffix -ota to denote phyla. Addendum to 'Proposal to include the rank of phylum in the International Code of Nomenclature of Prokaryotes'. Int J Syst Evol Microbiol 68, 967-969.