Skip to content

Commit 4424634

Browse files
committed
get taxonomy format documentation from master branch
1 parent 165c624 commit 4424634

2 files changed

Lines changed: 183 additions & 0 deletions

File tree

doc/interim-format.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Interim taxonomy file format
2+
3+
This page describes the format used to represent the taxonomies that are the inputs and outputs of the Open Tree of Life taxonomy build system.
4+
5+
The format derives from NCBI and is intentionally rudimentary because our needs are minimal. A better format to use in the long run might be [Darwin Core Archive](https://code.google.com/p/gbif-ecat/wiki/DwCArchive), which is what is used by GBIF, EOL, and the Global Names Architecture (GNA).
6+
7+
***
8+
9+
Each source taxonomy (NCBI, GBIF, Index Fungorum, ...) has its own script that converts its
10+
native format into this format.
11+
12+
A taxonomy consists of a directory of files with fixed names. Example: `mycobank/taxonomy.tsv`, `mycobank/synonyms.tsv`, `mycobank/about.md`.
13+
14+
## Character encoding
15+
16+
All files use the UTF-8 character encoding. Native taxonomy files often use some other encoding, so conversion might be necessary. Some aggregated taxonomies on the web have gotten this wrong and are a mess of mixed encodings and spurious re-encodings.
17+
18+
## Taxonomy
19+
20+
### File `taxonomy.tsv`
21+
22+
Four required columns, each column followed by tab - vertical bar - tab (even for the last column, which is unlike NCBI). The taxonomy build tool 'smasher' doesn't require the vertical bars; they are optional although they should be either all present or all absent. But some other consumers of these files may still require the vertical bars.
23+
24+
A header row of column names is recommended, but not required (for `Smasher`). If provided, it looks like:
25+
26+
uid | parent_uid | name | rank |
27+
28+
All following rows are one row per taxon
29+
30+
**Columns:**
31+
32+
1. `uid` - an identifier for the taxon, unique within this file. Should be native accession number whenever possible. Usually this is an integer, but it need not be.
33+
2. `parent_uid` - the identifier of this taxon's parent, or the empty string if there is no parent (i.e., it's a root).
34+
3. `name` - arbitrary text for the taxon name; not necessarily unique within the file.
35+
4. `rank`, e.g. species, family, class. Should be all lower case. If no rank is assigned, or the rank is unknown, put "no rank".
36+
37+
Example (from NCBI):
38+
39+
5157 | 1028423 | Ceratocystis | genus |
40+
5156 | 91171 | Gondwanamyces proteae | species |
41+
42+
**Optional additional columns:**
43+
44+
* `sourceinfo`: a comma-separated list of source specifiers, each one either a [URL or a CURIE](https://www.w3.org/TR/rdfa-syntax/#dfn-curieoriri). If a URL, it should be either a DOI in the form of a URL, or a link to some other source such as a database. URLs usually begin 'http://' or 'https://' and DOI URLs begin 'http://dx.doi.org/10.'. A CURIE is an abbreviated URI using a prefix drawn from a known set, e.g. ncbi:1234 is taxon 1234 in the NCBI taxonomy. Other prefixes include gbif:, if: (Index Fungorum), mb: (Mycobank). New prefixes can be added but this is a manual process, so please request explicitly.
45+
* `uniqueName`: a human-readable string that is unique to this taxon, typically the taxon name if it is unique, or taxon name followed by "([rank] in [ancestor])" where rank is the taxon's rank and ancestor is an ancestor that is unique to this taxon (among the taxa that have the same name). If the field is empty, the taxon name is already unique in the taxonomy.
46+
* `flags`: a comma-separated list of flags or markers. Usually these are generated by taxonomy synthesis and are used to decide whether a taxon is 'hidden' or not. For example, if there's an 'extinct' flag then it may be desirable to suppress the taxon in an application. See [here](./taxon-flags.md).
47+
48+
Example (from OTT) (long line):
49+
50+
2829583 | 4037065 | Symbiodinium pilosum | species | ncbi:2952,gbif:3207147,irmng:10996086,irmng:11902428 | | unclassified_inherited,infraspecific |
51+
52+
### Synonyms
53+
54+
Usually there are synonyms. These go into a second file, `synonyms.tsv`. This file must have a header row
55+
56+
uid | name | type | rank |
57+
58+
The header is necessary because it designates the order of the columns, which can sometimes change. These are the four columns:
59+
60+
* _uid_ - the id for the taxon (from the taxonomy file) that this synonym resolves to
61+
* _name_ - the synonymic taxon name
62+
* _type_ - typically will be 'synonym' but could be any of the NCBI synonym types (authority, common name, etc.)
63+
* _rank_ - currently ignored for taxonomy synthesis.
64+
65+
Example from NCBI:
66+
67+
89373 | Flexibacteraceae | synonym | |
68+
69+
### Forwarding pointers
70+
71+
When two records are combined into one, as when a newly learned synonymy reveals that two names name the same taxon, one of the records' ids is kept and the other one is retired. The file forwards.tsv lists all such retired ids and tells the records they were merged with.
72+
73+
The file format is a simple tab-separated file with two columns and a header row, e.g.
74+
75+
id replacement
76+
5533177 886365
77+
5533176 135041
78+
5533174 195815
79+
3878986 385523
80+
5533172 135041
81+
5533171 898152
82+
5533170 5533295
83+
2983263 2983269
84+
4967339 2915806
85+
86+
### Version
87+
88+
File version.txt contains just the OTT version number e.g. "2.9draft12"
89+
90+
### Reports
91+
92+
Taxonomies that are the output of smasher also contain a number of files to assist research into decisions made by smasher.
93+
94+
* conflicts.tsv - gives details on conflicts between source taxonomies.
95+
* log.tsv - traces how node mappings were chosen, for a selected subset of nodes (the entire trace for all nodes would be way too big).
96+
* deprecated.tsv - lists ids that were retired in this version (but only for ids that occur as OTUs in phylesystem). Also lists ids that were not suppressed before, but are suppressed now.
97+
* a few others
98+
99+
### Metadata
100+
101+
Overall metadata for the taxonomy is placed in a separate file. The metadata format is currently under development. `Smasher` generates this in JSON format as `about.json`, but this file is currently not used programmatically, and is in the process of being overhauled. When generating a taxonomy according to this format in external tools, for now it is best to simply write a markdown or plain text file called `about.md` (in the same directory as `taxonomy.tsv` and `synonyms.tsv`).
102+
103+
The metadata provided in the file should include the source of the taxonomy (article or database) as a URL and any other descriptive information that's available. The purpose of the metadata is not just explanatory but also to explain how to check the correctness of the taxonomy against its source and make corrections and other improvements should the source be updated. When using information from changing sources (databases) the date or dates of retrieval should be recorded.
104+
105+
***
106+
107+
_This page was originally part of the [open tree
108+
wiki](https://github.com/OpenTreeOfLife/opentree/wiki/Interim-taxonomy-file-format).
109+
On 2014-02-06 it was transferred to the [reference-taxonomy
110+
wiki](https://github.com/OpenTreeOfLife/reference-taxonomy/wiki/Interim-taxonomy-file-format),
111+
and then from there into the reference-taxonomy repository's doc directory
112+
on 2017-02-25._
113+

doc/taxon-flags.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Taxon flags
2+
3+
The last column in the taxonomy.tsv file in the [interim taxonomy file format](./interim-format.md) is "flags". The flags entry is a comma-separated list of flags or markers. Usually these are generated by taxonomy synthesis and are used to decide whether a taxon is to be suppressed in downstream processing. For example, if there's a `not_otu` flag then the name may not correspond to anything taxon-like and it may be desirable to suppress the name.
4+
5+
The possible values in that field are:
6+
* 'Incertae sedis'-like flags
7+
* `incertae_sedis` - in source taxonomy, was a member of an "incertae sedis" container (also "unallocated", "unclassified", "mitosporic")
8+
* `incertae_sedis_inherited` - descends from a node flagged `incertae_sedis`
9+
* `major_rank_conflict` - in source taxonomy, there is a gap, skipping a Linnaean rank, between the node's rank and its parent's rank, while there is a sibling not showing such a gap. For example: a genus in an order, that has a sibling that is a family. This flag is only applied in certain sources, e.g. GBIF, which happen to represent "incertae sedis" in this way. Does not apply to NCBI. Processed the same as `incertae_sedis`
10+
* `major_rank_conflict_inherited` - descends from a node flagged `major_rank_conflict`
11+
* `unplaced` (new in OTT 2.9) - equivalent to `incertae_sedis`. The nodes's parent is inconsistent with OTT, i.e. does not fit into the hierarchy, so the node has been made to be a child of the MRCA of the children of the inconsistent taxon
12+
* `unplaced_inherited` - descends from a node flagged `unplaced`
13+
* `environmental` - child of a node whose name contains the strings "environmental samples" or "mycorrhizal samples". Equivalent to `incertae_sedis`
14+
* `environmental_inherited` - descends from a node flagged "environmental"
15+
* `sibling_higher` - has a sibling with a higher rank, where `major_rank_conflict` does not apply. For example: a subfamily with a sibling that's a family. Similar to `major_rank_conflict`, but treatment as incertae sedis is not definitely warranted. Currently this only serves as a warning to a human browsing the taxonomy; it has no effect on assembly.
16+
* `inconsistent` (new in OTT 2.9) - a placeholder or "tombstone" for a taxon that has been removed due to its being inconsistent with higher priority taxa (judged to be not a clade). Does not have children, and can generally be ignored.
17+
* `merged` (OTT 2.9) - similar to `inconsistent`, but the children were directly placed in a larger taxon
18+
* Other flags
19+
* `barren` - there are only higher taxa at and below this node, no species or unranked tips
20+
* `extinct` - node is annotated as extinct (usually but not always by IRMNG)
21+
* `extinct_inherited` - descends from a node flagged `extinct`.
22+
* `hidden` - marked hidden due to Open Tree curatorial decision (e.g. microbes from GBIF)
23+
* `hidden_inherited` - descends from node flagged `hidden`
24+
* `hybrid` - taxon name contains "hybrid" or " x " indicating that it is a hybrid. Also, any node descended from such a node.
25+
* `infraspecific` - descends from a node with rank "species"
26+
* `not_otu` - the name suggests that this is not a taxon. Keywords interpreted this way include "uncultured", "unclassified", "unidentified", "unknown", "metagenome", "other sequences", "artificial", "libraries", "tranposons", and a few others. Also "sp." when at the end of a name. Also, any node descended from such a node. This flag is applied to NCBI taxa but not to SILVA taxa.
27+
* `viral` - the taxon name suggests that it has something to do with viruses. Also, any node descended from such a node.
28+
* `was_container` - this node used to be a container pseudo-taxon (incertae sedis, environmental samples, etc.) but its children have all been flagged and moved to the node's parent
29+
30+
* Deprecated flags: (occur in old versions of OTT but not current ones)
31+
* `major_rank_conflict_direct` - superseded by `was_container`
32+
* `unclassified` - this is NCBI's way of saying incertae sedis
33+
* `unclassified_inherited` - descends from a node flagged `unclassified`
34+
* `sibling_lower` (deprecated as of OTT 2.9)
35+
* `tattered` (deprecated as of OTT 2.9 in favor of `was_container`)
36+
* `tattered_inherited` (deprecated as of OTT 2.9 in favor of `unplaced` and `unplaced_inherited`)
37+
* `edited` - the taxon has been subject to an ad hoc edit ("patch")
38+
* `forced_visible` - not currently used
39+
* `extinct_direct` - superseded by `was_container`
40+
41+
For more detail see the [taxomachine source code](https://github.com/OpenTreeOfLife/taxomachine/blob/master/src/main/java/org/opentree/taxonomy/OTTFlag.java) and the smasher source code.
42+
43+
Synthesis (treemachine and future methods) and taxomachine are guided
44+
by the presence of these flags; each has its own list of flags that it
45+
uses as criteria for deciding whether to include an OTT entity in
46+
processing. For taxomachine, the flags affect which names are offered
47+
via the TNRS. For synthesis, the flags determine whether a node is to
48+
be included in the tree.
49+
50+
## Flags leading to taxa being unavailable for TNRS
51+
52+
Taxon flags influence the behavior of the [taxonomic name resolution services](https://github.com/OpenTreeOfLife/opentree/wiki/Open-Tree-of-Life-APIs#match_names). If a taxon has any of the following flags, it is suppressed for TNRS purposes (i.e. not offered in TNRS results):
53+
54+
* not_otu
55+
* environmental
56+
* environmental_inherited
57+
* viral
58+
* hidden
59+
* hidden_inherited
60+
* was_container
61+
62+
## Flags leading to taxa being suppressed from the synthetic tree
63+
64+
WRITE ME
65+
66+
-----
67+
68+
_This page was copied from the [reference-taxonomy
69+
wiki](https://github.com/OpenTreeOfLife/reference-taxonomy/wiki/Taxon-flags)
70+
on 2017-02-25._

0 commit comments

Comments
 (0)