Skip to content

Latest commit

 

History

History
275 lines (201 loc) · 13 KB

File metadata and controls

275 lines (201 loc) · 13 KB

OTT maintainer's guide

Addressing feedback issues

Ideally one starts by writing a test for each issue, both to see if it has already been resolved and for regression testing. Unfortunately there's no pleasant way to write such tests.

There is no systematic approach to feedback issues. The best way to understand how to deal with them is to go through some worked examples, which is what this document does.

All solutions currently involve edits either to adjustments.py or amendments.py, both in the curation directory. See the patch language documentation for information on the notations used for writing patches. It would be better if we had a directory full of python files containing patches, to facilitate editing by multiple authors. TBD.

Issue 397 - Scallops in Cnidaria

The issue points to duplicate locations of the "same" taxon, 'Placopecten magellanicus', one with ottid 449040 (in Bilatera) and one with ottid 6370755 (in Cnidaria). This taxa is a scallop, and should be in Bilatera, i.e. the Cnidaria one is misplaced. Using bin/investigate "Placopecten magellanicus" we see that the incorrect placement comes from gbif:

r/ott-NEW/source/taxonomy.tsv:6370755	|	6370754	|	Placopecten magellanicus	|	species	|	gbif:2285952	|	Placopecten magellanicus (species in phylum Cnidaria)	|		|
r/ott-NEW/source/taxonomy.tsv:449041	|	449040	|	Placopecten magellanicus	|	species	|	worms:156972,ncbi:6577,irmng:11384074,irmng:10529011	|	Placopecten magellanicus (species in Bilateria)	|	 

This appears to be fixed in the current version of gbif, but we don't have that yet, and updating gbif is going to take some time, so we will patch instead.

Compare the lineage of these two ott taxa (only showing output to the point where they match). Using bin/lineage 449041 r/ott-NEW/source:

449041	449040	Placopecten magellanicus	species	worms:156972,ncbi:6577,irmng:11384074,irmng:10529011	Placopecten magellanicus (species in Bilateria)		
449040	5692602	Placopecten	genus	worms:156971,ncbi:6576,irmng:1019019	Placopecten (genus in Bilateria)		
5692602	975312	Palliolinae	subfamily	worms:393793			
975312	951120	Pectinidae	family	worms:213,ncbi:6566,irmng:115891			
951120	951119	Pectinoidea	superfamily	worms:151320,ncbi:106219		sibling_higher
951119	1025543	Pectinida	order	worms:387324,irmng:12740		sibling_higher
1025543	1025545	Pteriomorphia	subclass	worms:206,ncbi:6545			
1025545	802117	Bivalvia	class	worms:105,ncbi:6544,gbif:137,irmng:1146		sibling_higher
802117	155737	Mollusca	phylum	worms:51,ncbi:6447,gbif:52,irmng:175			
155737	189832	Lophotrochozoa	no rank	ncbi:1206795			
189832	117569	Protostomia	no rank	ncbi:33317			
117569	641038	Bilateria	no rank	ncbi:33213			
641038	691846	Eumetazoa	no rank	ncbi:6072			

and bin/lineage 6370755 r/ott-NEW/source:

6370755	6370754	Placopecten magellanicus	species	gbif:2285952	Placopecten magellanicus (species in phylum Cnidaria)		
6370754	442914	Placopecten	genus	gbif:2285951	Placopecten (genus in phylum Cnidaria)		
442914	712383	Merulinidae	family	worms:196102,ncbi:46736,ncbi:46733,gbif:3358,gbif:5181,gbif:8219,irmng:104400,irmng:100317,irmng:100502		sibling_higher
712383	1084485	Scleractinia	order	worms:1363,ncbi:6125,gbif:714,irmng:11376		sibling_higher
1084485	1084488	Hexacorallia	subclass	worms:1340,ncbi:6102			
1084488	641033	Anthozoa	class	worms:1292,ncbi:6101,gbif:206,irmng:1134			
641033	641038	Cnidaria	phylum	worms:1267,ncbi:6073,gbif:43,irmng:151			
641038	691846	Eumetazoa	no rank	ncbi:6072			

GBIF places the genus Placopecten in Merulinidae, which is definitely in Cnidaria, so let's move that to Pectinidae (not to Palliolinae, which is a barren subfamily that only exists in worms).

Issue 345 - Conolophus extinct?

As discussed on github, the easy fix is to get rid of the mammal Conolophus, keeping the other one. For this approach, the following should go in adjust_irmng or align_irmng:

irmng.taxon('Conolophus', 'Mammalia').prune()

If we wanted to keep both taxa, we could use establish, with the following in align_irmng:

establish('Conolophus', ott, 'genus', ancestor='Mammalia', source='irmng:1415243')

We don't need to establish the iguana because it's already present in NCBI and GBIF.

That ought to be enough in itself, but I haven't tried it yet. To be sure, one would align the IRMNG genera to OTT:

a.same(irmng.taxon('Conolophus', 'Iguania'), ott.taxon('Conolophus', 'Iguania'))
a.same(irmng.taxon('Conolophus', 'Mammalia'), ott.taxon('Conolophus', 'Mammlia'))

Issue 341 - Campanulales = Asterales?

Is there an error or not? I followed the supplied OTT link to Asterales, and from there to the second IRMNG record, for Campanulales. Indeed, equating Campanulales with Asterales sounds fishy. But this is what NCBI does, so it stands. The synonymy is pro parte, I believe, so perhaps we should consider these to be distinct taxon records. I think I would leave this one alone for now.

See new issue 316 on pro parte synonyms.

Issue 340 Myomorpha incorrectly synonymised with Sciurognathi

Mark has already analyzed this. Turns out to be logically the same as the previous issue (341).

Issue 336 - Miliaria calandra is synonym for Emberiza calandra

Running the following commands at the shell

bin/investigate "Miliaria calandra"
bin/investigate "Emberiza calandra"

shows that NCBI has the two names as separate species; so that is the source of the error.

To find out which of the two names is currently accepted, I checked worldbirdnames.org and wikipedia.

The easiest fix - requiring the least amount of thinking - is in amendments.py:

proclaim(ott, synonym_of(taxon('Miliaria calandra'), taxon('Emberiza calandra'),
                         'objective synonym', otc(60)))

or equivalently (using the v2 patch facility)

ott.taxon('Emberiza calandra').absorb(ott.taxon('Miliaria calandra'))

Be sure to change 60 to some unused otc number.

A more nuanced method is to put the directive in patch_ncbi in adjustments.py, to prevent the creation of incorrect structure in OTT in the first place. This requires some knowledge of the merge order, so is less robust to future taxonomy changes than the amendments.py method.

proclaim(ncbi, synonym_of(taxon('Miliaria calandra'), taxon('Emberiza calandra'),
                          'objective synonym', otc(60)))

There will be an empty genus 'Miliaria' left over. It's not too harmful to leave it in, since it will be flagged barrent and suppressed from synthesis, but to be tidy one would want to get rid of it:

proclaim(ott, synonym_of(taxon('Miliaria', 'Aves'), taxon('Emberiza', 'Aves'),
                         'proparte synonym', otc(61)))

I'm specifying an ancestor for the genera because genus names are so often ambiguous, and it doesn't hurt.

Issue 332 - Misspelling of Strigops habroptilus

First try to understand what's going on.

grep "Strigops habroptilus" r/ott-NEW/source/taxonomy.tsv
809432	|	512918	|	Strigops habroptilus	|	species	|	ncbi:57251,irmng:11435975

grep "Strigops habroptila" r/ott-NEW/source/taxonomy.tsv
5857013	|	512918	|	Strigops habroptila	|	species	|	gbif:2479236

It's not obvious which is right, without a knowledge of Latin. Wikipedia has -ilus; IUCN has -ila (mentioned in the Wikipedia article). IUCN asserts "Strigops habroptilus Gray, 1845 [orth. error in BirdLife International (2004)]". GBIF imports IUCN, and if I remember correctly GBIF also has significant smarts for correcting the gender ending of epithets.
My bet is on -ila because IUCN explicitly says that -ilus is an error.

Easier fix: in amendments.py, merge the two:

proclaim(ott, synonym_of(taxon('Strigops habroptilus'), taxon('Strigops habroptila'),
                         'gender variant', otc(62)))

Alternate fix: fix the name in align_ncbi in adjustments.py:

proclaim(ncbi, synonym_of(taxon('Strigops habroptilus'), taxon('Strigops habroptila'),
                          'gender variant', otc(62)))

The latter leaves a synonymy behind, which is enough to cause the misspelled IRMNG node to align to it.

Issue 327 - nozaki vs nozakii, lamarckii vs lamarcki

First look at taxon 152136 (Cyanea) in the taxonomy browser to see sources for both nodes. citrea is from GBIF via ITIS; citrae is from WoRMS (and subsequently CoL, GBIF, and IRMNG - GBIF has both spellings). But it's not at all clear which is right.

Cyanea citrea Kishinouye, 1910 - ITIS (NOAA)
Cyanea citrae (Kishinouye, 1910) - WoRMS (URMO)

Cyanea citrae seems to be the misspelling, based on primary literature (found by Google Scholar searches of the two spellings). The best fix would be in align_worms since WoRMS is merged before GBIF and IRMNG:

proclaim(worms, synonym_of(taxon('Cyanea citrea'), taxon('Cyanea citrae'),
                           'misspelling', otc(63)))

This creates a synonym record, allowing both spellings to align to this record down the line.

It's not clear whether rosea and rosella are different. rosella smells like a synonym, because of its meagre provenance, but is it? First we check to see if the authority of the two is the same; that would suggest a misspelling. No, clearly different authorities, per GBIF (GBIF and IRMNG are both good sources of authority information). rosella has no occurrence records and comes to GBIF only from IRMNG (according to its GBIF page). IRMNG gets it from CAAB. CAAB is Codes for Australian Aquatic Biota. CAAB takes us to Gowlett-Holmes, K.L., 2008. A field guide to the marine invertebrates of South Australia. notomares, Hobart, TAS. 333pp. At about this point my energy runs out and I say let's just assume it's a separate species.

C. lamarcki vs. C. lamarckii - one i - 132 scholar results; two i's 344 results. WoRMS also has two i's. Rather than spend a lot of time trying to track down the original description, I would go with the majority. It can always be fixed later. A possible fix would be to add the synonym to align_worms.

proclaim(worms, synonym_of(taxon('Cyanea lamarcki'), taxon('Cyanea lamarckii'),
                           'misspelling', otc(64)))

This could also be done in adjustments.py with ott as the taxonomy.

Another option would be to use establish at the beginning of assembly to force the correct spelling at the outset, but I consider establish to be a sledgehammer to be used only when necessary.

C. nozaki vs. C. nozakii - nozaki 17 hits, nozakii 520 hits.

proclaim(worms, synonym_of(taxon('Cyanea nozaki'), taxon('Cyanea nozakii'),
                           'misspelling', otc(65)))

Issue 318 Syntexix misspelling for Syntexis

Syntexis has two senses. Using the taxonomy browser we see that the one in Anaxyelidae (in Hymenoptera) (893510) comes from NCBI. The other is a synonym for Mollisia in Fungi. Syntexix comes from GBIF and IRMNG (it's probably in GBIF by way of IRMNG) and is in Anaxyelidae (in Hymenoptera). So the first sense of Syntexis is pretty clearly the same as the single sense of Syntexix.

This is a bit trickier because we need to merge both the genus and its single species. I would recommend modifying the GBIF-to-OTT alignment in align_gbif, but one could also modify OTT in adjustments.py. This time, specifying an ancestor of the genus is essential in order to avoid the Fungi ambiguity.

a.same(gbif.taxon('Syntexix', 'Hymenoptera'),
       ott.taxon('Syntexis', 'Hymenoptera'))
a.same(gbif.taxon('Syntexix libocedrii', 'Hymenoptera'),
       ott.taxon('Syntexis libocedrii', 'Hymenoptera'))

Issue 317 All the Nautilidae genera ending in -ceras are extinct

The original problem is fixed in the OTT 3.1 draft, but there remains a superfluity of species in Nautilus. I sent email to Tony Rees to have him fix the genus in IRMNG, and GBIF picks up the genus from IRMNG, so given enough time, waiting for the IRMNG fix to propagate is the correct fix for the problem. However that could take years, and if new versions of IRMNG are not made available without a TOU, it could be forever.

The easiest short-term fix is just to clean out the bad ones, or set them hidden. Wikipedia lists four valid extant species, so let's keep those and remove the others.

valid = ['belauensis', 'macromphalus', 'pompilius', 'stenomphalus']

for child in ott.taxon('Nautilus', 'Cephalopoda').getChildren():
    (species, epithet) = child.name.split(' ', 1)
    if not epithet in valid:
        child.hide()