Use-case: the portal data

In this example, we will use NCBITaxonomy to validate the names of the species used in the Portal teaching dataset:

Ernest, Morgan; Brown, James; Valone, Thomas; White, Ethan P. (2017): Portal Project Teaching Database. figshare. https://doi.org/10.6084/m9.figshare.1314459.v6

We will download a list of species from figshare, which is given as a JSON file:

using NCBITaxonomy
using DataFrames
using JSON
using StringDistances

species_file = download("https://ndownloader.figshare.com/files/3299486")
species = JSON.parsefile(species_file)
54-element Array{Any,1}:
 Dict{String,Any}("species" => "bilineata","genus" => "Amphispiza","taxa" => "Bird","species_id" => "AB")
 Dict{String,Any}("species" => "harrisi","genus" => "Ammospermophilus","taxa" => "Rodent","species_id" => "AH")
 Dict{String,Any}("species" => "savannarum","genus" => "Ammodramus","taxa" => "Bird","species_id" => "AS")
 Dict{String,Any}("species" => "taylori","genus" => "Baiomys","taxa" => "Rodent","species_id" => "BA")
 Dict{String,Any}("species" => "brunneicapillus","genus" => "Campylorhynchus","taxa" => "Bird","species_id" => "CB")
 Dict{String,Any}("species" => "melanocorys","genus" => "Calamospiza","taxa" => "Bird","species_id" => "CM")
 Dict{String,Any}("species" => "squamata","genus" => "Callipepla","taxa" => "Bird","species_id" => "CQ")
 Dict{String,Any}("species" => "scutalatus","genus" => "Crotalus","taxa" => "Reptile","species_id" => "CS")
 Dict{String,Any}("species" => "tigris","genus" => "Cnemidophorus","taxa" => "Reptile","species_id" => "CT")
 Dict{String,Any}("species" => "uniparens","genus" => "Cnemidophorus","taxa" => "Reptile","species_id" => "CU")
 ⋮
 Dict{String,Any}("species" => "tereticaudus","genus" => "Spermophilus","taxa" => "Rodent","species_id" => "ST")
 Dict{String,Any}("species" => "undulatus","genus" => "Sceloporus","taxa" => "Reptile","species_id" => "SU")
 Dict{String,Any}("species" => "sp.","genus" => "Sigmodon","taxa" => "Rodent","species_id" => "SX")
 Dict{String,Any}("species" => "sp.","genus" => "Lizard","taxa" => "Reptile","species_id" => "UL")
 Dict{String,Any}("species" => "sp.","genus" => "Pipilo","taxa" => "Bird","species_id" => "UP")
 Dict{String,Any}("species" => "sp.","genus" => "Rodent","taxa" => "Rodent","species_id" => "UR")
 Dict{String,Any}("species" => "sp.","genus" => "Sparrow","taxa" => "Bird","species_id" => "US")
 Dict{String,Any}("species" => "leucophrys","genus" => "Zonotrichia","taxa" => "Bird","species_id" => "ZL")
 Dict{String,Any}("species" => "macroura","genus" => "Zenaida","taxa" => "Bird","species_id" => "ZM")

Cleaning up the portal names

There is are two things we want to do at this point: extract the species names from the file, and then validate that they are spelled correctly, or that they are the most recent taxonomic name according to NCBI.

We will store our results in a data frame:

cleanup = DataFrame(
    code = String[],
    portal = String[],
    name = String[],
    rank = Symbol[],
    order = String[],
    taxid = Int[]
)

0 rows × 6 columns

codeportalnamerankordertaxid
StringStringStringSymbolStringInt64

The next step is to loop throug the species, and figure out what to do with them:

for sp in species
    portal_name = sp["species"] == "sp." ? sp["genus"] : sp["genus"]*" "*sp["species"]
    ncbi_tax = taxid(portal_name)
    if isnothing(ncbi_tax)
        ncbi_tax = taxid(portal_name; fuzzy=true)
    end
    ncbi_lin = lineage(ncbi_tax)
    push!(cleanup,
        (
            sp["species_id"], portal_name, ncbi_tax.name, rank(ncbi_tax),
            first(filter(t -> isequal(:order)(rank(t)), lineage(ncbi_tax))).name,
            ncbi_tax.id
        )
    )
end

first(cleanup, 5)

5 rows × 6 columns

codeportalnamerankordertaxid
StringStringStringSymbolStringInt64
1ABAmphispiza bilineataAmphispiza bilineataspeciesPasseriformes198939
2AHAmmospermophilus harrisiAmmospermophilus harrisiispeciesRodentia45487
3ASAmmodramus savannarumAmmodramus savannarumspeciesPasseriformes135422
4BABaiomys tayloriBaiomys taylorispeciesRodentia56219
5CBCampylorhynchus brunneicapillusCampylorhynchus brunneicapillusspeciesPasseriformes141853

Looking at species with a name discrepancy

Finally, we can look at the codes for which there is a likely issue because the names do not match – this can be because of new names, improper use of vernacular, or spelling issues:

filter(r -> r.portal != r.name, cleanup)

14 rows × 6 columns

codeportalnamerankordertaxid
StringStringStringSymbolStringInt64
1AHAmmospermophilus harrisiAmmospermophilus harrisiispeciesRodentia45487
2CSCrotalus scutalatusCrotalus scutulatusspeciesSquamata8737
3CTCnemidophorus tigrisAspidoscelis tigrisspeciesSquamata52180
4CUCnemidophorus uniparensAspidoscelis uniparensspeciesSquamata37197
5EOEumeces obsoletusPlestiodon obsoletusspeciesSquamata463535
6GSGambelia silusGambelia silaspeciesSquamata475046
7PHPerognathus hispidusStrophanthus hispidusspeciesGentianales2605637
8PUPipilo fuscusKieneria fuscaspeciesPasseriformes40205
9SCSceloporus clarkiSceloporus clarkiispeciesSquamata235405
10SSSpermophilus spilosomaXerospermophilus spilosomaspeciesRodentia45471
11STSpermophilus tereticaudusXerospermophilus tereticaudusspeciesRodentia99860
12ULLizardLisardagenusHemiptera204543
13URRodentRodentiaorderRodentia9989
14USSparrowPasseridaefamilyPasseriformes9158

Note that these results should always be manually curated. For example, two species have been assigned to groups that are obviously wrong:

filter(r -> r.order ∈ ["Gentianales","Hemiptera"], cleanup)

2 rows × 6 columns

codeportalnamerankordertaxid
StringStringStringSymbolStringInt64
1PHPerognathus hispidusStrophanthus hispidusspeciesGentianales2605637
2ULLizardLisardagenusHemiptera204543

Fixing the mis-identified species

Well, the obvious choice here is manual cleaning. This is a good solution. Another thing that NCBITaxonomy offers is the ability to build a namefinder from a list of known NCBI taxa. This is good if we know that the names we expect to find are part of a reference list.

In this case, we know that the species are going to be vertebrates, so we can use the vertebratefinder function to restrict the search to these groups:

vertebratefinder(true)("Lizard"; fuzzy=true)
Lepidosauria (8504)

However, this approach does not seem to work for the second group:

vertebratefinder(true)("Perognathus hispidus"; fuzzy=true)
Perognathus fasciatus (38677)

The mystery of the hispid pocket mouse

This one will not be solved by our approach, as it is an invalid name – Perognathus hispidus should actually be Chaetodipus hispidus. Here are the list of issues that result in this name not being identifiable easily. First, Chaetodipus is a valid name, for which Perognathus is not a synonym. So searching by genus is not going to help. Second, there are a whole lot of species that end with hispidus, and trying different string distances is not going to help. We can try:

vertebratefinder(true)("Perognathus hispidus"; fuzzy=true, dist=DamerauLevenshtein)
Perognathus fasciatus (38677)

This returns a valid taxon, but an incorrect one (the Olive-backed pocket mouse). There is no obvious way to solve this problem.

Or is it?

To solve the issue with Lizards, we had to move away from taxid, and use verterbatefinder to limit the scope of the search. It would save some time to use this for the entire portal dataset, so let's create a portalnamesolver function:

portalnamesolver = vertebratefinder(true)
(::NCBITaxonomy.var"#_inner_finder#2"{NCBITaxonomy.var"#_inner_finder#1#3"{DataFrames.DataFrame}}) (generic function with 1 method)

It currently does not help with our example - but this is ok, as we cal use one of Julia's features to hard-code the solution: dispatching on values. Because portalnamesolver is a singleton function (due to the way namefinder works), we need to be explicit about which module we want to expand it from (the @__MODULE__ will get the appropriate value, which can be Main if you work from the REPL, the Weave sandbox if you are generatic a document, or your own module if you structure your analysis this wat):

Env = @__MODULE__
function Env.portalnamesolver(::Type{Val{Symbol("Perognathus hispidus")}})
    return ncbi"Chaetodipus hispidus"
end

This definition says "every time we call the portalnamesolver with a Symbol containing this species name, return this species". We can call it with:

portalnamesolver(Val{Symbol("Perognathus hispidus")})
Chaetodipus hispidus (38665)

Note that this is not changing the behavior of our portalnamesolver, it is simply adding a method:

portalnamesolver("Lizards"; fuzzy=true)
Lepidosauria (8504)

At this point, we may want to update the very first loop, to use the portalnamesolver throughout.

Wrapping-up

This vignette illustrates how to go through a list of names, and match them against the NCBI taxonomy. We have seen a number of functions from NCBITaxonomy, including fuzzy string searching,. using custom string distances, limiting the taxonomic scope of the search, and finally using value-based dispatch to fix the unfixable. The last step can be automated a lot by relying on Julia's existing code generation techniques, but this goes beyond the scope of this vignette.