Use-case: the portal data
In this example, we will use NCBITaxonomy
to validate the names of the species used in the Portal teaching dataset:
Ernest, Morgan; Brown, James; Valone, Thomas; White, Ethan P. (2017): Portal Project Teaching Database. figshare. https://doi.org/10.6084/m9.figshare.1314459.v6
We will download a list of species from figshare, which is given as a JSON file:
using NCBITaxonomy
using DataFrames
using JSON
using StringDistances
species_file = download("https://ndownloader.figshare.com/files/3299486")
species = JSON.parsefile(species_file)
54-element Array{Any,1}: Dict{String,Any}("species" => "bilineata","genus" => "Amphispiza","taxa" => "Bird","species_id" => "AB") Dict{String,Any}("species" => "harrisi","genus" => "Ammospermophilus","taxa" => "Rodent","species_id" => "AH") Dict{String,Any}("species" => "savannarum","genus" => "Ammodramus","taxa" => "Bird","species_id" => "AS") Dict{String,Any}("species" => "taylori","genus" => "Baiomys","taxa" => "Rodent","species_id" => "BA") Dict{String,Any}("species" => "brunneicapillus","genus" => "Campylorhynchus","taxa" => "Bird","species_id" => "CB") Dict{String,Any}("species" => "melanocorys","genus" => "Calamospiza","taxa" => "Bird","species_id" => "CM") Dict{String,Any}("species" => "squamata","genus" => "Callipepla","taxa" => "Bird","species_id" => "CQ") Dict{String,Any}("species" => "scutalatus","genus" => "Crotalus","taxa" => "Reptile","species_id" => "CS") Dict{String,Any}("species" => "tigris","genus" => "Cnemidophorus","taxa" => "Reptile","species_id" => "CT") Dict{String,Any}("species" => "uniparens","genus" => "Cnemidophorus","taxa" => "Reptile","species_id" => "CU") ⋮ Dict{String,Any}("species" => "tereticaudus","genus" => "Spermophilus","taxa" => "Rodent","species_id" => "ST") Dict{String,Any}("species" => "undulatus","genus" => "Sceloporus","taxa" => "Reptile","species_id" => "SU") Dict{String,Any}("species" => "sp.","genus" => "Sigmodon","taxa" => "Rodent","species_id" => "SX") Dict{String,Any}("species" => "sp.","genus" => "Lizard","taxa" => "Reptile","species_id" => "UL") Dict{String,Any}("species" => "sp.","genus" => "Pipilo","taxa" => "Bird","species_id" => "UP") Dict{String,Any}("species" => "sp.","genus" => "Rodent","taxa" => "Rodent","species_id" => "UR") Dict{String,Any}("species" => "sp.","genus" => "Sparrow","taxa" => "Bird","species_id" => "US") Dict{String,Any}("species" => "leucophrys","genus" => "Zonotrichia","taxa" => "Bird","species_id" => "ZL") Dict{String,Any}("species" => "macroura","genus" => "Zenaida","taxa" => "Bird","species_id" => "ZM")
Cleaning up the portal names
There is are two things we want to do at this point: extract the species names from the file, and then validate that they are spelled correctly, or that they are the most recent taxonomic name according to NCBI.
We will store our results in a data frame:
cleanup = DataFrame(
code = String[],
portal = String[],
name = String[],
rank = Symbol[],
order = String[],
taxid = Int[]
)
code | portal | name | rank | order | taxid | |
---|---|---|---|---|---|---|
String | String | String | Symbol | String | Int64 |
The next step is to loop throug the species, and figure out what to do with them:
for sp in species
portal_name = sp["species"] == "sp." ? sp["genus"] : sp["genus"]*" "*sp["species"]
ncbi_tax = taxid(portal_name)
if isnothing(ncbi_tax)
ncbi_tax = taxid(portal_name; fuzzy=true)
end
ncbi_lin = lineage(ncbi_tax)
push!(cleanup,
(
sp["species_id"], portal_name, ncbi_tax.name, rank(ncbi_tax),
first(filter(t -> isequal(:order)(rank(t)), lineage(ncbi_tax))).name,
ncbi_tax.id
)
)
end
first(cleanup, 5)
code | portal | name | rank | order | taxid | |
---|---|---|---|---|---|---|
String | String | String | Symbol | String | Int64 | |
1 | AB | Amphispiza bilineata | Amphispiza bilineata | species | Passeriformes | 198939 |
2 | AH | Ammospermophilus harrisi | Ammospermophilus harrisii | species | Rodentia | 45487 |
3 | AS | Ammodramus savannarum | Ammodramus savannarum | species | Passeriformes | 135422 |
4 | BA | Baiomys taylori | Baiomys taylori | species | Rodentia | 56219 |
5 | CB | Campylorhynchus brunneicapillus | Campylorhynchus brunneicapillus | species | Passeriformes | 141853 |
Looking at species with a name discrepancy
Finally, we can look at the codes for which there is a likely issue because the names do not match – this can be because of new names, improper use of vernacular, or spelling issues:
filter(r -> r.portal != r.name, cleanup)
code | portal | name | rank | order | taxid | |
---|---|---|---|---|---|---|
String | String | String | Symbol | String | Int64 | |
1 | AH | Ammospermophilus harrisi | Ammospermophilus harrisii | species | Rodentia | 45487 |
2 | CS | Crotalus scutalatus | Crotalus scutulatus | species | Squamata | 8737 |
3 | CT | Cnemidophorus tigris | Aspidoscelis tigris | species | Squamata | 52180 |
4 | CU | Cnemidophorus uniparens | Aspidoscelis uniparens | species | Squamata | 37197 |
5 | EO | Eumeces obsoletus | Plestiodon obsoletus | species | Squamata | 463535 |
6 | GS | Gambelia silus | Gambelia sila | species | Squamata | 475046 |
7 | PH | Perognathus hispidus | Strophanthus hispidus | species | Gentianales | 2605637 |
8 | PU | Pipilo fuscus | Kieneria fusca | species | Passeriformes | 40205 |
9 | SC | Sceloporus clarki | Sceloporus clarkii | species | Squamata | 235405 |
10 | SS | Spermophilus spilosoma | Xerospermophilus spilosoma | species | Rodentia | 45471 |
11 | ST | Spermophilus tereticaudus | Xerospermophilus tereticaudus | species | Rodentia | 99860 |
12 | UL | Lizard | Lisarda | genus | Hemiptera | 204543 |
13 | UR | Rodent | Rodentia | order | Rodentia | 9989 |
14 | US | Sparrow | Passeridae | family | Passeriformes | 9158 |
Note that these results should always be manually curated. For example, two species have been assigned to groups that are obviously wrong:
filter(r -> r.order ∈ ["Gentianales","Hemiptera"], cleanup)
code | portal | name | rank | order | taxid | |
---|---|---|---|---|---|---|
String | String | String | Symbol | String | Int64 | |
1 | PH | Perognathus hispidus | Strophanthus hispidus | species | Gentianales | 2605637 |
2 | UL | Lizard | Lisarda | genus | Hemiptera | 204543 |
Fixing the mis-identified species
Well, the obvious choice here is manual cleaning. This is a good solution. Another thing that NCBITaxonomy
offers is the ability to build a namefinder
from a list of known NCBI taxa. This is good if we know that the names we expect to find are part of a reference list.
In this case, we know that the species are going to be vertebrates, so we can use the vertebratefinder
function to restrict the search to these groups:
vertebratefinder(true)("Lizard"; fuzzy=true)
Lepidosauria (8504)
However, this approach does not seem to work for the second group:
vertebratefinder(true)("Perognathus hispidus"; fuzzy=true)
Perognathus fasciatus (38677)
The mystery of the hispid pocket mouse
This one will not be solved by our approach, as it is an invalid name – Perognathus hispidus should actually be Chaetodipus hispidus. Here are the list of issues that result in this name not being identifiable easily. First, Chaetodipus is a valid name, for which Perognathus is not a synonym. So searching by genus is not going to help. Second, there are a whole lot of species that end with hispidus, and trying different string distances is not going to help. We can try:
vertebratefinder(true)("Perognathus hispidus"; fuzzy=true, dist=DamerauLevenshtein)
Perognathus fasciatus (38677)
This returns a valid taxon, but an incorrect one (the Olive-backed pocket mouse). There is no obvious way to solve this problem.
Or is it?
To solve the issue with Lizards, we had to move away from taxid
, and use verterbatefinder
to limit the scope of the search. It would save some time to use this for the entire portal dataset, so let's create a portalnamesolver
function:
portalnamesolver = vertebratefinder(true)
(::NCBITaxonomy.var"#_inner_finder#2"{NCBITaxonomy.var"#_inner_finder#1#3"{DataFrames.DataFrame}}) (generic function with 1 method)
It currently does not help with our example - but this is ok, as we cal use one of Julia's features to hard-code the solution: dispatching on values. Because portalnamesolver
is a singleton function (due to the way namefinder
works), we need to be explicit about which module we want to expand it from (the @__MODULE__
will get the appropriate value, which can be Main
if you work from the REPL, the Weave sandbox if you are generatic a document, or your own module if you structure your analysis this wat):
Env = @__MODULE__
function Env.portalnamesolver(::Type{Val{Symbol("Perognathus hispidus")}})
return ncbi"Chaetodipus hispidus"
end
This definition says "every time we call the portalnamesolver
with a Symbol
containing this species name, return this species". We can call it with:
portalnamesolver(Val{Symbol("Perognathus hispidus")})
Chaetodipus hispidus (38665)
Note that this is not changing the behavior of our portalnamesolver
, it is simply adding a method:
portalnamesolver("Lizards"; fuzzy=true)
Lepidosauria (8504)
At this point, we may want to update the very first loop, to use the portalnamesolver
throughout.
Wrapping-up
This vignette illustrates how to go through a list of names, and match them against the NCBI taxonomy. We have seen a number of functions from NCBITaxonomy
, including fuzzy string searching,. using custom string distances, limiting the taxonomic scope of the search, and finally using value-based dispatch to fix the unfixable. The last step can be automated a lot by relying on Julia's existing code generation techniques, but this goes beyond the scope of this vignette.