35 years of work: More than 1000 leaf-mining pygmy moths classified & catalogued

The leaf-mining pygmy moths (family Nepticulidae) and the white eyecap moths (family Opostegidae) are among the smallest moths in the world with a wingspan of just a few millimetres. Their caterpillars make characteristic patterns in leaves: leaf mines. For the first time, the evolutionary relationships of the more than 1000 species have been analysed on the basis of DNA, resulting in a new classification.

Today, a team of scientists, led by Dr Erik J. van Nieukerken and Dr. Camiel Doorenweerd, Naturalis Biodiversity Center, Leiden, The Netherlands, published three inter-linked scientific publications in the journal Systematic Entomology and the open access journal ZooKeys, together with two online databases, providing a catalogue with the names of all species involved.image-2

The evolutionary study, forming part of the PhD thesis of Doorenweerd, used DNA methods to show that the group is ancient and was already diverse in the early Cretaceous, ca. 100 million years ago, partly based on the occurrence of leaf mines in fossil leaves. The moths are all specialised on some species of flowering plants, also called angiosperms, and could therefore diversify when the angiosperms diversified and largely replaced ecologically other groups of plants in the Cretaceous. The study lead to the discovery of three new genera occurring in South and Central America, which are described in one of the two ZooKeys papers, stressing the peculiar character and vastly undescribed diversity of the Neotropic fauna.

Changing a classification requires a change in many species names, which prompted the authors to simultaneously publish a full catalogue of all 1072 valid species names that are known worldwide and the many synonymic names from the literature from the past 150 years.

Creating such a large and comprehensive overview became possible from the moths and leaf-mine collections of the world’s natural history museums, and culminates the past 35 years of research that van Nieukerken has spent on this group. However, a small, but not trivial, note in one of the publications indicates that we can expect at least another 1000 species of pygmy leafminer moths that are yet undiscovered.image-3

###

Original sources:

Doorenweerd C, Nieukerken EJ van, Hoare RJB (2016) Phylogeny, classification and divergence times of pygmy leafmining moths (Lepidoptera: Nepticulidae): the earliest lepidopteran radiation on Angiosperms? Systematic Entomology, Early View. doi: 10.1111/syen.1221.

Nieukerken EJ van, Doorenweerd C, Nishida K, Snyers C (2016) New taxa, including three new genera show uniqueness of Neotropical Nepticulidae (Lepidoptera). ZooKeys 628: 1-63. doi: 10.3897/zookeys.628.9805.

Nieukerken EJ van, Doorenweerd C, Hoare RJB, Davis DR (2016) Revised classification and catalogue of global Nepticulidae and Opostegidae (Lepidoptera: Nepticuloidea). ZooKeys 628: 65-246. doi: 10.3897/zookeys.628.9799.

Nieukerken EJ van (ed) (2016) Nepticulidae and Opostegidae of the world, version 2.0. Scratchpads, biodiversity online.

Nieukerken EJ van (ed) (2016). Nepticuloidea: Nepticulidae and Opostegidae of the World (Oct 2016 version). In: Species 2000 & ITIS Catalogue of Life, 31st October 2016 (Roskov Y., Abucay L., Orrell T., Nicolson D., Flann C., Bailly N., Kirk P., Bourgoin T., DeWalt R.E., Decock W., De Wever A., eds). Digital resource at http://www.catalogueoflife.org/col. Species 2000: Naturalis, Leiden, the Netherlands. ISSN 2405-8858. http://www.catalogueoflife.org/col/details/database/id/172

How the names of organisms help to turn ‘small data’ into ‘Big Data’

Innovation in ‘Big Data’ helps address problems that were previously overwhelming. What we know about organisms is in hundreds of millions of pages published over 250 years. New software tools of the Global Names project find scientific names, index digital documents quickly, correcting names and updating them. These advances help “Making small data big” by linking together to content of many research efforts. The study was published in the open access journal Biodiversity Data Journal.

The ‘Big Data’ vision of science is transformed by computing resources to capture, manage, and interrogate the deluge of information coming from new technologies, infrastructural projects to digitise physical resources (such as our literature from the Biodiversity Heritage Library), or digital versions of specimens and records about specimens by museums.

Increased bandwidth has made dialogue among distributed data centres feasible and this is how new insights into biology are arising. In the case of biodiversity sciences, data centres range in size from the large GenBank for molecular records and the Global Biodiversity Information Facility for records of occurrences of species, to a long tail of tens of thousands of smaller datasets and web-sites which carry information compiled by individuals, research projects, funding agencies, local, state, national and international governmental agencies.

The large biological repositories do not yet approach the scale of astronomy and nuclear physics, but the very large number of sources in the long tail of useful resources do present biodiversity informaticians with a major challenge – how to discover, index, organize and interconnect the information contained in a very large number of locations.

In this regard, biology is fortunate that, from the middle of the 18th Century, the community has accepted the use of latin binomials such as Homo sapiens or Ba humbugi for species. All names are listed by taxonomists. Name recognition tools can call on large expert compilations of names (Catalogue of Life, Zoobank, Index Fungorum, Global Names Index) to find matches in sources of digital information. This allows for the rapid indexing of content.

Even when we do not know a name, we can ‘discover’ it because scientific names have certain distinctive characteristics (written in italics, most often two successive words in a latinised form, with the first one – capitalised). These properties allow names not yet present in compilations of names to be discovered in digital data sources.

The idea of a names-based cyberinfrastructure is to use the names to interconnect large and small distributed sites of expert knowledge distributed across the Internet. This is the concept of the described Global Names project which carried out the work described in this paper.

The effectiveness of such an infrastructure is compromised by the changes to names over time because of taxonomic and phylogenetic research. Names are often misspelled, or there might be errors in the way names are presented. Meanwhile, increasing numbers of species have no names, but are distinguished by their molecular characteristics.

In order to assess the challenge that these problems may present to the realization of a names-based cyberinfrastructure, we compared names from GenBank and DRYAD (a digital data repository) with names from Catalogue of Life to assess how well matched they are.

As a result, we found out that fewer than 15% of the names in pair-wise comparisons of these data sources could be matched. However, with a names parser to break the scientific names into all of their component parts, those parts that present the greatest number of problems could be removed to produce a simplified or canonical version of the name. Thanks to such tools, name-matching was improved to almost 85%, and in some cases to 100%.

The study confirms the potential for the use of names to link distributed data and to make small data big. Nonetheless, it is clear that we need to continue to invest more and better names-management software specially designed to address the problems in the biodiversity sciences.

###

Original source:

Patterson D, Mozzherin D, Shorthouse D, Thessen A (2016) Challenges with using names to link digital biodiversity information. Biodiversity Data Journal, doi: 10.3897/BDJ.4.e8080.

Additional information:

The study was supported by the National Science Foundation.

How to import occurrence records into manuscripts from GBIF, BOLD, iDigBio and PlutoF

On October 20, 2015, we published a blog post about the novel functionalities in ARPHA that allows streamlined import of specimen or occurrence records into taxonomic manuscripts.

Recently, this process was reflected in the “Tips and Tricks” section of the ARPHA authoring tool. Here, we’ll list the individual workflows:

Based on our earlier post, we will now go through our latest updates and highlight the new features that have been added since then.

Repositories and data indexing platforms, such as GBIF, BOLD systems, iDigBio, or PlutoF, hold, among other types of data, specimen or occurrence records. It is now possible to directly import specimen or occurrence records into ARPHA taxonomic manuscripts from these platforms [see Fig. 1]. We’ll refer to specimen or occurrence records as simply occurrence records for the rest of this post.

Import_specimen_workflow_
[Fig. 1] Workflow for directly importing occurrence records into a taxonomic manuscript.
Until now, when users of the ARPHA writing tool wanted to include occurrence records as materials in a manuscript, they would have had to format the occurrences as an Excel sheet that is uploaded to the Biodiversity Data Journal, or enter the data manually. While the “upload from Excel” approach significantly simplifies the process of importing materials, it still requires a transposition step – the data which is stored in a database needs to be reformatted to the specific Excel format. With the introduction of the new import feature, occurrence data that is stored at GBIF, BOLD systems, iDigBio, or PlutoF, can be directly inserted into the manuscript by simply entering a relevant record identifier.

The functionality shows up when one creates a new “Taxon treatment” in a taxonomic manuscript in the ARPHA Writing Tool. To import records, the author needs to:

  1. Locate an occurrence record or records in one of the supported data portals;
  2. Note the ID(s) of the records that ought to be imported into the manuscript (see Tips and Tricks for screenshots);
  3. Enter the ID(s) of the occurrence record(s) in a form that is to be seen in the “Materials” section of the species treatment;
  4. Select a particular database from a list, and then simply clicks ‘Add’ to import the occurrence directly into the manuscript.

In the case of BOLD Systems, the author may also select a given Barcode Identification Number (BIN; for a treatment of BIN’s read below), which then pulls all occurrences in the corresponding BIN.

We will illustrate this workflow by creating a fictitious treatment of the red moss, Sphagnum capillifolium, in a test manuscript. We have started a taxonomic manuscript in ARPHA and know that the occurrence records belonging to S. capillifolium can be found on iDigBio. What we need to do is to locate the ID of the occurrence record in the iDigBio webpage. In the case of iDigBio, the ARPHA system supports import via a Universally Unique Identifier (UUID). We have already created a treatment for S. capillifolium and clicked on the pencil to edit materials [Fig. 2].

Figure-61-01
[Fig. 2] Edit materials
In this example, type or paste the UUID (b9ff7774-4a5d-47af-a2ea-bdf3ecc78885), select the iDigBio source and click ‘Add’. This will pull the occurrence record for S. capillifolium from iDigBio and insert it as a material in the current paper [Fig. 3].

taxon-treatments- 3
[Fig. 3] Materials after they have been imported
This workflow can be used for a number of purposes. An interesting future application is the rapid re-description of species, but even more exciting is the description of new species from BIN’s. BIN’s (Barcode Identification Numbers) delimit Operational Taxonomic Units (OTU’s), created algorithmically at BOLD Systems. If a taxonomist decides that an OTU is indeed a new species, then he/she can import all the type information associated with that OTU for the purposes of describing it as a new species.

Not having to retype or copy/paste species occurrence records, the authors save a lot of efforts. Moreover, they automatically import them in a structured Darwin Core format, which can easily be downloaded from the article text into structured data by anyone who needs the data for reuse.

Another important aspect of the workflow is that it will serve as a platform for peer-review, publication and curation of raw data, that is of unpublished individual data records coming from collections or observations stored at GBIF, BOLD, iDigBio and PlutoF. Taxonomists are used to publish only records of specimens they or their co-authors have personally studied. In a sense, the workflow will serve as a “cleaning filter” for portions of data that are passed through the publishing process. Thereafter, the published records can be used to curate raw data at collections, e.g. put correct identifications, assign newly described species names to specimens belonging to the respective BIN and so on.

 

Additional Information:

The work has been partially supported by the EC-FP7 EU BON project (ENV 308454, Building the European Biodiversity Observation Network) and the ITN Horizon 2020 project BIG4 (Biosystematics, informatics and genomics of the big 4 insect groups: training tomorrow’s researchers and entrepreneurs), under Marie Sklodovska-Curie grant agreement No. 642241.

 

 

From Sherborn to ZooBank: Moving to the interconnected digital nomenclature of the future

From the outside, it can seem that taxonomy has a commitment issue with scientific names. They shift for reasons that seem obscure and unnecessarily wonkish to people who simply want to use names to refer to a consistent, knowable taxon such as species, genus or family. However, the relationship between nomenclature and taxonomy, as two quite separate but mutually dependent systems, is a sophisticated way of balancing what we know and what is open to further interpretation.

Nomenclature is a bureaucracy that follows rules and is tied to published records and type specimens. It provides a rigid framework or skeleton for knowledge. Taxonomy, on the other hand, is a data-driven science, influenced by interpretation and resulting in concepts that are open to further test and change. To actually get the answers right, taxonomy needs to be responsive and fluid as a system of knowledge. The link between nomenclature and the published record is also the junction with the data that fuels taxonomic interpretation.

Biodiversity informatics aims to solve this issue, and its founding father is Charles Davies Sherborn. His magnum opus, Index Animalium, provided the bibliographic foundation for current zoological nomenclature. In the 43 years he spent working on this extraordinary resource, he anchored our understanding of animal diversity through the published scientific record. No work has equaled it, and it is still in current and critical use.

ZK 550 SI Cover_LAST-1This special volume of the open-access journal ZooKeys celebrates Sherborn, his contributions, context and the future for the discipline of biodiversity informatics. The papers in this volume fall into three general areas of history, current practice and frontiers.

The first section presents facets of Sherborn as a man, scientist and bibliographer, and describes the historical context for taxonomic indexing from the 19th century to today. The second section discusses existing tools and innovations for bringing legacy biodiversity information into the modern age. The final section tackles the future of biological nomenclature, including digital access, innovative publishing models and the changing tools and sociology needed for communicating taxonomy.

In the late 1880s Charles Davies Sherborn recognised the need for a full index of names to the original sources that gave them legitimacy, their first publications. He set about making a complete index for names of animals, which are the largest group of described organisms (1.4 million of the current 1.8 million described species are animals). Because this work began while the very basics of nomenclatural rules were being thrashed out, the work itself affected how those rules were codified.

Sherborn’s monumental work, Index Animalium, comprises more than 9,000 pages in 11 volumes and about 440,000 names. This was on the scale of other hugely ambitious tasks at the time that changed the course of communication such as the Oxford English Dictionary. The error rates are astonishingly low, and it became, and it remains to date the most complete reference source for animal nomenclature. Taxonomic studies rely on Sherborn’s work today. While the future for information access is one of the most exciting frontiers for our increasingly interconnected, accelerated society, biodiversity information will continue to be grounded in this seminal work. The future for biodiversity informatics is built on Sherborn’s work, and is expanding to be digital, diversified and accessible.

The publisher of this volume, the journal ZooKeys, is itself a pioneer in developing a more stable and accessible scientific nomenclature. Together with PhytoKeys, ZooKeys is piloting an innovative workflow with a pre-publication automated pipeline for registration of nomenclatural acts. This initiative comes from the EU FP7 project pro-iBiosphere, and in close collaboration with ZooBank (the official online registry for scientific names of animals), Zoological Record, IPNI, MycoBank and Index Fungorum, and the Global Names project. The volume was inspired by a symposium held in Sherborn’s honour at the Natural History Museum (NHM), London, on the 150th year of his birth in 2011, organised by the International Commission on Zoological Nomenclature (ICZN), in collaboration with the Society for the History of Natural History (SHNH).

Sherborn was a man with a vision for the future and respect for the accomplishments of the past. He would have celebrated the new tools for the ambitious goal of linking all biological information through names that are readable for both machines and humans. He would have understood the tremendous power of interconnected names for biodiversity science overall. And he would have knuckled down and got to work to make it happen.

###

Original source:

Michel E (2016) Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond. In: Michel E (Ed.) Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond. ZooKeys 550: 1-11. doi: 10.3897/zookeys.550.7460

 

###

Anchoring Biodiversity Information – Sherborn Special Issue is available to read and order from here.