Audit finds biodiversity data aggregators ‘lose and confuse’ data

In an effort to improve the quality of biodiversity records, the Atlas of Living Australia (ALA) and the Global Biodiversity Information Facility (GBIF) use automated data processing to check individual data items. The records are provided to the ALA and GBIF by museums, herbaria and other biodiversity data sources.

However, an independent analysis of such records reports that ALA and GBIF data processing also leads to data loss and unjustified changes in scientific names.

The study was carried out by Dr Robert Mesibov, an Australian millipede specialist who also works as a data auditor. Dr Mesibov checked around 800,000 records retrieved from the Australian MuseumMuseums Victoria and the New Zealand Arthropod Collection. His results are published in the open access journal ZooKeys, and also archived in a public data repository.

“I was mainly interested in changes made by the aggregators to the genus and species names in the records,” said Dr Mesibov.

“I found that names in up to 1 in 5 records were changed, often because the aggregator couldn’t find the name in the look-up table it used.”

data_auditAnother worrying result concerned type specimens – the reference specimens upon which scientific names are based. On a number of occasions, the aggregators were found to have replaced the name of a type specimen with a name tied to an entirely different type specimen.

The biggest surprise, according to Dr Mesibov, was the major disagreement on names between aggregators.

“There was very little agreement,” he explained. “One aggregator would change a name and the other wouldn’t, or would change it in a different way.”

Furthermore, dates, names and locality information were sometimes lost from records, mainly due to programming errors in the software used by aggregators to check data items. In some data fields the loss reached 100%, with no original data items surviving the processing.

“The lesson from this audit is that biodiversity data aggregation isn’t harmless,” said Dr Mesibov. “It can lose and confuse perfectly good data.”

“Users of aggregated data should always download both original and processed data items, and should check for data loss or modification, and for replacement of names,” he concluded.

###

Original source:

Mesibov R (2018) An audit of some filtering effects in aggregated occurrence records. ZooKeys 751: 129-146. https://doi.org/10.3897/zookeys.751.24791

Dispatch from the field II: Students describe an elusive spider while stationed in Borneo

A mystery has long shrouded the orb-weaving spider genus Opadometa, where males and females belonging to one and the same species look nothing alike. Furthermore, the males appear to be so elusive that scientists still doubt whether both sexes are correctly linked to each other even in the best-known species.

Such is the case for Opadometa sarawakensis – a species known only from female specimens. While remarkable with their striking red and blue colors and large size, the females could not give the slightest hint about the likely appearance of the male Opadometa sarawakensis.

The red and blue female Opadometa sarawakensis
The red and blue female Opadometa sarawakensis

Nevertheless, students taking part in a recent two-week tropical ecology field course organized by the Naturalis Biodiversity Center and Leiden University, and hosted by the Danau Girang Field Centre (DGFC) on the island of Borneo, Malaysia, found a mature male spider hanging on the web of a red and blue female, later identified as Opadometa sarawakensis. Still quite striking, the male was colored in a blend of orange, gray, black, and silver.

At the brink of a long-awaited discovery and eager to describe the male, the students along with their lecturers and the field station scientific staff encountered a peril – with problematic species like the studied orb weaver they were in need for strong evidence to prove that it matched the female from the web. Furthermore, molecular DNA-based analysis was not an option at the time, since the necessary equipment was not available at DGFC.

On the other hand, being at the center of the action turned out to have advantages no less persuasive than DNA evidence. Having conducted thorough field surveys in the area, the team has concluded that the male’s observation on that particular female’s web in addition to the fact that no other Opadometa species were found in the area, was enough to prove they were indeed representatives of the same spider.

Adapting to the quite basic conditions at the DGFC laboratory, the students and their mentors put in use various items they had on hand, including smartphones paired up with headlights mounted on gooseneck clips in place of sophisticated cameras.

In the end, they gathered all the necessary data to prepare the formal description of the newly identified male.

Once they had the observations and the data, there was only one question left to answer. How could they proceed with the submission of a manuscript to a scholarly journal, so that their finding is formally announced and recognised?

submitting

Thanks to the elaborated and highly automated workflow available at the peer-reviewed open access Biodiversity Data Journal and its underlying ARPHA Writing Tool, the researchers managed to successfully compile their manuscript, including all underlying data, such as geolocations, and submit it from the field station. All in all, the authoring, peer review and publication – each step taking place within the ARPHA Platform‘s singular environment – took less than a month to complete. In fact, the paper was published within few days after being submitted.

This is the second publication in the series “Dispatch from the field”, resulting from an initiative led by spider taxonomist Dr Jeremy Miller. In 2014, another team of students and their mentors described a new species of curious one-millimetre-long spider from the Danau Girang Field Center. Both papers serve to showcase the feasibility of publication and sharing of easy to find, access and re-use biodiversity data.

“This has been a unique educational experience for the students,” says Jeremy. “They got to experience how tropical field biologists work, which is often from remote locations and without sophisticated equipment. This means that creativity and persistence are necessary to solve problems and complete a research objective. The fact that the students got to participate in advancing knowledge about this remarkable spider species by contributing to a manuscript was really exciting.”

###

Original source:

Miller J, Freund C, Rambonnet L, Koets L, Barth N, van der Linden C, Geml J, Schilthuizen M, Burger R, Goossens B (2018) Dispatch from the field II: the mystery of the red and blue Opadometa male (Araneae, Tetragnathidae, Opadometa sarawakensis). Biodiversity Data Journal6: e24777. https://doi.org/10.3897/BDJ.6.e24777

Data Quality Checklist and Recommendations at Pensoft

As much as research data sharing and re-usability is a staple in the open science practices, their value would be hugely diminished if their quality is compromised.

In a time when machine-readability and the related software are getting more and more crucial in science, while data are piling up by the minute, it is essential that researchers efficiently format and structure as well as deposit their data, so that they can make it accessible and re-usable for their successors.

Errors, as in data that fail to be read by computer programs, can easily creep into any dataset. These errors are as diverse as invalid characters, missing brackets, blank fields and incomplete geolocations.

To summarise the lessons learnt from our extensive experience in biodiversity data audit at Pensoft, we have now included a Data Quality Checklist and Recommendations page in the About section of each of our data-publishing journals.

We are hopeful that these guidelines will help authors prepare and publish datasets of higher quality, so that their work can be fully utilised in subsequent research.

At the end of the day, proofreading your data is no different than running through your text looking for typos.

 

We would like to use the occasion to express our gratitude to Dr. Robert Mesibov, who prepared the checklist and whose expertise in biodiversity data audit has contributed greatly to Pensoft through the years. Continue reading “Data Quality Checklist and Recommendations at Pensoft”

How the names of organisms help to turn ‘small data’ into ‘Big Data’

Innovation in ‘Big Data’ helps address problems that were previously overwhelming. What we know about organisms is in hundreds of millions of pages published over 250 years. New software tools of the Global Names project find scientific names, index digital documents quickly, correcting names and updating them. These advances help “Making small data big” by linking together to content of many research efforts. The study was published in the open access journal Biodiversity Data Journal.

The ‘Big Data’ vision of science is transformed by computing resources to capture, manage, and interrogate the deluge of information coming from new technologies, infrastructural projects to digitise physical resources (such as our literature from the Biodiversity Heritage Library), or digital versions of specimens and records about specimens by museums.

Increased bandwidth has made dialogue among distributed data centres feasible and this is how new insights into biology are arising. In the case of biodiversity sciences, data centres range in size from the large GenBank for molecular records and the Global Biodiversity Information Facility for records of occurrences of species, to a long tail of tens of thousands of smaller datasets and web-sites which carry information compiled by individuals, research projects, funding agencies, local, state, national and international governmental agencies.

The large biological repositories do not yet approach the scale of astronomy and nuclear physics, but the very large number of sources in the long tail of useful resources do present biodiversity informaticians with a major challenge – how to discover, index, organize and interconnect the information contained in a very large number of locations.

In this regard, biology is fortunate that, from the middle of the 18th Century, the community has accepted the use of latin binomials such as Homo sapiens or Ba humbugi for species. All names are listed by taxonomists. Name recognition tools can call on large expert compilations of names (Catalogue of Life, Zoobank, Index Fungorum, Global Names Index) to find matches in sources of digital information. This allows for the rapid indexing of content.

Even when we do not know a name, we can ‘discover’ it because scientific names have certain distinctive characteristics (written in italics, most often two successive words in a latinised form, with the first one – capitalised). These properties allow names not yet present in compilations of names to be discovered in digital data sources.

The idea of a names-based cyberinfrastructure is to use the names to interconnect large and small distributed sites of expert knowledge distributed across the Internet. This is the concept of the described Global Names project which carried out the work described in this paper.

The effectiveness of such an infrastructure is compromised by the changes to names over time because of taxonomic and phylogenetic research. Names are often misspelled, or there might be errors in the way names are presented. Meanwhile, increasing numbers of species have no names, but are distinguished by their molecular characteristics.

In order to assess the challenge that these problems may present to the realization of a names-based cyberinfrastructure, we compared names from GenBank and DRYAD (a digital data repository) with names from Catalogue of Life to assess how well matched they are.

As a result, we found out that fewer than 15% of the names in pair-wise comparisons of these data sources could be matched. However, with a names parser to break the scientific names into all of their component parts, those parts that present the greatest number of problems could be removed to produce a simplified or canonical version of the name. Thanks to such tools, name-matching was improved to almost 85%, and in some cases to 100%.

The study confirms the potential for the use of names to link distributed data and to make small data big. Nonetheless, it is clear that we need to continue to invest more and better names-management software specially designed to address the problems in the biodiversity sciences.

###

Original source:

Patterson D, Mozzherin D, Shorthouse D, Thessen A (2016) Challenges with using names to link digital biodiversity information. Biodiversity Data Journal, doi: 10.3897/BDJ.4.e8080.

Additional information:

The study was supported by the National Science Foundation.

How to import occurrence records into manuscripts from GBIF, BOLD, iDigBio and PlutoF

On October 20, 2015, we published a blog post about the novel functionalities in ARPHA that allows streamlined import of specimen or occurrence records into taxonomic manuscripts.

Recently, this process was reflected in the “Tips and Tricks” section of the ARPHA authoring tool. Here, we’ll list the individual workflows:

Based on our earlier post, we will now go through our latest updates and highlight the new features that have been added since then.

Repositories and data indexing platforms, such as GBIF, BOLD systems, iDigBio, or PlutoF, hold, among other types of data, specimen or occurrence records. It is now possible to directly import specimen or occurrence records into ARPHA taxonomic manuscripts from these platforms [see Fig. 1]. We’ll refer to specimen or occurrence records as simply occurrence records for the rest of this post.

Import_specimen_workflow_
[Fig. 1] Workflow for directly importing occurrence records into a taxonomic manuscript.
Until now, when users of the ARPHA writing tool wanted to include occurrence records as materials in a manuscript, they would have had to format the occurrences as an Excel sheet that is uploaded to the Biodiversity Data Journal, or enter the data manually. While the “upload from Excel” approach significantly simplifies the process of importing materials, it still requires a transposition step – the data which is stored in a database needs to be reformatted to the specific Excel format. With the introduction of the new import feature, occurrence data that is stored at GBIF, BOLD systems, iDigBio, or PlutoF, can be directly inserted into the manuscript by simply entering a relevant record identifier.

The functionality shows up when one creates a new “Taxon treatment” in a taxonomic manuscript in the ARPHA Writing Tool. To import records, the author needs to:

  1. Locate an occurrence record or records in one of the supported data portals;
  2. Note the ID(s) of the records that ought to be imported into the manuscript (see Tips and Tricks for screenshots);
  3. Enter the ID(s) of the occurrence record(s) in a form that is to be seen in the “Materials” section of the species treatment;
  4. Select a particular database from a list, and then simply clicks ‘Add’ to import the occurrence directly into the manuscript.

In the case of BOLD Systems, the author may also select a given Barcode Identification Number (BIN; for a treatment of BIN’s read below), which then pulls all occurrences in the corresponding BIN.

We will illustrate this workflow by creating a fictitious treatment of the red moss, Sphagnum capillifolium, in a test manuscript. We have started a taxonomic manuscript in ARPHA and know that the occurrence records belonging to S. capillifolium can be found on iDigBio. What we need to do is to locate the ID of the occurrence record in the iDigBio webpage. In the case of iDigBio, the ARPHA system supports import via a Universally Unique Identifier (UUID). We have already created a treatment for S. capillifolium and clicked on the pencil to edit materials [Fig. 2].

Figure-61-01
[Fig. 2] Edit materials
In this example, type or paste the UUID (b9ff7774-4a5d-47af-a2ea-bdf3ecc78885), select the iDigBio source and click ‘Add’. This will pull the occurrence record for S. capillifolium from iDigBio and insert it as a material in the current paper [Fig. 3].

taxon-treatments- 3
[Fig. 3] Materials after they have been imported
This workflow can be used for a number of purposes. An interesting future application is the rapid re-description of species, but even more exciting is the description of new species from BIN’s. BIN’s (Barcode Identification Numbers) delimit Operational Taxonomic Units (OTU’s), created algorithmically at BOLD Systems. If a taxonomist decides that an OTU is indeed a new species, then he/she can import all the type information associated with that OTU for the purposes of describing it as a new species.

Not having to retype or copy/paste species occurrence records, the authors save a lot of efforts. Moreover, they automatically import them in a structured Darwin Core format, which can easily be downloaded from the article text into structured data by anyone who needs the data for reuse.

Another important aspect of the workflow is that it will serve as a platform for peer-review, publication and curation of raw data, that is of unpublished individual data records coming from collections or observations stored at GBIF, BOLD, iDigBio and PlutoF. Taxonomists are used to publish only records of specimens they or their co-authors have personally studied. In a sense, the workflow will serve as a “cleaning filter” for portions of data that are passed through the publishing process. Thereafter, the published records can be used to curate raw data at collections, e.g. put correct identifications, assign newly described species names to specimens belonging to the respective BIN and so on.

 

Additional Information:

The work has been partially supported by the EC-FP7 EU BON project (ENV 308454, Building the European Biodiversity Observation Network) and the ITN Horizon 2020 project BIG4 (Biosystematics, informatics and genomics of the big 4 insect groups: training tomorrow’s researchers and entrepreneurs), under Marie Sklodovska-Curie grant agreement No. 642241.