The Biodiversity Data Journal launches its own data portal on GBIF

With this simple website designed to lower technical demands, data managers and other stakeholders can easily focus on data exploration and reuse.

The Biodiversity Data Journal (BDJ) became the second open-access peer-reviewed scholarly title to make use of the hosted portals service provided by the Global Biodiversity Information Facility (GBIF): an international network and data infrastructure aimed at providing anyone, anywhere, open access to data about all types of life on Earth. 

The Biodiversity Data Journal portal, hosted on the GBIF platform, is to support biodiversity data use and engagement at national, institutional, regional and thematic scales by facilitating access and reuse of data by users with various expertise in data use and management. 

Having piloted the GBIF hosted portal solution with arguably the most revolutionary biodiversity journal in its exclusively open-access scholarly portfolio, Pensoft is to soon replicate the effort with at least 35 other journals in the field. This would mean that the publisher will more than double the number of the currently existing GBIF-hosted portals.

As of the time of writing, the BDJ portal provides seamless access and exploration for nearly 300,000 occurrences of biological organisms from all over the world that have been extracted from the journal’s all-time publications. In addition, the portal provides direct access to more than 800 datasets published alongside papers in BDJ, as well as over 1,000 citations of the journal articles associated with those publications.  

The release of the BDJ portal should inspire other publishers to follow suit in advancing a more interconnected, open and accessible ecosystem for biodiversity research

Vince Smith

Using the search categories featured in the portal, users can narrow their query by geography, location, taxon, IUCN Global Red List Category, geological context and many others. The dashboard also lets users access multiple statistics about the data, and even explore potentially related records with the help of the clustering feature (e.g. a specimen sequenced by another institution or type material deposited at different institutions). Additionally, the BDJ portal provides basic information about the journal itself and links to the news section from its website. 

A video displaying an interactive map with occurrence data on the BDJ portal.

Launched in 2013 with the aim to bring together openly available data and narrative into a peer-reviewed scholarly paper, the Biodiversity Data Journal has remained at the forefront of scholarly publishing in the field of biodiversity research. Over the years, it has been amongst the first to adopt many novelties developed by Pensoft, including the entirely XML-based ARPHA Writing Tool (AWT) that has underpinned the journal’s submission and review process for several years now. Besides the convenience of an entirely online authoring environment, AWT provides multiple integrations with key databases, such as GBIF and BOLD, to allow direct export and import at the authoring stage, thereby further facilitating the publication and dissemination of biodiversity data. More recently, BDJ also piloted the “Nanopublications for Biodiversity” workflow and format as a novel solution to future-proof biodiversity knowledge by sharing “pixels” of machine-actionable scientific statements.   

“I am thrilled to see the Biodiversity Data Journal’s (BDJ) hosted portal active, ten years since it became the first journal to submit taxon treatments and Darwin Core occurrence records automatically to GBIF! Since its launch in 2013, BDJ has been unrivalled amongst taxonomy and biodiversity journals in its unique workflows that provide authors with import and export functions for structured biodiversity data to/from GBIF, BOLD, iDigBio and more. I am also glad to announce that more than 30 Pensoft biodiversity journals will soon be present as separate hosted portals on GBIF thanks to our long-time collaboration with Plazi, ensuring proper publication, dissemination and re-use of FAIR biodiversity data,” said Prof. Dr. Lyubomir Penev, founder and CEO of Pensoft, and founding editor of BDJ.

“The release of the BDJ portal and subsequent ones planned for other Pensoft journals should inspire other publishers to follow suit in advancing a more interconnected, open and accessible ecosystem for biodiversity research,” said Vince Smith, editor-in-chief of BDJ and head of digital, data and informatics at the Natural History Museum, London.

Data checking for biodiversity collections and other biodiversity data compilers from Pensoft

Guest blog post by Dr Robert Mesibov

Proofreading the text of scientific papers isn’t hard, although it can be tedious. Are all the words spelled correctly? Is all the punctuation correct and in the right place? Is the writing clear and concise, with correct grammar? Are all the cited references listed in the References section, and vice-versa? Are the figure and table citations correct?

Proofreading of text is usually done first by the reviewers, and then finished by the editors and copy editors employed by scientific publishers. A similar kind of proofreading is also done with the small tables of data found in scientific papers, mainly by reviewers familiar with the management and analysis of the data concerned.

But what about proofreading the big volumes of data that are common in biodiversity informatics? Tables with tens or hundreds of thousands of rows and dozens of columns? Who does the proofreading?

Sadly, the answer is usually “No one”. Proofreading large amounts of data isn’t easy and requires special skills and digital tools. The people who compile biodiversity data often lack the skills, the software or the time to properly check what they’ve compiled.

The result is that a great deal of the data made available through biodiversity projects like GBIF is — to be charitable — “messy”. Biodiversity data often needs a lot of patient cleaning by end-users before it’s ready for analysis. To assist end-users, GBIF and other aggregators attach “flags” to each record in the database where an automated check has found a problem. These checks find the most obvious problems amongst the many possible data compilation errors. End-users often have much more work to do after the flags have been dealt with.

In 2017, Pensoft employed a data specialist to proofread the online datasets that are referenced in manuscripts submitted to Pensoft’s journals as data papers. The results of the data-checking are sent to the data paper’s authors, who then edit the datasets. This process has substantially improved many datasets (including those already made available through GBIF) and made them more suitable for digital re-use. At blog publication time, more than 200 datasets have been checked in this way.

Note that a Pensoft data audit does not check the accuracy of the data, for example, whether the authority for a species name is correct, or whether the latitude/longitude for a collecting locality agrees with the verbal description of that locality. For a more or less complete list of what does get checked, see the Data checklist at the bottom of this blog post. These checks are aimed at ensuring that datasets are correctly organised, consistently formatted and easy to move from one digital application to another. The next reader of a digital dataset is likely to be a computer program, not a human. It is essential that the data are structured and formatted, so that they are easily processed by that program and by other programs in the pipeline between the data compiler and the next human user of the data.

Pensoft’s data-checking workflow was previously offered only to authors of data paper manuscripts. It is now available to data compilers generally, with three levels of service:

  • Basic: the compiler gets a detailed report on what needs fixing
  • Standard: minor problems are fixed in the dataset and reported
  • Premium: all detected problems are fixed in collaboration with the data compiler and a report is provided

Because datasets vary so much in size and content, it is not possible to set a price in advance for basic, standard and premium data-checking. To get a quote for a dataset, send an email with a small sample of the data topublishing@pensoft.net.


Data checklist

Minor problems:

  • dataset not UTF-8 encoded
  • blank or broken records
  • characters other than letters, numbers, punctuation and plain whitespace
  • more than one version (the simplest or most correct one) for each character
  • unnecessary whitespace
  • Windows carriage returns (retained if required)
  • encoding errors (e.g. “Dum?ril” instead of “Duméril”)
  • missing data with a variety of representations (blank, “-“, “NA”, “?” etc)

Major problems:

  • unintended shifts of data items between fields
  • incorrect or inconsistent formatting of data items (e.g. dates)
  • different representations of the same data item (pseudo-duplication)
  • for Darwin Core datasets, incorrect use of Darwin Core fields
  • data items that are invalid or inappropriate for a field
  • data items that should be split between fields
  • data items referring to unexplained entities (e.g. “habitat is type A”)
  • truncated data items
  • disagreements between fields within a record
  • missing, but expected, data items
  • incorrectly associated data items (e.g. two country codes for the same country)
  • duplicate records, or partial duplicate records where not needed

For details of the methods used, see the author’s online resources:

***

Find more for Pensoft’s data audit workflow provided for data papers submitted to Pensoft journals on Pensoft’s blog.