Call for data papers describing datasets from Russia to be published in Biodiversity Data Journal

GBIF partners with FinBIF and Pensoft to support publication of new datasets about biodiversity from across Russia

Original post via GBIF

In collaboration with the Finnish Biodiversity Information Facility (FinBIF) and Pensoft Publishers, GBIF has announced a new call for authors to submit and publish data papers on Russia in a special collection of Biodiversity Data Journal (BDJ). The call extends and expands upon a successful effort in 2020 to mobilize data from European Russia.

Between now and 15 September 2021, the article processing fee (normally €550) will be waived for the first 36 papers, provided that the publications are accepted and meet the following criteria that the data paper describes a dataset:

The manuscript must be prepared in English and is submitted in accordance with BDJ’s instructions to authors by 15 September 2021. Late submissions will not be eligible for APC waivers.

Sponsorship is limited to the first 36 accepted submissions meeting these criteria on a first-come, first-served basis. The call for submissions can therefore close prior to the stated deadline of 15 September 2021. Authors may contribute to more than one manuscript, but artificial division of the logically uniform data and data stories, or “salami publishing”, is not allowed.

BDJ will publish a special issue including the selected papers by the end of 2021. The journal is indexed by Web of Science (Impact Factor 1.331), Scopus (CiteScore: 2.1) and listed in РИНЦ / eLibrary.ru.

For non-native speakers, please ensure that your English is checked either by native speakers or by professional English-language editors prior to submission. You may credit these individuals as a “Contributor” through the AWT interface. Contributors are not listed as co-authors but can help you improve your manuscripts.

In addition to the BDJ instruction to authors, it is required that datasets referenced from the data paper a) cite the dataset’s DOI, b) appear in the paper’s list of references, and c) has “Russia 2021” in Project Data: Title and “N-Eurasia-Russia2021“ in Project Data: Identifier in the dataset’s metadata.

Authors should explore the GBIF.org section on data papers and Strategies and guidelines for scholarly publishing of biodiversity data. Manuscripts and datasets will go through a standard peer-review process. When submitting a manuscript to BDJ, authors are requested to select the Biota of Russia collection.

To see an example, view this dataset on GBIF.org and the corresponding data paper published by BDJ.

Questions may be directed either to Dmitry Schigel, GBIF scientific officer, or Yasen Mutafchiev, managing editor of Biodiversity Data Journal.

The 2021 extension of the collection of data papers will be edited by Vladimir Blagoderov, Pedro Cardoso, Ivan Chadin, Nina Filippova, Alexander Sennikov, Alexey Seregin, and Dmitry Schigel.

This project is a continuation of the successful call for data papers from European Russia in 2020. The funded papers are available in the Biota of Russia special collection and the datasets are shown on the project page.

***

Definition of terms

Datasets with more than 5,000 records that are new to GBIF.org

Datasets should contain at a minimum 5,000 new records that are new to GBIF.org. While the focus is on additional records for the region, records already published in GBIF may meet the criteria of ‘new’ if they are substantially improved, particularly through the addition of georeferenced locations.” Artificial reduction of records from otherwise uniform datasets to the necessary minimum (“salami publishing”) is discouraged and may result in rejection of the manuscript. New submissions describing updates of datasets, already presented in earlier published data papers will not be sponsored.

Justification for publishing datasets with fewer records (e.g. sampling-event datasets, sequence-based data, checklists with endemics etc.) will be considered on a case-by-case basis.

Datasets with high-quality data and metadata

Authors should start by publishing a dataset comprised of data and metadata that meets GBIF’s stated data quality requirement. This effort will involve work on an installation of the GBIF Integrated Publishing Toolkit.

Only when the dataset is prepared should authors then turn to working on the manuscript text. The extended metadata you enter in the IPT while describing your dataset can be converted into manuscript with a single-click of a button in the ARPHA Writing Tool (see also Creation and Publication of Data Papers from Ecological Metadata Language (EML) Metadata. Authors can then complete, edit and submit manuscripts to BDJ for review.

Datasets with geographic coverage in Russia

In correspondence with the funding priorities of this programme, at least 80% of the records in a dataset should have coordinates that fall within the priority area of Russia. However, authors of the paper may be affiliated with institutions anywhere in the world.

***

Check out the Biota of Russia dynamic data paper collection so far.

Follow Biodiversity Data Journal on Twitter and Facebook to keep yourself posted about the new research published.

Pensoft Annotator – a tool for text annotation with ontologies

By Mariya Dimitrova, Georgi Zhelezov, Teodor Georgiev and Lyubomir Penev

The use of written language to record new knowledge is one of the advancements of civilisation that has helped us achieve progress. However, in the era of Big Data, the amount of published writing greatly exceeds the physical ability of humans to read and understand all written information. 

More than ever, we need computers to help us process and manage written knowledge. Unlike humans, computers are “naturally fluent” in many languages, such as the formats of the Semantic Web. These standards were developed by the World Wide Web Consortium (W3C) to enable computers to understand data published on the Internet. As a result, computers can index web content and gather data and metadata about web resources.

To help manage knowledge in different domains, humans have started to develop ontologies: shared conceptualisations of real-world objects, phenomena and abstract concepts, expressed in machine-readable formats. Such ontologies can provide computers with the necessary basic knowledge, or axioms, to help them understand the definitions and relations between resources on the Web. Ontologies outline data concepts, each with its own unique identifier, definition and human-legible label.

Matching data to its underlying ontological model is called ontology population and involves data handling and parsing that gives it additional context and semantics (meaning). Over the past couple of years, Pensoft has been working on an ontology population tool, the Pensoft Annotator, which matches free text to ontological terms.

The Pensoft Annotator is a web application, which allows annotation of text input by the user, with any of the available ontologies. Currently, they are the Environment Ontology (ENVO) and the Relation Ontology (RO), but we plan to upload many more. The Annotator can be run with multiple ontologies, and will return a table of matched ontological term identifiers, their labels, as well as the ontology from which they originate (Fig. 1). The results can also be downloaded as a Tab-Separated Value (TSV) file and certain records can be removed from the table of results, if desired. In addition, the Pensoft Annotator allows to exclude certain words (“stopwords”) from the free text matching algorithm. There is a list of default stopwords, common for the English language, such as prepositions and pronouns, but anyone can add new stopwords.

Figure 1. Interface of the Pensoft Annotator application

In Figure 1, we have annotated a sentence with the Pensoft Annotator, which yields a single matched term, labeled ‘host of’, from the Relation Ontology (RO). The ontology term identifier is linked to a webpage in Ontobee, which points to additional metadata about the ontology term (Fig. 2).

Figure 2. Web page about ontology term

Such annotation requests can be run to perform text analyses for topic modelling to discover texts which contain host-pathogen interactions. Topic modelling is used to build algorithms for content recommendation (recommender systems) which can be implemented in online news platforms, streaming services, shopping websites and others.

At Pensoft, we use the Pensoft Annotator to enrich biodiversity publications with semantics. We are currently annotating taxonomic treatments with a custom-made ontology based on the Relation Ontology (RO) to discover treatments potentially describing species interactions. You can read more about using the Annotator to detect biotic interactions in this abstract.

The Pensoft Annotator can also be used programmatically through an API, allowing you to integrate the Annotator into your own script. For more information about using the Pensoft Annotator, please check out the documentation.

Data checking for biodiversity collections and other biodiversity data compilers from Pensoft

Guest blog post by Dr Robert Mesibov

Proofreading the text of scientific papers isn’t hard, although it can be tedious. Are all the words spelled correctly? Is all the punctuation correct and in the right place? Is the writing clear and concise, with correct grammar? Are all the cited references listed in the References section, and vice-versa? Are the figure and table citations correct?

Proofreading of text is usually done first by the reviewers, and then finished by the editors and copy editors employed by scientific publishers. A similar kind of proofreading is also done with the small tables of data found in scientific papers, mainly by reviewers familiar with the management and analysis of the data concerned.

But what about proofreading the big volumes of data that are common in biodiversity informatics? Tables with tens or hundreds of thousands of rows and dozens of columns? Who does the proofreading?

Sadly, the answer is usually “No one”. Proofreading large amounts of data isn’t easy and requires special skills and digital tools. The people who compile biodiversity data often lack the skills, the software or the time to properly check what they’ve compiled.

The result is that a great deal of the data made available through biodiversity projects like GBIF is — to be charitable — “messy”. Biodiversity data often needs a lot of patient cleaning by end-users before it’s ready for analysis. To assist end-users, GBIF and other aggregators attach “flags” to each record in the database where an automated check has found a problem. These checks find the most obvious problems amongst the many possible data compilation errors. End-users often have much more work to do after the flags have been dealt with.

In 2017, Pensoft employed a data specialist to proofread the online datasets that are referenced in manuscripts submitted to Pensoft’s journals as data papers. The results of the data-checking are sent to the data paper’s authors, who then edit the datasets. This process has substantially improved many datasets (including those already made available through GBIF) and made them more suitable for digital re-use. At blog publication time, more than 200 datasets have been checked in this way.

Note that a Pensoft data audit does not check the accuracy of the data, for example, whether the authority for a species name is correct, or whether the latitude/longitude for a collecting locality agrees with the verbal description of that locality. For a more or less complete list of what does get checked, see the Data checklist at the bottom of this blog post. These checks are aimed at ensuring that datasets are correctly organised, consistently formatted and easy to move from one digital application to another. The next reader of a digital dataset is likely to be a computer program, not a human. It is essential that the data are structured and formatted, so that they are easily processed by that program and by other programs in the pipeline between the data compiler and the next human user of the data.

Pensoft’s data-checking workflow was previously offered only to authors of data paper manuscripts. It is now available to data compilers generally, with three levels of service:

  • Basic: the compiler gets a detailed report on what needs fixing
  • Standard: minor problems are fixed in the dataset and reported
  • Premium: all detected problems are fixed in collaboration with the data compiler and a report is provided

Because datasets vary so much in size and content, it is not possible to set a price in advance for basic, standard and premium data-checking. To get a quote for a dataset, send an email with a small sample of the data topublishing@pensoft.net.


Data checklist

Minor problems:

  • dataset not UTF-8 encoded
  • blank or broken records
  • characters other than letters, numbers, punctuation and plain whitespace
  • more than one version (the simplest or most correct one) for each character
  • unnecessary whitespace
  • Windows carriage returns (retained if required)
  • encoding errors (e.g. “Dum?ril” instead of “Duméril”)
  • missing data with a variety of representations (blank, “-“, “NA”, “?” etc)

Major problems:

  • unintended shifts of data items between fields
  • incorrect or inconsistent formatting of data items (e.g. dates)
  • different representations of the same data item (pseudo-duplication)
  • for Darwin Core datasets, incorrect use of Darwin Core fields
  • data items that are invalid or inappropriate for a field
  • data items that should be split between fields
  • data items referring to unexplained entities (e.g. “habitat is type A”)
  • truncated data items
  • disagreements between fields within a record
  • missing, but expected, data items
  • incorrectly associated data items (e.g. two country codes for the same country)
  • duplicate records, or partial duplicate records where not needed

For details of the methods used, see the author’s online resources:

***

Find more for Pensoft’s data audit workflow provided for data papers submitted to Pensoft journals on Pensoft’s blog.

How to get data from research articles back into the research cycle аt no additional costs?

Pensoft’s journals introduce a standard appendix template for primary biodiversity data to provide direct harvesting and conversion to interlinked FAIR data

by Lyubomir Penev, Mariya Dimitrova, Iva Kostadinova, Teodor Georgiev, Donat Agosti, Jorrit Poelen

Linking open data is far from being a “new” or “innovative” concept ever since Tim Berners-Lee published his “5-Star Rating of Linked Open Data (LOD)” in 2006. The real question is how to implement it in practice, especially when most data are still waiting to be liberated from the narratives of more than 2.5 million scholarly articles published annually? We are still far from the dream world of linked and re-usable open data, not least because the inertia in academic publishing practices appears much stronger than the necessary cultural changes.

Already, there are many exciting tools and projects that harvest data from large corpora of published literature, including historical papers, such as PubMedCentral in biomedicine or Biodiversity Heritage Library in biodiversity science. Yet, finding data elements within the text of these corpora and linking data to external resources, even with the help of AI tools, is still in its infancy and is presently only half way there.

Data should not only be extracted, they should be semantically enriched and linked to both their original resources (e.g. accession numbers for sequences need to be linked to GenBank), but also between each other, as well as with data from other domains. Only then, the data can be made FAIR: Findable, Accessible, Interoperable and Re-usable. There are already research infrastructures, which provide extraction, liberation and semantic enrichment of data from the published narratives, for example, the Biodiversity Literature Repository, established at Zenodo by the digitisation company Plazi and the science publisher and technology provider Pensoft

Quick access to high-quality Linked Open Data can become vitally important in cases like the current COVID-19 pandemic, when scientists need re-usable data from different research fields to come up with healthcare solutions. To complete the puzzle, they need data related to the taxonomy and biology of viruses, but also data taken from their potential hosts and vectors in the animal world, like bats or pangolins. Therefore, what could publishers do to facilitate the re-usability and interoperability of data they publish? 

In a recently published paper by Patterson et al. (2020) on the phylogenetics of Old World Leaf-nosed bats in the journal ZooKeys, the authors and the publisher worked together to present the data on the studied voucher specimens of bats in an Appendix table, where each row represents a set of valuable links between the different data related to a specimen (see Fig. 1). 

Fig. 1. Screenshot of the Appendix table with data on 324 specimens of bats (Patterson et al. 2020).


Specimens in natural history collections, for instance, have their so-called human-readable Specimen codes, for example, FMNH 221308 translates to a specimen with Catalogue No 221308, which is preserved in the collection of the Field Museum of Natural History Chicago (FMNH). When added to a collection, such voucher specimens are also assigned Globally Unique Identifiers (GUIDs). For example, the GUID of the above-mentioned specimen looks like this:  

25634cae-5a0c-490b-b380-9cabe456316a 

and is available from the Global Biodiversity Information Facilities (GBIF) under Original Occurrence ID (Fig. 2), from where computer algorithms can locate various types of data associated with the GUID of a particular specimen, regardless of where these data are stored. Examples for data types and relevant repositories, besides the occurrence record of the specimen available from the GBIF, are specimen data stored at the US-based natural history collection network iDigBio, specimen’s genetic sequences at GenBank, images or sound recordings stored in other third-party databases (e.g. MorphoSource, BioAcustica) and others. 

The complex digital environment of various information linked to the globally unique identifier of a physical specimen in a collection together constitutes its “openDS digital specimen” representation, recently formulated within the EU project ICEDIG. Nevertheless, this complex linking could occur more easily and at a modest cost if only the GUIDs were always submitted to the respective data repositories together with the data about that particular specimen. Unfortunately, this is too rarely the case, hence we have to look for other ways to link these fragmented data.

Fig. 2. The representation of the specimen FMNH 221308 on GBIF. The Global Unique Identifier (GUID) of the specimen is shown in the Original Occurrence ID field.

Next to the Specimen code in the table (Fig. 1), there are one or more columns containing accession numbers of different gene sequences from that specimen, linked to their entries in GenBank. There is also a column for the species names associated with the specimens, linked through the Pensoft Taxon Profile (PTP) tool to several trusted international resources, in whose data holdings it appears, such as GBIF, GenBank, Biodiversity Heritage Library, PubMedCentral and many more (see example for the bat species Hipposideros ater). The next column contains the country where the specimen has been collected. The last columns contain the geo-coordinated locations of the collecting spot. 

The structure of such a specimen-based table is not fixed and can also have several other data elements, for example, resolvable persistent identifiers for the deposition of MicroCt or other images of the specimen at a repository (e.g. MorphoSource) or of a tissue sample from where a virus has been isolated (see the sample table template below). 

So far, so good, but what would the true value of those interlinked data be, besides that a reader could easily click on to a linked data item and see immediately more information about that particular element? What other missing links can we include to bring extra value to the data, so that these can be put together and re-used by the research community? Moreover, from where do we take these missing links?

The missing links are present in the table rows!

Firstly, if we open the GBIF record for the specimen in question (FMNH 221308), we see a lot of additional information there (Fig.2), which can be read by humans and retrieved by computers through GBIF’s Application Programming Interface (API). However, the links to the GenBank accession numbers KT583829 of the cyt-b gene sequenced from that specimen are missing, probably because, at the time of deposition of this specimen data in GBIF, its sequences had not yet been submitted to GenBank.

Now, we would probably wish to determine the specimen from which a particular gene has been sequenced and deposited in GenBank and where this specimen is preserved? We can easily click on any accession number in the table but, again, while we find a lot of useful information about the gene, for example, about the methods of sequencing, its taxon name etc., the voucher specimen’s GUID is actually missing (see KT583829 accession number of the specimen FMNH 221308, Fig. 3). How could we then locate the GUID of that specimen and the additional information linked to it? By publishing all this information in the Appendix in the way described here, we can easily locate this missing link between the specimen’s GUID and its sequence, either “by hand” or through API call requests provided by computers. 

Fig. 3. GenBank record for the accession number KT583829 of the voucher specimen FMNH 221308. The GUID for the voucher specimen is not present in the record. 


While biodiversity researchers are used to working with taxon names, these names are far from being stable entities. Names can either change over time or several different names could be associated with the same “thing” (synonyms) or identical names (homonyms) may be used for different “things”. The biodiversity community needs to resolve this problem by agreeing in the future Catalogue of Life on taxon names that are unambiguously identified with GUIDs through their taxon concepts (the content behind each name, according to a particular author who has already used that name in a publication, for example, Hipposideros vittatus (Peters, 1852) is used in the work of Patterson et al. (2020). Here comes another missing link that the table could provide – the link between the specimen, the taxon name to which it belongs and the taxon concept of that name, according to the article in which this name has been used and published.

Now, once we have listed all available linked information about several specimens belonging to a number of different species in a table, we can continue by adding some other important data, such as the biotic interactions between specimens or species. For example, we can name the table we have already constructed “Source specimens/species” and add to it some more columns under the heading “Target specimens/species”. The linking between the two groups of specimens or species in the extended biotic interaction table can be modelled using the OBO Relations Ontology, especially its list of terms, in a drop-down menu provided in the table template. Observed biotic interactions between specimens or species of the type “pathogen of”, “preys on”, “has vector” etc. can then be easily harvested and recorded in the Global Biotic Interactions database GloBI (see example on interactions between specimens).

As a result, we could have a table like the one below, where column names and data elements linked in the rows follow simple but strict rules: 

Appendix A. Specimen data table. Legend: 1 – Two groupings of specimen/species data (Source and Target); 2 – Data type groups – not changeable, linked to the appropriate ontology terms, whenever possible; 3- Column names – not changeable, linked to the appropriate ontology terms, whenever possible;  4- Linked to; 5 – Linked by.

1
Source specimens/species Biotic intercations (after OBO Relation Ontology) Target specimens/species
2 Preserved specimen (Specimen code) Associated sequences Taxon name/MOTU Other thematic repositories Location Habitat / Envoronment (after ENVO Ontology) Preserved specimen (Specimen code) Associated sequences Taxon name/MOTU
3 Institution Code Collection Code Cat hubalogiue ID Gene #1 Gene #2
PID (e.g. images dataset) PID (e.g. sound recordings) Latitude Longitude
Institution Code Collection Code Catalogiue ID Gene #1
4 GRSciCol GRSciCol GBIF, iDigBio, or DiSSCo INDSC (GenBank, ENA or DDBJ) INDSC (GenBank, ENA or DDBJ Pensoft Taxon Profile Image repository
Google Maps Google Maps ENVO vocabulary OBO term vocabulary GRSciCol GRSciCol GBIF, iDigBio, or DiSSCo INDSC (GenBank, ENA or DDBJ) Pensoft Taxon Profile
5 Pensoft Pensoft Author Pensoft Pensoft Pensoft Author Author Pensoft Pensoft Pensoft Author Pensoft Pensoft Author Pensoft Pensoft

(Google spreadsheet format: https://docs.google.com/spreadsheets/d/1AWf75FSHppTifNpmhpvWNgtTJJGu-vFtFudYrhbMOuY/edit#gid=0)

As one can imagine, some columns or cells provided in the table could be empty, as the full completion of this kind of data is rarely possible. For the purposes of a publication, the author can remove all empty columns or add additional columns, for example, for listing more genes or other types of data repository records containing data about a particular specimen. What should not be changed, though, are the column names, because they give the semantic meaning of the data in the table, which allows computers to transform them into machine-readable formats.

At the end of the publishing process, this table is published, not only for humans, but also in a code language, called Extensive Markup Language (XML), which makes the data in the table “understandable” for computers. At the moment of publication, tables published in XML contain not only data, but also information about what these data mean (semantics) and how they could be identified. Thanks to these two features, an algorithm can automatically convert the data into another machine-readable language: Resource Description Framework (RDF), which, in turn, makes the data compatible (interoperable) with other data that can be linked together, using any of the identifiers of the data elements in the table. Such converted data are represented as simple statements, called “RDF triples” and stored in special triple stores, such as OpenBiodiv or Ozymandias, from where knowledge graphs can be created and used further. As an example, one can search and access data associated with a particular specimen, but deposited at various data repositories, for example, other research groups might be interested in having together all pathogens that have been extracted from particular tissues from specimens belonging to a particular host species within a specific geographical location and so on.

Finding and preserving links between the different data elements, for example, between a Specimen, Tissue, Sequence, Taxon name and Location, is by itself a task deserving special attention and investments. How could such bi- and multilateral linking work? Having the table above in place alongside all relevant tools and competences, one can run, for example, the following operations via scripts and APIs:

  1. Locate the GUID for Specimen Code at GBIF (= OccurrenceID)
  2. Lookup sequence data associated with that GUID at GenBank
  3. Represent the link between the GUID and Sequence accession numbers in a series of RDF triples
  4. Link and express in RDF the presentation of the specimen on GBIF with the article where it has been published.
  5. Automatically inform institutions/collections for published materials containing data on their holdings (specimens, authors, publications, links to other research infrastructures, etc.).

Semantic representation of data found in such an Appendix Specimen Data Table allows the utilisation of the Linked Open Data model to map and link several data elements to each other, including the provenance record, that is the original source (article) from where these links have been extracted (Fig. 4). 

Fig. 4. Example of a semantic representation between some of the data elements from the Appendix Specimen Data Table. The proposed schema for mapping these elements uses mostly Darwin Core terms to maintain interoperability across different platforms. The link between the specimen GUID, GBIF occurrence, GenBank sequence and scientific name is marked in red.

At the very end, we will be able to construct a new “virtual super-table“ of semantic links between the data elements associated with a specimen, which, in the ideal case, would provide the fully-linked information on data and metadata along and across the lines:

Species A: Specimen <> Tissue sample <> Sequence <> Location <> Taxon name <> Habitat <> Publication source 

↑↓

Species B: Specimen <> Tissue sample <> Sequence <> Location <> Taxon name <> Habitat <> Publication source

Retrieving such additional information, for example, about an occurrence from GBIF or sequence information from GenBank through APIs and linking these pieces of information together in one dataset opens new possibilities for data discovery and re-use, as well as to the reproducibility of the research results.

An example for how data from different resources could be put and represented together is the visualisation of the host-parasite interactions between species, such as those between bats and coronaviruses, indexed by the Global Biotic Interactions (GloBI) (Fig. 5). Various other interactions, such as pollination, feeding, co-existence and others, are stored in GloBI’s database which is also available in the form of a Linked Open Dataset, openly accessible through files or through a SPARQL endpoint.

Fig. 5. Visualisation resulting from querying biotic interactions existing between a bat species from order Chiroptera (Plecotus auritus) and bat coronavirus.

The technology of Linked Open Data is already widely used across many fields, so data scientists will not be tremendously impressed by the fact that all of the above is possible. The problem is how to get there. One of the most obvious ways seems to be for publishers to start publishing data in a standard, community-agreed format so that these can easily be handled by machines with little or no human intervention. Will they do that? Some will, but until it becomes routine practice, most of the published data, i.e. high-quality, peer-reviewed data vetted by the act of publishing, will remain hardly accessible, hence unusable.

This pilot was elaborated as a use case published as the first article in a free-to-publish special issue on the biology of bats and pangolins as potential vectors for Coronaviruses in the journal ZooKeys. An additional benefit from the use case is the digitisation and data liberation from many articles on bats contained in the bibliography of the Patterson et al. article by Plazi. The use case is also a contribution to the recently opened COVID-19 Joint Task Force of the Consortium of European Taxonomic Facilities (CETAF), the Distributed System for Scientific Collections (DiSSCo) and the Integrated Digitized Biocollections (iDigBio).

To facilitate the quick adoption of the improved data table standards, Pensoft invites all who would like to test and see how their data are distributed and re-used after publication to submit manuscripts containing specimen data and biotic interaction tables, following the standard described above. The authors would be provided with a template table for completion of all fields relevant to their study while conforming to the standard used by Pensoft.

This initiative was supported in part by the IGNITE project.

Information: 

Pensoft Publishers

Field Museum of Natural History Chicago

References:

Patterson BD, Webala PW, Lavery TH, Agwanda BR, Goodman SM, Kerbis Peterhans JC, Demos TC (2020) Evolutionary relationships and population genetics of the Afrotropical leaf-nosed bats (Chiroptera, Hipposideridae). ZooKeys 929: 117-161. https://doi.org/10.3897/zookeys.929.50240

Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2014.08.005.

Hardisty A, Ma K, Nelson G, Fortes J (2019) ‘openDS’ – A New Standard for Digital Specimens and Other Natural Science Digital Object Types. Biodiversity Information Science and Standards 3: e37033. https://doi.org/10.3897/biss.3.37033

Could biodiversity data be finally here to last?

While digital curation, publication and dissemination of data have been steadily picking up in recent years in scientific fields ranging from biodiversity and ecology to chemistry and aeronautics, so have imminent concerns about their quality, availability and reusability. What’s the use of any dataset if it isn’t FAIR (i.e. findable, accessible, interoperable and reusable)?  

With the all-too-fresh memory of researchers like Elizabeth “Lizzie” Wolkovich who would spend a great deal of time chasing down crucial and impossible-to-replicate data by means of pleading to colleagues (or their successors) via inactive email addresses and literally dusting off card folders and floppy disks, it is easy to imagine that we could be bound to witness history repeating itself once more. At the end of yet another day in today’s “Big Data” world, data loss caused by accidental entry errors or misused data standards seems even more plausible than an outdated contact or a drawer that has suddenly caught fire. 

When a 2013 study, which looked into 516 papers from 1991 to 2011, reported that the chances of associated datasets to be available for reuse fell by 17% each year starting from the publication date, it cited issues mostly dealing with the data having simply been lost through the years or stored on currently inaccessible storage media. However, the researcher of today is increasingly logging their data into external repositories, where datasets are swiftly provided with a persistent link via a unique digital object identifier (DOI), while more and more publishers and funders require from authors and project investigators to make their research data openly available upon the publication of the associated paper. Further, we saw the emergence of the Data Paper, a research article type later customised for the needs of various fields, including biodiversity, launched in order to describe datasets and facilitate their reach to a wider audience. So, aren’t data finally here to last? 

The majority of research funders, such as the EU’s Framework Programme Horizon2020, have already adopted Open Access policies and are currently working on their further development and exhaustiveness.

Credit: OpenAIRE Research Data Management Briefing paper, available to download from <https://www.openaire.eu/briefpaper-rdm-infonoads/download>.

Today, biodiversity scientists publish and deposit biodiversity data at an unprecedented rate and the pace is only increasing, boosted by the harrowing effects of climate change, species loss, pollution and habitat degradation among others. Meanwhile, the field is yet to adopt universal practices and standards for efficiently linking all those data together – currently available from rather isolated platforms – so that researchers can indeed navigate through the available knowledge and build on it, rather than duplicate unknowingly the efforts of multiple teams from across the globe. Given the limited human capabilities as opposed to the unrestricted amounts of data piling up by the minute, biodiversity science is bound to stagnate if researchers don’t hand over the “data chase” to their computers.

Truth be told, a machine that stumbles across ‘messy’ data – i.e. data whose format and structure have been compromised, so that the dataset is no longer interoperable, i.e. it fails to be retrieved from one application to another – differs little from a researcher whose personal request to a colleague is being ignored. Easily missable details such as line breaks within data items, invalid characters or empty fields could lead to data loss, eventually compromising future research that would otherwise build on those same records. Unfortunately, institutionally available data collections are just as prone to ‘messiness’, as evidenced by data expert and auditor Dr Robert Mesibov

“Proofreading data takes at least as much time and skill as proofreading text,” says Dr Mesibov. “Just as with text, mistakes easily creep into data files, and the bigger the data file, the more likely it has duplicates, disagreements between data fields, misplaced and truncated (cut-off) data items, and an assortment of formatting errors.”

Snapshot from a data audit report received by University of Cordoba’s Dr Gloria Martínez-Sagarra and Prof Juan Antonio Devesa while preparing their data paper, which describes the herbarium dataset for the vascular plants in COFC.

Similarly to research findings and conclusions which cannot be considered truthful until backed up by substantial evidence, the same evidence (i.e. a dataset) should be of questionable relevance and credibility if its components are not easily retrievable for anyone wishing to follow them up, be it a human researcher or a machine. In order to ensure that their research contribution is made in a responsible fashion in compliance with good scientific practices, scientists should not only care to make their datasets openly available online, but also ensure they are clean and tidy, therefore truly FAIR. 

With the kind help of Dr Robert Mesibov, Pensoft has implemented mandatory data audit for all data paper manuscripts submitted to the relevant journals in its exclusively open access portfolio to support responsibility, efficiency and FAIRness in biodiversity science. Learn more about the workflow here. The workflow is illustrated in a case study, describing the experience of University of Cordoba’s Dr Gloria Martínez-Sagarra and Prof Juan Antonio Devesa, while preparing their data paper later published in PhytoKeys. A “Data Quality Checklist and Recommendations” is accessible on the websites of the relevant Pensoft journals, including Biodiversity Data Journal.