Skip to content
Blog

Blog

  • Home
  • Pensoft Publishers

Tag: metadata

First scientific publication using digital specimen DOIs is out

Now, scientists will be in a much better position to really exchange and link data across institutions.

First scientific publication using digital specimen DOIs is out
Tweet

“The genera Chrysilla and Phintelloides revisited with the description of a new species (Araneae, Salticidae) using digital specimen DOIs and nanopublications” is the first scientific publication that uses digital specimen DOIs.

Linking data across collections

It is nothing new that our planet is facing a number of serious threats: climate change, biodiversity loss, pandemics… If you have been watching the news, all this is probably familiar to you. The wealth of data hosted in Natural history collections can contribute to finding a response to these challenges.Alas, today’s practices of working with collected bio- and geodiversity specimens lack some efficiency, thus limiting what our scientists can achieve.

In particular, there is a rather serious absence of linkages between specimen data. Sure, each specimen in a collection usually has its own catalogue ID that is unique within that collection, but the moment collections attempt to work with other collections -as they should in the face of planetary threats- problems start to arise because usually, each collection has its own way of identifying their data, thus leading to confusion.

Persistent identifiers: the DOIs

To avoid this problem, several initiatives have been launched in recent years to establish a globally accepted system of persistent identifiers (PIDs) that guarantee the “uniqueness” of collection specimens—physical or digital—over time. 

Digital specimen DOIs can point to individual specimens in a collections.

You can think of a PID as a marker, an identifier that points at a single individual object and only one, differentiating it from any other in the world. You must have heard of acronyms such as ISBN or ORCID. Those are PIDs for books and individual scholars, respectively. For digital research content, the most widely used PID is the DOI (Digital Object Identifier), proposed by the DOI Foundation. 

A DOI is an alphanumeric code that looks like this: 10.prefix/sufix 

For example, if you type https://doi.org/10.15468/w6ubjx in your browser, you will reach the Royal Belgian Institute of Natural Sciences’s mollusk collection database, accessed through GBIF. This specific DOI will never point at anything else, and the identifier will remain the same in the future, even if changes occur in the content of this particular database.

DiSSCo and the DOIs

The Distributed System of Scientific Collections (DiSSCo) aims to provide a DOI for all individual digital specimens in European natural history collections. The point is not only to accurately identify specimens. That is, of course, crucial, but the DOI of a digital specimen provides a number of other advantages that are extremely interesting for DiSSCo and natural history collections in general. Among them, two are simply revolutionary. 

The digital specimen DOI stores quick-access, basic metadata about the specimen.

Firstly, using DOIs allows linking the digital specimen to all other relevant information about the same specimen that might be hosted in other repositories (e.g. ecological data, genomic data, etc.). In creating this extended digital specimen that links different data types, digital specimen DOIs make a huge contribution to inter-institutional scientific work, filling the gap that is described at the beginning of this piece. Now scientists will be in a much better position to really exchange and link data across institutions.

Second, in contrast to most other persistent identifiers, the DOI of a digital specimen stores additional metadata (e.g. name, catalogue number) beyond the URL to which it redirects. This allows access to some information about the specimen without having to retrieve the full data object, i.e. without having to be redirected to the specimen HTML page. This metadata  facilitates AI systems to quickly navigate billions of digital specimens and perform different automated work on them, saving us (humans) precious time.

Use of DOIs in publications

With all this in mind, it is easier to understand why being able to cite digital specimens in scholarly publications using DOIs is an important step. So far, the only DOIs that we could use in publications were those at the dataset level, not at the individual specimen level. In the example above, if a scientist were to publish an article about a specific type of bivalve in the Belgian collection, the only DOI that she or he would have available for citation in the article would be that of the entire mollusk database -containing hundreds or thousands of specimens- not the one of the specific oyster or scallop that might be the focus of the publication. 

Main page of DiSSCo’s sandbox, the future DiSSCover service.

The publication in Biodiversity Data Journal about the Chrysilla and Phintelloides genera is the first of its kind and opens the door to citing not only dataset-level objects but also individual specimens in publications using DOIs. You can try it yourself: Hover over the DOIs that are cited in the publication and you will get some basic information that might save you the time of visiting the page of the institution where the specimen is. Click on it and you will be taken to DiSSCo’s sandbox -the future DiSSCover service- where you will find all the information about the digital specimen. There you will also be able to comment, annotate the specimen, and more, thus making science in a more dynamic and efficient way than until now.

A note about Christa Deeleman-Reinhold

At 94 years old, the Dutch arachnologist Christa Deeleman-Reinhold is not only one of the authors of the Chrysilla and Phintelloides article but also one of the most important arachnologists in the world. Born in 1930 on the island of Java -then part of the Dutch East Indies- Christa gained her PhD from Leiden University in 1978. Since then, she has developed a one-of-a-kind scientific career, mainly focused on spider species from South Asia. In her Forest Spiders of South East Asia (2001), Dr. Deeleman-Reinhold revised six spider families, describing 18 new genera and 115 new species. The Naturalis Biodiversity Center hosts the Christa Laetitia Deeleman-Reinhold collection, with more than 20,000 specimens.

Text and images provided by DiSSCo RI.

Research article:

Deeleman-Reinhold CL, Addink W, Miller JA (2024) The genera Chrysilla and Phintelloides revisited with the description of a new species (Araneae, Salticidae) using digital specimen DOIs and nanopublications. Biodiversity Data Journal 12: e129438. https://doi.org/10.3897/BDJ.12.e129438

Tweet
Author Pensoft Editorial TeamPosted on September 30, 2024September 30, 2024Categories Biodiversity Data JournalTags data, dataset, digital specimens, DOI, metadata

Pensoft – GloBI workflow for FAIR data exchange and indexing of biotic interactions locked within scholarly articles

Pensoft – GloBI workflow for FAIR data exchange and indexing of biotic interactions locked within scholarly articles.

Pensoft – GloBI workflow for FAIR data exchange and indexing of biotic interactions locked within scholarly articles
Tweet

by Mariya Dimitrova, Jorrit Poelen, Georgi Zhelezov, Teodor Georgiev, Lyubomir Penev

Fig. 1. Pensoft-GloBI workflow for indexing biotic interactions from scholarly literature

Tables published in scholarly literature are a rich source of primary biodiversity data. They are often used for communicating species occurrence data, morphological characteristics of specimens, links of species or specimens to particular genes, ecology data and biotic interactions between species, etc. Tables provide a structured format for sharing numerous facts about biodiversity in a concise and clear way. 

Together with the rest of the article narrative, Pensoft publishes all tables in the semi-structured eXtensible Markup Language (XML) format. Tables are semantically enhanced with annotated taxonomic names, coordinates, localities and other fields from the Darwin Core Standard. 

Inspired by the potential use of semantically-enhanced tables for text and data mining, Pensoft and Global Biotic Interactions (GloBI) developed a workflow for extracting and indexing biotic interactions from tables published in scholarly literature. GloBI is an open infrastructure enabling the discovery and sharing of species interaction data. GloBI ingests and accumulates individual datasets containing biotic interactions and standardises them by mapping them to community-accepted ontologies, vocabularies and taxonomies. Data integrated by GloBI is accessible through an application programming interface (API) and as archives in different formats (e.g. n-quads). GloBI has indexed millions of species interactions from hundreds of existing datasets spanning over a hundred thousand taxa.

The workflow

First, all tables extracted from Pensoft publications and stored in the OpenBiodiv triple store were automatically retrieved (Step 1 in Fig. 1). There were 6993 tables from 21 different journals. To identify only the tables containing biotic interactions, we used an ontology annotator, currently developed by Pensoft using terms from the OBO Relation Ontology (RO). The Pensoft Annotator analyses free text and finds words and phrases matching ontology term labels.

We used the RO to create a custom ontology, or list of terms, describing different biotic interactions (e.g. ‘host of’, ‘parasite of’, ‘pollinates’) (Step 2 in Fig. 1).. We used all subproperties of the RO term labeled ‘biotically interacts with’ and expanded the list of terms with additional word spellings and variations (e.g. ‘hostof’, ‘host’) which were added to the custom ontology as synonyms of already existing terms using the property oboInOwl:hasExactSynonym.

This custom ontology was used to perform annotation of all tables via the Pensoft Annotator (Step 3 in Fig. 1). Tables were split into rows and columns and accompanying table metadata (captions). Each of these elements was then processed through the Pensoft Annotator and if a match from the custom ontology was found, the resulting annotation was written to a MongoDB database, together with the article metadata. The original table in XML format, containing marked-up taxa, was also stored in the records.

Thus, we detected 233 tables which contain biotic interactions, constituting about 3.4% of all examined tables. The scripts used for parsing the tables and annotating them, together with the custom ontology, are open source and available on GitHub. The database records were exported as json to a GitHub repository, from where they could be accessed by GloBI.

GloBI processed the tables further, involving the generation of a table citation from the article metadata and the extraction of interactions between species from the table rows (Step 4 in Fig. 1). Table citations were generated by querying the OpenBiodiv database with the DOI of the article containing each table to obtain the author list, article title, journal name and publication year. The extraction of table contents was not a straightforward process because tables do not follow a single schema and can contain both merged rows and columns (signified using the ‘rowspan’ and ‘colspan’ attributes in the XML). GloBI were able to index such tables by duplicating rows and columns where needed to be able to extract the biotic interactions within them. Taxonomic name markup allowed GloBI to identify the taxonomic names of species participating in the interactions. However, the underlying interaction could not be established for each table without introducing false positives due to the complicated table structures which do not specify the directionality of the interaction. Hence, for now, interactions are only of the type ‘biotically interacts with’ (Fig. 2) because it is a bi-directional one (e.g. ‘Species A interacts with Species B’ is equivalent to ‘Species B interacts with Species A’).

Fig. 2. Example of a biotic interaction indexed by GloBI.

Examples of species interactions provided by OpenBiodiv and indexed by GloBI are available on GloBI’s website.

In the future we plan to expand the capacity of the workflow to recognise interaction types in more detail. This could be implemented by applying part of speech tagging to establish the subject and object of an interaction.

In addition to being accessible via an API and as archives, biotic interactions indexed by GloBI are available as Linked Open Data and can be accessed via a SPARQL endpoint. Hence, we plan on creating a user-friendly service for federated querying of GloBI and OpenBiodiv biodiversity data. 

This collaborative project is an example of the benefits of open and FAIR data, enabling the enhancement of biodiversity data through the integration between Pensoft and GloBI. Transformation of knowledge contained in existing scholarly works into giant, searchable knowledge graphs increases the visibility and attributed re-use of scientific publications.

Tables published in scholarly literature are a rich source of primary biodiversity data. They are often used for communicating species occurrence data, morphological characteristics of specimens, links of species or specimens to particular genes, ecology data and biotic interactions between species etc. Tables provide a structured format for sharing numerous facts about biodiversity in a concise and clear way.

Together with the rest of the article narrative, Pensoft publishes all tables in the semi-structured eXtensible Markup Language (XML) format. Tables are semantically enhanced with annotated taxonomic names, coordinates, localities and other fields from the Darwin Core Standard.

Inspired by the potential use of semantically-enhanced tables for text and data mining, Pensoft and Global Biotic Interactions (GloBI) developed a workflow for extracting and indexing biotic interactions from tables published in scholarly literature. GloBI is an open infrastructure enabling the discovery and sharing of species interaction data. GloBI ingests and accumulates individual datasets containing biotic interactions and standardises them by mapping them to community-accepted ontologies, vocabularies and taxonomies. Data integrated by GloBI is accessible through an application programming interface (API) and as archives in different formats (e.g. n-quads). GloBI has indexed millions of species interactions from hundreds of existing datasets spanning over a hundred thousand taxa.

The workflow

First, all tables extracted from Pensoft publications and stored in the OpenBiodiv triple store were automatically retrieved (Step 1 in Fig. 1). There were 6,993 tables from 21 different journals. To identify only the tables containing biotic interactions, we used an ontology annotator, currently developed by Pensoft using terms from the OBO Relation Ontology (RO). The Pensoft Annotator analyses free text and finds words and phrases matching ontology term labels.

We used the RO to create a custom ontology, or list of terms, describing different biotic interactions (e.g. ‘host of’, ‘parasite of’, ‘pollinates’) (Step 1 in Fig. 1). We used all subproperties of the RO term labeled ‘biotically interacts with’ and expanded the list of terms with additional word spellings and variations (e.g. ‘hostof’, ‘host’) which were added to the custom ontology as synonyms of already existing terms using the property oboInOwl:hasExactSynonym.

This custom ontology was used to perform annotation of all tables via the Pensoft Annotator (Step 3 in Fig. 1). Tables were split into rows and columns and accompanying table metadata (captions). Each of these elements was then processed through the Pensoft Annotator and if a match from the custom ontology was found, the resulting annotation was written to a MongoDB database, together with the article metadata. The original table in XML format, containing marked-up taxa, was also stored in the records.

Thus, we detected 233 tables which contain biotic interactions, constituting about 3.4% of all examined tables. The scripts used for parsing the tables and annotating them, together with the custom ontology, are open source and available on GitHub. The database records were exported as JSON to a GitHub repository, from where they could be accessed by GloBI.

GloBI processed the tables further, involving the generation of a table citation from the article metadata and the extraction of interactions between species from the table rows (Step 4 in Fig. 1). Table citations were generated by querying the OpenBiodiv database with the DOI of the article containing each table to obtain the author list, article title, journal name and publication year. The extraction of table contents was not a straightforward process because tables do not follow a single schema and can contain both merged rows and columns (signified using the ‘rowspan’ and ‘colspan’ attributes in the XML). GloBI were able to index such tables by duplicating rows and columns where needed to be able to extract the biotic interactions within them. Taxonomic name markup allowed GloBI to identify the taxonomic names of species participating in the interactions. However, the underlying interaction could not be established for each table without introducing false positives due to the complicated table structures which do not specify the directionality of the interaction. Hence, for now, interactions are only of the type ‘biotically interacts with’ because it is a bi-directional one (e.g. ‘Species A interacts with Species B’ is equivalent to ‘Species B interacts with Species A’).

In the future, we plan to expand the capacity of the workflow to recognise interaction types in more detail. This could be implemented by applying part of speech tagging to establish the subject and object of an interaction.

In addition to being accessible via an API and as archives, biotic interactions indexed by GloBI are available as Linked Open Data and can be accessed via a SPARQL endpoint. Hence, we plan on creating a user-friendly service for federated querying of GloBI and OpenBiodiv biodiversity data. 

This collaborative project is an example of the benefits of open and FAIR data, enabling the enhancement of biodiversity data through the integration between Pensoft and GloBI. Transformation of knowledge contained in existing scholarly works into giant, searchable knowledge graphs increases the visibility and attributed re-use of scientific publications.

References

Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2014.08.005.

Additional Information

The work has been partially supported by the International Training Network (ITN) IGNITE funded by the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 764840.

Tweet
Author Pensoft Editorial TeamPosted on July 17, 2020July 17, 2020Categories Data PublishingTags biodiversity, biodiversity data, biotic interactions, data publishing, ecology data, FAIR data, GloBI, linked data, linked open data, metadata, open data, OpenBiodiv, Pensoft, species diversity, species occurrence

Streamlined import of omics metadata from the European Nucleotide Archive (ENA) into an OMICS Data Paper manuscript

Pensoft creates a specialised data paper article type for the omics community within Biodiversity Data Journal to reflect the specific nature of omics data. The scholarly publisher and technology provider established a manuscript template to help standardise the description of such datasets and their most important features.

Streamlined import of omics metadata from the European Nucleotide Archive (ENA) into an OMICS Data Paper manuscript
Tweet

By Mariya Dimitrova, Raïssa Meyer, Pier Luigi Buttigieg, Lyubomir Penev

Data papers are scientific papers which describe a dataset rather than present and discuss research results. The concept was introduced to the biodiversity community by Chavan and Penev in 2011 as the result of a joint project of GBIF and Pensoft.

Since then, Pensoft has implemented the data paper in several of its journals (Fig. 1). The recognition gained through data papers is an important incentive for researchers and data managers to author better quality metadata and to make it Findable, Accessible, Interoperable and Re-usable (FAIR). High quality and FAIRness of (meta)data are promoted through providing peer review, data audit, permanent scientific record and citation credit as for any other scholarly publication. One can read more on the different types of data papers and how they help to achieve these goals in the Strategies and guidelines for scholarly publishing of biodiversity data  (https://doi.org/10.3897/rio.3.e12431). 

Fig. 1 Number of data papers published in Pensoft’s journals since 2011.

The data paper concept was initially based on the standard metadata descriptions, using the  Ecological Metadata Language (EML). Apart from distinguishing a specialised place for dataset descriptions by creating a data paper article type, Pensoft has developed multiple workflows for streamlined import of metadata from various repositories and their conversion into data paper a manuscripts in Pensoft’s ARPHA Writing Tool (AWT). You can read more about the EML workflow in this blog post.

Similarly, we decided to create a specialised data paper article type for the omics community within Pensoft’s Biodiversity Data Journal to reflect the specific nature of omics data. We established a manuscript template to help standardise the description of such datasets and their most important features. This initiative was supported in part by the IGNITE project.

How can authors publish omics data papers?

There are two ways to do publish omics data papers – (1) to write a data paper manuscript following the respective template in the ARPHA Writing Tool (AWT) or (2) to convert metadata describing a project or study deposited in EMBL-EBI’s European Nucleotide Archive (ENA) into a manuscript within the AWT.

The first method is straightforward but the second one deserves more attention. We focused on metadata published in ENA, which is part of the International Nucleotide Sequence Database Collaboration (INSDC) and synchronises its records with these of the other two members (DDBJ and NCBI). ENA is linked to the ArrayExpress and BioSamples databases, which describe sequencing experiments and samples, and follow the community-accepted metadata standards MINSEQE and MIxS. To auto populate a manuscript with a click of a button, authors can provide the accession number of the relevant ENA Study of Project and our workflow will automatically retrieve all metadata from ENA, as well as any available ArrayExpress or BioSamples records linked to it (Fig. 2). After that, authors can edit any of the article sections in the manuscript by filling in the relevant template fields or creating new sections, adding text, figures, citations and so on.

An important component of the OMICS data paper manuscript is a supplementary table containing MIxS-compliant metadata imported from BioSamples. When available, BioSamples metadata is automatically converted to a long table format and attached to the manuscript. The authors are not permitted to edit or delete it inside the ARPHA Writing Tool. Instead, if desired, they should correct the associated records in the sourced BioSamples database. We have implemented a feature allowing the automatic re-import of corrected BioSamples records inside the supplementary table. In this way, we ensure data integrity and provide a reliable and trusted source for accessing these metadata. 

Fig. 2 Automated generation of omics data paper manuscripts through import and conversion of metadata associated with the Project ID or Study ID at ENA

Here is a step-by-step guide for conversion of ENA metadata into a data paper manuscript:

  1. The author has published a dataset to any of the INSDC databases. They copy its ENA Study or Project accession number.
  2. The author goes to the Biodiversity Data Journal (BDJ) webpage, clicks the “Start a manuscript” buttоn and selects OMICS Data Paper template in the ARPHA Writing Tool (AWT). Alternatively, the author can also start from the AWT website,  click “Create a manuscript”, and select “OMICS Data Paper” as the article type, the Biodiversity Data Journal will be automatically marked by the system. The author clicks the “Import a manuscript” button at the bottom of the webpage.
  3. The author pastes the ENA Study or Project accession number inside the relevant text box (“Import an European Nucleotide Archive (ENA) Study ID or Project ID”) and clicks “Import”.
  4. The Project or Study metadata is converted into an OMICS data paper manuscript along with the metadata from ArrayExpress and BioSamples if available. The author can start making changes to the manuscript,  invite co-authors and then submit it for technical evaluation, peer review and publication. 

For a detailed description of authoring an OMICS data paper, please refer to the Author Guidelines: https://bdj.pensoft.net/about#OmicsDataPapers

Our innovative workflow makes authoring omics data papers much easier and saves authors time and efforts when inserting metadata into the manuscript. It takes advantage of existing links between data repositories to unify biodiversity and omics knowledge into a single narrative. This workflow demonstrates the importance of standardisation and interoperability to integrate data and metadata from different scientific fields. 

We have established a special collection for OMICS data papers in the Biodiversity Data Journal. Authors are invited to describe their omics datasets by using the novel streamlined workflow for creating a manuscript at a click of a button from metadata deposited in ENA or by following the template to create their manuscript via the non-automated route.

To stimulate omics data paper publishing, the first 10 papers will be published free of charge. Upon submission of an omics data paper manuscript, do not forget to assign it to the collection Next-generation publishing of omics data.

Tweet
Author Pensoft Editorial TeamPosted on June 16, 2020June 16, 2020Categories Biodiversity data, Biodiversity Data Journal, Data PublishingTags BDJ, biodiversity, Biodiversity Data Journal, biology, data, ecological metadata, ecology, FAIR data, metadata, omics data paper, scholarly publishing

Call for data papers from European Russia

Call for data papers from European Russia
Tweet

Partners GBIF, FinBIF and Pensoft to support publication of data papers that describe datasets from Russia west of the Ural Mountains

Original post via GBIF

GBIF—the Global Biodiversity Information Facility—in collaboration with the Finnish Biodiversity Information Facility (FinBIF) and Pensoft Publishers, are happy to issue a call for authors to submit and publish data papers on European Russia (west of the Urals) in an upcoming special issue of Biodiversity Data Journal (BDJ).

Between now and 31 August 2020, the article processing fee (normally €450) will be waived for the first 20 papers, provided that the publications are accepted and meet the following criteria that the data paper describes a dataset:

  • with more than 5,000 records that are new to GBIF.org in 2020
  • with high-quality data and metadata
  • with geographic coverage in European Russia west of the Ural mountains

The manuscript must be prepared in English and is submitted in accordance with BDJ’s instructions to authors by 31 August 2020. Late submissions will not be eligible for APC waivers.

Sponsorship is limited to the first 20 accepted submissions meeting these criteria on a first-come, first-served basis. The call for submissions can therefore close prior to the stated deadline of 31 August. Authors may contribute to more than one manuscript, but artificial division of the logically uniform data and data stories, or “salami publishing”, is not allowed.

BDJ will publish a special issue including the selected papers by the end of 2020. The journal is indexed by Web of Science (Impact Factor 1.029), Scopus (CiteScore: 1.24) and listed in РИНЦ / eLibrary.ru

For non-native speakers, please ensure that your English is checked either by native speakers or by professional English-language editors prior to submission. You may credit these individuals as a “Contributor” through the AWT interface. Contributors are not listed as co-authors but can help you improve your manuscripts.

In addition to the BDJ instruction to authors, it is required that datasets referenced from the data paper a) cite the dataset’s DOI and b) appear in the paper’s list of references.

Authors should explore the GBIF.org section on data papers and Strategies and guidelines for scholarly publishing of biodiversity data. Manuscripts and datasets will go through a standard peer-review process.

To see an example, view this dataset on GBIF.org and the corresponding data paper published by BDJ.

Questions may be directed either to Dmitry Schigel, GBIF scientific officer, or Yasen Mutafchiev, managing editor of Biodiversity Data Journal.

Definition of terms

Datasets with more than 5,000 records that are new to GBIF.org

Datasets should contain at a minimum 5,000 new records that are new to GBIF.org. While the focus is on additional records for the region, records already published in GBIF may meet the criteria of ‘new’ if they are substantially improved, particularly through the addition of georeferenced locations.

Justification for publishing datasets with fewer records (e.g. sampling-event datasets, sequence-based data, checklists with endemics etc.) will be considered on a case-by-case basis.

Datasets with high-quality data and metadata

Authors should start by publishing a dataset comprised of data and metadata that meets GBIF’s stated data quality requirement. This effort will involve work on an installation of the GBIF Integrated Publishing Toolkit.

Only when the dataset is prepared should authors then turn to working on the manuscript text. The extended metadata you enter in the IPT while describing your dataset can be converted into manuscript with a single-click of a button in the ARPHA Writing Tool (see also Creation and Publication of Data Papers from Ecological Metadata Language (EML) Metadata. Authors can then complete, edit and submit manuscripts to BDJ for review.

Datasets with geographic coverage in European Russia west of the Ural mountains

In correspondence with the funding priorities of this programme, at least 80% of the records in a dataset should have coordinates that fall within the priority area of European Russia west of the Ural mountains. However, authors of the paper may be affiliated with institutions anywhere in the world.

#####

Data audit at Pensoft’s biodiversity journals

Data papers submitted to Biodiversity Data Journal, as well as all relevant biodiversity-themed journals in Pensoft’s portfolio, undergo a mandatory data auditing workflow before being passed down to a subject editor.

Learn more about the workflow here:
https://www.eurekalert.org/pub_releases/2019-10/pp-aif101819.php.

Check out the case study below to see how the data audit workflow works in practice.

CASE STUDY: Data audit for the “Vascular plants dataset of the COFC herbarium (University of Cordoba, Spain)”, a data paper in PhytoKeys
Tweet
Author Pensoft Editorial TeamPosted on March 4, 2020March 4, 2020Categories Biodiversity Data JournalTags biodiversity, biodiversity data, Data papers, european russia, fauna, FinBIF, GBIF, metadata, open data, russian research, russian science, scholarly data, scholarly publishing, scientific data, Ural Mountains

How to import data papers from GBIF, DataONE and LTER metadata

How to import data papers from GBIF, DataONE and LTER metadata
Tweet

On October 13, 2015, we published a blog post about the novel functionalities in ARPHA that allow streamlined import of data papers from EML.

Now, this process has been described in the Tips and Tricks section of the ARPHA Authoring Tool. Here, we’ll list the individual workflows:

  • Import a data paper from GBIF IPT metadata (EML)
  • Import a data paper from DataONE metadata (EML)
  • Import a data paper from LTER metadata (EML)

We want to stress at this point that the import functionality itself is agnostic of the data source and any metadata file in EML 2.1.1 or 2.1.0 can be imported. We have listed these three most likely sources of metadata to illustrate the workflow.

In the remainder of the post, we will go through the original post from October 13, 2015 and highlight the latest updates.

fig1

 

At the time of the writing of the original post, the Biodiversity Information Standards conference, TDWG 2015, was taking place in Kenya. Data sharing, data re-use, and data discovery were being brought up in almost every talk. We might have entered the age of Big Data twenty years ago, but it is now that scientists face the real challenge – storing and searching through the deluge of data to find what they need.

As the rate at which we exponentially generate data exceeds the rate at which data storage technologies improve, the field of data management seems to be greatly challenged. Worse, this means the more new data is generated, the more of the older ones will be lost. In order to know what to keep and what to delete, we need to describe the data as much as possible, and judge the importance of datasets. This post is about a novel way to automatically generate scientific papers describing a dataset, which will be referred to as data papers.

The common characters of the records, i.e. descriptions of the object of study, the measurement apparatus and the statistical summaries used to quantify the records, the personal notes of the researcher, and so on, are called metadata. Major web portals such as DataONE, the Global Biodiversity Information Facility (GBIF), or the Long Term Ecological Research Network store metadata in conjunction with a given dataset as one or more text files, usually structured in special formats enabling the parsing of the metadata by algorithms.

To make the metadata and the corresponding datasets discoverable and citable, the concept of the data paper was introduced in the early 2000’s by the Ecological Society of America. This concept was brought to the attention of the biodiversity community by Chavan and Penev (2011) with the introduction of a new data paper concept, based on a metadata standard, such as the Ecological Metadata Language, and derived from metadata content stored at large data platforms, in this case the Global Biodiversity Information Facility (GBIF). You can read this article for an in-depth discussion of the topic.

Pensoft’s Biodiversity Data Journal (BDJ) is to the best of our knowledge the first academic journal to have implemented a one-hundred-percent online authoring system for data papers, called ARPHA. Moreover, BDJ and the other Pensoft journals, such as ZooKeys, have already published more than seventy data papers.

Therefore, in the remainder of this post we will explain how to use an automated approach to publish a data paper describing an online dataset in Biodiversity Data Journal. The ARPHA system will convert the metadata describing your dataset into a manuscript for you after reading in the metadata. We will illustrate the workflow on the previously mentioned DataONE and GBIF.

The Data Observation Network for Earth (DataONE) is a distributed cyberinfrastructure funded by the U.S. National Science Foundation. It links together over twenty five nodes, primarily in the U.S., hosting biodiversity and biodiversity-related data, and provides an interface to search for data in all of them (Note: In the meantime, DataONE has updated their search interface).

Since butterflies are neat, let’s search for datasets about butterflies on DataONE! Type “Lepidoptera” in the search field and scroll down to the dataset describing “The Effects of Edge Proximity on Butterfly Biodiversity.” You should see something like this:

fig2

 

As you can notice, this resource has two objects associated with it: metadata, which has been highlighted, and the dataset itself. Let’s download the metadata from the cloud! The resulting text file, “Blandy.235.1.xml”, or whatever you want to call it, can be read by humans, but is somewhat cryptic because of all the XML tags. Now, you can import this file to the ARPHA writing platform and the information stored in it would be used to create a data paper! Go to the ARPHA web-site, and click on “Start a manuscript,” then scroll all the way down and click on “Import manuscript”.

fig3

Upload the “blandy” file and you will see an “Authors’ page,” where you can select which of the authors mentioned in the metadata must be included as authors of the data paper itself. Note that the user of ARPHA uploading the metadata is added to the list of the authors even if they are not included in the metadata. After the selection is done, a scholarly article is created by the system with the information from the metadata already in the respective sections of the article:

fig4

Now, the authors can add some description, edit out errors, tell a story, cite someone – all of this without leaving ARPHA – i.e. do whatever it takes to produce a high-quality scholarly text. After they are done, they can submit their article for peer-review and it could be published in a matter of hours. Voila!

Let’s look at GBIF. Go to “Data -> Explore by country” and select “Saint Vincent and the Grenadines,” an English-speaking Caribbean island. There are, as of the time of writing of this post, 166 occurrence datasets containing data about the islands. Select the dataset from the Museum of Comparative Zoology at Harvard. If you scroll down, you will see the GBIF annotated EML. Download this as a separate text file (if you are using Chrome, you can view the source, and then use Copy-Paste). Do the exact same steps as before – go to “Import manuscript” in ARPHA and upload the EML file. The result should be something like this, ready to finalize:

fig 5

 

To finish it up, we want to leave you with some caveats and topics for further discussion. Till today, useful and descriptive metadata has not always been present. There are two challenges: metadata completeness and metadata standards. The invention of the EML standard was one of the first efforts to standardize how metadata should be stored in the field of ecology and biodiversity science.

Currently, our import system supports the last two versions of the EML standard: 2.1.1 and 2.1.0, but we hope to further develop this functionality. In an upcoming version of their search interface, DataONE will provide infographics on the prevalence of the metadata standards on their site (as illustrated below), so there is still work to be done, but if there is a positive feedback from the community, we will definitely keep elaborating this feature.

 

fig 6 DataOne
Image: DataONE

Regarding metadata completeness, our hope is that by enabling scientists to create scholarly papers from their metadata with a single-step process, they will be incentivized to produce high-quality metadata.

Now, allow us to give a disclaimer here: the authors of this blog post have nothing to do with the two datasets. They have not contributed to any of them, nor do they know the authors. The datasets have been chosen more or less randomly since the authors wanted to demonstrate the functionality with a real-world example. You should only publish data papers if you know the authors or you are the author of the dataset itself. During the actual review process of the paper, the authors that have been included will get an email from the journal.

 

Additional information:

This project has received funding from the European Union’s  FP7 project EU BON (Building the European Biodiversity Observation Network), grant agreement No 308454, and Horizon 2020 research and innovation project BIG4 (Biosystematics, informatics and genomics of the big 4 insect groups: training tomorrow’s researchers and entrepreneurs) under the Marie Sklodovska-Curie grant agreement No. 642241 for a PhD project titled Technological Implications of the Open Biodiversity Knowledge Management System.

Tweet
Author Pensoft Editorial TeamPosted on May 18, 2016June 13, 2016Categories Biodiversity Data JournalTags academic publishing, ARPHA, bio data, biodiversity, biodiversity data, bioinformatics, data import, Data papers, DataONE, EU BON, GBIF, LTER, metadata, metadata standards, open data, open publishing, open science, repository, scholarship, taxa, taxonomy

Streamlining the Import of Specimen or Occurrence Records Into Taxonomic Manuscripts

Streamlining the Import of Specimen or Occurrence Records Into Taxonomic Manuscripts
Tweet

Repositories and data indexing platforms, such as GBIF, BOLD systems, or iDigBio hold documented specimen or occurrence records along with their record ID’s. In order to streamline the authoring process, save taxonomists’ time, and provide a workflow for peer-review and quality checks of raw occurrence data, the ARPHA team has introduced an innovative feature that makes it possible to easily import specimen occurrence records into a taxonomic manuscript (see Fig. 1).

For the remainder of this post we will refer to specimen data as occurrence records, since an occurrence can be both an observation in the wild, or a museum specimen.

Figure1

Fig. 1: Workflow for directly importing occurrence records into a taxonomic manuscript.

Until now, when users of the ARPHA writing tool wanted to include occurrence records as materials in a manuscript, they would have had to format the occurrences as an Excel sheet that is uploaded to the Biodiversity Data Journal, or enter the data manually. While the “upload from Excel” approach significantly simplifies the process of importing materials, it still requires a transposition step – the data which is stored in a database needs to be reformatted to the specific Excel format. With the introduction of the new import feature, occurrence data that is stored at GBIF, BOLD systems, or iDigBio, can be directly inserted into the manuscript by simply entering a relevant record identifier.

The functionality shows up when one creates a new “Taxon treatment” in a taxonomic manuscript prepared in the ARPHA Writing Tool. The import functions as follows:

  1. the author locates an occurrence record or records in one of the supported data portals;
  2. the author notes the ID(s) of the records that ought to be imported into the manuscript (see Fig. 2, 3, and 4 for examples);
  3. the author enters the ID(s) of the occurrence records in a form that is to be seen in the materials section of the species treatment, selects a particular database from a list, and then simply clicks ‘Add’ to import the occurrence directly into the manuscript.

In the case of BOLD Systems, the author may also select a given Barcode Identification Number (BIN; for a treatment of BIN’s read below), which then pulls all occurrences in the corresponding BIN (see Fig. 5).

Figure 2       Figure 3

Fig. 2: (Left) An occurrence record in iDigBio. The UUID is highlighted; Fig. 3: (Right) An occurrence record in GBIF. The GBIF ID and the Occurrence ID is highlighted. (Click on images to enlarge)

Figure 4       Figure 5

Fig. 4: (Left) An occurrence record in BOLD Systems. The record ID is highlighted.; Fig. 5:  (Right) All occurrence records corresponding to a OTU. The BIN is highlighted. (Click on images to enlarge)

We will illustrate this workflow by creating a fictitious treatment of the red moss, Sphagnum capillifolium, in a test manuscript. Let’s assume we have started a taxonomic manuscript in ARPHA and know that the occurrence records belonging to S. capillifolium can be found in iDigBio. What we need to do is to locate the ID of the occurrence record in the iDigBio webpage. In the case of iDigBio, the ARPHA system supports import via a Universally Unique Identifier (UUID). We have already created a treatment for S. capillifolium and clicked on the pencil to edit materials (Fig. 6). When we scroll all the way down in the pop-up window, we see the form which is displayed in the middle of Fig. 1.

Figure 6

Fig. 6: Edit materials.

From here, the following actions are possible:

  • insert (an) occurrence record(s) from iDigBio by specifying their UUID’s (universally unique identifier) (Fig.2);
  • insert (an) occurrence record(s) from GBIF by entering their GBIF ID’s (Fig.3);
  • insert (an) occurrence record(s) from GBIF by entering their occurrence ID’s (note that unfortunately not all GBIF records have an occurrence ID, which is to be understood as some sort of universal identifier) (Fig. 3);
  • insert (an) occurrence record(s) from BOLD by entering their record ID’s (Fig. 4);
  • insert a set of occurrence records from BOLD belonging to a BIN (barcode index number) (Fig. 5).

In this example, select the fifth option (iDigBio) and type or paste the UUID b9ff7774-4a5d-47af-a2ea-bdf3ecc78885 and click Add. This will pull the occurrence record for S. capillifolium from iDigBio and insert it as a material in the current paper (Fig. 6). The same workflow applies also to the aforementioned GBIF and BOLD portals.

Figure 7

Fig. 7: Materials after they have been imported.

This workflow can be used for a number of purposes but one of its most exciting future applications is the rapid re-description of Linnaean species, or new morphological descriptions of species together with DNA barcode sequences (a barcode is a taxon-specific highly conserved gene that provides enough inter-species variation for statistical classification to take place) using the  Barcode Identification Numbers (BIN’s) underlying an Operational Taxonomic Units (OTU). If a taxonomist is convinced that a species hypothesis corresponding to OTU defined algorithmically at  BOLD systems clearly presents a new species, then he/she can import all specimen records associated with that OTU via inserting that OTU’s BIN ID in the respective fields.

Having imported the specimen occurrence records, the author needs to define one specimen as holotype of the news species, other as paratypes, and so on. The author can also edit the records in the ARPHA tool, delete some, or add new ones, etc.

Not having to retype or copy/paste species occurrence records, the authors save a lot of efforts. Moreover, they automatically import them in a structured Darwin Core format, which can easily be downloaded from the article text into structured data by anyone who needs the data for reuse.

Another important aspect of the workflow is that it will serve as a platform for peer-review, publication and curation of raw data, that is of unpublished individual data records coming from collections or observations stored at GBIF, BOLD and iDigBio. Taxonomists are used to publish only records of specimens they or their co-authors have personally studied. In a sense, the workflow will serve as a “cleaning filter” for portions of data that are passed through the publishing process. Thereafter, the published records can be used to curate raw data at collections, e.g. put correct identifications, assign newly described species names to specimens belonging to the respective BIN and so on.

Additional Information:

The work has been partially supported by the EC-FP7 EU BON project (ENV 308454, Building the European Biodiversity Observation Network) and the ITN Horizon 2020 project BIG4 (Biosystematics, informatics and genomics of the big 4 insect groups: training tomorrow’s researchers and entrepreneurs), under Marie Sklodovska-Curie grant agreement No. 542241.

Tweet
Author Pensoft Editorial TeamPosted on October 20, 2015October 9, 2018Categories UncategorizedTags ARPHA, AWT, BIG4, biodiversity, BOLD, data, EU BON, GBIF, ID, iDigBio, import, manuscript, metadata, occurrence, specimen, taxonomy, UUID
Porcupines can't jump: Camera traps in the forest canopy reveal dwarf porcupine behavior
Snake photo posted on Instagram leads to the discovery of a new species from the HimalayasSnake photo posted on Instagram leads to the discovery of a new species from the Himalayas
New deadly snake from Asia named after character from Chinese myth ‘Legend of White Snake’New deadly snake from Asia named after character from Chinese myth ‘Legend of White Snake’
The Dark Side of the ocean: New giant sea bug species named after Darth VaderThe Dark Side of the ocean: New giant sea bug species named after Darth Vader

Categories

Archives

SUBSCRIBE VIA EMAIL

Enter your email address to subscribe to this blog and receive notifications of new posts by email.
Loading
Terms Display
ecosystem scholarly publishing reptiles biogeography insect diversity new species nature conservation research taxonomy flora evolution diversity ARPHA invasive species biodiversity data plants academic publishing Coleoptera ecosystems Pensoft natural history fauna ecology herpetology open access Lepidoptera Hymenoptera marine science biodiversity biology conservation zoology botany environment wasps citizen science open science entomology ZooKeys spiders fish Amphibia plant science insects FAIR data

Links

  • Log In
  • RSS

All materials published on this blog are distributed under the Creative Commons Attribution License (CC BY 4.0) unless stated otherwise.

  • Home
  • Pensoft Publishers
Blog Proudly powered by WordPress