Tag: open data

FAIR biodiversity data in Pensoft journals thanks to a routine data auditing workflow

*Data audit workflow provided for data papers submitted to Pensoft journals.*

To avoid publication of openly accessible, yet unusable datasets, fated to result in irreproducible and inoperable biological diversity research at some point down the road, Pensoft takes care for auditing data described in data paper manuscripts upon their submission to applicable journals in the publisher’s portfolio, including Biodiversity Data Journal, ZooKeys, PhytoKeys, MycoKeys and many others.

Once the dataset is clean and the paper is published, biodiversity data, such as taxa, occurrence records, observations, specimens and related information, become FAIR (findable, accessible, interoperable and reusable), so that they can be merged, reformatted and incorporated into novel and visionary projects, regardless of whether they are accessed by a human researcher or a data-mining computation.

As part of the pre-review technical evaluation of a data paper submitted to a Pensoft journal, the associated datasets are subjected to data audit meant to identify any issues that could make the data inoperable. This check is conducted regardless of whether the dataset are provided as supplementary material within the data paper manuscript or linked from the Global Biodiversity Information Facility (GBIF) or another external repository. The features that undergo the audit can be found in a data quality checklist made available from the website of each journal alongside key recommendations for submitting authors.

Once the check is complete, the submitting author receives an audit report providing improvement recommendations, similarly to the commentaries he/she would receive following the peer review stage of the data paper. In case there are major issues with the dataset, the data paper can be rejected prior to assignment to a subject editor, but resubmitted after the necessary corrections are applied. At this step, authors who have already published their data via an external repository are also reminded to correct those accordingly.

“It all started back in 2010, when we joined forces with GBIF on a quite advanced idea in the domain of biodiversity: a data paper workflow as a means to recognise both the scientific value of rich metadata and the efforts of the the data collectors and curators. Together we figured that those data could be published most efficiently as citable academic papers,” says Pensoft’s founder and Managing director Prof. Lyubomir Penev.

“From there, with the kind help and support of Dr Robert Mesibov, the concept evolved into a data audit workflow, meant to ‘proofread’ the data in those data papers the way a copy editor would go through the text,” he adds.

“The data auditing we do is not a check on whether a scientific name is properly spelled, or a bibliographic reference is correct, or a locality has the correct latitude and longitude”, explains Dr Mesibov. “Instead, we aim to ensure that there are no broken or duplicated records, disagreements between fields, misuses of the Darwin Core recommendations, or any of the many technical issues, such as character encoding errors, that can be an obstacle to data processing.”

At Pensoft, the publication of openly accessible, easy to access, find, re-use and archive data is seen as a crucial responsibility of researchers aiming to deliver high-quality and viable scientific output intended to stand the test of time and serve the public good.

CASE STUDY: Data audit for the “Vascular plants dataset of the COFC herbarium (University of Cordoba, Spain)”, a data paper in PhytoKeys

To explain how and why biodiversity data should be published in full compliance with the best (open) science practices, the team behind Pensoft and long-year collaborators published a guidelines paper, titled “Strategies and guidelines for scholarly publishing of biodiversity data” in the open science journal Research Ideas and Outcomes (RIO Journal).

Sir Charles Lyell’s historical fossils kept at London’s Natural History Museum accessible online

The Lyell Project team: First row, seated from left to right: Martha Richter (Principal Curator in Charge of Vertebrates), Consuelo Sendino (with white coat, curator of bryozoans holding a Lyell fossil gastropod from Canaries), Noel Morris (Scientific Associate of Invertebrates), Claire Mellish (Senior Curator of arthropods), Sandra Chapman (curator of reptiles) and Emma Bernard (curator of fishes, holding the lectotype of Cephalaspis lyelli). Second row, standing on from left to right: Jill Darrell (curator of cnidarians), Zoe Hughes (curator of brachiopods) and Kevin Webb (science photographer). Photo by Nelly Perez-Larvor.

More than 1,700 animal and plant specimens from the collection of eminent British geologist Sir Charles Lyell – known as the pioneer of modern geology – were organised, digitised and made openly accessible via the NHM Data Portal in a pilot project, led by Dr Consuelo Sendino, curator at the Department of Earth Sciences (Natural History Museum, London). They are described in a data paper published in the open-access Biodiversity Data Journal.

*Curator of plants Peta Hayes (left) and curator of bryozoans Consuelo Sendino (right) looking at a Lyell fossil plant from Madeira in the collection area. Photo by Mark Lewis.*

The records contain the data from the specimens’ labels (species name, geographical details, geological age and collection details), alongside high-resolution photographs, most of which were ‘stacked’ with the help of specialised software to re-create a 3D model.

Sir Charles Lyell’s fossil collection comprises a total of 1,735 specimens of fossil molluscs, filter-feeding moss animals and fish, as well as 51 more recent shells, including nine specimens originally collected by Charles Darwin from Tierra del Fuego or Galapagos, and later gifted to the geologist. The first specimen of the collection was deposited in distant 1846 by Charles Lyell himself, while the last one – in 1980 by one of his heirs.

With as much as 95% of the specimens having been found at the Macaronesian archipelagos of the Canaries and Madeira and dating to the Cenozoic era, the collection provides a key insight into the volcano formation and palaeontology of Macaronesia and the North Atlantic Ocean. By digitising the collection and making it easy to find and access for researchers from around the globe, the database is to serve as a stepping stone for studies in taxonomy, stratigraphy and volcanology at once.

*Sites where the Earth Sciences’ Lyell Collection specimens originate.*

“The display of this data virtually eliminates the need for specimen handling by researchers and will greatly speed up response time to collection enquiries,” explains Dr Sendino.

Furthermore, the pilot project and its workflow provide an invaluable example to future digitisation initiatives. In her data paper, Dr Sendino lists the limited resources she needed to complete the task in just over a year.

In terms of staff, the curator was joined by MSc student Teresa Máñez (University of Valencia, Spain) for six weeks while locating the specimens and collecting all the information about them; volunteer Jane Barnbrook, who re-boxed 1,500 specimens working one day per week for a year; NHM’s science photographer Kevin Webb and University of Lisbon’s researcher Carlos Góis-Marques, who imaged the specimens; and a research associate, who provided broad identification of the specimens, working one day per week for two months. Each of the curators for the collections, where the Lyell specimens were kept, helped Dr Sendino for less than a day. On the other hand, the additional costs comprised consumables such as plastazote, acid-free trays, archival pens, and archival paper for new labels.

“The success of this was due to advanced planning and resource tracking,” comments Dr Sendino.

“This is a good example of reduced cost for digitisation infrastructure creation maintaining a high public profile for digitisation,” she concludes.

###

Original source:

Sendino C (2019) The Lyell Collection at the Earth Sciences Department, Natural History Museum, London (UK). Biodiversity Data Journal 7: e33504. https://doi.org/10.3897/BDJ.7.e33504

###

About NHM Data Portal:

Committed to open access and open science, the Natural History Museum (London, UK) has launched the Data Portal to make its research and collections datasets available online. It allows anyone to explore, download and reuse the data for their own research.

The portal’s main dataset consists of specimens from the Museum’s collection database, with 4,224,171 records from the Museum’s Palaeontology, Mineralogy, Botany, Entomology and Zoology collections.

Plazi and the Biodiversity Literature Repository (BLR) awarded EUR 1.1 million from Arcadia Fund to grant free access to biodiversity data

Plazi has received a grant of EUR 1.1 million from Arcadia – the charitable fund of Lisbet Rausing and Peter Baldwin – to liberate data, such as taxonomic treatments and images, trapped in scholarly biodiversity publications.

The project will expand the existing corpus of the Biodiversity Literature Repository (BLR), a joint venture of Plazi and Pensoft, hosted on Zenodo at CERN. The project aims to add hundreds of thousands of figures and taxonomic treatments extracted from publications, and further develop and hone the tools to search through the corpus.

The BLR is an open science community platform to make the data contained in scholarly publications findable, accessible, interoperable and reusable (FAIR). BLR is hosted on Zenodo, the open science repository at CERN, and maintained by the Switzerland-based Plazi association and the open access publisher Pensoft.

In its short existence, BLR has already grown to a considerate size: 35,000+ articles have been added, and extracted from 600+ journals. From these articles, more than 180,000 images have also been extracted and uploaded to BLR, and 225,000+ sub-article components, including biological names, taxonomic treatments or equivalent defined blocks of text have been deposited at Plazi’s TreatmentBank. Additionally, over a million bibliographic references have been extracted and added to Refbank.

The articles, images and all other sub-article elements are fully FAIR compliant and citable. In case an article is behind a paywall, a user can still access its underlying metadata, the link to the original article, and use the DOI assigned to it by BLR for persistent citation.

“Generally speaking, scientific illustrations and taxonomic treatments, such as species descriptions, are one of the best kept ‘secrets’ in science as they are neither indexed, nor are they citable or accessible. At best, they are implicitly referenced,” said Donat Agosti, president of Plazi. “Meanwhile, their value is undisputed, as shown by the huge effort to create them in standard, comparative ways. From day one, our project has been an eye-opener and a catalyst for the open science scene,” he concluded.

Though the target scientific domain is biodiversity, the Plazi workflow and tools are open source and can be applied to other domains – being a catalyst is one of the project’s goals.

While access to biodiversity images has already proven useful to scientists, but also inspirational to artists, for example, the people behind Plazi are certain that such a well-documented, machine-readable interface is sure to lead to many more innovative uses.

To promote BLR’s approach to make these important data accessible, Plazi seeks collaborations with the community and publishers, to remove hurdles in liberating the data contained in scholarly publications and make them FAIR.

The robust legal aspects of the project are a core basis of BLR’s operation. By extracting the non-copyrightable elements from the publications and making them findable, accessible and re-usable for free, the initiative drives the move beyond the PDF and HTML formats to structured data.

###

To participate in the project or for further questions, please contact Donat Agosti, President at Plazi at info@plazi.org

Additional information:

About Plazi:

Plazi is an association supporting and promoting the development of persistent and openly accessible digital taxonomic literature. To this end, Plazi maintains TreatmentBank, a digital taxonomic literature repository to enable archiving of taxonomic treatments; develops and maintains TaxPub, an extension of the National Library of Medicine / National Center for Biotechnology Informatics Journal Article Tag Suite for taxonomic treatments; is co-founder of the Biodiversity Literature Repository at Zenodo, participates in the development of new models for publishing taxonomic treatments in order to maximize interoperability with other relevant cyberinfrastructure components such as name servers and biodiversity resources; and advocates and educates about the vital importance of maintaining free and open access to scientific discourse and data. Plazi is a major contributor to the Global Biodiversity Information Facility.

About Arcadia Fund:

Arcadia is a charitable fund of Lisbet Rausing and Peter Baldwin. It supports charities and scholarly institutions that preserve cultural heritage and the environment. Arcadia also supports projects that promote open access and all of its awards are granted on the condition that any materials produced are made available for free online. Since 2002, Arcadia has awarded more than $500 million to projects around the world.

Pensoft journals integrated with Catalogue of Life to help list the species of the world

While not every taxonomic study is conducted with a nature conservation idea in mind, most ecological initiatives need to be backed by exhaustive taxonomic research. There simply isn’t a way to assess a species’ distributional range, migratory patterns or ecological trends without knowing what this species actually is and where it is coming from.

In order to facilitate taxonomic and other studies, and lay the foundations for effective biodiversity conservation in a time where habitat loss and species extinction are already part of our everyday life, the global organisation Catalogue of Life (CoL) works together with major programmes, including GBIF, Encyclopedia of Life and the IUCN Red List, to collate the names of all species on the planet set in the context of a taxonomic hierarchy and their distribution.

Recently, the scholarly publisher and technological provider Pensoft has implemented a new integration with CoL, so that it joins in the effort to encourage authors publishing global taxonomic review in any of the publisher’s journals to upload their taxonomic contributions to the database.

Whenever authors submit a manuscript containing a world revision or checklist of a taxon to a Pensoft journal, they are offered the possibility to upload their datasets in CoL-compliant format, so that they can contribute to CoL, gain more visibility and credit for their work, and support future research and conservation initiatives.

Once the authors upload the dataset, Pensoft will automatically notify CoL about the new contribution, so that the organisation can further process the knowledge and contact the authors, if necessary.

In addition, CoL will also consider for indexing global taxonomic checklists, which have already been published by Pensoft.

It is noteworthy to mention that unlike an automated search engine, CoL does not simply gather the uploaded data and store them. All databases in CoL are thoroughly reviewed by experts in the relevant field and comply with a set of explicit instructions.

“Needless to say that the Species 2000 / Catalogue of Life community is very happy with this collaboration,” says Dr. Peter Schalk, Executive Secretary.

“It is essential that all kinds of data and information sharing initiatives in the realm of taxonomy and biodiversity science get connected, in order to provide integrated quality services to the users in and outside of our community. The players in this field carry responsibility to forge partnerships and collaborations that create added value for science and society and are mutually reinforcing for the participants. Our collaboration is a fine example how this can be achieved,” he adds.

“With our extensive experience in biodiversity research, at Pensoft we have already taken various steps to encourage and support data sharing practices,” says Prof. Lyubomir Penev, Pensoft’s founder and CEO. To better serve this purpose, last year, we even published a set of guidelines and strategies for scholarly publishing of biodiversity data as recommended by our own experience. Furthermore, at our Biodiversity Data Journal, we have not only made the publication of open data mandatory, but we were also the first to implement integrated narrative and data publication within a single paper.”

“It only makes sense to collaborate with organisations, such as Catalogue of Life, to make sure that all these global indexers are up-to-date and serve the world’s good in preserving our wonderful biodiversity,” he concludes.

New partnership between Pensoft and BEXIS 2 encourages Data Paper publications

Following the new partnership between the German open source platform BEXIS 2 and the academic publisher Pensoft, scientists are now able to publish data papers in three of the most innovative Pensoft journals: Biodiversity Data Journal (BDJ), One Ecosystem and Metabarcoding and Metagenomics (MBMG), using EML data packs from BEXIS 2.

In order to encourage and facilitate high-quality data publication, the collaboration allows for researchers to easily store, analyse and manage their data via BEXIS 2, before sharing it with the scientific community in a creditable format.

The newly implemented workflow requires researchers to first download their data from the free open source BEXIS 2 software and, then, upload the data pack on Pensoft’s ARPHA Journal Publishing Platform where the data can be further elaborated to comply to the established Data Paper standards. Within the software, they can work freely on these data.

Having selected a journal and a data paper article template, a single click at an ‘Import a manuscript’ button transfers the data into a manuscript in ARPHA Authoring Tool. Within the collaborative writing tool, the data owner can invite co-authors and peers to help him/her finalise the paper.

Once submitted to a journal, the article undergoes a peer review and data auditing and, if accepted for publication, is published to take advantage of all perks available at any Pensoft journal, including easy discoverability and increased citability.

“I am delighted to have this new partnership between Pensoft and BEXIS 2 announced,” says Pensoft’s founder and CEO Prof. Lyubomir Penev.

“I believe that workflows like ours do inspire scientists to, firstly, refine their data to the best possible quality, and, secondly, make them available to the world, so that these data can benefit the society much faster and more efficiently through collaborative efforts and constructive feedback.”

“With scientists becoming more and more eager to publish research data in data journals like Pensoft’s BDJ, it is important to provide comprehensive and easy workflows for the transition of data from a data management platform like BEXIS 2 to the repository of the data journal without losing or re-entering any information. So we are absolutely delighted that a first version of such data publication workflow is now available to users of BEXIS 2.” says Prof. Birgitta König-Ries, Principle Investigator of BEXIS 2.

The collaboration between Pensoft and BEXIS 2 is set to strengthen in the next few months, when a new import workflow is expected to provide an alternative way to publish datasets.

In 2015, Pensoft launched similar workflows for DataONE, the Global Biodiversity Information Facility (GBIF) and the Long Term Ecological Research Network (LTER).

###

Additional information:

About BEXIS 2:

BEXIS 2 is a free and open source software supporting researchers in managing their data throughout the entire data lifecycle from data collection, documentation, processing, analyzing, to sharing and publishing research data.

BEXIS 2 is a modular scalable platform suitable for working groups and collaborative project consortia with up to several hundred researchers. It has been designed to meet the requirements of researchers in the field of biodiversity, but it is generic enough to serve other communities as well.

BEXIS 2 is developed at Friedrich-Schiller-University Jena together with partners from Max-Planck Institute of Biogeochemistry Jena, Technical University Munich and GWDG Göttingen. The development is funded by the German Science Foundation (DFG).

Data Quality Checklist and Recommendations at Pensoft

As much as research data sharing and re-usability is a staple in the open science practices, their value would be hugely diminished if their quality is compromised.

In a time when machine-readability and the related software are getting more and more crucial in science, while data are piling up by the minute, it is essential that researchers efficiently format and structure as well as deposit their data, so that they can make it accessible and re-usable for their successors.

Errors, as in data that fail to be read by computer programs, can easily creep into any dataset. These errors are as diverse as invalid characters, missing brackets, blank fields and incomplete geolocations.

To summarise the lessons learnt from our extensive experience in biodiversity data audit at Pensoft, we have now included a Data Quality Checklist and Recommendations page in the About section of each of our data-publishing journals.

We are hopeful that these guidelines will help authors prepare and publish datasets of higher quality, so that their work can be fully utilised in subsequent research.

At the end of the day, proofreading your data is no different than running through your text looking for typos.

We would like to use the occasion to express our gratitude to Dr. Robert Mesibov, who prepared the checklist and whose expertise in biodiversity data audit has contributed greatly to Pensoft through the years. Continue reading “Data Quality Checklist and Recommendations at Pensoft”

Five new Pensoft journals integrated with Dryad to improve data discoverability

Academic publisher Pensoft strengthens partnership with Dryad by adding its latest five journals to the list integrated with the digital repository. From now on, all authors who choose any of the journals published under Pensoft’s imprint will be able to opt for uploading their datasets on Dryad. At the click of a button, the authors will have their data additionally discoverable, reusable, and citable.

Started in 2011 as one of the first ever integrated data deposition workflows between a repository (Dryad) and a publisher (Pensoft), the partnership has now been reinforced to cover publications submitted to any of Pensoft’s 21 journals, including recently launched Research Ideas and Outcomes (RIO) and One Ecosystem, as well as BioDiscovery, African Invertebrates and Zoologia, which all moved to Pensoft within the last year.

By agreeing to deposit their datasets to Dryad, authors take advantage of a specialised and highly acknowledged platform to easily showcase and, hence, take credit for their data. On the other hand, the science community, including educators and students, can readily access the data, facilitating verification, citability and even potential collaborations.

“Dedicated to open and reproducible science, at Pensoft we have always strived to encourage our authors to make their research as transparent and, hence, trustworthy as possible, by providing the right infrastructure and support,” says Pensoft’s founder and CEO Prof. Lyubomir Penev. “By strengthening our long-year partnership with Dryad, I envision more and more authors, who publish in our journals, adding open data to their list of best practices.”

“Dryad works to promote data that are openly available, integrated with the scholarly literature, and routinely re-used to create knowledge,” said Dryad’s Executive Director, Meredith Morovati. “We are encouraged by the growth of our partnership with Pensoft, one of our earliest supporters. We are honored to provide services to Pensoft authors to ensure their data is openly available, linked to the article, and preserved for future use and for the future of science.”

LifeWatchGreece launches a Special Paper Collection for Greek biodiversity research

Developed in the 1990s and early 2000s, LifeWatch is one of the large-scale European Research Infrastructures (ESFRI) created to support biodiversity science and its developments. Its ultimate goal is to model Earth’s biodiversity based on large-scale data, to build a vast network of partners, and to liaise with other high-quality and viable research infrastructures (RI).

Being one of the founding LifeWatch member states, Greece has not only implemented LifeWatchGreece, but it is all set and ready to “fulfill the vision of the Greek LifeWatch RI and establish it as the biodiversity Centre of Excellence for South-eastern Europe”, according to the authors of the latest Biodiversity Data Journal‘s Editorial: Dr Christos Arvanitidis, Dr Eva Chatzinikolaou, Dr Vasilis Gerovasileiou, Emmanouela Panteri, Dr Nicolas Bailly, all affiliated with the Hellenic Centre for Marine Research (HCMR) and part of the LifeWatchGreece Core Team, together with Nikos Minadakis, Foundation for Research and Technology Hellas (FORTH), Alex Hardisty, Cardiff University, and Dr Wouter Los, University of Amsterdam.

Making use of the technologically advanced open access Biodiversity Data Journal and its Collections feature, the LifeWatchGreece team is publishing a vast collection of peer-reviewed scientific outputs, including software descriptions, data papers, taxonomic checklists and research articles, along with the accompanying datasets and supporting material. Their intention is to demonstrate the availability and applicability of the developed e-Services and Virtual Laboratories (vLabs) to both the scientific community, as well as the broader domain of biodiversity management.

The LifeWatchGreece Special Collection is now available in Biodiversity Data Journal, with a series of articles highlighting key contributions to the large-scale European LifeWatch RI. The Software Description papers explain the LifeWatchGreece Portal, where all the e-Services and the vLabs provided by LifeWatchGreece RI are hosted; the Data Services based on semantic web technologies, which provide detailed and specialized search paths to facilitate data mining; the R vLab which can be used for a series of statistical analyses in ecology, based on an integrated and optimized online R environment; and the Micro-CT vLab, which allows the online exploration, dissemination and interactive manipulation of micro-tomography datasets.

The LifeWatchGreece Special Collection also includes a series of taxonomic checklists (preliminary, updated and/or annotated); a series of data papers presenting historical and original datasets; and a selection of research articles reporting on the outcomes, methodologies and citizen science initiatives developed by collaborating research projects, which have shared human, hardware and software resources with LifeWatchGreece RI.

LifeWatchGreece relies on a multidisciplinary approach, involving several subsidiary initiatives; collaborations with Greek, European and World scientific communities; specialised staff, responsible for continuous updates and developments; and, of course, innovative online tools and already established IT infrastructure.

###

Original source:

Arvanitidis C, Chatzinikolaou E, Gerovasileiou V, Panteri E, Bailly N, Minadakis N, Hardisty A, Los W (2016) LifeWatchGreece: Construction and operation of the National Research Infrastructure (ESFRI). Biodiversity Data Journal 4: e10791. https://doi.org/10.3897/BDJ.4.e10791

Additional information:

This work has been supported by the LifeWatchGreece infrastructure (MIS 384676), funded by the Greek Government under the General Secretariat of Research and Technology (GSRT), ESFRI Projects, National Strategic Reference Framework (NSRF).

How to import data papers from GBIF, DataONE and LTER metadata

On October 13, 2015, we published a blog post about the novel functionalities in ARPHA that allow streamlined import of data papers from EML.

Now, this process has been described in the Tips and Tricks section of the ARPHA Authoring Tool. Here, we’ll list the individual workflows:

We want to stress at this point that the import functionality itself is agnostic of the data source and any metadata file in EML 2.1.1 or 2.1.0 can be imported. We have listed these three most likely sources of metadata to illustrate the workflow.

In the remainder of the post, we will go through the original post from October 13, 2015 and highlight the latest updates.

At the time of the writing of the original post, the Biodiversity Information Standards conference, TDWG 2015, was taking place in Kenya. Data sharing, data re-use, and data discovery were being brought up in almost every talk. We might have entered the age of Big Data twenty years ago, but it is now that scientists face the real challenge – storing and searching through the deluge of data to find what they need.

As the rate at which we exponentially generate data exceeds the rate at which data storage technologies improve, the field of data management seems to be greatly challenged. Worse, this means the more new data is generated, the more of the older ones will be lost. In order to know what to keep and what to delete, we need to describe the data as much as possible, and judge the importance of datasets. This post is about a novel way to automatically generate scientific papers describing a dataset, which will be referred to as data papers.

The common characters of the records, i.e. descriptions of the object of study, the measurement apparatus and the statistical summaries used to quantify the records, the personal notes of the researcher, and so on, are called metadata. Major web portals such as DataONE, the Global Biodiversity Information Facility (GBIF), or the Long Term Ecological Research Network store metadata in conjunction with a given dataset as one or more text files, usually structured in special formats enabling the parsing of the metadata by algorithms.

To make the metadata and the corresponding datasets discoverable and citable, the concept of the data paper was introduced in the early 2000’s by the Ecological Society of America. This concept was brought to the attention of the biodiversity community by Chavan and Penev (2011) with the introduction of a new data paper concept, based on a metadata standard, such as the Ecological Metadata Language, and derived from metadata content stored at large data platforms, in this case the Global Biodiversity Information Facility (GBIF). You can read this article for an in-depth discussion of the topic.

Pensoft’s Biodiversity Data Journal (BDJ) is to the best of our knowledge the first academic journal to have implemented a one-hundred-percent online authoring system for data papers, called ARPHA. Moreover, BDJ and the other Pensoft journals, such as ZooKeys, have already published more than seventy data papers.

Therefore, in the remainder of this post we will explain how to use an automated approach to publish a data paper describing an online dataset in Biodiversity Data Journal. The ARPHA system will convert the metadata describing your dataset into a manuscript for you after reading in the metadata. We will illustrate the workflow on the previously mentioned DataONE and GBIF.

The Data Observation Network for Earth (DataONE) is a distributed cyberinfrastructure funded by the U.S. National Science Foundation. It links together over twenty five nodes, primarily in the U.S., hosting biodiversity and biodiversity-related data, and provides an interface to search for data in all of them (Note: In the meantime, DataONE has updated their search interface).

Since butterflies are neat, let’s search for datasets about butterflies on DataONE! Type “Lepidoptera” in the search field and scroll down to the dataset describing “The Effects of Edge Proximity on Butterfly Biodiversity.” You should see something like this:

As you can notice, this resource has two objects associated with it: metadata, which has been highlighted, and the dataset itself. Let’s download the metadata from the cloud! The resulting text file, “Blandy.235.1.xml”, or whatever you want to call it, can be read by humans, but is somewhat cryptic because of all the XML tags. Now, you can import this file to the ARPHA writing platform and the information stored in it would be used to create a data paper! Go to the ARPHA web-site, and click on “Start a manuscript,” then scroll all the way down and click on “Import manuscript”.

Upload the “blandy” file and you will see an “Authors’ page,” where you can select which of the authors mentioned in the metadata must be included as authors of the data paper itself. Note that the user of ARPHA uploading the metadata is added to the list of the authors even if they are not included in the metadata. After the selection is done, a scholarly article is created by the system with the information from the metadata already in the respective sections of the article:

Now, the authors can add some description, edit out errors, tell a story, cite someone – all of this without leaving ARPHA – i.e. do whatever it takes to produce a high-quality scholarly text. After they are done, they can submit their article for peer-review and it could be published in a matter of hours. Voila!

Let’s look at GBIF. Go to “Data -> Explore by country” and select “Saint Vincent and the Grenadines,” an English-speaking Caribbean island. There are, as of the time of writing of this post, 166 occurrence datasets containing data about the islands. Select the dataset from the Museum of Comparative Zoology at Harvard. If you scroll down, you will see the GBIF annotated EML. Download this as a separate text file (if you are using Chrome, you can view the source, and then use Copy-Paste). Do the exact same steps as before – go to “Import manuscript” in ARPHA and upload the EML file. The result should be something like this, ready to finalize:

To finish it up, we want to leave you with some caveats and topics for further discussion. Till today, useful and descriptive metadata has not always been present. There are two challenges: metadata completeness and metadata standards. The invention of the EML standard was one of the first efforts to standardize how metadata should be stored in the field of ecology and biodiversity science.

Currently, our import system supports the last two versions of the EML standard: 2.1.1 and 2.1.0, but we hope to further develop this functionality. In an upcoming version of their search interface, DataONE will provide infographics on the prevalence of the metadata standards on their site (as illustrated below), so there is still work to be done, but if there is a positive feedback from the community, we will definitely keep elaborating this feature.

Regarding metadata completeness, our hope is that by enabling scientists to create scholarly papers from their metadata with a single-step process, they will be incentivized to produce high-quality metadata.

Now, allow us to give a disclaimer here: the authors of this blog post have nothing to do with the two datasets. They have not contributed to any of them, nor do they know the authors. The datasets have been chosen more or less randomly since the authors wanted to demonstrate the functionality with a real-world example. You should only publish data papers if you know the authors or you are the author of the dataset itself. During the actual review process of the paper, the authors that have been included will get an email from the journal.

Additional information:

This project has received funding from the European Union’s FP7 project EU BON (Building the European Biodiversity Observation Network), grant agreement No 308454, and Horizon 2020 research and innovation project BIG4 (Biosystematics, informatics and genomics of the big 4 insect groups: training tomorrow’s researchers and entrepreneurs) under the Marie Sklodovska-Curie grant agreement No. 642241 for a PhD project titled Technological Implications of the Open Biodiversity Knowledge Management System.

How to import occurrence records into manuscripts from GBIF, BOLD, iDigBio and PlutoF

On October 20, 2015, we published a blog post about the novel functionalities in ARPHA that allows streamlined import of specimen or occurrence records into taxonomic manuscripts.

Recently, this process was reflected in the “Tips and Tricks” section of the ARPHA authoring tool. Here, we’ll list the individual workflows:

Based on our earlier post, we will now go through our latest updates and highlight the new features that have been added since then.

Repositories and data indexing platforms, such as GBIF, BOLD systems, iDigBio, or PlutoF, hold, among other types of data, specimen or occurrence records. It is now possible to directly import specimen or occurrence records into ARPHA taxonomic manuscripts from these platforms [see Fig. 1]. We’ll refer to specimen or occurrence records as simply occurrence records for the rest of this post.

Import_specimen_workflow_ — [Fig. 1] Workflow for directly importing occurrence records into a taxonomic manuscript.

Until now, when users of the ARPHA writing tool wanted to include occurrence records as materials in a manuscript, they would have had to format the occurrences as an Excel sheet that is uploaded to the Biodiversity Data Journal, or enter the data manually. While the “upload from Excel” approach significantly simplifies the process of importing materials, it still requires a transposition step – the data which is stored in a database needs to be reformatted to the specific Excel format. With the introduction of the new import feature, occurrence data that is stored at GBIF, BOLD systems, iDigBio, or PlutoF, can be directly inserted into the manuscript by simply entering a relevant record identifier.

The functionality shows up when one creates a new “Taxon treatment” in a taxonomic manuscript in the ARPHA Writing Tool. To import records, the author needs to:

Locate an occurrence record or records in one of the supported data portals;
Note the ID(s) of the records that ought to be imported into the manuscript (see Tips and Tricks for screenshots);
Enter the ID(s) of the occurrence record(s) in a form that is to be seen in the “Materials” section of the species treatment;
Select a particular database from a list, and then simply clicks ‘Add’ to import the occurrence directly into the manuscript.

In the case of BOLD Systems, the author may also select a given Barcode Identification Number (BIN; for a treatment of BIN’s read below), which then pulls all occurrences in the corresponding BIN.

We will illustrate this workflow by creating a fictitious treatment of the red moss, Sphagnum capillifolium, in a test manuscript. We have started a taxonomic manuscript in ARPHA and know that the occurrence records belonging to S. capillifolium can be found on iDigBio. What we need to do is to locate the ID of the occurrence record in the iDigBio webpage. In the case of iDigBio, the ARPHA system supports import via a Universally Unique Identifier (UUID). We have already created a treatment for S. capillifolium and clicked on the pencil to edit materials [Fig. 2].

In this example, type or paste the UUID (b9ff7774-4a5d-47af-a2ea-bdf3ecc78885), select the iDigBio source and click ‘Add’. This will pull the occurrence record for S. capillifolium from iDigBio and insert it as a material in the current paper [Fig. 3].

taxon-treatments- 3 — [Fig. 3] Materials after they have been imported

This workflow can be used for a number of purposes. An interesting future application is the rapid re-description of species, but even more exciting is the description of new species from BIN’s. BIN’s (Barcode Identification Numbers) delimit Operational Taxonomic Units (OTU’s), created algorithmically at BOLD Systems. If a taxonomist decides that an OTU is indeed a new species, then he/she can import all the type information associated with that OTU for the purposes of describing it as a new species.

Not having to retype or copy/paste species occurrence records, the authors save a lot of efforts. Moreover, they automatically import them in a structured Darwin Core format, which can easily be downloaded from the article text into structured data by anyone who needs the data for reuse.

Another important aspect of the workflow is that it will serve as a platform for peer-review, publication and curation of raw data, that is of unpublished individual data records coming from collections or observations stored at GBIF, BOLD, iDigBio and PlutoF. Taxonomists are used to publish only records of specimens they or their co-authors have personally studied. In a sense, the workflow will serve as a “cleaning filter” for portions of data that are passed through the publishing process. Thereafter, the published records can be used to curate raw data at collections, e.g. put correct identifications, assign newly described species names to specimens belonging to the respective BIN and so on.

Additional Information:

The work has been partially supported by the EC-FP7 EU BON project (ENV 308454, Building the European Biodiversity Observation Network) and the ITN Horizon 2020 project BIG4 (Biosystematics, informatics and genomics of the big 4 insect groups: training tomorrow’s researchers and entrepreneurs), under Marie Sklodovska-Curie grant agreement No. 642241.