New BiCIKL project to build a freeway between pieces of biodiversity knowledge

The new Horizon 2020-funded project BiCIKL (Biodiversity Community Integrated Knowledge Library) is a partnership of 14 European institutions, representing the continent’s and global key players in biodiversity research and data. They have committed to link their infrastructures and technologies, in order to provide flawless access to data across all stages of the research process. Three years in, BiCIKL will have created the first-of-its-kind Biodiversity Knowledge Hub.

In a recently started Horizon 2020-funded project, 14 European institutions from 10 countries, representing both the continent’s and global key players in biodiversity research and natural history, deploy and improve their own and partnering infrastructures to bridge gaps between each other’s biodiversity data types and classes. By linking their technologies, they are set to provide flawless access to data across all stages of the research cycle.

Three years in, BiCIKL (abbreviation for Biodiversity Community Integrated Knowledge Library) will have created the first-of-its-kind Biodiversity Knowledge Hub, where a researcher will be able to retrieve a full set of linked and open biodiversity data, thereby accessing the complete story behind an organism of interest: its name, genetics, occurrences, natural history, as well as authors and publications mentioning any of those.

Ultimately, the project’s products will solidify Open Science and FAIR (Findable, Accessible, Interoperable and Reusable) data practices by empowering and streamlining biodiversity research.

Together, the project partners will redesign the way biodiversity data is found, linked, integrated and re-used across the research cycle. By the end of the project, BiCIKL will provide the community with a more transparent, trustworthy and efficient highly automated research ecosystem, allowing for scientists to access, explore and put into further use a wide range of data with only a few clicks.

“In recent years, we’ve made huge progress on how biodiversity data is located, accessed, shared, extracted and preserved, thanks to a vast array of digital platforms, tools and projects looking after the different types of data, such as natural history specimens, species descriptions, images, occurrence records and genomics data, to name a few. However, we’re still missing an interconnected and user-friendly environment to pull all those pieces of knowledge together. Within BiCIKL, we all agree that it’s only after we puzzle out how to best bridge our existing infrastructures and the information they are continuously sourcing that future researchers will be able to realise their full potential,” 

explains BiCIKL’s project coordinator Prof. Lyubomir Penev, CEO and founder of Pensoft, a scholarly publisher and technology provider company.

Continuously fed with data sourced by the partnering institutions and their infrastructures, BiCIKL’s key final output: the Biodiversity Knowledge Hub, is set to persist with time long after the project has concluded. On the contrary, by accelerating biodiversity research that builds on – rather than duplicates – existing knowledge, it will in fact be providing access to exponentially growing contextualised biodiversity data.

***

Learn more about BiCIKL on the project’s website at: bicikl-project.eu

Follow BiCIKL Project on Twitter and Facebook. Join the conversation on Twitter at #BiCIKL_H2020.

***

The project partners:

Pensoft Annotator – a tool for text annotation with ontologies

By Mariya Dimitrova, Georgi Zhelezov, Teodor Georgiev and Lyubomir Penev

The use of written language to record new knowledge is one of the advancements of civilisation that has helped us achieve progress. However, in the era of Big Data, the amount of published writing greatly exceeds the physical ability of humans to read and understand all written information. 

More than ever, we need computers to help us process and manage written knowledge. Unlike humans, computers are “naturally fluent” in many languages, such as the formats of the Semantic Web. These standards were developed by the World Wide Web Consortium (W3C) to enable computers to understand data published on the Internet. As a result, computers can index web content and gather data and metadata about web resources.

To help manage knowledge in different domains, humans have started to develop ontologies: shared conceptualisations of real-world objects, phenomena and abstract concepts, expressed in machine-readable formats. Such ontologies can provide computers with the necessary basic knowledge, or axioms, to help them understand the definitions and relations between resources on the Web. Ontologies outline data concepts, each with its own unique identifier, definition and human-legible label.

Matching data to its underlying ontological model is called ontology population and involves data handling and parsing that gives it additional context and semantics (meaning). Over the past couple of years, Pensoft has been working on an ontology population tool, the Pensoft Annotator, which matches free text to ontological terms.

The Pensoft Annotator is a web application, which allows annotation of text input by the user, with any of the available ontologies. Currently, they are the Environment Ontology (ENVO) and the Relation Ontology (RO), but we plan to upload many more. The Annotator can be run with multiple ontologies, and will return a table of matched ontological term identifiers, their labels, as well as the ontology from which they originate (Fig. 1). The results can also be downloaded as a Tab-Separated Value (TSV) file and certain records can be removed from the table of results, if desired. In addition, the Pensoft Annotator allows to exclude certain words (“stopwords”) from the free text matching algorithm. There is a list of default stopwords, common for the English language, such as prepositions and pronouns, but anyone can add new stopwords.

Figure 1. Interface of the Pensoft Annotator application

In Figure 1, we have annotated a sentence with the Pensoft Annotator, which yields a single matched term, labeled ‘host of’, from the Relation Ontology (RO). The ontology term identifier is linked to a webpage in Ontobee, which points to additional metadata about the ontology term (Fig. 2).

Figure 2. Web page about ontology term

Such annotation requests can be run to perform text analyses for topic modelling to discover texts which contain host-pathogen interactions. Topic modelling is used to build algorithms for content recommendation (recommender systems) which can be implemented in online news platforms, streaming services, shopping websites and others.

At Pensoft, we use the Pensoft Annotator to enrich biodiversity publications with semantics. We are currently annotating taxonomic treatments with a custom-made ontology based on the Relation Ontology (RO) to discover treatments potentially describing species interactions. You can read more about using the Annotator to detect biotic interactions in this abstract.

The Pensoft Annotator can also be used programmatically through an API, allowing you to integrate the Annotator into your own script. For more information about using the Pensoft Annotator, please check out the documentation.

Data checking for biodiversity collections and other biodiversity data compilers from Pensoft

Guest blog post by Dr Robert Mesibov

Proofreading the text of scientific papers isn’t hard, although it can be tedious. Are all the words spelled correctly? Is all the punctuation correct and in the right place? Is the writing clear and concise, with correct grammar? Are all the cited references listed in the References section, and vice-versa? Are the figure and table citations correct?

Proofreading of text is usually done first by the reviewers, and then finished by the editors and copy editors employed by scientific publishers. A similar kind of proofreading is also done with the small tables of data found in scientific papers, mainly by reviewers familiar with the management and analysis of the data concerned.

But what about proofreading the big volumes of data that are common in biodiversity informatics? Tables with tens or hundreds of thousands of rows and dozens of columns? Who does the proofreading?

Sadly, the answer is usually “No one”. Proofreading large amounts of data isn’t easy and requires special skills and digital tools. The people who compile biodiversity data often lack the skills, the software or the time to properly check what they’ve compiled.

The result is that a great deal of the data made available through biodiversity projects like GBIF is — to be charitable — “messy”. Biodiversity data often needs a lot of patient cleaning by end-users before it’s ready for analysis. To assist end-users, GBIF and other aggregators attach “flags” to each record in the database where an automated check has found a problem. These checks find the most obvious problems amongst the many possible data compilation errors. End-users often have much more work to do after the flags have been dealt with.

In 2017, Pensoft employed a data specialist to proofread the online datasets that are referenced in manuscripts submitted to Pensoft’s journals as data papers. The results of the data-checking are sent to the data paper’s authors, who then edit the datasets. This process has substantially improved many datasets (including those already made available through GBIF) and made them more suitable for digital re-use. At blog publication time, more than 200 datasets have been checked in this way.

Note that a Pensoft data audit does not check the accuracy of the data, for example, whether the authority for a species name is correct, or whether the latitude/longitude for a collecting locality agrees with the verbal description of that locality. For a more or less complete list of what does get checked, see the Data checklist at the bottom of this blog post. These checks are aimed at ensuring that datasets are correctly organised, consistently formatted and easy to move from one digital application to another. The next reader of a digital dataset is likely to be a computer program, not a human. It is essential that the data are structured and formatted, so that they are easily processed by that program and by other programs in the pipeline between the data compiler and the next human user of the data.

Pensoft’s data-checking workflow was previously offered only to authors of data paper manuscripts. It is now available to data compilers generally, with three levels of service:

  • Basic: the compiler gets a detailed report on what needs fixing
  • Standard: minor problems are fixed in the dataset and reported
  • Premium: all detected problems are fixed in collaboration with the data compiler and a report is provided

Because datasets vary so much in size and content, it is not possible to set a price in advance for basic, standard and premium data-checking. To get a quote for a dataset, send an email with a small sample of the data topublishing@pensoft.net.


Data checklist

Minor problems:

  • dataset not UTF-8 encoded
  • blank or broken records
  • characters other than letters, numbers, punctuation and plain whitespace
  • more than one version (the simplest or most correct one) for each character
  • unnecessary whitespace
  • Windows carriage returns (retained if required)
  • encoding errors (e.g. “Dum?ril” instead of “Duméril”)
  • missing data with a variety of representations (blank, “-“, “NA”, “?” etc)

Major problems:

  • unintended shifts of data items between fields
  • incorrect or inconsistent formatting of data items (e.g. dates)
  • different representations of the same data item (pseudo-duplication)
  • for Darwin Core datasets, incorrect use of Darwin Core fields
  • data items that are invalid or inappropriate for a field
  • data items that should be split between fields
  • data items referring to unexplained entities (e.g. “habitat is type A”)
  • truncated data items
  • disagreements between fields within a record
  • missing, but expected, data items
  • incorrectly associated data items (e.g. two country codes for the same country)
  • duplicate records, or partial duplicate records where not needed

For details of the methods used, see the author’s online resources:

***

Find more for Pensoft’s data audit workflow provided for data papers submitted to Pensoft journals on Pensoft’s blog.

Novel research on African bats pilots new ways in sharing and linking published data

A colony of what is apparently a new species of the genus Hipposideros found in an abandoned gold mine in Western Kenya
Photo by B. D. Patterson / Field Museum

Newly published findings about the phylogenetics and systematics of some previously known, but also other yet to be identified species of Old World Leaf-nosed bats, provide the first contribution to a recently launched collection of research articles, whose task is to help scientists from across disciplines to better understand potential hosts and vectors of zoonotic diseases, such as the Coronavirus. Bats and pangolins are among the animals already identified to be particularly potent vehicles of life-threatening viruses, including the infamous SARS-CoV-2.

The article, publicly available in the peer-reviewed scholarly journal ZooKeys, also pilots a new generation of Linked Open Data (LOD) publishing practices, invented and implemented to facilitate ongoing scientific collaborations in times of urgency like those we experience today with the COVID-19 pandemic currently ravaging across over 230 countries around the globe.

In their study, an international team of scientists, led by Dr Bruce PattersonField Museum‘s MacArthur curator of mammals, point to the existence of numerous, yet to be described species of leaf-nosed bats inhabiting the biodiversity hotspots of East Africa and Southeast Asia. In order to expedite future discoveries about the identity, biology and ecology of those bats, they provide key insights into the genetics and relations within their higher groupings, as well as further information about their geographic distribution.

“Leaf-nosed bats carry coronaviruses–not the strain that’s affecting humans right now, but this is certainly not the last time a virus will be transmitted from a wild mammal to humans. If we have better knowledge of what these bats are, we’ll be better prepared if that happens,”

says Dr Terrence Demos, a post-doctoral researcher in Patterson’s lab and a principal author of the paper.
One of the possibly three new to science bat species, previously referred to as Hipposideros caffer or Sundevall’s leaf-nosed bat
Photo by B. D. Patterson / Field Museum

“With COVID-19, we have a virus that’s running amok in the human population. It originated in a horseshoe bat in China. There are 25 or 30 species of horseshoe bats in China, and no one can determine which one was involved. We owe it to ourselves to learn more about them and their relatives,”

comments Patterson.

In order to ensure that scientists from across disciplines, including biologists, but also virologists and epidemiologists, in addition to health and policy officials and decision-makers have the scientific data and evidence at hand, Patterson and his team supplemented their research publication with a particularly valuable appendix table. There, in a conveniently organized table format, everyone can access fundamental raw genetic data about each studied specimen, as well as its precise identification, origin and the natural history collection it is preserved. However, what makes those data particularly useful for researchers looking to make ground-breaking and potentially life-saving discoveries is that all that information is linked to other types of data stored at various databases and repositories contributed by scientists from anywhere in the world.

Furthermore, in this case, those linked and publicly available data or Linked Open Data (LOD) are published in specific code languages, so that they are “understandable” for computers. Thus, when a researcher seeks to access data associated with a particular specimen he/she finds in the table, he/she can immediately access additional data stored at external data repositories by means of a single algorithm. Alternatively, another researcher might want to retrieve all pathogens extracted from tissues from specimens of a specific animal species or from particular populations inhabiting a certain geographical range and so on.

###

The data publication and dissemination approach piloted in this new study was elaborated by the science publisher and technology provider Pensoft and the digitisation company Plazi for the purposes of a special collection of research papers reporting on novel findings concerning the biology of bats and pangolins in the scholarly journal ZooKeys. By targeting the two most likely ‘culprits’ at the roots of the Coronavirus outbreak in 2020: bats and pangolins, the article collection aligns with the agenda of the COVID-19 Joint Task Force, a recent call for contributions made by the Consortium of European Taxonomic Facilities (CETAF), the Distributed System for Scientific Collections (DiSSCo) and the Integrated Digitized Biocollections (iDigBio).

###

Original source:

Patterson BD, Webala PW, Lavery TH, Agwanda BR, Goodman SM, Kerbis Peterhans JC, Demos TC (2020) Evolutionary relationships and population genetics of the Afrotropical leaf-nosed bats (Chiroptera, Hipposideridae). ZooKeys 929: 117-161. https://doi.org/10.3897/zookeys.929.50240

Plazi and Pensoft join forces to let biodiversity knowledge of coronaviruses hosts out

Pensoft’s flagship journal ZooKeys invites free-to-publish research on key biological traits of SARS-like viruses potential hosts and vectors; Plazi harvests and brings together all relevant data from legacy literature to a reliable FAIR-data repository

To bridge the huge knowledge gaps in the understanding of how and which animal species successfully transmit life-threatening diseases to humans, thereby paving the way for global health emergencies, scholarly publisher Pensoft and literature digitisation provider Plazi join efforts, expertise and high-tech infrastructure. 

By using the advanced text- and data-mining tools and semantic publishing workflows they have developed, the long-standing partners are to rapidly publish easy-to-access and reusable biodiversity research findings and data, related to hosts or vectors of the SARS-CoV-2 or other coronaviruses, in order to provide the stepping stones needed to manage and prevent similar crises in the future.

Already, there’s plenty of evidence pointing to certain animals, including pangolins, bats, snakes and civets, to be the hosts of viruses like SARS-CoV-2 (coronaviruses), hence, potential triggers of global health crises, such as the currently ravaging Coronavirus pandemic. However, scientific research on what biological and behavioural specifics of those species make them particularly successful vectors of zoonotic diseases is surprisingly scarce. Even worse, the little that science ‘knows’ today is often locked behind paywalls and copyright laws, or simply ‘trapped’ in formats inaccessible to text- and data-mining performed by search algorithms. 

This is why Pensoft’s flagship zoological open-access, peer-reviewed scientific journal ZooKeys recently announced its upcoming, special issue, titled “Biology of pangolins and bats”, to invite research papers on relevant biological traits and behavioural features of bats and pangolins, which are or could be making them efficient vectors of zoonotic diseases. Another open-science innovation champion in the Pensoft’s portfolio, Research Ideas and Outcomes (RIO Journal) launched another free-to-publish collection of early and/or brief outcomes of research devoted to SARS-like viruses.

Due to the expedited peer review and publication processes at ZooKeys, the articles will rapidly be made public and accessible to scientists, decision-makers and other experts, who could then build on the findings and eventually come up with effective measures for the prevention and mitigation of future zoonotic epidemics. To further facilitate the availability of such critical research, ZooKeys is waiving the publication charges for accepted papers.

Meanwhile, the literature digitisation provider Plazi is deploying its text- and data-mining expertise and tools, to locate and acquire publications related to hosts of coronaviruses – such as those expected in the upcoming “Biology of pangolins and bats” special issue in ZooKeys – and deposit them in a newly formed Coronavirus-Host Community, a repository hosted on the Zenodo platform. There, all publications will be granted persistent open access and enhanced with taxonomy-specific data derived from their sources. Contributions to Plazi can be made at various levels: from sending suggestions of articles to be added to the Zotero bibliographic public libraries on virus-hosts associations and hosts’ taxonomy, to helping the conversion of those articles into findable, accessible, interoperable and reusable (FAIR) knowledge.

Pensoft’s and Plazi’s collaboration once again aligns with the efforts of the biodiversity community, after the natural science collections consortium DiSSCo (Distributed System of Scientific Collections) and the Consortium of European Taxonomic Facilities (CETAF), recently announced the COVID-19 Task Force with the aim to create a network of taxonomists, collection curators and other experts from around the globe.

FAIR biodiversity data in Pensoft journals thanks to a routine data auditing workflow

 

Data audit workflow provided for data papers submitted to Pensoft journals.

To avoid publication of openly accessible, yet unusable datasets, fated to result in irreproducible and inoperable biological diversity research at some point down the road, Pensoft takes care for auditing data described in data paper manuscripts upon their submission to applicable journals in the publisher’s portfolio, including Biodiversity Data JournalZooKeysPhytoKeysMycoKeys and many others.

Once the dataset is clean and the paper is published, biodiversity data, such as taxa, occurrence records, observations, specimens and related information, become FAIR (findable, accessible, interoperable and reusable), so that they can be merged, reformatted and incorporated into novel and visionary projects, regardless of whether they are accessed by a human researcher or a data-mining computation.

As part of the pre-review technical evaluation of a data paper submitted to a Pensoft journal, the associated datasets are subjected to data audit meant to identify any issues that could make the data inoperable. This check is conducted regardless of whether the dataset are provided as supplementary material within the data paper manuscript or linked from the Global Biodiversity Information Facility (GBIF) or another external repository. The features that undergo the audit can be found in a data quality checklist made available from the website of each journal alongside key recommendations for submitting authors.

Once the check is complete, the submitting author receives an audit report providing improvement recommendations, similarly to the commentaries he/she would receive following the peer review stage of the data paper. In case there are major issues with the dataset, the data paper can be rejected prior to assignment to a subject editor, but resubmitted after the necessary corrections are applied. At this step, authors who have already published their data via an external repository are also reminded to correct those accordingly.

“It all started back in 2010, when we joined forces with GBIF on a quite advanced idea in the domain of biodiversity: a data paper workflow as a means to recognise both the scientific value of rich metadata and the efforts of the the data collectors and curators. Together we figured that those data could be published most efficiently as citable academic papers,” says Pensoft’s founder and Managing director Prof. Lyubomir Penev.
“From there, with the kind help and support of Dr Robert Mesibov, the concept evolved into a data audit workflow, meant to ‘proofread’ the data in those data papers the way a copy editor would go through the text,” he adds.
“The data auditing we do is not a check on whether a scientific name is properly spelled, or a bibliographic reference is correct, or a locality has the correct latitude and longitude”, explains Dr Mesibov. “Instead, we aim to ensure that there are no broken or duplicated records, disagreements between fields, misuses of the Darwin Core recommendations, or any of the many technical issues, such as character encoding errors, that can be an obstacle to data processing.”

At Pensoft, the publication of openly accessible, easy to access, find, re-use and archive data is seen as a crucial responsibility of researchers aiming to deliver high-quality and viable scientific output intended to stand the test of time and serve the public good.

To explain how and why biodiversity data should be published in full compliance with the best (open) science practices, the team behind Pensoft and long-year collaborators published a guidelines paper, titled “Strategies and guidelines for scholarly publishing of biodiversity data” in the open science journal Research Ideas and Outcomes (RIO Journal).

Recipe for Reusability: Biodiversity Data Journal integrated with Profeza’s CREDIT Suite

Through their new collaboration, the partners encourage publication of dynamic additional research outcomes to support reusability and reproducibility in science

In a new partnership between open-access Biodiversity Data Journal (BDJ) and workflow software development platform Profeza, authors submitting their research to the scholarly journal will be invited to prepare a Reuse Recipe Document via CREDIT Suite to encourage reusability and reproducibility in science. Once published, their articles will feature a special widget linking to additional research output, such as raw, experimental repetitions, null or negative results, protocols and datasets.

A Reuse Recipe Document is a collection of additional research outputs, which could serve as a guidelines to another researcher trying to reproduce or build on the previously published work. In contrast to a research article, it is a dynamic ‘evolving’ research item, which can be later updated and also tracked back in time, thanks to a revision history feature.

Both the Recipe Document and the Reproducible Links, which connect subsequent outputs to the original publication, are assigned with their own DOIs, so that reuse instances can be easily captured, recognised, tracked and rewarded with increased citability.

With these events appearing on both the original author’s and any reuser’s ORCID, the former can easily gain further credibility for his/her work because of his/her work’s enhanced reproducibility, while the latter increases his/her own by showcasing how he/she has put what he/she has cited into use.

Furthermore, the transparency and interconnectivity between the separate works allow for promoting intra- and inter-disciplinary collaboration between researchers.

“At BDJ, we strongly encourage our authors to use CREDIT Suite to submit any additional research outputs that could help fellow scientists speed up progress in biodiversity knowledge through reproducibility and reusability,” says Prof. Lyubomir Penev, founder of the journal and its scholarly publisher – Pensoft. “Our new partnership with Profeza is in itself a sign that collaboration and integrity in academia is the way to good open science practices.”

“Our partnership with Pensoft is a great step towards gathering crucial feedback and insight concerning reproducibility and continuity in research. This is now possible with Reuse Recipe Documents, which allow for authors and reusers to engage and team up with each other,” says Sheevendra, Co-Founder of Profeza.