Towards the “Biodiversity PMC”: a literature database supporting advanced content queries

The indexing is one of the major outcomes from the partnerships within the Horizon 2020-funded project Biodiversity Community Integrated Knowledge Library (BiCIKL)

Getting your Trinity Audio player ready...

Amongst the major outcomes from the currently nearly completed Horizon 2020-funded project Biodiversity Community Integrated Knowledge Library (BiCIKL) – dedicated to making biodiversity data FAIR and bi-directionally linked – brings the SIB Literature Services (SIBiLS) database one step closer to solidifying its “Biodiversity PMC” portal and working title.

In a joint effort between the Swiss-based Text Mining group of Patrick Ruch at SIB (developing SIBiLS), the text- and data-mining association Plazi and scientific publisher Pensoft, the long-time collaborators have started feeding full-text content of over 500,000 taxonomic treatments extracted by Plazi and tens of thousands full-text articles from 40 well-renowned biodiversity journals published by Pensoft to the SIBiLS database. 

What this means is that users at SIBiLS – be it human or AI – have now gained access to advanced text- and data-mining tools, including AI-powered factoid question-answering capacities, to query all this full-text indexed content and seek out, for example, species traits and biotic interactions.

To index and directly feed the content from its 40+ academic outlets at SIBiLS, Pensoft relies on advanced and full-text TaxPub JATS XML journal publication workflow, powered by the ARPHA publishing platform. Meanwhile, Plazi uses its GoldenGate text- and data-mining software to harvest taxon treatments from over 80 journals stored at TreatmentBank and the Biodiversity Literature Repository, and then further re-used by GBIF, OpenBiodiv and now by SiBILS.

Seen as a pilot, the indexing – the partners believe – could soon be extended with other journals relying on modern publishing or converted legacy publications. 

In fact, ever since its launch in 2020, the queryable database SIBiLS has been retrieving relevant full-text papers directly from the NIH’s PubMed Central, including Pensoft’s ZooKeysPhytoKeysMycoKeysBiodiversity Data Journal and Comparative Cytogenetics

However, there were still gaps left to bridge before SIBiLS could indeed be dubbed “the Biodiversity PMC”, and those have mostly been about volume and breadth of content. While the above-mentioned five journals by Pensoft had long been indexed by SIBiLS through harvesting PMC, those had been quite an exception since, several years ago, a reorganisation at PMC moved the focus of the database to almost exclusively biomedical content, thus leaving biodiversity journals out of the scope of the database.

In the meantime, while Plazi has been feeding SIBiLS a growing volume of taxonomic treatments and visual data, as it was exponentially increasing the number of publishers and journals it mined data from, a lot of biodiversity data (e.g. genetic, molecular, ecological) published in the article narratives that were not taxon treatments could not make it to the portal.

“We all know the advantages and practical uses PMC offers to its users, so we cannot miss the opportunity to incorporate this well-proven approach to navigate the data deluge in biodiversity science. Undoubtedly, it is an extremely ambitious and demanding task. Yet, I believe that, at the BiCIKL consortium, we have made it pretty clear we have the necessary expertise, know-how and aspiration to take on the challenge,”

said Prof. Lyubomir Penev, founder/CEO at Pensoft and project coordinator of BiCIKL.

“For far too long, scientific knowledge about biodiversity has been imprisoned in a continuously growing corpus of scientific outputs, which – most of the time – are published in unstructured formats, such as PDF, or as paywalled content, and often locked by both! This means that they are – at best – difficult to access and comprehend by computer algorithms. In the meantime, we need all that knowledge, in order to accelerate our understanding of the dynamics of the global biodiversity crisis and to efficiently assess the impact of climate change. This is why the need for advanced workflows and tools to annotate, mine, query and discover new facts from the available literature is more than obvious,”

added Dr. Donat Agosti, President at Plazi.

“In the course of the BiCIKL project, at SIBiLS, we started indexing a larger set of biodiversity-related contents in the broad sense, including environmental sciences and ecology, to build a new literature database, or what we now call ‘Biodiversity PMC’. Now, with the help of Plazi and Pensoft, we provide a unique entry point to half a million taxonomic treatments, which were not included into the original PubMed Central. Next on the list is to expand our network of literature sources and continue this exponential growth of queryable biodiversity knowledge to turn Biodiversity PMC into the “One Health” library. We promise to keep you posted,”

said Dr. Patrick Ruch, Group Leader at SIB and Head of Research at HES-SO, HEG Geneva, Switzerland. 

***

Follow BiCIKL Project on Twitter and Facebook. Join the conversation on Twitter at #BiCIKL_H2020.

***

About the SIB Swiss Institute of Bioinformatics:

SIB is an internationally recognized non-profit organisation, dedicated to biological and biomedical data science. SIB’s data scientists are passionate about creating knowledge and solving complex questions in many fields, from biodiversity and evolution to medicine. They provide essential databases and software platforms as well as bioinformatics expertise and services to academic, clinical, and industry groups. With the recent creation of the Environmental Bioinformatics group, led by Robert Waterhouse, SIB is engaged in an unprecedented effort to streamline data across molecular biology, health and biodiversity. SIB also federates the Swiss bioinformatics community of some 900 scientists, encouraging collaboration and knowledge sharing.

***

About Plazi:

Plazi is an association supporting and promoting the development of persistent and openly accessible digital taxonomic literature. To this end, Plazi maintains TreatmentBank, a digital taxonomic literature repository to enable archiving of taxonomic treatments; develops and maintains TaxPub, an extension of the National Library of Medicine / National Center for Biotechnology Informatics Journal Article Tag Suite for taxonomic treatments; is co-founder of the Biodiversity Literature Repository at Zenodo, participates in the development of new models for publishing taxonomic treatments in order to maximise interoperability with other relevant cyberinfrastructure components such as name servers and biodiversity resources; and advocates and educates about the vital importance of maintaining free and open access to scientific discourse and data. Plazi is a major contributor to the Global Biodiversity Information Facility.