Pensoft Annotator – a tool for text annotation with ontologies

By Mariya Dimitrova, Georgi Zhelezov, Teodor Georgiev and Lyubomir Penev

The use of written language to record new knowledge is one of the advancements of civilisation that has helped us achieve progress. However, in the era of Big Data, the amount of published writing greatly exceeds the physical ability of humans to read and understand all written information. 

More than ever, we need computers to help us process and manage written knowledge. Unlike humans, computers are “naturally fluent” in many languages, such as the formats of the Semantic Web. These standards were developed by the World Wide Web Consortium (W3C) to enable computers to understand data published on the Internet. As a result, computers can index web content and gather data and metadata about web resources.

To help manage knowledge in different domains, humans have started to develop ontologies: shared conceptualisations of real-world objects, phenomena and abstract concepts, expressed in machine-readable formats. Such ontologies can provide computers with the necessary basic knowledge, or axioms, to help them understand the definitions and relations between resources on the Web. Ontologies outline data concepts, each with its own unique identifier, definition and human-legible label.

Matching data to its underlying ontological model is called ontology population and involves data handling and parsing that gives it additional context and semantics (meaning). Over the past couple of years, Pensoft has been working on an ontology population tool, the Pensoft Annotator, which matches free text to ontological terms.

The Pensoft Annotator is a web application, which allows annotation of text input by the user, with any of the available ontologies. Currently, they are the Environment Ontology (ENVO) and the Relation Ontology (RO), but we plan to upload many more. The Annotator can be run with multiple ontologies, and will return a table of matched ontological term identifiers, their labels, as well as the ontology from which they originate (Fig. 1). The results can also be downloaded as a Tab-Separated Value (TSV) file and certain records can be removed from the table of results, if desired. In addition, the Pensoft Annotator allows to exclude certain words (“stopwords”) from the free text matching algorithm. There is a list of default stopwords, common for the English language, such as prepositions and pronouns, but anyone can add new stopwords.

Figure 1. Interface of the Pensoft Annotator application

In Figure 1, we have annotated a sentence with the Pensoft Annotator, which yields a single matched term, labeled ‘host of’, from the Relation Ontology (RO). The ontology term identifier is linked to a webpage in Ontobee, which points to additional metadata about the ontology term (Fig. 2).

Figure 2. Web page about ontology term

Such annotation requests can be run to perform text analyses for topic modelling to discover texts which contain host-pathogen interactions. Topic modelling is used to build algorithms for content recommendation (recommender systems) which can be implemented in online news platforms, streaming services, shopping websites and others.

At Pensoft, we use the Pensoft Annotator to enrich biodiversity publications with semantics. We are currently annotating taxonomic treatments with a custom-made ontology based on the Relation Ontology (RO) to discover treatments potentially describing species interactions. You can read more about using the Annotator to detect biotic interactions in this abstract.

The Pensoft Annotator can also be used programmatically through an API, allowing you to integrate the Annotator into your own script. For more information about using the Pensoft Annotator, please check out the documentation.

Data checking for biodiversity collections and other biodiversity data compilers from Pensoft

Guest blog post by Dr Robert Mesibov

Proofreading the text of scientific papers isn’t hard, although it can be tedious. Are all the words spelled correctly? Is all the punctuation correct and in the right place? Is the writing clear and concise, with correct grammar? Are all the cited references listed in the References section, and vice-versa? Are the figure and table citations correct?

Proofreading of text is usually done first by the reviewers, and then finished by the editors and copy editors employed by scientific publishers. A similar kind of proofreading is also done with the small tables of data found in scientific papers, mainly by reviewers familiar with the management and analysis of the data concerned.

But what about proofreading the big volumes of data that are common in biodiversity informatics? Tables with tens or hundreds of thousands of rows and dozens of columns? Who does the proofreading?

Sadly, the answer is usually “No one”. Proofreading large amounts of data isn’t easy and requires special skills and digital tools. The people who compile biodiversity data often lack the skills, the software or the time to properly check what they’ve compiled.

The result is that a great deal of the data made available through biodiversity projects like GBIF is — to be charitable — “messy”. Biodiversity data often needs a lot of patient cleaning by end-users before it’s ready for analysis. To assist end-users, GBIF and other aggregators attach “flags” to each record in the database where an automated check has found a problem. These checks find the most obvious problems amongst the many possible data compilation errors. End-users often have much more work to do after the flags have been dealt with.

In 2017, Pensoft employed a data specialist to proofread the online datasets that are referenced in manuscripts submitted to Pensoft’s journals as data papers. The results of the data-checking are sent to the data paper’s authors, who then edit the datasets. This process has substantially improved many datasets (including those already made available through GBIF) and made them more suitable for digital re-use. At blog publication time, more than 200 datasets have been checked in this way.

Note that a Pensoft data audit does not check the accuracy of the data, for example, whether the authority for a species name is correct, or whether the latitude/longitude for a collecting locality agrees with the verbal description of that locality. For a more or less complete list of what does get checked, see the Data checklist at the bottom of this blog post. These checks are aimed at ensuring that datasets are correctly organised, consistently formatted and easy to move from one digital application to another. The next reader of a digital dataset is likely to be a computer program, not a human. It is essential that the data are structured and formatted, so that they are easily processed by that program and by other programs in the pipeline between the data compiler and the next human user of the data.

Pensoft’s data-checking workflow was previously offered only to authors of data paper manuscripts. It is now available to data compilers generally, with three levels of service:

  • Basic: the compiler gets a detailed report on what needs fixing
  • Standard: minor problems are fixed in the dataset and reported
  • Premium: all detected problems are fixed in collaboration with the data compiler and a report is provided

Because datasets vary so much in size and content, it is not possible to set a price in advance for basic, standard and premium data-checking. To get a quote for a dataset, send an email with a small sample of the data topublishing@pensoft.net.


Data checklist

Minor problems:

  • dataset not UTF-8 encoded
  • blank or broken records
  • characters other than letters, numbers, punctuation and plain whitespace
  • more than one version (the simplest or most correct one) for each character
  • unnecessary whitespace
  • Windows carriage returns (retained if required)
  • encoding errors (e.g. “Dum?ril” instead of “Duméril”)
  • missing data with a variety of representations (blank, “-“, “NA”, “?” etc)

Major problems:

  • unintended shifts of data items between fields
  • incorrect or inconsistent formatting of data items (e.g. dates)
  • different representations of the same data item (pseudo-duplication)
  • for Darwin Core datasets, incorrect use of Darwin Core fields
  • data items that are invalid or inappropriate for a field
  • data items that should be split between fields
  • data items referring to unexplained entities (e.g. “habitat is type A”)
  • truncated data items
  • disagreements between fields within a record
  • missing, but expected, data items
  • incorrectly associated data items (e.g. two country codes for the same country)
  • duplicate records, or partial duplicate records where not needed

For details of the methods used, see the author’s online resources:

***

Find more for Pensoft’s data audit workflow provided for data papers submitted to Pensoft journals on Pensoft’s blog.

Special ZooKeys memorial volume open to submissions to commemorate our admirable founding Editor-in-Chief Terry Erwin

In recognition of the love and devotion that Terry expressed for the study of the World’s biodiversity, ZooKeys invites contributions to this memorial issue, covering all subjects falling within the area of systematic zoology. Titled “Systematic Zoology and Biodiversity Science: A tribute to Terry Erwin (1940-2020)”.

In tribute to our beloved friend and founding Editor-in-Chief, Dr Terry Erwin, who passed away on 11th May 2020, we are planning a special memorial volume to be published on 11 May 2021, the date Terry left us. Terry will be remembered by all who knew him for his radiant spirit, charming enthusiasm for carabid beetles and never-ceasing exploration of the world of biodiversity! 

In recognition of the love and devotion that Terry expressed for study of the World’s biodiversity, ZooKeys invites contributions to this memorial issue, titled “Systematic Zoology and Biodiversity Science: A tribute to Terry Erwin (1940-2020)”, to all subjects falling within the area of systematic zoology. Of special interest are papers recognising Terry’s dedication to collection based research, massive biodiversity  surveys and origin of biodiversity hot spot areas. The Special will be edited by John Spence, Achille Casale, Thorsten Assmann, James Liebherr and Lyubomir Penev.

Article processing charges (APCs) will be waived for: (1) Contributions to systematic biology and diversity of carabid beetles, (2) Contributions from Terry’s students and (3) Contributions from his colleagues from the Smithsonian Institution. The APC for articles which do not fall in the above categories will be discounted at 30%.

The submission deadline is 31st December 2020.

Contributors are also invited to send memories and photos which shall be published in a special addendum to the volume.

The memorial volume will also include a joint project of Plazi, Pensoft and the Biodiversity Literature Repository aimed at extracting of taxonomic data from Terry Erwin’s publications and making it easily accessible to the scientific community.

Novel research on African bats pilots new ways in sharing and linking published data

A colony of what is apparently a new species of the genus Hipposideros found in an abandoned gold mine in Western Kenya
Photo by B. D. Patterson / Field Museum

Newly published findings about the phylogenetics and systematics of some previously known, but also other yet to be identified species of Old World Leaf-nosed bats, provide the first contribution to a recently launched collection of research articles, whose task is to help scientists from across disciplines to better understand potential hosts and vectors of zoonotic diseases, such as the Coronavirus. Bats and pangolins are among the animals already identified to be particularly potent vehicles of life-threatening viruses, including the infamous SARS-CoV-2.

The article, publicly available in the peer-reviewed scholarly journal ZooKeys, also pilots a new generation of Linked Open Data (LOD) publishing practices, invented and implemented to facilitate ongoing scientific collaborations in times of urgency like those we experience today with the COVID-19 pandemic currently ravaging across over 230 countries around the globe.

In their study, an international team of scientists, led by Dr Bruce PattersonField Museum‘s MacArthur curator of mammals, point to the existence of numerous, yet to be described species of leaf-nosed bats inhabiting the biodiversity hotspots of East Africa and Southeast Asia. In order to expedite future discoveries about the identity, biology and ecology of those bats, they provide key insights into the genetics and relations within their higher groupings, as well as further information about their geographic distribution.

“Leaf-nosed bats carry coronaviruses–not the strain that’s affecting humans right now, but this is certainly not the last time a virus will be transmitted from a wild mammal to humans. If we have better knowledge of what these bats are, we’ll be better prepared if that happens,”

says Dr Terrence Demos, a post-doctoral researcher in Patterson’s lab and a principal author of the paper.
One of the possibly three new to science bat species, previously referred to as Hipposideros caffer or Sundevall’s leaf-nosed bat
Photo by B. D. Patterson / Field Museum

“With COVID-19, we have a virus that’s running amok in the human population. It originated in a horseshoe bat in China. There are 25 or 30 species of horseshoe bats in China, and no one can determine which one was involved. We owe it to ourselves to learn more about them and their relatives,”

comments Patterson.

In order to ensure that scientists from across disciplines, including biologists, but also virologists and epidemiologists, in addition to health and policy officials and decision-makers have the scientific data and evidence at hand, Patterson and his team supplemented their research publication with a particularly valuable appendix table. There, in a conveniently organized table format, everyone can access fundamental raw genetic data about each studied specimen, as well as its precise identification, origin and the natural history collection it is preserved. However, what makes those data particularly useful for researchers looking to make ground-breaking and potentially life-saving discoveries is that all that information is linked to other types of data stored at various databases and repositories contributed by scientists from anywhere in the world.

Furthermore, in this case, those linked and publicly available data or Linked Open Data (LOD) are published in specific code languages, so that they are “understandable” for computers. Thus, when a researcher seeks to access data associated with a particular specimen he/she finds in the table, he/she can immediately access additional data stored at external data repositories by means of a single algorithm. Alternatively, another researcher might want to retrieve all pathogens extracted from tissues from specimens of a specific animal species or from particular populations inhabiting a certain geographical range and so on.

###

The data publication and dissemination approach piloted in this new study was elaborated by the science publisher and technology provider Pensoft and the digitisation company Plazi for the purposes of a special collection of research papers reporting on novel findings concerning the biology of bats and pangolins in the scholarly journal ZooKeys. By targeting the two most likely ‘culprits’ at the roots of the Coronavirus outbreak in 2020: bats and pangolins, the article collection aligns with the agenda of the COVID-19 Joint Task Force, a recent call for contributions made by the Consortium of European Taxonomic Facilities (CETAF), the Distributed System for Scientific Collections (DiSSCo) and the Integrated Digitized Biocollections (iDigBio).

###

Original source:

Patterson BD, Webala PW, Lavery TH, Agwanda BR, Goodman SM, Kerbis Peterhans JC, Demos TC (2020) Evolutionary relationships and population genetics of the Afrotropical leaf-nosed bats (Chiroptera, Hipposideridae). ZooKeys 929: 117-161. https://doi.org/10.3897/zookeys.929.50240

Plazi and Pensoft join forces to let biodiversity knowledge of coronaviruses hosts out

Pensoft’s flagship journal ZooKeys invites free-to-publish research on key biological traits of SARS-like viruses potential hosts and vectors; Plazi harvests and brings together all relevant data from legacy literature to a reliable FAIR-data repository

To bridge the huge knowledge gaps in the understanding of how and which animal species successfully transmit life-threatening diseases to humans, thereby paving the way for global health emergencies, scholarly publisher Pensoft and literature digitisation provider Plazi join efforts, expertise and high-tech infrastructure. 

By using the advanced text- and data-mining tools and semantic publishing workflows they have developed, the long-standing partners are to rapidly publish easy-to-access and reusable biodiversity research findings and data, related to hosts or vectors of the SARS-CoV-2 or other coronaviruses, in order to provide the stepping stones needed to manage and prevent similar crises in the future.

Already, there’s plenty of evidence pointing to certain animals, including pangolins, bats, snakes and civets, to be the hosts of viruses like SARS-CoV-2 (coronaviruses), hence, potential triggers of global health crises, such as the currently ravaging Coronavirus pandemic. However, scientific research on what biological and behavioural specifics of those species make them particularly successful vectors of zoonotic diseases is surprisingly scarce. Even worse, the little that science ‘knows’ today is often locked behind paywalls and copyright laws, or simply ‘trapped’ in formats inaccessible to text- and data-mining performed by search algorithms. 

This is why Pensoft’s flagship zoological open-access, peer-reviewed scientific journal ZooKeys recently announced its upcoming, special issue, titled “Biology of pangolins and bats”, to invite research papers on relevant biological traits and behavioural features of bats and pangolins, which are or could be making them efficient vectors of zoonotic diseases. Another open-science innovation champion in the Pensoft’s portfolio, Research Ideas and Outcomes (RIO Journal) launched another free-to-publish collection of early and/or brief outcomes of research devoted to SARS-like viruses.

Due to the expedited peer review and publication processes at ZooKeys, the articles will rapidly be made public and accessible to scientists, decision-makers and other experts, who could then build on the findings and eventually come up with effective measures for the prevention and mitigation of future zoonotic epidemics. To further facilitate the availability of such critical research, ZooKeys is waiving the publication charges for accepted papers.

Meanwhile, the literature digitisation provider Plazi is deploying its text- and data-mining expertise and tools, to locate and acquire publications related to hosts of coronaviruses – such as those expected in the upcoming “Biology of pangolins and bats” special issue in ZooKeys – and deposit them in a newly formed Coronavirus-Host Community, a repository hosted on the Zenodo platform. There, all publications will be granted persistent open access and enhanced with taxonomy-specific data derived from their sources. Contributions to Plazi can be made at various levels: from sending suggestions of articles to be added to the Zotero bibliographic public libraries on virus-hosts associations and hosts’ taxonomy, to helping the conversion of those articles into findable, accessible, interoperable and reusable (FAIR) knowledge.

Pensoft’s and Plazi’s collaboration once again aligns with the efforts of the biodiversity community, after the natural science collections consortium DiSSCo (Distributed System of Scientific Collections) and the Consortium of European Taxonomic Facilities (CETAF), recently announced the COVID-19 Task Force with the aim to create a network of taxonomists, collection curators and other experts from around the globe.

Fifteen years & 20 million insects later: Sweden’s impressive effort to document its insect fauna in a changing world

The Swedish Malaise Trap Project (SMTP) was launched in 2003 with the aim of making a complete list of the insect diversity of Sweden. Over the past fifteen years, an estimated total of 20 million insects, collected during the project, have been processed for scientific study. Recently, the team behind this effort published the resulting inventory in the open-access journal Biodiversity Data Journal. In their paper, they also document the project all the way from its inception to its current status by reporting on its background, organisation, methodology and logistics.

The SMTP deployed a total of 73 Malaise traps – a Swedish invention designed to capture flying insects – and placed them across the country, where they remained from 2003 to 2006. Subsequently, the samples were sorted by a dedicated team of staff, students and volunteers into over 300 groups of insects ready for further study by expert entomologists. At the present time, this material can be considered as a unique timestamp of the Swedish insect fauna and an invaluable source of baseline data, which is especially relevant as reports of terrifying insect declines keep on making the headlines across the world.

The first author and Project Manager of the SMTP, Dave Karlsson started his academic paper on the project’s results years ago by compiling various tips, tricks, lessons and stories that he had accumulated over his years as SMTP’s Project Manager. Some fun examples include the time when one of the Malaise traps was destroyed by a moose bull rubbing his antlers against it, or when another trap was attacked and eaten by a group of 20 reindeer. The project even had a trap taken out by Sweden’s military! Karlsson’s intention was that, by sharing the details of the project, he would inspire and encourage similar efforts around the globe.

Animals were not as kind to our traps as humans,” recall the scientists behind the project. One of the Malaise traps, located in the Brännbergets Nature Reserve in Västerbotten, was destroyed by a bull moose rubbing his antlers against it.
Photo by Anna Wenngren

Karlsson has worked with and trained dozens of workers in the SMTP lab over the past decade and a half. Some were paid staff, some were enthusiastic volunteers and a good number were researchers and students using SMTP material for projects and theses. Thus, he witnessed first-hand how much excitement and enthusiasm the work on insect samples under a microscope can generate, even in those who had been hesitant about “bugs” at first.

Stressing the benefits of traditional morphological approaches to inventory work, he says: “Appreciation for nature is something you miss when you go ‘hi-tech’ with inventory work. We have created a unique resource for specialists in our sorted material while fostering a passion for natural history.”

Sorted SMTP material is now available to specialists. Hundreds of thousands of specimens have already been handed over to experts, resulting in over 1,300 species newly added to the Swedish fauna. A total of 87 species have been recognised as new to science from the project thus far, while hundreds more await description.

The SMTP is part of the Swedish Taxonomy Initiative, from where it also receives its funding. In its turn, the latter is a project by the Swedish Species Information Center, a ground-breaking initiative funded by the Swedish Parliament since 2002 with the aim of documenting all multicellular life in Sweden.

The SMTP is based at Station Linné, a field station named after the famous Swedish naturalist and father of taxonomy, Carl Linneaus. Situated on the Baltic island of Öland, the station is managed by Dave Karlsson. Co-authors Emily Hartop and Mathias Jaschhof are also based at the station, while Mattias Forshage and Fredrik Ronquist (SMTP Project Co-Founder) are based at the Swedish Museum of Natural History.

###

Original source:

Karlsson D, Hartop E, Forshage M, Jaschhof M, Ronquist F (2020) The Swedish Malaise Trap Project: A 15 Year Retrospective on a Countrywide Insect Inventory. Biodiversity Data Journal 8: e47255. https://doi.org/10.3897/BDJ.8.e47255

On the edge between science & art: historical biodiversity data from Japanese “gyotaku”

Japanese cultural art of ‘gyotaku’, which means “fish impression” or “fish rubbing”, captures accurate images of fish specimens. It has been used by recreational fishermen and artists since the Edo Period. Distributional data from 261 ‘Gyotaku’ rubbings were extracted for 218 individual specimens, roughly representing regional fish fauna and common fishing targets in Japan through the years. The results of the research are presented in a paper published by Japanese scientists in open-access journal Zookeys.

Japanese cultural art of ‘gyotaku’, which means “fish impression” or “fish rubbing”, captures accurate images of fish specimens. It has been used by recreational fishermen and artists since the Edo Period. Distributional data from 261 ‘Gyotaku’ rubbings were extracted for 218 individual specimens, roughly representing regional fish fauna and common fishing targets in Japan through the years. The results of the research are presented in a paper published by Japanese scientists in open-access journal Zookeys.

Historical biodiversity data is being obtained from museum specimens, literature, classic monographs and old photographs, yet those sources can be damaged, lost or not completely adequate. That brings us to the need of finding additional, even if non-traditional, sources. 

In Japan many recreational fishers have recorded their memorable catches as ‘gyotaku’ (魚拓), which means fish impression or fish rubbing in English. ‘Gyotaku’ is made directly from the fish specimen and usually includes information such as sampling date and locality, the name of the fisherman, its witnesses, the fish species (frequently its local name), and fishing tackle used. This art has existed since the last Edo period. Currently, the oldest ‘gyotaku’ material is the collection of the Tsuruoka City Library made in 1839.

Traditionally, ‘gyotaku’ is printed by using black writing ink, but over the last decades colour versions of ‘gyotaku’ have become better developed and they are now used for art and educational purposes. Though, the colour prints are made just for the means of art and rarely include specimen data, sampling locality and date.

In the sense of modern technological progress, it’s getting rarer and rarer that people are using ‘gyotaku’ to save their “fishing impressions”. The number of personally managed fishing-related shops is decreasing and the number of original ‘gyotaku’ prints and recreational fishermen might start to decrease not before long.

Smartphones and photo cameras are significantly reducing the amount of produced ‘gyotaku’, while the data from the old art pieces are in danger of either getting lost or diminished in private collections. That’s why the research on existing ‘gyotaku’ as a data source is required.

A Japanese research team, led by Mr. Yusuke Miyazaki, has conducted multiple surveys among recreational fishing shops in different regions of Japan in order to understand if ‘gyotaku’ information is available within all the territory of the country, including latitudinal limits (from subarctic to subtropical regions) and gather historical biodiversity data from it.

In total, 261 ‘gyotaku’ rubbings with 325 printed individual specimens were found among the targeted shops and these data were integrated to the ‘gyotaku’ database. Distributional data about a total of 235 individuals were obtained within the study.

The observed species compositions reflected the biogeography of the regions and can be representative enough to identify rare Red-listed species in particular areas. Some of the studied species are listed as endangered in national and prefectural Red Lists which prohibits the capture, holding, receiving and giving off, and other interactions with the species without the prefectural governor’s permission. Given the rarity of these threatened species in some regions, ‘gyotaku’ are probably important vouchers for estimating historical population status and factors of decline or extinction.

“Overall, the species composition displayed in the ‘gyotaku’ approximately reflected the fish faunas of each biogeographic region. We suggest that Japanese recreational fishers may be continuing to use the ‘gyotaku’ method in addition to digital photography to record their memorable catches” , concludes author of the research, Mr. Yusuke Miyazaki.


Gyotaku rubbing from the fish store in Miyazaki Prefecture
Credit: Yusuke Miyazaki
License: CC-BY 4.0

Gyotaku rubbing of the specimen from Kanagawa found in the shop in Tokyo
Credit: Yusuke Miyazaki
License: CC-BY 4.0

###

Original source:

Miyazaki Y, Murase A (2019) Fish rubbings, ‘gyotaku’, as a source of historical biodiversity data. ZooKeys 904: 89-101. https://doi.org/10.3897/zookeys.904.47721

Data mining applied to scholarly publications to finally reveal Earth’s biodiversity

At a time when a million species are at risk of extinction, according to a recent UN report, ironically, we don’t know how many species there are on Earth, nor have we noted down all those that we have come to know on a single list. In fact, we don’t even know how many species we would have put on such a list.

The combined research including over 2,000 natural history institutions worldwide, produced an estimated ~500 million pages of scholarly publications and tens of millions of illustrations and species descriptions, comprising all we currently know about the diversity of life. However, most of it isn’t digitally accessible. Even if it were digital, our current publishing systems wouldn’t be able to keep up, given that there are about 50 species described as new to science every day, with all of these published in plain text and PDF format, where the data cannot be mined by machines, thereby requiring a human to extract them. Furthermore, those publications would often appear in subscription (closed access) journals.

The Biodiversity Literature Repository (BLR), a joint project ofPlaziPensoft and Zenodo at CERN, takes on the challenge to open up the access to the data trapped in scientific publications, and find out how many species we know so far, what are their most important characteristics (also referred to as descriptions or taxonomic treatments), and how they look on various images. To do so, BLR uses highly standardised formats and terminology, typical for scientific publications, to discover and extract data from text written primarily for human consumption.

By relying on state-of-the-art data mining algorithms, BLR allows for the detection, extraction and enrichment of data, including DNA sequences, specimen collecting data or related descriptions, as well as providing implicit links to their sources: collections, repositories etc. As a result, BLR is the world’s largest public domain database of taxonomic treatments, images and associated original publications.

Once the data are available, they are immediately distributed to global biodiversity platforms, such as GBIF–the Global Biodiversity Information Facility. As of now, there are about 42,000 species, whose original scientific descriptions are only accessible because of BLR.

The very basic principle in science to cite previous information allows us to trace back the history of a particular species, to understand how the knowledge about it grew over time, and even whether and how its name has changed through the years. As a result, this service is one avenue to uncover the catalogue of life by means of simple lookups.

So far, the lessons learned have led to the development of TaxPub, an extension of the United States National Library of Medicine Journal Tag Suite and its application in a new class of 26 scientific journals. As a result, the data associated with articles in these journals are machine-accessible from the beginning of the publishing process. Thus, as soon as the paper comes out, the data are automatically added to GBIF.

While BLR is expected to open up millions of scientific illustrations and descriptions, the system is unique in that it makes all the extracted data findable, accessible, interoperable and reusable (FAIR), as well as open to anybody, anywhere, at any time. Most of all, its purpose is to create a novel way to access scientific literature.

To date, BLR has extracted ~350,000 taxonomic treatments and ~200,000 figures from over 38,000 publications. This includes the descriptions of 55,800 new species, 3,744 new genera, and 28 new families. BLR has contributed to the discovery of over 30% of the ~17,000 species described annually.

Prof. Lyubomir Penev, founder and CEO of Pensoft says,

“It is such a great satisfaction to see how the development process of the TaxPub standard, started by Plazi some 15 years ago and implemented as a routine publishing workflow at Pensoft’s journals in 2010, has now resulted in an entire infrastructure that allows automated extraction and distribution of biodiversity data from various journals across the globe. With the recent announcement from the Consortium of European Taxonomic Facilities (CETAF) that their European Journal of Taxonomy is joining the TaxPub club, we are even more confident that we are paving the right way to fully grasping the dimensions of the world’s biodiversity.”

Dr Donat Agosti, co-founder and president of Plazi, adds:

“Finally, information technology allows us to create a comprehensive, extended catalogue of life and bring to light this huge corpus of cultural and scientific heritage – the description of life on Earth – for everybody. The nature of taxonomic treatments as a network of citations and syntheses of what scientists have discovered about a species allows us to link distinct fields such as genomics and taxonomy to specimens in natural history museums.”

Dr Tim Smith, Head of Collaboration, Devices and Applications Group at CERN, comments:

“Moving the focus away from the papers, where concepts are communicated, to the concepts themselves is a hugely significant step. It enables BLR to offer a unique new interconnected view of the species of our world, where the taxonomic treatments, their provenance, histories and their illustrations are all linked, accessible and findable. This is inspirational for the digital liberation of other fields of study!”

###

Additional information:

BLR is a joint project led by Plazi in partnership with Pensoft and Zenodo at CERN.

Currently, BLR is supported by a grant from Arcadia, a charitable fund of Lisbet Rausing and Peter Baldwin.