data mining

Mining nature’s knowledge: turning text into data

By using natural language processing, researchers created a reliable system that can automatically read and pull useful data from thousands of articles.

Guest blog post by Joseph Cornelius, Harald Detering, Oscar Lithgow-Serrano, Donat Agosti, Fabio Rinaldi, and Robert M Waterhouse

In a groundbreaking new study, scientists are using powerful computer tools to gather key information about arthropods—creatures like insects, spiders, and crustaceans—from the large and growing collection of scientific papers. The research focuses on finding details in published texts about how these animals live and interact with their environment. By using natural language processing (a type of artificial intelligence that helps computers understand human language), the team created a reliable system that can automatically read and pull useful data from thousands of articles. This innovative method not only helps us learn more about the variety of life on Earth, but also supports efforts to solve environmental challenges by making it easier to access important biological information.

Illustration depicting species literature feeding data on arthropod traits into a database, linking researchers and the community. — Mining the literature to identify species, their traits, and associated values.

The challenge

Scientific literature contains vast amounts of essential data about species—like what arthropods eat, where they live, and how big they are. However, this information is often trapped in hard-to-access files and old publications, making large-scale analysis almost impossible. So how can we convert these pages into usable data?

The goal

The team set out to develop an automatic text‑mining system using Natural Language Processing (NLP) and machine learning to scan thousands of biology papers and extract structured information about insects and other arthropods to build a database linking species names with traits like “leg length” or “forest habitat” or “predator”.

How it works in practice

Collect curated vocabularies of terms to be searched for in the texts:

~1 million species names from the Catalogue of Life
390 traits, categorised into feeding ecology, habitat, and morphology

Create “Gold‑standard” data needed to train language models:

Experts manually annotated 25 papers—labelling species, traits, values, and their links—to use as a training benchmark

Train NLP models so they “learn” which are the terms of interest:

Named‑Entity Recognition using BioBERT for identifying species, trait, and value words or phrases in the texts
Relation Extraction using LUKE to link the words/phrases e.g. “this species has this trait” and “this trait has this value”

Automated extraction of words/phrases and their links:

Processed 2,000 open‑access papers from PubMed Central
Identified ~656,000 entities (species, traits, values) and ~339,000 links between them

Publish results in an open searchable online resource:

Developed ArTraDB, an interactive web database where users can search, view, and visualise species‑trait pairs and full species‑trait‑value triples

Text-mining is a conceptually and computationally challenging task.

What is needed for the next steps

Annotation complexity: Even experts struggled to agree on boundaries and precise relationships, underscoring the need for clearer guidelines and more training examples to improve the performance of the models
Gaps in the vocabularies of terms: Many were unrecognised due to missing synonyms, outdated species names, and variations in phrasing. Expanding vocabularies will help improve the ability to find the species, traits, and values
Community curation: Planned features in ArTraDB will allow scientists and citizen curators to improve annotations, helping retrain and refine the models over time

How it impacts science

Speeds up research: Scientists can find species‑trait data quickly and accurately, boosting studies in ecology, evolution, and biodiversity
Scale and scope: This semi‑automated method can eventually be extended well beyond arthropods to other species
Supports global biodiversity efforts: Enables creation of large, quantitative trait datasets essential for monitoring ecosystem changes, climate impact, and conservation strategies

Illustration of a butterfly with icons and arrows outlining key biological data: barcode, genome, distribution, nutrition, habitat, and more. — A long-term vision to connect species with knowledge about their biology.

The outcomes

This innovative work demonstrates how combining text mining, expert curation, and interactive databases can unlock centuries of biological research. It lays a scalable foundation for building robust, open-access trait databases—empowering both scientists and the public to explore the living world in unprecedented ways.

Research article:

Cornelius J, Detering H, Lithgow-Serrano O, Agosti D, Rinaldi F, Waterhouse R (2025) From literature to biodiversity data: mining arthropod organismal traits with machine learning. Biodiversity Data Journal 13: e153070. https://doi.org/10.3897/BDJ.13.e153070

Data mining applied to scholarly publications to finally reveal Earth’s biodiversity

At a time when a million species are at risk of extinction, according to a recent UN report, ironically, we don’t know how many species there are on Earth, nor have we noted down all those that we have come to know on a single list. In fact, we don’t even know how many species we would have put on such a list.

The combined research including over 2,000 natural history institutions worldwide, produced an estimated ~500 million pages of scholarly publications and tens of millions of illustrations and species descriptions, comprising all we currently know about the diversity of life. However, most of it isn’t digitally accessible. Even if it were digital, our current publishing systems wouldn’t be able to keep up, given that there are about 50 species described as new to science every day, with all of these published in plain text and PDF format, where the data cannot be mined by machines, thereby requiring a human to extract them. Furthermore, those publications would often appear in subscription (closed access) journals.

The Biodiversity Literature Repository (BLR), a joint project ofPlazi, Pensoft and Zenodo at CERN, takes on the challenge to open up the access to the data trapped in scientific publications, and find out how many species we know so far, what are their most important characteristics (also referred to as descriptions or taxonomic treatments), and how they look on various images. To do so, BLR uses highly standardised formats and terminology, typical for scientific publications, to discover and extract data from text written primarily for human consumption.

By relying on state-of-the-art data mining algorithms, BLR allows for the detection, extraction and enrichment of data, including DNA sequences, specimen collecting data or related descriptions, as well as providing implicit links to their sources: collections, repositories etc. As a result, BLR is the world’s largest public domain database of taxonomic treatments, images and associated original publications.

Once the data are available, they are immediately distributed to global biodiversity platforms, such as GBIF–the Global Biodiversity Information Facility. As of now, there are about 42,000 species, whose original scientific descriptions are only accessible because of BLR.

The very basic principle in science to cite previous information allows us to trace back the history of a particular species, to understand how the knowledge about it grew over time, and even whether and how its name has changed through the years. As a result, this service is one avenue to uncover the catalogue of life by means of simple lookups.

So far, the lessons learned have led to the development of TaxPub, an extension of the United States National Library of Medicine Journal Tag Suite and its application in a new class of 26 scientific journals. As a result, the data associated with articles in these journals are machine-accessible from the beginning of the publishing process. Thus, as soon as the paper comes out, the data are automatically added to GBIF.

While BLR is expected to open up millions of scientific illustrations and descriptions, the system is unique in that it makes all the extracted data findable, accessible, interoperable and reusable (FAIR), as well as open to anybody, anywhere, at any time. Most of all, its purpose is to create a novel way to access scientific literature.

To date, BLR has extracted ~350,000 taxonomic treatments and ~200,000 figures from over 38,000 publications. This includes the descriptions of 55,800 new species, 3,744 new genera, and 28 new families. BLR has contributed to the discovery of over 30% of the ~17,000 species described annually.

Prof. Lyubomir Penev, founder and CEO of Pensoft says,

“It is such a great satisfaction to see how the development process of the TaxPub standard, started by Plazi some 15 years ago and implemented as a routine publishing workflow at Pensoft’s journals in 2010, has now resulted in an entire infrastructure that allows automated extraction and distribution of biodiversity data from various journals across the globe. With the recent announcement from the Consortium of European Taxonomic Facilities (CETAF) that their European Journal of Taxonomy is joining the TaxPub club, we are even more confident that we are paving the right way to fully grasping the dimensions of the world’s biodiversity.”

Dr Donat Agosti, co-founder and president of Plazi, adds:

“Finally, information technology allows us to create a comprehensive, extended catalogue of life and bring to light this huge corpus of cultural and scientific heritage – the description of life on Earth – for everybody. The nature of taxonomic treatments as a network of citations and syntheses of what scientists have discovered about a species allows us to link distinct fields such as genomics and taxonomy to specimens in natural history museums.”

Dr Tim Smith, Head of Collaboration, Devices and Applications Group at CERN, comments:

“Moving the focus away from the papers, where concepts are communicated, to the concepts themselves is a hugely significant step. It enables BLR to offer a unique new interconnected view of the species of our world, where the taxonomic treatments, their provenance, histories and their illustrations are all linked, accessible and findable. This is inspirational for the digital liberation of other fields of study!”

###

Additional information:

BLR is a joint project led by Plazi in partnership with Pensoft and Zenodo at CERN.

Currently, BLR is supported by a grant from Arcadia, a charitable fund of Lisbet Rausing and Peter Baldwin.

FAIR biodiversity data in Pensoft journals thanks to a routine data auditing workflow

*Data audit workflow provided for data papers submitted to Pensoft journals.*

To avoid publication of openly accessible, yet unusable datasets, fated to result in irreproducible and inoperable biological diversity research at some point down the road, Pensoft takes care for auditing data described in data paper manuscripts upon their submission to applicable journals in the publisher’s portfolio, including Biodiversity Data Journal, ZooKeys, PhytoKeys, MycoKeys and many others.

Once the dataset is clean and the paper is published, biodiversity data, such as taxa, occurrence records, observations, specimens and related information, become FAIR (findable, accessible, interoperable and reusable), so that they can be merged, reformatted and incorporated into novel and visionary projects, regardless of whether they are accessed by a human researcher or a data-mining computation.

As part of the pre-review technical evaluation of a data paper submitted to a Pensoft journal, the associated datasets are subjected to data audit meant to identify any issues that could make the data inoperable. This check is conducted regardless of whether the dataset are provided as supplementary material within the data paper manuscript or linked from the Global Biodiversity Information Facility (GBIF) or another external repository. The features that undergo the audit can be found in a data quality checklist made available from the website of each journal alongside key recommendations for submitting authors.

Once the check is complete, the submitting author receives an audit report providing improvement recommendations, similarly to the commentaries he/she would receive following the peer review stage of the data paper. In case there are major issues with the dataset, the data paper can be rejected prior to assignment to a subject editor, but resubmitted after the necessary corrections are applied. At this step, authors who have already published their data via an external repository are also reminded to correct those accordingly.

“It all started back in 2010, when we joined forces with GBIF on a quite advanced idea in the domain of biodiversity: a data paper workflow as a means to recognise both the scientific value of rich metadata and the efforts of the the data collectors and curators. Together we figured that those data could be published most efficiently as citable academic papers,” says Pensoft’s founder and Managing director Prof. Lyubomir Penev.

“From there, with the kind help and support of Dr Robert Mesibov, the concept evolved into a data audit workflow, meant to ‘proofread’ the data in those data papers the way a copy editor would go through the text,” he adds.

“The data auditing we do is not a check on whether a scientific name is properly spelled, or a bibliographic reference is correct, or a locality has the correct latitude and longitude”, explains Dr Mesibov. “Instead, we aim to ensure that there are no broken or duplicated records, disagreements between fields, misuses of the Darwin Core recommendations, or any of the many technical issues, such as character encoding errors, that can be an obstacle to data processing.”

At Pensoft, the publication of openly accessible, easy to access, find, re-use and archive data is seen as a crucial responsibility of researchers aiming to deliver high-quality and viable scientific output intended to stand the test of time and serve the public good.

CASE STUDY: Data audit for the “Vascular plants dataset of the COFC herbarium (University of Cordoba, Spain)”, a data paper in PhytoKeys

To explain how and why biodiversity data should be published in full compliance with the best (open) science practices, the team behind Pensoft and long-year collaborators published a guidelines paper, titled “Strategies and guidelines for scholarly publishing of biodiversity data” in the open science journal Research Ideas and Outcomes (RIO Journal).