Mining nature’s knowledge: turning text into data

Getting your Trinity Audio player ready...

Guest blog post by Joseph Cornelius, Harald Detering, Oscar Lithgow-Serrano, Donat Agosti, Fabio Rinaldi, and Robert M Waterhouse

In a groundbreaking new study, scientists are using powerful computer tools to gather key information about arthropods—creatures like insects, spiders, and crustaceans—from the large and growing collection of scientific papers. The research focuses on finding details in published texts about how these animals live and interact with their environment. By using natural language processing (a type of artificial intelligence that helps computers understand human language), the team created a reliable system that can automatically read and pull useful data from thousands of articles. This innovative method not only helps us learn more about the variety of life on Earth, but also supports efforts to solve environmental challenges by making it easier to access important biological information.

Illustration depicting species literature feeding data on arthropod traits into a database, linking researchers and the community. — Mining the literature to identify species, their traits, and associated values.

The challenge

Scientific literature contains vast amounts of essential data about species—like what arthropods eat, where they live, and how big they are. However, this information is often trapped in hard-to-access files and old publications, making large-scale analysis almost impossible. So how can we convert these pages into usable data?

The goal

The team set out to develop an automatic text‑mining system using Natural Language Processing (NLP) and machine learning to scan thousands of biology papers and extract structured information about insects and other arthropods to build a database linking species names with traits like “leg length” or “forest habitat” or “predator”.

How it works in practice

Collect curated vocabularies of terms to be searched for in the texts:

~1 million species names from the Catalogue of Life
390 traits, categorised into feeding ecology, habitat, and morphology

Create “Gold‑standard” data needed to train language models:

Experts manually annotated 25 papers—labelling species, traits, values, and their links—to use as a training benchmark

Train NLP models so they “learn” which are the terms of interest:

Named‑Entity Recognition using BioBERT for identifying species, trait, and value words or phrases in the texts
Relation Extraction using LUKE to link the words/phrases e.g. “this species has this trait” and “this trait has this value”

Automated extraction of words/phrases and their links:

Processed 2,000 open‑access papers from PubMed Central
Identified ~656,000 entities (species, traits, values) and ~339,000 links between them

Publish results in an open searchable online resource:

Developed ArTraDB, an interactive web database where users can search, view, and visualise species‑trait pairs and full species‑trait‑value triples

Text-mining is a conceptually and computationally challenging task.

What is needed for the next steps

Annotation complexity: Even experts struggled to agree on boundaries and precise relationships, underscoring the need for clearer guidelines and more training examples to improve the performance of the models
Gaps in the vocabularies of terms: Many were unrecognised due to missing synonyms, outdated species names, and variations in phrasing. Expanding vocabularies will help improve the ability to find the species, traits, and values
Community curation: Planned features in ArTraDB will allow scientists and citizen curators to improve annotations, helping retrain and refine the models over time

How it impacts science

Speeds up research: Scientists can find species‑trait data quickly and accurately, boosting studies in ecology, evolution, and biodiversity
Scale and scope: This semi‑automated method can eventually be extended well beyond arthropods to other species
Supports global biodiversity efforts: Enables creation of large, quantitative trait datasets essential for monitoring ecosystem changes, climate impact, and conservation strategies

Illustration of a butterfly with icons and arrows outlining key biological data: barcode, genome, distribution, nutrition, habitat, and more. — A long-term vision to connect species with knowledge about their biology.

The outcomes

This innovative work demonstrates how combining text mining, expert curation, and interactive databases can unlock centuries of biological research. It lays a scalable foundation for building robust, open-access trait databases—empowering both scientists and the public to explore the living world in unprecedented ways.

Research article:

Cornelius J, Detering H, Lithgow-Serrano O, Agosti D, Rinaldi F, Waterhouse R (2025) From literature to biodiversity data: mining arthropod organismal traits with machine learning. Biodiversity Data Journal 13: e153070. https://doi.org/10.3897/BDJ.13.e153070

Tags: scientific articles, arthropods, data mining, biodiversity data, data