Getting your Trinity Audio player ready...
|
Guest blog post by Joseph Cornelius, Harald Detering, Oscar Lithgow-Serrano, Donat Agosti, Fabio Rinaldi, and Robert M Waterhouse
In a groundbreaking new study, scientists are using powerful computer tools to gather key information about arthropods—creatures like insects, spiders, and crustaceans—from the large and growing collection of scientific papers. The research focuses on finding details in published texts about how these animals live and interact with their environment. By using natural language processing (a type of artificial intelligence that helps computers understand human language), the team created a reliable system that can automatically read and pull useful data from thousands of articles. This innovative method not only helps us learn more about the variety of life on Earth, but also supports efforts to solve environmental challenges by making it easier to access important biological information.

The challenge
Scientific literature contains vast amounts of essential data about species—like what arthropods eat, where they live, and how big they are. However, this information is often trapped in hard-to-access files and old publications, making large-scale analysis almost impossible. So how can we convert these pages into usable data?
The goal
The team set out to develop an automatic text‑mining system using Natural Language Processing (NLP) and machine learning to scan thousands of biology papers and extract structured information about insects and other arthropods to build a database linking species names with traits like “leg length” or “forest habitat” or “predator”.
How it works in practice
- Collect curated vocabularies of terms to be searched for in the texts:
- ~1 million species names from the Catalogue of Life
- 390 traits, categorised into feeding ecology, habitat, and morphology
- Create “Gold‑standard” data needed to train language models:
- Experts manually annotated 25 papers—labelling species, traits, values, and their links—to use as a training benchmark
- Train NLP models so they “learn” which are the terms of interest:
- Named‑Entity Recognition using BioBERT for identifying species, trait, and value words or phrases in the texts
- Relation Extraction using LUKE to link the words/phrases e.g. “this species has this trait” and “this trait has this value”
- Automated extraction of words/phrases and their links:
- Processed 2,000 open‑access papers from PubMed Central
- Identified ~656,000 entities (species, traits, values) and ~339,000 links between them
- Publish results in an open searchable online resource:
- Developed ArTraDB, an interactive web database where users can search, view, and visualise species‑trait pairs and full species‑trait‑value triples

What is needed for the next steps
- Annotation complexity: Even experts struggled to agree on boundaries and precise relationships, underscoring the need for clearer guidelines and more training examples to improve the performance of the models
- Gaps in the vocabularies of terms: Many were unrecognised due to missing synonyms, outdated species names, and variations in phrasing. Expanding vocabularies will help improve the ability to find the species, traits, and values
- Community curation: Planned features in ArTraDB will allow scientists and citizen curators to improve annotations, helping retrain and refine the models over time
How it impacts science
- Speeds up research: Scientists can find species‑trait data quickly and accurately, boosting studies in ecology, evolution, and biodiversity
- Scale and scope: This semi‑automated method can eventually be extended well beyond arthropods to other species
- Supports global biodiversity efforts: Enables creation of large, quantitative trait datasets essential for monitoring ecosystem changes, climate impact, and conservation strategies

The outcomes
This innovative work demonstrates how combining text mining, expert curation, and interactive databases can unlock centuries of biological research. It lays a scalable foundation for building robust, open-access trait databases—empowering both scientists and the public to explore the living world in unprecedented ways.
Research article:
Cornelius J, Detering H, Lithgow-Serrano O, Agosti D, Rinaldi F, Waterhouse R (2025) From literature to biodiversity data: mining arthropod organismal traits with machine learning. Biodiversity Data Journal 13: e153070. https://doi.org/10.3897/BDJ.13.e153070