Mining nature’s knowledge: turning text into data

By using natural language processing, researchers created a reliable system that can automatically read and pull useful data from thousands of articles.

Guest blog post by Joseph Cornelius, Harald Detering, Oscar Lithgow-Serrano, Donat Agosti, Fabio Rinaldi, and Robert M Waterhouse

In a groundbreaking new study, scientists are using powerful computer tools to gather key information about arthropods—creatures like insects, spiders, and crustaceans—from the large and growing collection of scientific papers. The research focuses on finding details in published texts about how these animals live and interact with their environment. By using natural language processing (a type of artificial intelligence that helps computers understand human language), the team created a reliable system that can automatically read and pull useful data from thousands of articles. This innovative method not only helps us learn more about the variety of life on Earth, but also supports efforts to solve environmental challenges by making it easier to access important biological information.

Illustration depicting species literature feeding data on arthropod traits into a database, linking researchers and the community.
Mining the literature to identify species, their traits, and associated values.

The challenge

Scientific literature contains vast amounts of essential data about species—like what arthropods eat, where they live, and how big they are. However, this information is often trapped in hard-to-access files and old publications, making large-scale analysis almost impossible. So how can we convert these pages into usable data?

The goal

The team set out to develop an automatic text‑mining system using Natural Language Processing (NLP) and machine learning to scan thousands of biology papers and extract structured information about insects and other arthropods to build a database linking species names with traits like “leg length” or “forest habitat” or “predator”.

How it works in practice

  1. Collect curated vocabularies of terms to be searched for in the texts:
  • ~1 million species names from the Catalogue of Life
  • 390 traits, categorised into feeding ecology, habitat, and morphology 
  1. Create “Gold‑standard” data needed to train language models:
  • Experts manually annotated 25 papers—labelling species, traits, values, and their links—to use as a training benchmark
  1. Train NLP models so they “learn” which are the terms of interest:
  • Named‑Entity Recognition using BioBERT for identifying species, trait, and value words or phrases in the texts
  • Relation Extraction using LUKE to link the words/phrases e.g. “this species has this trait” and “this trait has this value” 
  1. Automated extraction of words/phrases and their links:
  • Processed 2,000 open‑access papers from PubMed Central
  • Identified ~656,000 entities (species, traits, values) and ~339,000 links between them 
  1. Publish results in an open searchable online resource:
  • Developed ArTraDB, an interactive web database where users can search, view, and visualise species‑trait pairs and full species‑trait‑value triples
Text-mining is a conceptually and computationally challenging task.

What is needed for the next steps

  • Annotation complexity: Even experts struggled to agree on boundaries and precise relationships, underscoring the need for clearer guidelines and more training examples to improve the performance of the models
  • Gaps in the vocabularies of terms: Many were unrecognised due to missing synonyms, outdated species names, and variations in phrasing. Expanding vocabularies will help improve the ability to find the species, traits, and values
  • Community curation: Planned features in ArTraDB will allow scientists and citizen curators to improve annotations, helping retrain and refine the models over time

How it impacts science

  • Speeds up research: Scientists can find species‑trait data quickly and accurately, boosting studies in ecology, evolution, and biodiversity
  • Scale and scope: This semi‑automated method can eventually be extended well beyond arthropods to other species
  • Supports global biodiversity efforts: Enables creation of large, quantitative trait datasets essential for monitoring ecosystem changes, climate impact, and conservation strategies
Illustration of a butterfly with icons and arrows outlining key biological data: barcode, genome, distribution, nutrition, habitat, and more.
A long-term vision to connect species with knowledge about their biology.

The outcomes

This innovative work demonstrates how combining text mining, expert curation, and interactive databases can unlock centuries of biological research. It lays a scalable foundation for building robust, open-access trait databases—empowering both scientists and the public to explore the living world in unprecedented ways.

Research article:

Cornelius J, Detering H, Lithgow-Serrano O, Agosti D, Rinaldi F, Waterhouse R (2025) From literature to biodiversity data: mining arthropod organismal traits with machine learning. Biodiversity Data Journal 13: e153070. https://doi.org/10.3897/BDJ.13.e153070

Brand new computer language describes organismal traits to create computable species descriptions

Describing traits with Phenoscript is like programming a computer code for how an organism looks.

The beetle species Grebennikovius basilewskyi. Numbers next to arrows indicate patterns of phenotype statements explained in the section “Phenoscript: main patterns of phenotype statements”. Arrow numbers from T1 to T5 illustrate individual body parts. See more in the research study.

One of the most beautiful aspects of Nature is the endless variety of shapes, colours and behaviours exhibited by organisms. These traits help organisms survive and find mates, like how a male peacock’s colourful tail attracts females or his wings allow him to fly away from danger. Understanding traits is crucial for biologists, who study them to learn how organisms evolve and adapt to different environments.

To do this, scientists first need to describe these traits in words, like saying a peacock’s tail is “vibrant, iridescent, and ornate”. This approach works for small studies, but when looking at hundreds or even millions of different animals or plants, it’s impossible for the human brain to keep track of everything.

Computers could help, but not even the latest AI technology is able to grasp human language to the extent needed by biologists. This hampers research significantly because, although scientists can handle large volumes of DNA data, linking this information to physical traits is still very difficult.

To solve this problem, researchers from the Finnish Museum of Natural History, Giulio Montanaro and Sergei Tarasov, along with collaborators, have created a special language called Phenoscript. This language is designed to describe traits in a way that both humans and computers can understand. Describing traits with Phenoscript is like programming a computer code for how an organism looks.

Phenoscript uses something called semantic technology, which helps computers understand the meaning behind words, much like how modern search engines know the difference between the fruit “apple” and the tech company “Apple” based on the context of your search.

“This language is still being tested, but it shows a lot of promise. As more scientists start using Phenoscript, it will revolutionise biology by making vast amounts of trait data available for large-scale studies, boosting the emerging field of phenomics,”

explains Montanaro.

In their research article, newly published in the open-access, peer-reviewed Biodiversity Data Journal, the researchers make use of the new language for the first time, as they create semantic phenotypes for four species of dung beetles from the genus Grebennikovius. Then, to demonstrate the power of the semantic approach, they apply simple semantic queries to the generated phenotypic descriptions. 

Finally, the team takes a look yet further ahead into modernising the way scientists work with species information. Their next aim is to integrate semantic species descriptions with the concept of nanopublications, “which encapsulates discrete pieces of information into a comprehensive knowledge graph”. As a result, data that has become part of this graph can be queried directly, thereby ensuring that it remains Findable, Accessible, Interoperable and Reusable (FAIR) through a variety of semantic resources.

***

Research paper:

Montanaro G, Balhoff JP, Girón JC, Söderholm M, Tarasov S (2024) Computable species descriptions and nanopublications: applying ontology-based technologies to dung beetles (Coleoptera, Scarabaeinae). Biodiversity Data Journal 12: e121562. https://doi.org/10.3897/BDJ.12.e121562

***

The hereby study is the latest addition to the special topical collection: “Linking FAIR biodiversity data through publications: The BiCIKL approach”, launched and supported by the recently concluded Horizon 2020 project: Biodiversity Community Integrated Knowledge Library (BiCIKL). The collection aims to bring together scientific publications that demonstrate the advantages and novel approaches in accessing and (re-)using linked biodiversity data.

***

What expert recommendations did the BiCIKL consortium give to policy makers and research funders to ensure that biodiversity data is FAIR, linked, open and, indeed, future-proof? Find out in the blog post summarising key lessons learnt from the Horizon 2020 project.

***

Follow Biodiversity Data Journal on Facebook and X.

Pensoft collaborates with R Discovery to elevate research discoverability

Pensoft and R Discovery’s innovative connection aims to change the way researchers find academic articles.

Leading scholarly publisher Pensoft has announced a strategic collaboration with R Discovery, the AI-powered research discovery platform by Cactus Communications, a renowned science communications and technology company. This partnership aims to revolutionize the accessibility and discoverability of research articles published by Pensoft, making them more readily available on R Discovery to its over three million researchers across the globe.

R Discovery, acclaimed for its advanced algorithms and an extensive database boasting over 120 million scholarly articles, empowers researchers with intelligent search capabilities and personalized recommendations. Through its innovative Reading Feed feature, R Discovery delivers tailored suggestions in a format reminiscent of social media, identifying articles based on individual research interests. This not only saves time but also keeps researchers updated with the latest and most relevant studies in their field.

Open Science is much more than cost-free access to research output.

Lyubomir Penev

One of R Discovery’s standout features is its ability to provide paper summaries, audio readings, and language translation, enabling users to quickly assess a paper’s relevance and enhance their research reading experience significantly.

With over 2.5 million app downloads and upwards of 80 million journal articles featured, the R Discovery database is one of the largest scholarly content repositories.

At Pensoft, we do realise that Open Science is much more than cost-free access to research outputs. It is also about easier discoverability and reusability, or, in other words, how likely it is for the reader to come across a particular scientific publication and, as a result, cite and build on those findings in his/her own studies. By feeding the content of our journals into R Discovery, we’re further facilitating the discoverability of the research done and shared by the authors who trust us with their work,” said ARPHA’s and Pensoft’s founder and CEO Prof. Lyubomir Penev.

Abhishek Goel, Co-Founder and CEO of Cactus Communications, commented on the collaboration, “We are delighted to work with Pensoft and offer researchers easy access to the publisher’s high-quality research articles on R Discovery. This is a milestone in our quest to support academia in advancing open science that can help researchers improve the world.

So far, R Discovery has successfully established partnership with over 20 publishers, enhancing the platform’s extensive repository of scholarly content. By joining forces with R Discovery, Pensoft solidifies its dedication to making scholarly publications from its open-access, peer-reviewed journal portfolio easily discoverable and accessible.

Beware of scientific scams! Tips to avoid predatory publishing in biological journals

Predatory publishing has been growing exponentially, with severe consequences for society and the environment.

Guest blog post by Cássio Cardoso Pereira, Gabriela França Fernandes, and Walisson Kenedy Siqueira

We are bombarded day and night with slot-machine invitations from journals, books, and events such as congresses and lectures. Predatory publishing has reached alarming levels in biology, which is why we published an editorial in the journal Neotropical Biology and Conservation to alert the community, show the modus operandi of these publishers, and pass on good practices so that researchers, especially beginners, can escape this trap.

Piggybacking on the open access movement, numerous predatory publishers have emerged in search of easy profits. These cybercriminals take advantage of the publish-or-perish culture without providing any information about their peer-review protocols, concerned not with the scientific, bibliographic, or ethical aspects of publishing, but with the money received from authors.

The number of predatory publishers has grown exponentially in recent years and spread across all areas of knowledge, including biology. It is a common practice of these journals, often with an equally fake editorial staff, to send electronic invitations to potential authors to publish articles. These invitations are often facilitated by initial screenings of the emails of corresponding authors available on the internet. The emailed invitations from the supposed editors often stress that the author’s work is sound and, since it has already gone through the scrutiny of the editorial board, requires only the payment of a fee to publish it, with no need for further peer review.

Invitations to join the editorial board of these journals are also frequent, mostly intended to take advantage of the scientists’ prestige. Instead of editing articles, these invited editors are used as poster boys, i.e., they have their names published on the journal’s website, thus attracting unsuspecting authors to submit their manuscripts.

These journals are generally not included in the directory of open access journals (DOAJ) and are not indexed in the main bibliometric databases, such as Google Scholar, SciELO, Scopus, and Web of Science, for the simple reason that they do not meet their inclusion criteria. The websites of these journals often have little information about the editorial board, have a fake International Standard Serial Number (ISSN), lack transparency regarding their scope, provide no indication of a policy of retraction, have no transparency regarding copyright transfer, and provide very vague contact information, often omitting the address of the journal’s office.

In addition to papers, there are also invitations to publish books and book chapters with fake International Standard Book Numbers and dubious editorial boards. There is also a flood of invitations to predatory meetings, such as online conferences, symposia, workshops, and lectures. These often have websites that are equally confusing and never linked to a university or a postgraduate program. Above all, one should consult advisors, supervisors, or senior colleagues about the invitation and the sender’s academic reputation. In any case, one must pay attention not only to the citation metrics but also, mainly, to their editorial board, ISSN, ISBN, contact information, and relationships with recognized institutions.

When we analyze the impacts of predatory publishing on the scientific community, the worst problems are:

  • the dissemination of erroneous information about scientific problems of interest
  • the facilitation of plagiarism
  • the waste of public resources intended for publication
  • the appointment of researchers at universities and research institutes based on curricula full of doubtful publications, generating negative cascading effects that undermine higher education as a whole.

The damage done to society can be even worse. Governments, large companies, and decision-makers can be misled by false information, resulting in attitudes that undermine responses to major human problems such as climate change, biodiversity, and pandemics.

Efforts to fight predatory publishers require collaboration and support at higher levels. Governments need to create regulatory agencies that carefully and systematically evaluate the activities carried out by scientific journals. Science funding agencies should require that publication fees be paid only to publishers that adhere to an internationally recognized set of transparency and ethical rules. We need to discuss our values and incentives in the academic community, so we can start prioritizing quality over quantity. This would provide a reference point for research, help design coherent interventions, and improve information and public policy in favor of society and the environment.

Reference:

Pereira CC, Mello MAR, Negreiros D, Figueiredo JCG, Kenedy-Siqueira W, Maia LR, Fernandes S, Fernandes GFC, Ponce de Leon A, Ashworth L, Oki Y, de Castro GC, Aguilar R, Fearnside PM, Fernandes GW (2023) Beware of scientific scams! Hints to avoid predatory publishing in biological journals. Neotropical Biology and Conservation 18(2): 97-105. https://doi.org/10.3897/neotropical.18.e108887

Interoperable biodiversity data extracted from literature through open-ended queries

OpenBiodiv is a biodiversity database containing knowledge extracted from scientific literature, built as an Open Biodiversity Knowledge Management System. 

The OpenBiodiv contribution to BiCIKL

Apart from coordinating the Horizon 2020-funded project BiCIKL, scholarly publisher and technology provider Pensoft has been the engine behind what is likely to be the first production-stage semantic system to run on top of a reasonably-sized biodiversity knowledge graph.

OpenBiodiv is a biodiversity database containing knowledge extracted from scientific literature, built as an Open Biodiversity Knowledge Management System. 

As of February 2023, OpenBiodiv contains 36,308 processed articles; 69,596 taxon treatments; 1,131 institutions; 460,475 taxon names; 87,876 sequences; 247,023 bibliographic references; 341,594 author names; and 2,770,357 article sections and subsections.

In fact, OpenBiodiv is a whole ecosystem comprising tools and services that enable biodiversity data to be extracted from the text of biodiversity articles published in data-minable XML format, as in the journals published by Pensoft (e.g. ZooKeys, PhytoKeys, MycoKeys, Biodiversity Data Journal), and other taxonomic treatments – available from Plazi and Plazi’s specialised extraction workflow – into Linked Open Data.

“I believe that OpenBiodiv is a good real-life example of how the outputs and efforts of a research project may and should outlive the duration of the project itself. Something that is – of course – central to our mission at BiCIKL.”

explains Prof Lyubomir Penev, BiCIKL’s Project Coordinator and founder and CEO of Pensoft.

“The basics of what was to become the OpenBiodiv database began to come together back in 2015 within the EU-funded BIG4 PhD project of Victor Senderov, later succeeded by another PhD project by Mariya Dimitrova within IGNITE. It was during those two projects that the backend Ontology-O, the first versions of RDF converters and the basic website functionalities were created,”

he adds.

At the time OpenBiodiv became one of the nine research infrastructures within BiCIKL tasked with the provision of virtual access to open FAIR data, tools and services, it had already evolved into a RDF-based biodiversity knowledge graph, equipped with a fully automated extraction and indexing workflow and user apps.

Currently, Pensoft is working at full speed on new user apps in OpenBiodiv, as the team is continuously bringing into play invaluable feedback and recommendation from end-users and partners at BiCIKL. 

As a result, OpenBiodiv is already capable of answering open-ended queries based on the available data. To do this, OpenBiodiv discovers ‘hidden’ links between data classes, i.e. taxon names, taxon treatments, specimens, sequences, persons/authors and collections/institutions. 

Thus, the system generates new knowledge about taxa, scientific articles and their subsections, the examined materials and their metadata, localities and sequences, amongst others. Additionally, it is able to return information with a relevant visual representation about any one or a combination of those major data classes within a certain scope and semantic context.

Users can explore the database by either typing in any term (even if misspelt!) in the search engine available from the OpenBiodiv homepage; or integrating an Application Programming Interface (API); as well as by using SPARQL queries.

On the OpenBiodiv website, there is also a list of predefined SPARQL queries, which is continuously being expanded.

Sample of predefined SPARQL queries at OpenBiodiv.

“OpenBiodiv is an ambitious project of ours, and it’s surely one close to Pensoft’s heart, given our decades-long dedication to biodiversity science and knowledge sharing. Our previous fruitful partnerships with Plazi, BIG4 and IGNITE, as well as the current exciting and inspirational network of BiCIKL are wonderful examples of how far we can go with the right collaborators,”

concludes Prof Lyubomir Penev.

***

Follow BiCIKL on Twitter and Facebook. Join the conversation on Twitter at #BiCIKL_H2020.

You can also follow Pensoft on Twitter, Facebook and Linkedin and use #OpenBiodiv on Twitter.

Journal publishing platform ARPHA partners with content recommendation engine TrendMD

Thanks to the new collaboration between content recommendation engine TrendMD and journal publishing platform ARPHA, readers of all journals under Pensoft’s imprint, as well as those using the white-label publishing solution provided by the platform, will be given a useful list of recommended articles related to the study they are reading. The new widget is to save the users a great amount of time, by simply pointing them to the most relevant papers on the topic from across a constantly expanding network of of peer-reviewed articles and research news.

While nearly 8,000 new scholarly articles are published each day, it is basically impossible staying up-to-date with the news from a single scientific field, let alone doing cross-disciplinary research. Furthermore, sifting out the quality literature is another painstaking activity no academic is looking forward to. Hence, TrendMD comes as the sensible solution to help a reader find the most relevant and fine studies on a particular topic. The widget’s recommendations are based on the topic a user is currently reading, what papers they have read in the past, and the articles others with similar interests have sought out – all available from the most authoritative and quality journals in the world.

“TrendMD is excited to welcome Pensoft, a highly innovative, open access, online publishing platform, to the TrendMD network! This partnership will bring over 5,000 open access articles and books in the field of natural history, predominantly taxonomy and organismal biology, to TrendMD’s ever expanding network,” says Paul Kudlow, CEO and co-founder of TrendMD.

“In our continuing effort to develop and implement the most novel tools and workflows in academic publishing, at Pensoft we are pleased to have integrated our journal publishing platform ARPHA with the new-age scholarly innovation that is TrendMD’s tool, so that our readers have an easy and constant access to the most relevant and best-quality research,” says Pensoft’s CEO and founder Prof. Lyubomir Penev.