Brand new computer language describes organismal traits to create computable species descriptions

Describing traits with Phenoscript is like programming a computer code for how an organism looks.

The beetle species Grebennikovius basilewskyi. Numbers next to arrows indicate patterns of phenotype statements explained in the section “Phenoscript: main patterns of phenotype statements”. Arrow numbers from T1 to T5 illustrate individual body parts. See more in the research study.

One of the most beautiful aspects of Nature is the endless variety of shapes, colours and behaviours exhibited by organisms. These traits help organisms survive and find mates, like how a male peacock’s colourful tail attracts females or his wings allow him to fly away from danger. Understanding traits is crucial for biologists, who study them to learn how organisms evolve and adapt to different environments.

To do this, scientists first need to describe these traits in words, like saying a peacock’s tail is “vibrant, iridescent, and ornate”. This approach works for small studies, but when looking at hundreds or even millions of different animals or plants, it’s impossible for the human brain to keep track of everything.

Computers could help, but not even the latest AI technology is able to grasp human language to the extent needed by biologists. This hampers research significantly because, although scientists can handle large volumes of DNA data, linking this information to physical traits is still very difficult.

To solve this problem, researchers from the Finnish Museum of Natural History, Giulio Montanaro and Sergei Tarasov, along with collaborators, have created a special language called Phenoscript. This language is designed to describe traits in a way that both humans and computers can understand. Describing traits with Phenoscript is like programming a computer code for how an organism looks.

Phenoscript uses something called semantic technology, which helps computers understand the meaning behind words, much like how modern search engines know the difference between the fruit “apple” and the tech company “Apple” based on the context of your search.

“This language is still being tested, but it shows a lot of promise. As more scientists start using Phenoscript, it will revolutionise biology by making vast amounts of trait data available for large-scale studies, boosting the emerging field of phenomics,”

explains Montanaro.

In their research article, newly published in the open-access, peer-reviewed Biodiversity Data Journal, the researchers make use of the new language for the first time, as they create semantic phenotypes for four species of dung beetles from the genus Grebennikovius. Then, to demonstrate the power of the semantic approach, they apply simple semantic queries to the generated phenotypic descriptions. 

Finally, the team takes a look yet further ahead into modernising the way scientists work with species information. Their next aim is to integrate semantic species descriptions with the concept of nanopublications, “which encapsulates discrete pieces of information into a comprehensive knowledge graph”. As a result, data that has become part of this graph can be queried directly, thereby ensuring that it remains Findable, Accessible, Interoperable and Reusable (FAIR) through a variety of semantic resources.

***

Research paper:

Montanaro G, Balhoff JP, Girón JC, Söderholm M, Tarasov S (2024) Computable species descriptions and nanopublications: applying ontology-based technologies to dung beetles (Coleoptera, Scarabaeinae). Biodiversity Data Journal 12: e121562. https://doi.org/10.3897/BDJ.12.e121562

***

The hereby study is the latest addition to the special topical collection: “Linking FAIR biodiversity data through publications: The BiCIKL approach”, launched and supported by the recently concluded Horizon 2020 project: Biodiversity Community Integrated Knowledge Library (BiCIKL). The collection aims to bring together scientific publications that demonstrate the advantages and novel approaches in accessing and (re-)using linked biodiversity data.

***

What expert recommendations did the BiCIKL consortium give to policy makers and research funders to ensure that biodiversity data is FAIR, linked, open and, indeed, future-proof? Find out in the blog post summarising key lessons learnt from the Horizon 2020 project.

***

Follow Biodiversity Data Journal on Facebook and X.

Pensoft took a BiCIKL ride to Naturalis to report on a 3-year endeavour towards FAIR data

Three years ago, the BiCIKL consortium took to traverse obstacles to wider use and adoption of FAIR and linked biodiversity data.

Leiden – also known as the ‘City of Keys’ and the ‘City of Discoveries’ – was aptly chosen to host the third Empowering Biodiversity Research (EBR III) conference. The two-day conference – this time focusing on the utilisation of biodiversity data as a vehicle for biodiversity research to reach to Policy – was held in a no less fitting locality: the Naturalis Biodiversity Center

On 25th and 26th March 2024, the delegates got the chance to learn more about the latest discoveries, trends and innovations from scientists, as well as various stakeholders, including representatives of policy-making bodies, research institutions and infrastructures. The conference also ran a poster session and a Biodiversity Informatics market, where scientists, research teams, project consortia, and providers of biodiversity research-related services and tools could showcase their work and meet like-minded professionals.

BiCIKL stops at the Naturalis Biodiversity Center

The main outcome of the BiCIKL project: the Biodiversity Knowledge Hub, a one-stop knowledge portal to interlinked and machine-readable FAIR data.

The famous for its bicycle friendliness country also made a suitable stop for BiCIKL (an acronym for the Biodiversity Community Integrated Knowledge Library): a project funded under the European Commission’s Horizon 2020 programme that aimed at triggering a culture change in the way users access, (re)use, publish and share biodiversity data. To do this, the BiCIKL consortium set off on a 3-year journey to build on the existing biodiversity data infrastructures, workflows, standards and the linkages between them.

Many of the people who have been involved in the project over the last three years could be seen all around the beautiful venue. Above all, Naturalis is itself one of the partnering institutions at BiCIKL. Then, on Tuesday, on behalf of the BiCIKL consortium and the project’s coordinator: the scientific publisher and technology innovator: Pensoft, Iva Boyadzhieva presented the work done within the project one month ahead of its official conclusion at the end of April.

As she talked about the way the BiCIKL consortium took to traverse obstacles to wider use and adoption of FAIR and linked biodiversity data, she focused on BiCIKL’s main outcome: the Biodiversity Knowledge Hub (BKH).

Key results from the BiCIKL project three years into its existence presented by Pensoft’s Iva Boyadzhieva at the EBR III conference.

Intended to act as a knowledge broker for users who wish to navigate and access sources of open and FAIR biodiversity data, guidelines, tools and services, in practicality, the BKH is a one-stop portal for understanding the complex but increasingly interconnected landscape of biodiversity research infrastructures in Europe and beyond. It collates information, guidelines, recommendations and best practices in usage of FAIR and linked biodiversity data, as well as a continuously expanded catalogue of compliant relevant services and tools.

At the core of the BKH is the FAIR Data Place (FDP), where users can familiarise themselves with each of the participating biodiversity infrastructures and network organisations, and also learn about the specific services they provide. There, anyone can explore various biodiversity data tools and services by browsing by their main data type, e.g. specimens, sequences, taxon names, literature.

While the project might be coming to an end, she pointed out, the BKH is here to stay as a navigation system in a universe of interconnected biodiversity research infrastructures.

To do this, not only will the partners continue to maintain it, but it will also remain open to any research infrastructure that wishes to feature its own tools and services compliant with the linked and FAIR data requirements set by the BiCIKL consortium.

On the event’s website you can access the BiCIKL’s slides presentation as presented at the EBR III conference.

What else was on at the EBR III?

Indisputably, the ‘hot’ topics at the EBR III were the novel technologies for remote and non-invasive, yet efficient biomonitoring; the utilisation of data and other input sourced by citizen scientists; as well as leveraging different types and sources of biodiversity data, in order to better inform decision-makers, but also future-proof the scientific knowledge we have collected and generated to date.

Project’s coordinator Dr Quentin Groom presents the B-Cubed’s approach towards standardised access to biodiversity data for the use of policy-making at the EBR III conference.

Amongst the other Horizon Europe projects presented at the EBR III conference was B-Cubed (Biodiversity Building Blocks for policy). On Monday, the project’s coordinator Dr Quentin Groom (Meise Botanic Garden) familiarised the conference participants with the project, which aims to standardise access to biodiversity data, in order to empower policymakers to proactively address the impacts of biodiversity change.

You can find more about B-Cubed and Pensoft’s role in it in this blog post.

On the event’s website you can access the B-Cubed’s slides presentation as presented at the EBR III conference.

***

Dr France Gerard (UK Centre for Ecology & Hydrology) talks about the challenges in using raw data – including those provided by drones – to derive habitat condition metrics.

MAMBO: another Horizon Europe project where Pensoft has been contributing with expertise in science communication, dissemination and exploitation, was also an active participant at the event. An acronym for Modern Approaches to the Monitoring of BiOdiversity, MAMBO had its own session on Tuesday morning, where Dr Vincent Kalkman (Naturalis Biodiversity Center), Dr France Gerard (UK Centre for Ecology & Hydrology) and Prof. Toke Høye (Aarhus University) each took to the stage to demonstrate how modern technology developed within the project is to improve biodiversity and habitat monitoring. Learn more about MAMBO and Pensoft’s involvement in this blog post.

MAMBO’s project coordinator Prof. Toke T. Høye talked about smarter technologies for biodiversity monitoring, including camera traps able to count insects at a particular site.

On the event’s website you can access the MAMBO’s slides presentations by Kalkman, Gerard and Høye, as presented at the EBR III conference.

***

The EBR III conference also saw a presentation – albeit remote – from Prof. Dr. Florian Leese (Dean at the University of Duisburg-Essen, Germany, and Editor-in-Chief at the Metabarcoding and Metagenomics journal), where he talked about the promise, but also the challenges for DNA-based methods to empower biodiversity monitoring. 

Amongst the key tasks here, he pointed out, are the alignment of DNA-based methods with the Global Biodiversity Framework; central push and funding for standards and guidance; publication of data in portals that adhere to the best data practices and rules; and the mobilisation of existing resources such as the meteorological ones. 

Prof. Dr. Florian Leese talked about the promise, but also the challenges for DNA-based methods to empower biodiversity monitoring. He also referred to the 2022 Forum Paper: “Introducing guidelines for publishing DNA-derived occurrence data through biodiversity data platforms” by R. Henrik Nilsson et al.

He also made a reference to the Forum Paper “Introducing guidelines for publishing DNA-derived occurrence data through biodiversity data platforms” by R. Henrik Nilsson et al., where the international team provided a brief rationale and an overview of guidelines targeting the principles and approaches of exposing DNA-derived occurrence data in the context of broader biodiversity data. In the study, published in the Metabarcoding and Metagenomics journal in 2022, they also introduced a living version of these guidelines, which continues to encourage feedback and interaction as new techniques and best practices emerge.

***

You can find the programme on the conference website and see highlights on the conference hashtag: #EBR2024.

Don’t forget to also explore the Biodiversity Knowledge Hub for yourself at: https://biodiversityknowledgehub.eu/ 

Towards the “Biodiversity PMC”: a literature database supporting advanced content queries

The indexing is one of the major outcomes from the partnerships within the Horizon 2020-funded project Biodiversity Community Integrated Knowledge Library (BiCIKL)

Amongst the major outcomes from the currently nearly completed Horizon 2020-funded project Biodiversity Community Integrated Knowledge Library (BiCIKL) – dedicated to making biodiversity data FAIR and bi-directionally linked – brings the SIB Literature Services (SIBiLS) database one step closer to solidifying its “Biodiversity PMC” portal and working title.

In a joint effort between the Swiss-based Text Mining group of Patrick Ruch at SIB (developing SIBiLS), the text- and data-mining association Plazi and scientific publisher Pensoft, the long-time collaborators have started feeding full-text content of over 500,000 taxonomic treatments extracted by Plazi and tens of thousands full-text articles from 40 well-renowned biodiversity journals published by Pensoft to the SIBiLS database. 

What this means is that users at SIBiLS – be it human or AI – have now gained access to advanced text- and data-mining tools, including AI-powered factoid question-answering capacities, to query all this full-text indexed content and seek out, for example, species traits and biotic interactions.

To index and directly feed the content from its 40+ academic outlets at SIBiLS, Pensoft relies on advanced and full-text TaxPub JATS XML journal publication workflow, powered by the ARPHA publishing platform. Meanwhile, Plazi uses its GoldenGate text- and data-mining software to harvest taxon treatments from over 80 journals stored at TreatmentBank and the Biodiversity Literature Repository, and then further re-used by GBIF, OpenBiodiv and now by SiBILS.

Seen as a pilot, the indexing – the partners believe – could soon be extended with other journals relying on modern publishing or converted legacy publications. 

In fact, ever since its launch in 2020, the queryable database SIBiLS has been retrieving relevant full-text papers directly from the NIH’s PubMed Central, including Pensoft’s ZooKeysPhytoKeysMycoKeysBiodiversity Data Journal and Comparative Cytogenetics

However, there were still gaps left to bridge before SIBiLS could indeed be dubbed “the Biodiversity PMC”, and those have mostly been about volume and breadth of content. While the above-mentioned five journals by Pensoft had long been indexed by SIBiLS through harvesting PMC, those had been quite an exception since, several years ago, a reorganisation at PMC moved the focus of the database to almost exclusively biomedical content, thus leaving biodiversity journals out of the scope of the database.

In the meantime, while Plazi has been feeding SIBiLS a growing volume of taxonomic treatments and visual data, as it was exponentially increasing the number of publishers and journals it mined data from, a lot of biodiversity data (e.g. genetic, molecular, ecological) published in the article narratives that were not taxon treatments could not make it to the portal.

“We all know the advantages and practical uses PMC offers to its users, so we cannot miss the opportunity to incorporate this well-proven approach to navigate the data deluge in biodiversity science. Undoubtedly, it is an extremely ambitious and demanding task. Yet, I believe that, at the BiCIKL consortium, we have made it pretty clear we have the necessary expertise, know-how and aspiration to take on the challenge,”

said Prof. Lyubomir Penev, founder/CEO at Pensoft and project coordinator of BiCIKL.

“For far too long, scientific knowledge about biodiversity has been imprisoned in a continuously growing corpus of scientific outputs, which – most of the time – are published in unstructured formats, such as PDF, or as paywalled content, and often locked by both! This means that they are – at best – difficult to access and comprehend by computer algorithms. In the meantime, we need all that knowledge, in order to accelerate our understanding of the dynamics of the global biodiversity crisis and to efficiently assess the impact of climate change. This is why the need for advanced workflows and tools to annotate, mine, query and discover new facts from the available literature is more than obvious,”

added Dr. Donat Agosti, President at Plazi.

“In the course of the BiCIKL project, at SIBiLS, we started indexing a larger set of biodiversity-related contents in the broad sense, including environmental sciences and ecology, to build a new literature database, or what we now call ‘Biodiversity PMC’. Now, with the help of Plazi and Pensoft, we provide a unique entry point to half a million taxonomic treatments, which were not included into the original PubMed Central. Next on the list is to expand our network of literature sources and continue this exponential growth of queryable biodiversity knowledge to turn Biodiversity PMC into the “One Health” library. We promise to keep you posted,”

said Dr. Patrick Ruch, Group Leader at SIB and Head of Research at HES-SO, HEG Geneva, Switzerland. 

***

Follow BiCIKL Project on Twitter and Facebook. Join the conversation on Twitter at #BiCIKL_H2020.

***

About the SIB Swiss Institute of Bioinformatics:

SIB is an internationally recognized non-profit organisation, dedicated to biological and biomedical data science. SIB’s data scientists are passionate about creating knowledge and solving complex questions in many fields, from biodiversity and evolution to medicine. They provide essential databases and software platforms as well as bioinformatics expertise and services to academic, clinical, and industry groups. With the recent creation of the Environmental Bioinformatics group, led by Robert Waterhouse, SIB is engaged in an unprecedented effort to streamline data across molecular biology, health and biodiversity. SIB also federates the Swiss bioinformatics community of some 900 scientists, encouraging collaboration and knowledge sharing.

***

About Plazi:

Plazi is an association supporting and promoting the development of persistent and openly accessible digital taxonomic literature. To this end, Plazi maintains TreatmentBank, a digital taxonomic literature repository to enable archiving of taxonomic treatments; develops and maintains TaxPub, an extension of the National Library of Medicine / National Center for Biotechnology Informatics Journal Article Tag Suite for taxonomic treatments; is co-founder of the Biodiversity Literature Repository at Zenodo, participates in the development of new models for publishing taxonomic treatments in order to maximise interoperability with other relevant cyberinfrastructure components such as name servers and biodiversity resources; and advocates and educates about the vital importance of maintaining free and open access to scientific discourse and data. Plazi is a major contributor to the Global Biodiversity Information Facility.

Nanopublications tailored to biodiversity data

Novel nanopublication workflows and templates for associations between organisms, taxa and their environment are the latest outcome of the collaboration between Knowledge Pixels and Pensoft.

First off, why nanopublications?

Nanopublications complement human-created narratives of scientific knowledge with elementary, machine-actionable, simple and straightforward scientific statements that prompt sharing, finding, accessibility, citability and interoperability. 

By making it easier to trace individual findings back to their origin and/or follow-up updates, nanopublications also help to better understand the provenance of scientific data. 

With the nanopublication format and workflow, authors make sure that key scientific statements – the ones underpinning their research work – are efficiently communicated in both human-readable and machine-actionable manner in line with FAIR principles. Thus, their contributions to science are better prepared for a reality driven by AI technology.

The machine-actionability of nanopublications is a standard due to each assertion comprising a subject, an object and a predicate (type of relation between the subject and the object), complemented by provenance, authorship and publication information. A unique feature here is that each of the elements is linked to an online resource, such as a controlled vocabulary, ontology or standards. 

Now, what’s new?

As a result of the partnership between high-tech startup Knowledge Pixels and open-access scholarly publisher and technology provider Pensoft, authors in Biodiversity Data Journal (BDJ) can make use of three types of nanopublications:

  1. Nanopublications associated with a manuscript submitted to BDJ. This workflow lets authors add a Nanopublications section within their manuscript while preparing their submission in the ARPHA Writing Tool (AWT). Basically, authors ‘highlight’ and ‘export’ key points from their papers as nanopublications to further ensure the FAIRness of the most important findings from their publications.
  1. Standalone nanopublication related to any scientific publication, regardless of its author or source. This can be done via the Nanopublications page accessible from the BDJ website. The main advantage of standalone nanopublication is that straightforward scientific statements become available and FAIR early on, and remain ready to be added to a future scholarly paper.
  1. Nanopublications as annotations to existing scientific publications. This feature is available from several journals published on the ARPHA Platform, including BDJ. By attaching an annotation to the entire paper (via the Nanopublication tab) or a text selection (by first adding an inline comment, then exporting it as a nanopublication), a reader can evaluate and record an opinion about any article using a simple template based on the Citation Typing Ontology (CiTO).

Nanopublications for biodiversity data?

At Biodiversity Data Journal (BDJ), authors can now incorporate nanopublications within their manuscripts to future-proof the most important assertions on biological taxa and organisms or statements about associations of taxa or organisms and their environments

On top of being shared and archived by means of a traditional research publication in an open-access peer-reviewed journal, scientific statements using the nanopublication format will also remain ‘at the fingertips’ of automated tools that may be the next to come looking for this information, while mining the Web.

Using the nanopublication workflows and templates available at BDJ, biodiversity researchers can share assertions, such as:

So far, the available biodiversity nanopublication templates cover a range of associations, including those between taxa and individual organisms, as well as between those and their environments and nucleotide sequences. 

Nanopublication template customised for biodiversity research publications available from Nanodash.

As a result, those easy-to-digest ‘pixels of knowledge’ can capture and disseminate information about single observations, as well as higher taxonomic ranks. 

The novel domain-specific publication format was launched as part of the collaboration between Knowledge Pixels – an innovative startup tech company aiming to revolutionise scientific publishing and knowledge sharing and the open-access scholarly publisher Pensoft.

… so, what exactly is a nanopublication?

General structure of a nanopublication:

“the smallest unit of publishable information”,

as explained on nanopub.net.

Basically, a nanopublication – unlike a research article – is a tiny snippet of a precise and structured scientific finding (e.g. medication X treats disease Y), which exists as a reusable and cite-able pieces of a growing knowledge graph stored on a decentralised server network in a format that it is readable for humans, but also “understandable” and actionable for computers and their algorithms.

These semantic statements expressed in community-agreed terms, openly available through links to controlled vocabularies, ontologies and standards, are not only freely accessible to everyone in both human-readable and machine-actionable formats, but also easy-to-digest for computer algorithms and AI-powered assistants.

In short, nanopublications allow us to browse and aggregate such findings as part of a complex scientific knowledge graph. Therefore, nanopublications bring us one step closer to the next revolution in scientific publishing, which started with the emergence and increasing adoption of knowledge graphs. 

“As pioneers in the semantic open access scientific publishing field for over a decade now, we at Pensoft are deeply engaged with making research work actually available at anyone’s fingertips. What once started as breaking down paywalls to research articles and adding the right hyperlinks in the right places, is time to be built upon,”

said Prof. Lyubomir Penev, founder and CEO at Pensoft, which had published the very first semantically enhanced research article in the biodiversity domain back in 2010 in the ZooKeys journal.

Why are nanopublications necessary?

By letting computer algorithms access published research findings in a structured format, nanopublications allow for the knowledge snippets that they are intended to communicate to be fully understandable and actionable. With nanopublications, each of those fragments of scientific information is interconnected and traceable back to its author(s) and scientific evidence. 

A nanopublication is a tiny snippet of a precise and structured scientific finding (e.g. medication X treats disease Y), which exists within a growing knowledge graph stored on a decentralised server network in a format that it is readable for humans, but also “understandable” and actionable for computers and their algorithms. Illustration by Knowledge Pixels. 

By building on shared knowledge representation models, these data become Interoperable (as in the I in FAIR), so that they can be delivered to the right user, at the right time, in the right place , ready to be reused (as per the R in FAIR) in new contexts. 

Another issue nanopublications are designed to address is research scrutiny. Today, scientific publications are produced at an unprecedented rate that is unlikely to cease in the years to come, as scholarship embraces the dissemination of early research outputs, including preprints, accepted manuscripts and non-conventional papers.

A network of interlinked nanopublications could also provide a valuable forum for scientists to test, compare, complement and build on each other’s results and approaches to a common scientific problem, while retaining the record of their cooperation each step along the way. 

*** 

We encourage you to try the nanopublications workflow yourself when submitting your next biodiversity paper to Biodiversity Data Journal

Community feedback on this pilot project and suggestions for additional biodiversity-related nanopublication templates are very welcome!

This Nanopublications for biodiversity workflow was created with a partial support of the European Union’s Horizon 2020 BiCIKL project under grant agreement No 101007492 and in collaboration with Knowledge Pixels AG.The tool uses data and API services of ChecklistBank, Catalogue of Life, GBIF, GenBank/ENA, BOLD, Darwin Core, Environmental Ontology (ENVO), Relation Ontology (RO), NOMEN, ZooBank, Index Fungorum, MycoBank, IPNI, TreatmentBank, and other resources. 

*** 

On the journal website: https://bdj.pensoft.net/, you can find more about the unique features and workflows provided by the Biodiversity Data Journal (BDJ), including innovative research paper formats (e.g. Data Paper, OMICS Data Paper, Software Description, R Package, Species Conservation Profiles, Alien Species Profile), expert-provided data audit for each data paper submission, automated data export and more.

Don’t forget to also sign up for the BDJ newsletter via the Email alert form on the journal’s homepage and follow it on Twitter and Facebook.

***

Earlier this year, Knowledge Pixels and Pensoft presented several routes for readers and researchers to contribute to research outputs – either produced by themselves or by others – through nanopublications generated through and visualised in Pensoft’s cross-disciplinary Research Ideas and Outcomes (RIO) journal, which uses the same nanopublication workflows.

How it works: Nanopublications linked to articles in RIO Journal

To bridge the gap between authors and their readers or fellow researchers – whether humans or computers – Knowledge Pixels and Pensoft launched workflows to link scientific publications to nanopublications.

A new pilot project by Pensoft and Knowledge Pixels breaks scientific knowledge into FAIR and interlinked snippets of precise information

As you might have already heard, Knowledge Pixels: an innovative startup tech company aiming to revolutionise scientific publishing and knowledge sharing by means of nanopublications – recently launched a pilot project with the similarly pioneering open-science journal Research Ideas and Outcomes (RIO), in a first of several upcoming collaborations between the software developer and the open-access scholarly publisher Pensoft.

“The way how science is performed has dramatically changed with digitalisation, the Internet, and the vast increase in data, but the results are still shared in basically the same form and language as 300 years ago: in narrative text, like a story. These narratives are not precise and not directly interpretable by machines, thereby not FAIR. Even the latest impressive AI tools like ChatGPT can only guess (and sometimes ‘hallucinate’) what the authors meant exactly and how the results compare,”

said Philipp von Essen and Tobias Kuhn, the two founders of Knowledge Pixels in a press announcement.

So, in order to bridge the gap between authors and their readers and fellow researchers – whether humans or computers – the partners launched several workflows to bi-directionally link scientific publications from RIO Journal to nanopublications. We will explain and demonstrate these workflows in a bit.

Now, first, let’s see what nanopublications are and how they contribute to scientific knowledge, researchers and scholarship as a whole.

What are nanopublications?

General structure of a nanopublication:

“the smallest unit of publishable information”,

as explained by Knowledge Pixel on nanopub.net.

Basically, a nanopublication – unlike a research article – is just a tiny snippet of a scientific finding (e.g. medication X treats disease Y), which exists as a complete and straightforward piece of information stored on a decentralised server network in a specially structured format, so that it is readable for humans, but also “understandable” and actionable for computers and their algorithms. 

A nanopublication may also be an assertion related to an existing research article meant to support, comment, update or complement the reported findings.

In fact, nanopublications as a concept have been with us for quite a while now. Ever since the rise of the Semantic Web, to be exact. At the end of the day, it all boils down to providing easily accessible information that is only a click away from additional useful and relevant content. The thing is, technological advancement has only recently begun to catch up with the concept of nanopublications. Today, we are one step closer to another revolution in scientific publishing, thanks to the emergence and increasing adoption of what we call knowledge graphs.

“As pioneers in the semantic open access scientific publishing field for over a decade now, at Pensoft we are deeply engaged with making research work actually available at anyone’s fingertips. What once started as breaking down paywalls to research articles and adding the right hyperlinks in the right places, is time to be built upon,”

said Prof. Lyubomir Penev, founder and CEO at Pensoft: the open-access scholarly publisher behind the very first semantically enhanced research article in the biodiversity domain, published back in 2010 in the ZooKeys journal.

Why nanopublications?

Apart from enabling computer algorithms with wholesome access to published research findings, nanopublications allow for the knowledge snippets that they are intended to communicate to be fully understandable and actionable. With nanopublications, each byte of knowledge is interconnected and traceable back to its author(s) and scientific evidence. 

Nanopublications present a complete and straightforward piece of information stored on a decentralised server network in a specially structured format, so that it is readable for humans, but also “understandable” and actionable for computers and their algorithms. Illustration by Knowledge Pixels.

By granting computers the capability of exchanging information between users and platforms, these data become Interoperable (as in the I in FAIR), so that they can be delivered to the right user, at the right time, in the right place. 

Another issue nanopublications are designed to address is research scrutiny. Today, scientific publications are produced at an unprecedented rate that is unlikely to cease in the years to come, as scholarship embraces the dissemination of early research outputs, including preprints, accepted manuscripts and non-conventional papers.

By linking assertions to a publication by means of nanopublications allows the original authors and their fellow researchers to keep knowledge up to date as new findings emerge either in support or contradiction to previous information.

A network of interlinked nanopublications could also provide a valuable forum for scientists to test, compare, complement and build on each other’s results and approaches to a common scientific problem, while retaining the record of their cooperation each step along the way. 

A scientific issue that would definitely benefit from an additional layer of provenance and, specifically, a workflow allowing for new updates to be linked to previous publications is the biodiversity domain, where species treatments, taxon names, biotic interactions and phylogenies are continuously being updated, reworked and even discarded for good. This is why an upcoming collaboration between Pensoft and Knowledge Pixels will also involve the Biodiversity Data Journal (stay tuned!)

What can you do in RIO?

Now, let’s have a look at the *nano* opportunities already available at RIO Journal.

The integration between RIO and Nanodash: the environment developed by Knowledge Pixels where users edit and publish their nanopublications is available at any article published in the journal. 

Add reaction to article

This function allows any reader to evaluate and record an opinion about any article using a simple template. The opinion is posted as a nanopublication displayed on the article page, bearing the timestamp and the name of the creator.

All one needs to do is go to a paper, locate the Nanopubs tab in the menu on the left and click on the Add reaction command to navigate to the Nanodash workspace accessible to anyone registered on ORCiD

To access the Nanodash workspace, where you can fill in a ready-to-use, partially filled in nanopublication template, simply go to the Nanopubs tab in the menu of any article published in RIO Journal and click Add reaction to this article (see example).

Within the simple Nanodash workspace, the user can provide the text of the nanopublication; define its relation to the linked paper using the Citation Typing Ontology (CiTO); update its provenance and add information (e.g. licence, extra creators) by inserting extra elements. 

To do this, the Knowledge Pixels team has created a ready-to-use nanopublication template, where the necessary details for the RIO paper and the author that secure the linkage have already been pre-filled.

Post an inline comment as a nanopublication

Another opportunity for readers and authors to add further meaningful information or feedback to an already published paper is by attaching an inline comment and then exporting it to Nanodash, so that it becomes a nanopublication. To do this, users will simply need to select some text with a left click, type in the comment, and click OK. Now, their input will be available in the Comment tab designed to host simple comments addressing the authors of the publication. 

While RIO has long been supporting features allowing for readers to publicly share comments and even CrossRef-registered post-publication peer reviews along the articles, the nanopublications integration adds to the versatile open science-driven arsenal of feedback tools. More precisely, the novel workflow is especially useful for comments that provide a particularly valuable contribution to a research topic.

To make a comment into a nanopublication the user needs to locate the comment in the tab, and click on the Post as Nanopub command to access the Nanodash environment.

Add a nanopublication while writing your manuscript

A functionality available from ARPHA Writing Tool – the online collaborative authoring environment that underpins the manuscript submission process at several journals published by Pensoft, including RIO Journal – allows for researchers to create a list of nanopublications within their manuscripts. 

By doing so, not only do authors get to highlight their key statements in a tabular view within a separate pre-designated Nanopublications section, but they also make it easier for reviewers and scientific editors to focus on and evaluate the very foundations of the paper.

By incorporating a machine algorithm-friendly structure for the main findings of their research paper, authors ensure that AI assistants, for example, will be more likely to correctly ‘read’, ‘interpret’ and deliver the knowledge reported in the publication for the next users and their prompts. Furthermore, fellow researchers who might want to cite the paper will also have an easier time citing the specific statement from within the cited source, so that their own readers – be it human, or AI – will make the right links and conclusions.

Within a pre-designated article template at RIO – regardless of the paper type selected – authors have the option to either paste a link to an already available nanopublication or manage their nanopublication via the Nanodash environment by following a link. Customised for the purposes of RIO, the Nanodash workspace will provide them with all the information needed to guide them through the creation and publication of their nanopublications.

Why Research Ideas and Outcomes, a.k.a. RIO Journal?

Why did Knowledge Pixels and Pensoft opt to run their joint pilot at no other journal within the Pensoft portfolio of open-access scientific journals but the Research Ideas and Outcomes (RIO)?

Well, one may argue that there simply was no better choice than an academic outlet that was initially designed to serve as “the open-science journal”: something it has been honourably recognised for by SPARC in 2016, only one year since its launch. 

Innovative since day #1, back in 2015, RIO surfaced as an academic outlet to publish a whole lot of article types, reporting on scientific work from across the research process, starting from research ideas, grant proposals and workshop reports. 

After all, back in 2015, when it was only a handful of funders who required Data and Software Management Plans to be made openly and publicly, RIO was already providing a platform to publish those as easily citable research outputs, complete with DOI and registration on Crossref. In the spirit of transparency, RIO has always operated an open and public by default peer review policy.

More recently, RIO introduced a novel collections workflow which allows, for example, project coordinators, to provide a one-stop access point for publications and all kinds of valuable outputs resulting from their projects regardless of their publication source.

Bottom line is, RIO has always stood for innovation, transparency, openness and FAIRness in scholarly publishing and communication, so it was indeed the best fit for the nanopublication pilot with Knowledge Pixels.

*** 

We encourage you to try the nanopublications workflow yourself by going to https://riojournal.com/articles, and posting your own assertion to an article of your choice!

Don’t forget to also sign up for the RIO Journal’s newsletter via the Email alert form on the journal’s website and follow it on Twitter, Facebook, Linkedin and Mastodon.

BiCIKL Project supports article collection in Biodiversity Data Journal about use of linked data

Welcomed are taxonomic and other biodiversity-related research articles, which demonstrate the advantages and novel approaches in accessing and (re-)using linked biodiversity data

The EU-funded project BiCIKL (Biodiversity Community Integrated Knowledge Library) will support free of charge publications* submitted to the dedicated topical collection: “Linking FAIR biodiversity data through publications: The BiCIKL approach” in the Biodiversity Data Journal, demonstrating advanced publishing methods of linked biodiversity data, so that they can be easily harvested, distributed and re-used to generate new knowledge. 

BiCIKL is dedicated to building a new community of key research infrastructures, researchers and citizen scientists by using linked FAIR biodiversity data at all stages of the research lifecycle, from specimens through sequencing, imaging, identification of taxa, etc. to final publication in novel, re-usable, human-readable and machine-interpretable scholarly articles.

Achieving a culture change in how biodiversity data are being identified, linked, integrated and re-used is the mission of the BiCIKL consortium. By doing so, BiCIKL is to help increase the transparency, trustworthiness and efficiency of the entire research ecosystem.


The new article collection welcomes taxonomic and other biodiversity-related research articles, data papers, software descriptions, and methodological/theoretical papers. These should demonstrate the advantages and novel approaches in accessing and (re-)using linked biodiversity data.

To be eligible for the collection, a manuscript must comply with at least two of the conditions listed below. In the submission form, the author needs to specify the condition(s) applicable to the manuscript. The author should provide the explanation in a cover letter, using the Notes to the editor field.

All submissions must abide by the community-agreed standards for terms, ontologies and vocabularies used in biodiversity informatics. 

The data used in the articles must comply with the Data Quality Checklist and Fair Data Checklist available in the Authors’ instructions of the journal.


Conditions for publication in the article collection:

  • The authors are expected to use explicit Globally Unique Persistent and Resolvable Identifiers (GUPRI) or other persistent identifiers (PIDs), where such are available, for the different types of data they use and/or cite in the manuscripts (specimens IDs, sequence accession numbers, taxon name and taxon treatment IDs, image IDs, etc.)

  • Global taxon reviews in the form of “cyber-catalogues” are welcome if they contain links of the key data elements (specimens, sequences, taxon treatments, images, literature references, etc.) to their respective records in external repositories. Taxon names in the text should not be hyperlinked. Instead, under each taxon name in the catalogue, the authors should add external links to, for example, Catalogue of Life, nomenclators (e.g. IPNI, MycoBank, Index Fungorum, ZooBank), taxon treatments in Plazi’s TreatmentBank or other relevant trusted resources.

  • Taxonomic papers (e.g. descriptions of new species or revisions) must contain persistent identifiers for the holotype, paratypes and at least most of the specimens used in the study.

  • Specimen records that are used for new taxon descriptions or taxonomic revisions and are associated with a particular Barcode Identification Number (BIN) or Species Hypothesis (SH) should be imported directly from BOLD or PlutoF, respectively, via the ARPHA Writing Tool data-import plugin.

  • More generally, individual specimen records used for various purposes in taxonomic descriptions and inventories should be imported directly into the manuscript from GBIF, iDigBio, or BOLD via the ARPHA Writing Tool data-import plugin. 

  • In-text citations of taxon treatments from Plazi’s TreatmentBank are highly welcome in any taxonomic revision or catalogue. The in-text citations should be hyperlinked to the original treatment data at TreatmentBank.

  • Hyperlinking other terms of importance in the article text to their original external data sources or external vocabularies is encouraged.

  • Tables that list gene accession numbers, specimens and taxon names, should conform to the Biodiversity Data Journal’s linked data tables guidelines.

  • Theoretical or methodological papers on linking FAIR biodiversity data are eligible for the BiCIKL collection if they provide real examples and use cases.

  • Data papers or software descriptions are eligible if they use linked data from the BiCIKL’s partnering research infrastructures, or describe tools and services that facilitate access to and linking between FAIR biodiversity data.

  • Articles that contain nanopublications created or added during the authoring process in Biodiversity Data Journal. A nanopublication is a scientifically meaningful assertion about anything that can be uniquely identified and attributed to its author and serve to communicate a single statement, for example biotic relationship between taxa, or habitat preference of a taxon. The in-built workflow ensures the linkage and its persistence, while the information is simultaneously human-readable and machine-interpretable.
  • Manuscripts that contain or describe any other novel idea or feature related to linked or semantically enhanced biodiversity data will be considered too.

We recommend authors to get acquainted with these two papers before they decide to submit a manuscript to the collection: 


Here are several examples of research questions that might be explored using semantically enriched and linked biodiversity data: 

(1) How does linking taxon names or Operational Taxonomic Units (OTUs) to related external data (e.g. specimen records, sequences, distributions, ecological & bionomic traits, images) contribute to a better understanding of the functions and regional/local processes within faunas/floras/mycotas or biotic communities?

(2) How could the production and publication of taxon descriptions and inventories – including those based mostly on genomic and barcoding data – be streamlined? 

(3) How could general conclusions, assertions and citations in biodiversity articles be expressed in formal, machine-actionable language, either to update prior work or express new facts (e.g. via nanopublications)? 

(4) How could research data and narratives be re-used to support more extensive and data-rich studies? 

(5) Are there other taxon- or topic-specific research questions that would benefit from richer, semantically enhanced FAIR biodiversity data?


All manuscripts submitted to the Biodiversity Data Journal have their data audited by data scientists prior to the peer review stage.

Once published, specimen records data are being exported in Darwin Core Archive to GBIF.

The data and taxon treatments are also exported to several additional data aggregators, such as TreatmentBank, the Biodiversity Literature Repository, and SiBILS amongst others. The full-text articles are also converted to Linked Open Data indexed in the OpenBiodiv Knowledge Graph.


All articles will need to acknowledge the BiCIKL project, Grant No 101007492 in the Acknowledgements section.

* The publication fee (APC) is waived for standard-sized manuscripts (up to 40,000 characters, including spaces) normally charged by BDJ at € 650. Authors of larger manuscripts will need to cover the surplus charge (€10 for each 1,000 characters above 40,000). See more about the APC policy at Biodiversity Data Journal, or contact the journal editorial team at: bdj@pensoft.net.

Follow the BiCIKL Project on Twitter and Facebook. Join the conservation on via #BiCIKL_H2020.

You can also follow Biodiversity Data Journal on Twitter and Facebook.

Interoperable biodiversity data extracted from literature through open-ended queries

OpenBiodiv is a biodiversity database containing knowledge extracted from scientific literature, built as an Open Biodiversity Knowledge Management System. 

The OpenBiodiv contribution to BiCIKL

Apart from coordinating the Horizon 2020-funded project BiCIKL, scholarly publisher and technology provider Pensoft has been the engine behind what is likely to be the first production-stage semantic system to run on top of a reasonably-sized biodiversity knowledge graph.

OpenBiodiv is a biodiversity database containing knowledge extracted from scientific literature, built as an Open Biodiversity Knowledge Management System. 

As of February 2023, OpenBiodiv contains 36,308 processed articles; 69,596 taxon treatments; 1,131 institutions; 460,475 taxon names; 87,876 sequences; 247,023 bibliographic references; 341,594 author names; and 2,770,357 article sections and subsections.

In fact, OpenBiodiv is a whole ecosystem comprising tools and services that enable biodiversity data to be extracted from the text of biodiversity articles published in data-minable XML format, as in the journals published by Pensoft (e.g. ZooKeys, PhytoKeys, MycoKeys, Biodiversity Data Journal), and other taxonomic treatments – available from Plazi and Plazi’s specialised extraction workflow – into Linked Open Data.

“I believe that OpenBiodiv is a good real-life example of how the outputs and efforts of a research project may and should outlive the duration of the project itself. Something that is – of course – central to our mission at BiCIKL.”

explains Prof Lyubomir Penev, BiCIKL’s Project Coordinator and founder and CEO of Pensoft.

“The basics of what was to become the OpenBiodiv database began to come together back in 2015 within the EU-funded BIG4 PhD project of Victor Senderov, later succeeded by another PhD project by Mariya Dimitrova within IGNITE. It was during those two projects that the backend Ontology-O, the first versions of RDF converters and the basic website functionalities were created,”

he adds.

At the time OpenBiodiv became one of the nine research infrastructures within BiCIKL tasked with the provision of virtual access to open FAIR data, tools and services, it had already evolved into a RDF-based biodiversity knowledge graph, equipped with a fully automated extraction and indexing workflow and user apps.

Currently, Pensoft is working at full speed on new user apps in OpenBiodiv, as the team is continuously bringing into play invaluable feedback and recommendation from end-users and partners at BiCIKL. 

As a result, OpenBiodiv is already capable of answering open-ended queries based on the available data. To do this, OpenBiodiv discovers ‘hidden’ links between data classes, i.e. taxon names, taxon treatments, specimens, sequences, persons/authors and collections/institutions. 

Thus, the system generates new knowledge about taxa, scientific articles and their subsections, the examined materials and their metadata, localities and sequences, amongst others. Additionally, it is able to return information with a relevant visual representation about any one or a combination of those major data classes within a certain scope and semantic context.

Users can explore the database by either typing in any term (even if misspelt!) in the search engine available from the OpenBiodiv homepage; or integrating an Application Programming Interface (API); as well as by using SPARQL queries.

On the OpenBiodiv website, there is also a list of predefined SPARQL queries, which is continuously being expanded.

Sample of predefined SPARQL queries at OpenBiodiv.

“OpenBiodiv is an ambitious project of ours, and it’s surely one close to Pensoft’s heart, given our decades-long dedication to biodiversity science and knowledge sharing. Our previous fruitful partnerships with Plazi, BIG4 and IGNITE, as well as the current exciting and inspirational network of BiCIKL are wonderful examples of how far we can go with the right collaborators,”

concludes Prof Lyubomir Penev.

***

Follow BiCIKL on Twitter and Facebook. Join the conversation on Twitter at #BiCIKL_H2020.

You can also follow Pensoft on Twitter, Facebook and Linkedin and use #OpenBiodiv on Twitter.

One Biodiversity Knowledge Hub to link them all: BiCIKL 2nd General Assembly

The FAIR Data Place – the key and final product of the partnership – is meant to provide scientists with all types of biodiversity data “at their fingertips”

The Horizon 2020 – funded project BiCIKL has reached its halfway stage and the partners gathered in Plovdiv (Bulgaria) from the 22nd to the 25th of October for the Second General Assembly, organised by Pensoft

The BiCIKL project will launch a new European community of key research infrastructures, researchers, citizen scientists and other stakeholders in the biodiversity and life sciences based on open science practices through access to data, tools and services.

BiCIKL’s goal is to create a centralised place to connect all key biodiversity data by interlinking 15 research infrastructures and their databases. The 3-year European Commission-supported initiative kicked off in 2021 and involves 14 key natural history institutions from 10 European countries.

BiCIKL is keeping pace as expected with 16 out of the 48 final deliverables already submitted, another 9 currently in progress/under review and due in a few days. Meanwhile, 21 out of the 48 milestones have been successfully achieved.

Prof. Lyubomir Penev (BiCIKL’s project coordinator Prof. Lyubomir Penev and CEO and founder of Pensoft) opens the 2nd General Assembly of BiCIKL in Plovdiv, Bulgaria.

The hybrid format of the meeting enabled a wider range of participants, which resulted in robust discussions on the next steps of the project, such as the implementation of additional technical features of the FAIR Data Place (FAIR being an abbreviation for Findable, Accessible, Interoperable and Reusable).

This FAIR Data Place online platform – the key and final product of the partnership and the BiCIKL initiative – is meant to provide scientists with all types of biodiversity data “at their fingertips”.

This data includes biodiversity information, such as detailed images, DNA, physiology and past studies concerning a specific species and its ‘relatives’, to name a few. Currently, the issue is that all those types of biodiversity data have so far been scattered across various databases, which in turn have been missing meaningful and efficient interconnectedness.

Additionally, the FAIR Data Place, developed within the BiCIKL project, is to give researchers access to plenty of training modules to guide them through the different services.

Halfway through the duration of BiCIKL, the project is at a turning point, where crucial discussions between the partners are playing a central role in the refinement of the FAIR Data Place design. Most importantly, they are tasked with ensuring that their technologies work efficiently with each other, in order to seamlessly exchange, update and share the biodiversity data every one of them is collecting and taking care of.

By Year 3 of the BiCIKL project, the partners agree, when those infrastructures and databases become efficiently interconnected to each other, scientists studying the Earth’s biodiversity across the world will be in a much better position to build on existing research and improve the way and the pace at which nature is being explored and understood. At the end of the day, knowledge is the stepping stone for the preservation of biodiversity and humankind itself.


“Needless to say, it’s an honour and a pleasure to be the coordinator of such an amazing team spanning as many as 14 partnering natural history and biodiversity research institutions from across Europe, but also involving many global long-year collaborators and their infrastructures, such as Wikidata, GBIF, TDWG, Catalogue of Life to name a few,”

said BiCIKL’s project coordinator Prof. Lyubomir Penev, CEO and founder of Pensoft.

“I see our meeting in Plovdiv as a practical demonstration of our eagerness and commitment to tackle the long-standing and technically complex challenge of breaking down the silos in the biodiversity data domain. It is time to start building freeways between all biodiversity data, across (digital) space, time and data types. After the last three days that we spent together in inspirational and productive discussions, I am as confident as ever that we are close to providing scientists with much more straightforward routes to not only generate more biodiversity data, but also build on the already existing knowledge to form new hypotheses and information ready to use by decision- and policy-makers. One cannot stress enough how important the role of biodiversity data is in preserving life on Earth. These data are indeed the groundwork for all that we know about the natural world”  

Prof. Lyubomir Penev added.
Christos Arvanitidis (CEO of LifeWatch ERIC) at the 2nd General Assembly of the BiCIKL project.

Christos Arvanitidis, CEO of LifeWatch ERIC, added:

“The point is: do we want an integrated structure or do we prefer federated structures? What are the pros and cons of the two options? It’s essential to keep the community united and allied because we can’t afford any information loss and the stakeholders should feel at home with the Project and the Biodiversity Knowledge Hub.”


Joe Miller, Executive Secretary and Director at GBIF, commented:

“We are a brand new community, and we are in the middle of the growth process. We would like to already have answers, but it’s good to have this kind of robust discussion to build on a good basis. We must find the best solution to have linkages between infrastructures and be able to maintain them in the future because the Biodiversity Knowledge Hub is the location to gather the community around best practices, data and guidelines on how to use the BiCIKL services… In order to engage even more partners to fill the eventual gaps in our knowledge.”


Joana Pauperio (biodiversity curator at EMBL-EBI) at the 2nd General Assembly of the BiCIKL project.

“BiCIKL is leading data infrastructure communities through some exciting and important developments”  

said Dr Guy Cochrane, Team Leader for Data Coordination and Archiving and Head of the European Nucleotide Archive at EMBL’s European Bioinformatics Institute (EMBL-EBI).

“In an era of biodiversity change and loss, leveraging scientific data fully will allow the world to catalogue what we have now, to track and understand how things are changing and to build the tools that we will use to conserve or remediate. The challenge is that the data come from many streams – molecular biology, taxonomy, natural history collections, biodiversity observation – that need to be connected and intersected to allow scientists and others to ask real questions about the data. In its first year, BiCIKL has made some key advances to rise to this challenge,”

he added.

Deborah Paul, Chair of the Biodiversity Information Standards – TDWG said:

“As a partner, we, at the Biodiversity Information Standards – TDWG, are very enthusiastic that our standards are implemented in BiCIKL and serve to link biodiversity data. We know that joining forces and working together is crucial to building efficient infrastructures and sharing knowledge.”


The project will go on with the first Round Table of experts in December and the publications of the projects who participated in the Open Call and will be founded at the beginning of the next year.

***

Learn more about BiCIKL on the project’s website at: bicikl-project.eu

Follow BiCIKL Project on Twitter and Facebook. Join the conversation on Twitter at #BiCIKL_H2020.

***

All BiCIKL project partners:

#TDWG2022 recap: TDWG and Pensoft welcomed 400 biodiversity information experts from 41 countries in Sofia

For the 37th time, experts from across the world to share and discuss the latest developments surrounding biodiversity data and how they are being gathered, used, shared and integrated across time, space and disciplines.

Between 17th and 21st October, about 400 scientists and experts took part in a hybrid meeting dedicated to the development, use and maintenance of biodiversity data, technologies, and standards across the world.

This year, the conference was hosted by Pensoft in collaboration with the National Museum of Natural History (Bulgaria) and the Institute of Biodiversity and Ecosystem Research at the Bulgarian Academy of Science. It ran under the theme “Stronger Together: Standards for linking biodiversity data”.

For the 37th time, the global scientific and educational association Biodiversity Information Standards (TDWG) brought together experts from all over the globe to share and discuss the latest developments surrounding biodiversity data and how they are being gathered, used, shared and integrated across time, space and disciplines.

This was the first time the event happened in a hybrid format. It was attended by 160 people on-site, while another 235 people joined online. 

The TDWG 2022 conference saw plenty of networking and engaging discussions with as many as 160 on-site attendees and another 235 people, who joined the event remotely.

The conference abstracts, submitted by the event’s speakers ahead of the meeting, provide a sneak peek into their presentations and are all publicly available in the TDWG journal Biodiversity Information Science and Standards (BISS).

“It’s wonderful to be in the Balkans and Bulgaria for our Biodiversity Information and Standards (TDWG) 2022 conference! Everyone’s been so welcoming and thoughtfully engaged in conversations about biodiversity information and how we can all collaborate, contribute and benefit,”

said Deborah Paul, Chair of TDWG, a biodiversity informatics specialist and community liaison at the University of Illinois, Prairie Research Institute‘s Illinois Natural History Survey and also an active participant in the Society for the Preservation of Natural History Collections (SPNHC), the Entomological Collections Network (ECN), ICEDIG, the Research Data Alliance (RDA), and The Carpentries.

“Our TDWG mission is to create, maintain and promote the use of open, community-driven standards to enable sharing and use of biodiversity data for all,”

she added.
Prof Lyubomir Penev (Pensoft) and Deborah Paul (TDWG) at TDWG 2022.

“We are proud to have been selected to be the hosts of this year’s TDWG annual conference and are definitely happy to have joined and observed so many active experts network and share their know-how and future plans with each other, so that they can collaborate and make further progress in the way scientists and informaticians work with biodiversity information,”  

said Pensoft’s founder and CEO Prof. Lyubomir Penev.

“As a publisher of multiple globally renowned scientific journals and books in the field of biodiversity and ecology, at Pensoft we assume it to be our responsibility to be amongst the first to implement those standards and good practices, and serve as an example in the scholarly publishing world. Let me remind you that it is the scientific publications that present the most reliable knowledge the world and science has, due to the scrutiny and rigour in the review process they undergo before seeing the light of day,”

he added.

***

In a nutshell, the main task and dedication of the TDWG association is to develop and maintain standards and data-sharing protocols that support the infrastructures (e.g., The Global Biodiversity Information Facility – GBIF), which aggregate and facilitate use of these data, in order to inform and expand humanity’s knowledge about life on Earth.

It is the goal of everyone at TDWG to let scientists interested in the world’s biodiversity to do their work efficiently and in a manner that can be understood, shared and reused.

It is the goal of everyone volunteering their time and expertise to TDWG to enable the scientists interested in the world’s biodiversity to do their work efficiently and in a manner that can be understood, shared and reused by others. After all, biodiversity data underlie everything we know about the natural world.

If there are optimised and universal standards in the way researchers store and disseminate biodiversity data, all those biodiversity scientists will be able to find, access and use the knowledge in their own work much more easily. As a result, they will be much better positioned to contribute new knowledge that will later be used in nature and ecosystem conservation by key decision-makers.

On Monday, the event opened with welcoming speeches by Deborah Paul and Prof. Lyubomir Penev in their roles of the Chair of TDWG and the main host of this year’s conference, respectively.

The opening ceremony continued with a keynote speech by Prof. Pavel Stoev, Director of the Natural History Museum of Sofia and co-host of TDWG 2022. 

Prof. Pavel Stoev (Natural History Museum of Sofia) with a presentation about the known and unknown biodiversity of Bulgaria during the opening plenary session of TDWG 2022.

He walked the participants through the fascinating biodiversity of Bulgaria, but also the worrying trends in the country associated with declining taxonomic expertise. 

He finished his talk with a beam of hope by sharing about the recently established national unit of DiSSCo, whose aim – even if a tad too optimistic – is to digitise one million natural history items in four years, of which 250,000 with photographs. So far, one year into the project, the Bulgarian team has managed to digitise more than 32,000 specimens and provide images to 10,000 specimens.

The plenary session concluded with a keynote presentation by renowned ichthyologist and biodiversity data manager Dr. Richard L. Pyle, who is also a manager of ZooBank – the key international database for newly described species.

Keynote presentation by Dr Richard L. Pyle (Bishop Museum, USA) at the opening plenary session of TDWG 2022.

In his talk, he highlighted the gaps in the ways taxonomy is being used, thereby impeding biodiversity research and cutting off a lot of opportunities for timely scientific progress.

“There are simple things we can do to change how we use taxonomy as a tool that would dramatically improve our ability to conduct science and understand biodiversity. There is enormous value and utility within existing databases around the world to understand biodiversity, how threatened it is, what impacts human activity has (especially climate change), and how to optimise the protection and preservation of biodiversity,”

he said in an interview for a joint interview by the Bulgarian News Agency and Pensoft.

“But we do not have easy access to much of this information because the different databases are not well integrated. Taxonomy offers us the best opportunity to connect this information together, to answer important questions about biodiversity that we have never been able to answer before. The reason meetings like this are so important is that they bring people together to discuss ways of using modern informatics to greatly increase the power of the data we already have, and prioritise how we fill the gaps in data that exist. Taxonomy, and especially taxonomic data integration, is a very important part of the solution.”

Pyle also commented on the work in progress at ZooBank ten years into the platform’s existence and its role in the next (fifth) edition of the International Code of Zoological Nomenclature, which is currently being developed by the International Commission of Zoological Nomenclature (ICZN). 

“We already know that ZooBank will play a more important role in the next edition of the Code than it has for these past ten years, so this is exactly the right time to be planning new services for ZooBank. Improvements at ZooBank will include things like better user-interfaces on the web to make it easier and faster to use ZooBank, better data services to make it easier for publishers to add content to ZooBank as part of their publication workflow, additional information about nomenclature and taxonomy that will both support the next edition of the Code, and also help taxonomists get their jobs done more efficiently and effectively. Conferences like the TDWG one are critical for helping to define what the next version of ZooBank will look like, and what it will do.”

***

During the week, the conference participants had the opportunity to enjoy a total of 140 presentations; as well as multiple social activities, including a field trip to Rila Monastery and a traditional Bulgarian dinner.

TDWG 2022 conference participants document their species observations on their way to Rila Monastery.

While going about the conference venue and field trip localities, the attendees were also actively uploading their species observations made during their stay in Bulgaria on iNaturalist in a TDWG2022-dedicated BioBlitz. The challenge concluded with a total of 635 observations and 228 successfully identified species.

Amongst the social activities going on during TDWG 2022 was a BioBlitz, where the conference participants could uploade their observations made in Bulgaria on iNaturalist and help each other successfully identify the specimens.

***

In his interview for the Bulgarian News Agency and Pensoft, Dr Vincent Smith, Head of the Informatics Division at the Natural History Museum, London (United Kingdom), co-founder of DiSSCo, the Distributed System of Scientific Collections, and the Editor-in-Chief of Biodiversity Data Journal, commented: 

“Biodiversity provides the support systems for all life on Earth. Yet the natural world is in peril, and we face biodiversity and climate emergencies. The consequences of these include accelerating extinction, increased risk from zoonotic disease, degradation of natural capital, loss of sustainable livelihoods in many of the poorest yet most biodiverse countries of the world, challenges with food security, water scarcity and natural disasters, and the associated challenges of mass migration and social conflicts.

Solutions to these problems can be found in the data associated with natural science collections. DiSSCo is a partnership of the institutions that digitise their collections to harness their potential. By bringing them together in a distributed, interoperable research infrastructure, we are making them physically and digitally open, accessible, and usable for all forms of research and innovation. 

At present rates, digitising all of the UK collection – which holds more than 130 million specimens collected from across the globe and is being taken care of by over 90 institutions – is likely to take many decades, but new technologies like machine learning and computer vision are dramatically reducing the time it will take, and we are presently exploring how robotics can be applied to accelerate our work.”

Dr Vincent Smith, Head of the Informatics Division at the Natural History Museum, London, co-founder of DiSSCo, and Editor-in-Chief of Biodiversity Data Journal at the TDWG 2022 conference.

In his turn, Dr Donat Agosti, CEO and Managing director at Plazi – a not-for-profit organisation supporting and promoting the development of persistent and openly accessible digital taxonomic literature – said:

“All the data about biodiversity is in our libraries, that include over 500 million pages, and everyday new publications are being added. No person can read all this, but machines allow us to mine this huge, very rich source of data. We do not know how many species we know, because we cannot analyse with all the scientists in this library, nor can we follow new publications. Thus, we do not have the best possible information to explore and protect our biological environment.”

Dr Donat Agosti demonstrating the importance of publishing biodiversity data in a structured and semantically enhanced format in one of his presentations at TDWG 2022.

***

At the closing plenary session, Gail Kampmeier – TDWG Executive member and one of the first zoologists to join TDWG in 1996 – joined via Zoom to walk the conference attendees through the 37-year history of the association, originally named the Taxonomic Databases Working Group, but later transformed to Biodiversity Information Standards, as it expanded its activities to the whole range of biodiversity data. 

“While this presentation is about TDWG’s history as an organisation, its focus will be on the heart of TDWG: its people. We would like to show how the organisation has evolved in terms of gender balance, inclusivity actions, and our engagement to promote and enhance diversity at all levels. But more importantly, where do we—as a community—want to go in the future?”,

reads the conference abstract of her colleague at TDWG Dr Visotheary Ung (CNRS-MNHN) and herself.

Then, in the final talk of the session, Deborah Paul took to the stage to present the progress and key achievements by the association from 2022.

She gave a special shout-out to the TDWG journal: Biodiversity Information Science and Standards (BISS), where for the 6th consecutive year, the participants of the annual conference submitted and published their conference abstracts ahead of the event. 

Deborah Paul reminds that – apart from the conference abstracts – the TDWG journal: Biodiversity Information Science and Standards (BISS) also welcomes full-lenght articles that demonstrate the development or application of new methods and approaches in biodiversity informatics.

Launched in 2017 on the Pensoft’s publishing platform ARPHA, the journal provides the quite unique and innovative opportunity to have both abstracts and full-length research papers published in a modern, technologically-advanced scholarly journal. In her speech, Deborah Paul reminded that BISS journal welcomes research articles that demonstrate the development or application of new methods and approaches in biodiversity informatics in the form of case studies.

Amongst the achievements of TDWG and its community, a special place was reserved for the Horizon 2020-funded BiCIKL project (abbreviation for Biodiversity Community Integrated Knowledge Library), involving many of the association’s members. 

Having started in 2021, the 3-year project, coordinated by Pensoft, brings together 14 partnering institutions from 10 countries, and 15 biodiversity under the common goal to create a centralised place to connect all key biodiversity data by interlinking a total of 15 research infrastructures and their databases.

Deborah Paul also reported on the progress of the Horizon 2020-funded project BiCIKL, which involves many of the TDWG members. BiCIKL’s goal is to create a centralised place to connect all key biodiversity data by interlinking 15 key research infrastructures and their databases.

In fact, following the week-long TDWG 2022 conference in Sofia, a good many of the participants set off straight for another Bulgarian city and another event hosted by Pensoft. The Second General Assembly of BiCIKL took place between 22nd and 24th October in Plovdiv.

***

You can also explore highlights and live tweets from TDWG 2022 on Twitter via #TDWG2022.
The Pensoft team at TDWG 2022 were happy to become the hosts of the 37th TDWG conference.

‘Who is in your database and why does it matter?’

The uncertainty about a person’s identity hampers research, hinders the discovery of expertise, and obstructs the ability to give attribution or credit for work performed. 

Collection discovery through disambiguation

Guest blog post by Sabine von Mering, Heather Rogers, Siobhan Leachman, David P. ShorthouseDeborah Paul & Quentin Groom

Worldwide, natural history institutions house billions of physical objects in their collections, they create and maintain data about these items, and they share their data with aggregators such as the Global Biodiversity Information Facility (GBIF), the Integrated Digitized Biocollections (iDigBio), the Atlas of Living Australia (ALA), Genbank and the European Nucleotide Archive (ENA). 

Even though these data often include the names of the people who collected or identified each object, such statements may be ambiguous, as the names frequently lack any globally unique, machine-readable concept of their shared identity.

Despite the data being available online, barriers exist to effectively use the information about who collects or provides the expertise to identify the collection objects. People have similar names, change their name over the course of their lifetime (e.g. through marriage), or there may be variability introduced through the label transcription process itself (e.g. local look-up lists). 

As a result, researchers and collections staff often spend a lot of time deducing who is the person or people behind unknown collector strings while collating or tidying natural history data. The uncertainty about a person’s identity hampers research, hinders the discovery of expertise, and obstructs the ability to give attribution or credit for work performed. 

Disambiguation activities: the act of churning strings into verifiable things using all available evidence – need not be done in isolation. In addition to presenting a workflow on how to disambiguate people in collections, we also make the case that working in collaboration with colleagues and the general public presents new opportunities and introduces new efficiencies. There is tacit knowledge everywhere.

More often than not, data about people involved in biodiversity research are scattered across different digital platforms. However, with linking information sources to each other by using person identifiers, we can better trace the connections in these networks, so that we can weave a more interoperable narrative about every actor.

That said, inconsistent naming conventions or lack of adequate accreditation often frustrate the realization of this vision. This sliver of natural history could be churned to gold with modest improvements in long-term funding for human resources, adjustments to digital infrastructure, space for the physical objects themselves alongside their associated documents, and sufficient training on how to disambiguate people’s names.

“He aha te mea nui o te ao. He tāngata, he tāngata, he tāngata.

“What is the most important thing in the world? It is people, it is people, it is people.”

(Māori proverb)

The process of properly disambiguating those who have contributed to natural history collections takes time. 

The disambiguation process involves the extra challenge of trying to deduce “who is who” for legacy data, compared to undertaking this activity for people alive today. Retrospective disambiguation can require considerable detective work, especially for scarcely known people or if the community has a different naming convention. Provided the results of this effort are well-communicated and openly shared, mercifully, it need only be done once.

At the core of our research is the question of how to solve the issue of assigning proper credit

In our recent Methods paper, we discuss several methods for this, as well as available routes for making records available online that include not only the names of people expressed as text, but additionally twinned with their unique, resolvable identifiers. 

Disambiguation is a cycle. Enrichment of the data feeds off itself leading to further disambiguation. As more names are disambiguated and more biographical data are accumulated, it becomes easier to disambiguate more names. 

First and foremost, we should maintain our own public biographical data by making full use of ORCID. In addition to preserving our own scientific legacy and that of the institutions that employ us, we have a responsibility to avoid generating unnecessary disambiguation work for others. 

For legacy data, where the people connected to the collections are deceased, Wikidata can be used to openly document rich bibliographic and demographic data, each statement with one or more verifiable references. Wikidata can also act as a bridge to link other sources of authority such as VIAF or ORCID identifiers. It has many tools and services to bulk import, export, and to query information, making it well-suited as a universal democratiser of information about people often walled-off in collection management systems (CMS). 

A network of the top twenty most used identifiers for biologists on Wikidata.

Once unique identifiers for people are integrated in collection management systems, these may be shared with the global collections and research community using the new Darwin Core terms, recordedByID or identifiedByID along with the well-known, yet text-based terms, recordedBy or identifiedBy. 

Approximately 120 datasets published through GBIF now make use of these identifier-based terms, which are additionally resolved in Bionomia every few weeks alongside co-curated attributions newly made there. This roundtrip of data – emerging as ambiguous strings of text from the source, affixed with resolvable identifiers elsewhere, absorbed into the source as new digital annotations, and then re-emerging with these fresh, identifier-based enhancements – is an exciting approach to co-manage collections data.

Round tripping. In Bionomia, people identifiers from Wikidata and ORCID are used to enrich data published via GBIF, thus linking natural history specimens to the world’s collectors.

Disambiguation work is particularly important in recognising contributors who have been historically marginalized. For example, gender bias in specimen data can be seen in the case of Wilmatte Porter Cockerell, a prolific collector of botanical, entomological and fossil specimens. Cockerell’s collections are often attributed to her husband as he was also a prolific collector and the two frequently collected together. 

On some labels, her identity is further obscured as she is simply recorded as “& wife” (see example on GBIF). Since Wilmatte Cockerell was her husband’s second wife, it can take some effort to confirm if a specimen can be attributed to her and not her husband’s first wife, who was also involved in collecting specimens. By ensuring that Cockerell is disambiguated and her contributions are appropriately attributed, the impact of her work becomes more visible enabling her work to be properly and fairly credited.

Thus, disambiguation work helps to not only give credit where credit is due, thereby making data about people and their biodiversity collections more findable, but it also creates an inclusive and representative narrative of the landscape of people involved with scientific knowledge creation, identification, and preservation. 

A future – once thought to be a dream – where the complete scientific output of a person is connected as Linked Open Data (LOD) is now

Both the tools and infrastructure are at our disposal and the demand is palpable. All institutions can contribute to this movement by sharing data that include unique identifiers for the people in their collections. We recommend that institutions develop a strategy, perhaps starting with employees and curatorial staff, people of local significance, or those who have been marginalized, and to additionally capitalize on existing disambiguation activities elsewhere. This will have local utility and will make a significant, long-term impact. 

The more we participate in these activities, the greater chance we will uncover positive feedback loops, which will act to lighten the workload for all involved, including our future selves!

The disambiguation of people in collections is an ongoing process, but it becomes easier with practice. We also encourage collections staff to consider modifying their existing workflows and policies to include identifiers for people at the outset, when new data are generated or when new specimens are acquired. 

There is more work required at the global level to define, update, and ratify standards and best practices to help accelerate data exchange or roundtrips of this information; there is room for all contributions. Thankfully, there is a diverse, welcoming, energetic, and international community involved in these activities. 

We see a bright future for you, our collections, and our research products – well within reach – when the identities of people play a pivotal role in the construction of a knowledge graph of life.

You would like to participate and need support getting disambiguation of your collection started? Please contact our TDWG People in Biodiversity Data Task Group.

A good start is also to check Bionomia to find out what metrics exist now for your institution or collection and affiliated people.

The next steps for collections: 7 objectives that can help to disambiguate your institutions’ collection:

1. Promote the use of person identifiers in local, national or international outreach, publishing and research activities

2. Increase the number of collection management systems that use person identifiers

3. Increase the number of living collectors registered and using an ORCID identifier when contributing to collections

4. Undertake disambiguation in the national languages of many countries

5. Increase the number of identified people on Wikidata linked to collections

6. Increase the number of people in collections with expertise in person disambiguation

7. Collaborate towards an exchange standard for attribution data

A real example of how a name string is disambiguated and the steps taken in documenting it. Wikidata item of Jean-André Soulié

***

Methods publication:

Groom Q, Bräuchler C, Cubey RWN, Dillen M, Huybrechts P, Kearney N, Klazenga N, Leachman S, Paul DL, Rogers H, Santos J, Shorthouse DP, Vaughan A, von Mering S, Haston EM (2022) The disambiguation of people names in biological collections. Biodiversity Data Journal 10: e86089. https://doi.org/10.3897/BDJ.10.e86089

***

Follow Biodiversity Data Journal on Twitter and Facebook.