A new dawn for biological collections: The AI revolution in museums and herbaria

There are numerous uses for machine learning in digital collections, including an enormous potential to extract traits of organisms.

Guest blog post by Quentin Groom

Imagine having access to all the two billion biological collections of the world from your desktop! Not only to browse, but to search with artificial intelligence. We recently published a paper where we envisage what might be possible, such as searching all specimen labels for a person’s signature, studying the patterns of butterflies’ wings, or reconstructing a historic expedition.

Numbers of digital images from biodiversity collections are increasing exponentially. Herbariums have led the way with tens of millions of images available, but images of pinned insects will soon overtake plants.

Numbers of accessible images of specimens are increasing exponentially. Plants lead the way, but insects are increasing at the fastest rate. This graph was created from snapshots of the Global Biodiversity Information Facility and is undoubtedly an underestimate of the actual number of specimens for which images exist. See how this was created in Groom et al. (2023).

At one time, if you wanted access to biological collections, you had to travel. Now we are used to visiting collections online, where we can view images of specimens and their details on our desktops. Nevertheless, biological collection images are still dispersed and this limits their effective use, not just for people, but also for computers. One of the promises of making specimens digital is being able to apply machine learning to these images.  Yet the real benefits of machine access to specimens can only be realised through massive access to collection images and the ability to apply these techniques to hundreds of collections and millions of specimens.

Imagine examining collections globally for the variation and evolution of wing coloration in butterflies, or studying the size and shape of leaves in research that transverses habitats and gradients of latitude and altitude.

In our paper in Biodiversity Data Journal, we examined some of the numerous uses for machine learning in digital collections. These include an enormous potential to extract traits of organisms, from the size and shape of different organs, to their colours, patterns, and phenology. Imagine examining collections globally for the variation and evolution of wing coloration in butterflies, or studying the size and shape of leaves in research that transverses habitats and gradients of latitude and altitude. We would not only be able to study the intricacies of evolution, but also practical subjects, such as the mechanics of pollination in insects, adaptations to drought in plants, and adaptations to weediness in invasive species.

Machine access to these images will also provide an unparalleled view of the history of the biological sciences, the specimens used to describe species, the evidence for evolution, the people involved and institutions that contributed. Such transparency may reveal some amazing stories of scientific exploration, but will undoubtedly also shed light on some of the less exemplary actions of colonialism. Yet if we are to redress the injustices of the past we need to have a balanced view of collections, and we should do this openly.

Specimen labels provide numerous clues to their history often in the form of stamps and emblems. A BR0000013433048 Meise Botanic Garden (CC-BY-SA 4.0). B USCH0030719, A.C. Moore Herbarium at the University of South Carolina (public domain). C E00809288, Royal Botanic Garden Edinburgh (public domain). D USCH0030719, University of South Carolina (public domain). E E00919066, Royal Botanic Garden Edinburgh (public domain). F BR0000017682725, Meise Botanic Garden (CC-BY-SA 4.0). G P00605317, Museum National d’Histoire Naturelle, Paris (CC-BY 4.0). H LISC036829, Instituto de Investigação Científica Tropical (CC-BY-NC 4.0). l PC0702930, Muséum National d’Histoire Naturelle, Paris (CC-By 4.0). J same specimen as (B). K PC0702930 Muséum National d’Histoire Naturelle, Paris (CC-BY 4.0). L 101178648, Missouri Botanical Garden (CC-BY-SA 4.0).

With such unparalleled access to collections, we could travel vicariously to times and places that are hard to reach in any other way. Fieldwork is expensive and time-consuming, and can’t provide the historic perspective of collections, let alone the geographic extent. Furthermore, digital resources have the potential to democratise collections, allowing anyone the opportunity to study these collections irrespective of location.

Is such a vision of integrated digital collections possible? It certainly is! The technologies already exist, not just for machine learning, but also to create the infrastructure to provide access to millions of digital images and their metadata. Initiatives, such as DiSSCo in Europe and iDigBio in the USA are moving in this direction. Yet, we conclude that the main challenge to realising this vision of the future is a sociopolitical one. Can so many institutions and funders work together to pool their resources? Can collections in rich countries share the sovereignty of their collections with the countries where many of the specimens originated?

If you too share the dream, we encourage you to support or contribute to initiatives working in this direction, whether through funding, collaboration, or sharing knowledge. If the full potential of digital collections is to be realised, we need to think big and work together.

Research article:

Groom Q, Dillen M, Addink W, Ariño AHH, Bölling C, Bonnet P, Cecchi L, Ellwood ER, Figueira R, Gagnier P-Y, Grace OM, Güntsch A, Hardy H, Huybrechts P, Hyam R, Joly AAJ, Kommineni VK, Larridon I, Livermore L, Lopes RJ, Meeus S, Miller JA, Milleville K, Panda R, Pignal M, Poelen J, Ristevski B, Robertson T, Rufino AC, Santos J, Schermer M, Scott B, Seltmann KC, Teixeira H, Trekels M, Gaikwad J (2023) Envisaging a global infrastructure to exploit the potential of digitised collections. Biodiversity Data Journal 11: e109439. https://doi.org/10.3897/BDJ.11.e109439

Strategic collaboration agreement signed between ScienceOpen and Pensoft

The research discovery platform ScienceOpen and Pensoft Publishers have entered into a strategic collaboration partnership with the aim of strengthening the companies’ identities as the leaders of innovative content dissemination.

The research discovery platform ScienceOpen and Pensoft Publishers have entered into a strategic collaboration partnership with the aim of strengthening the companies’ identities as the leaders of innovative content dissemination. The new cooperation will focus on the unified indexation, the integration of Pensoft’s ARPHA Platform content into ScienceOpen and the utilization of novel streams of scientific communication for the published materials.

Pensoft is an independent academic publishing company, well known worldwide for bringing novelty through its cutting-edge publishing tools and for its commitment to open access practices. In 2013, Pensoft launched the first ever, end-to-end, XML-based, authoring, reviewing and publishing workflow, now upgraded to the ARPHA Publishing Platform. As of today, ARPHA hosts over 50 open access, peer-reviewed scholarly journals: the whole Pensoft portfolio in addition to titles owned by learned societies, university presses and research institutions.

As part of the strategic collaboration, all Pensoft content and journals hosted on ARPHA are indexed in the ScienceOpen’s research and discovery environment, which puts them into thematic context of over 60 million articles and books. In addition, thousands of articles across more than 20 journals were integrated into a “Pensoft Biodiversity” Collection. Combined this way, the content benefits from the special infrastructure of ScienceOpen Collections, which supports thematic groups of articles and books equipped with a unique landing page, a built-in search engine and an overview of the featured content. The Collections can be reviewed, recommended and shared by users, which facilitates academic debate and increases the discoverability of the research.

The Pensoft Biodiversity collection is available from: https://www.scienceopen.com/collection/PensoftBiodiversity

“It is certainly great news and a much-anticipated milestone for Pensoft, ARPHA and our long-year partners and supporters from ScienceOpen to have brought our collaboration to a new level by indexing the whole ARPHA-hosted content at ScienceOpen,” comments Pensoft’s and ARPHA’s CEO and founder Prof. Lyubomir Penev. “Most of all, the integration between ARPHA and ScienceOpen at an infrastructural level means that we will be able to offer this incredible service and increased visibility to newcoming journals right away. On the other hand, by streaming fresh and valuable publicly accessible content to the ScienceOpen database, these journals will be further adding to the growth of science in the open.”

Stephanie Dawson, CEO of ScienceOpen says, “I am particularly excited to add new high-quality, open access biodiversity content from Pensoft Publishers to the ScienceOpen discovery environment as we have a very active community of researchers on ScienceOpen creating and sharing Collections in this field. We are looking forward to working with Pensoft’s innovative journals to support their open science goals.”

The collaboration reflects not only the commitment of both Pensoft and ScienceOpen to new methods of knowledge dissemination, but also the joint mission to champion open science through innovation. The two companies will cooperate at a strategic level in order to increase the international outreach of their content and services, and to make them even more accessible to the broad community.

###

About ScienceOpen:

From promotional collections to Open Access hosting and full publishing packages, ScienceOpen provides next-generation services to academic publishers embedded in an interactive discovery platform. ScienceOpen was founded in 2013 in Berlin and Boston by Alexander Grossmann and Tibor Tscheke to accelerate research communication.