A new dawn for biological collections: The AI revolution in museums and herbaria

There are numerous uses for machine learning in digital collections, including an enormous potential to extract traits of organisms.

Guest blog post by Quentin Groom

Imagine having access to all the two billion biological collections of the world from your desktop! Not only to browse, but to search with artificial intelligence. We recently published a paper where we envisage what might be possible, such as searching all specimen labels for a person’s signature, studying the patterns of butterflies’ wings, or reconstructing a historic expedition.

Numbers of digital images from biodiversity collections are increasing exponentially. Herbariums have led the way with tens of millions of images available, but images of pinned insects will soon overtake plants.

Numbers of accessible images of specimens are increasing exponentially. Plants lead the way, but insects are increasing at the fastest rate. This graph was created from snapshots of the Global Biodiversity Information Facility and is undoubtedly an underestimate of the actual number of specimens for which images exist. See how this was created in Groom et al. (2023).

At one time, if you wanted access to biological collections, you had to travel. Now we are used to visiting collections online, where we can view images of specimens and their details on our desktops. Nevertheless, biological collection images are still dispersed and this limits their effective use, not just for people, but also for computers. One of the promises of making specimens digital is being able to apply machine learning to these images.  Yet the real benefits of machine access to specimens can only be realised through massive access to collection images and the ability to apply these techniques to hundreds of collections and millions of specimens.

Imagine examining collections globally for the variation and evolution of wing coloration in butterflies, or studying the size and shape of leaves in research that transverses habitats and gradients of latitude and altitude.

In our paper in Biodiversity Data Journal, we examined some of the numerous uses for machine learning in digital collections. These include an enormous potential to extract traits of organisms, from the size and shape of different organs, to their colours, patterns, and phenology. Imagine examining collections globally for the variation and evolution of wing coloration in butterflies, or studying the size and shape of leaves in research that transverses habitats and gradients of latitude and altitude. We would not only be able to study the intricacies of evolution, but also practical subjects, such as the mechanics of pollination in insects, adaptations to drought in plants, and adaptations to weediness in invasive species.

Machine access to these images will also provide an unparalleled view of the history of the biological sciences, the specimens used to describe species, the evidence for evolution, the people involved and institutions that contributed. Such transparency may reveal some amazing stories of scientific exploration, but will undoubtedly also shed light on some of the less exemplary actions of colonialism. Yet if we are to redress the injustices of the past we need to have a balanced view of collections, and we should do this openly.

Specimen labels provide numerous clues to their history often in the form of stamps and emblems. A BR0000013433048 Meise Botanic Garden (CC-BY-SA 4.0). B USCH0030719, A.C. Moore Herbarium at the University of South Carolina (public domain). C E00809288, Royal Botanic Garden Edinburgh (public domain). D USCH0030719, University of South Carolina (public domain). E E00919066, Royal Botanic Garden Edinburgh (public domain). F BR0000017682725, Meise Botanic Garden (CC-BY-SA 4.0). G P00605317, Museum National d’Histoire Naturelle, Paris (CC-BY 4.0). H LISC036829, Instituto de Investigação Científica Tropical (CC-BY-NC 4.0). l PC0702930, Muséum National d’Histoire Naturelle, Paris (CC-By 4.0). J same specimen as (B). K PC0702930 Muséum National d’Histoire Naturelle, Paris (CC-BY 4.0). L 101178648, Missouri Botanical Garden (CC-BY-SA 4.0).

With such unparalleled access to collections, we could travel vicariously to times and places that are hard to reach in any other way. Fieldwork is expensive and time-consuming, and can’t provide the historic perspective of collections, let alone the geographic extent. Furthermore, digital resources have the potential to democratise collections, allowing anyone the opportunity to study these collections irrespective of location.

Is such a vision of integrated digital collections possible? It certainly is! The technologies already exist, not just for machine learning, but also to create the infrastructure to provide access to millions of digital images and their metadata. Initiatives, such as DiSSCo in Europe and iDigBio in the USA are moving in this direction. Yet, we conclude that the main challenge to realising this vision of the future is a sociopolitical one. Can so many institutions and funders work together to pool their resources? Can collections in rich countries share the sovereignty of their collections with the countries where many of the specimens originated?

If you too share the dream, we encourage you to support or contribute to initiatives working in this direction, whether through funding, collaboration, or sharing knowledge. If the full potential of digital collections is to be realised, we need to think big and work together.

Research article:

Groom Q, Dillen M, Addink W, Ariño AHH, Bölling C, Bonnet P, Cecchi L, Ellwood ER, Figueira R, Gagnier P-Y, Grace OM, Güntsch A, Hardy H, Huybrechts P, Hyam R, Joly AAJ, Kommineni VK, Larridon I, Livermore L, Lopes RJ, Meeus S, Miller JA, Milleville K, Panda R, Pignal M, Poelen J, Ristevski B, Robertson T, Rufino AC, Santos J, Schermer M, Scott B, Seltmann KC, Teixeira H, Trekels M, Gaikwad J (2023) Envisaging a global infrastructure to exploit the potential of digitised collections. Biodiversity Data Journal 11: e109439. https://doi.org/10.3897/BDJ.11.e109439

Invasive alien species? Isn’t there an app for that?

Scientists review 41 invasive species reporting apps and provide recommendations for future development.

Invasive alien species (IAS) are a leading contributor to biodiversity loss, and they cause annual economic damage in the order of hundreds of billions of US dollars in each of many countries around the world. Smartphone apps are one relatively new tool that could help monitor, predict, and ideally prevent their spread. But are they living up to their full potential?

A team of researchers from the University of Montana, the Flathead Lake Biological Station and the University of Georgia River Basin Center tried to answer that in a recent research paper in the open access, peer-reviewed journal NeoBiota. Going through nearly 500 peer-reviewed articles, they identified the key features of the perfect IAS reporting app and then rated all known English-language IAS reporting apps available to North America users against this ideal.

Smartphone apps have the potential to be powerful reporting tools. Citizen scientists the world around have made major contributions to the reporting of biodiversity using apps like iNaturalist and eBird. But apps for reporting invasive species never reached that level of popularity; Howard and his team investigated why.

Smartphone apps like the soon-to-be-released new EDDmapS platform are promising tools for monitoring, predicting, and reducing the spread of invasive species. However, the same explosion of reports has not been realized as that which has been experienced by biodiversity-wide platforms. Howard et al. investigate why there has not been the same boom in use observed for these invasive species-specific apps. Image by Leif Howard and Charles van Rees

User uptake and retention are just as important as collecting data. Howard and colleagues found that apps tend to do a good job with one of these, and rarely with both. In their paper, they emphasize that making apps user-friendly and fun to use, involving games and useful functions like species identification and social media plug-ins is a major missing piece among current apps.

“The greatest advancement in IAS early detection would likely result from app gamification,” they write.

Another feature they would like to see more of is artificial intelligence or machine learning for photo identification, which they believe would greatly enhance species identification and might increase public participation.

The authors also make suggestions for future innovations that could make IAS reporting apps even more effective. Their biggest suggestion is coordination. 

“Currently, most invasive species apps are developed by many separate organizations, leading to duplicated effort and inconsistent implementation”, they say. “The valuable data collected by these apps is also sent to different databases, making it harder for scientists to combine them for useful research.”

A more efficient way to implement these technologies might be providing open-source code and app templates, with which local organizations can make regional apps that contribute data to centralized databases. 

Overall, this research shows how with broader participation, more complete and informative reporting forms, and more consistent and structured data management, IAS reporting apps could make much larger contributions to invasive species management worldwide. This, in turn, could save local, regional, and national economies hundreds of millions or billions of dollars annually, while protecting valuable ecological and agricultural systems for future generations.

Research article:

Howard L, van Rees C, Dahquist Z, Luikart G, Hand B (2022) A review of invasive species reporting apps for citizen science and opportunities for innovation. NeoBiota 71: 165-188. https://doi.org/10.3897/neobiota.71.79597

Follow NeoBiota on Twitter and Facebook.

Data mining applied to scholarly publications to finally reveal Earth’s biodiversity

At a time when a million species are at risk of extinction, according to a recent UN report, ironically, we don’t know how many species there are on Earth, nor have we noted down all those that we have come to know on a single list. In fact, we don’t even know how many species we would have put on such a list.

The combined research including over 2,000 natural history institutions worldwide, produced an estimated ~500 million pages of scholarly publications and tens of millions of illustrations and species descriptions, comprising all we currently know about the diversity of life. However, most of it isn’t digitally accessible. Even if it were digital, our current publishing systems wouldn’t be able to keep up, given that there are about 50 species described as new to science every day, with all of these published in plain text and PDF format, where the data cannot be mined by machines, thereby requiring a human to extract them. Furthermore, those publications would often appear in subscription (closed access) journals.

The Biodiversity Literature Repository (BLR), a joint project ofPlaziPensoft and Zenodo at CERN, takes on the challenge to open up the access to the data trapped in scientific publications, and find out how many species we know so far, what are their most important characteristics (also referred to as descriptions or taxonomic treatments), and how they look on various images. To do so, BLR uses highly standardised formats and terminology, typical for scientific publications, to discover and extract data from text written primarily for human consumption.

By relying on state-of-the-art data mining algorithms, BLR allows for the detection, extraction and enrichment of data, including DNA sequences, specimen collecting data or related descriptions, as well as providing implicit links to their sources: collections, repositories etc. As a result, BLR is the world’s largest public domain database of taxonomic treatments, images and associated original publications.

Once the data are available, they are immediately distributed to global biodiversity platforms, such as GBIF–the Global Biodiversity Information Facility. As of now, there are about 42,000 species, whose original scientific descriptions are only accessible because of BLR.

The very basic principle in science to cite previous information allows us to trace back the history of a particular species, to understand how the knowledge about it grew over time, and even whether and how its name has changed through the years. As a result, this service is one avenue to uncover the catalogue of life by means of simple lookups.

So far, the lessons learned have led to the development of TaxPub, an extension of the United States National Library of Medicine Journal Tag Suite and its application in a new class of 26 scientific journals. As a result, the data associated with articles in these journals are machine-accessible from the beginning of the publishing process. Thus, as soon as the paper comes out, the data are automatically added to GBIF.

While BLR is expected to open up millions of scientific illustrations and descriptions, the system is unique in that it makes all the extracted data findable, accessible, interoperable and reusable (FAIR), as well as open to anybody, anywhere, at any time. Most of all, its purpose is to create a novel way to access scientific literature.

To date, BLR has extracted ~350,000 taxonomic treatments and ~200,000 figures from over 38,000 publications. This includes the descriptions of 55,800 new species, 3,744 new genera, and 28 new families. BLR has contributed to the discovery of over 30% of the ~17,000 species described annually.

Prof. Lyubomir Penev, founder and CEO of Pensoft says,

“It is such a great satisfaction to see how the development process of the TaxPub standard, started by Plazi some 15 years ago and implemented as a routine publishing workflow at Pensoft’s journals in 2010, has now resulted in an entire infrastructure that allows automated extraction and distribution of biodiversity data from various journals across the globe. With the recent announcement from the Consortium of European Taxonomic Facilities (CETAF) that their European Journal of Taxonomy is joining the TaxPub club, we are even more confident that we are paving the right way to fully grasping the dimensions of the world’s biodiversity.”

Dr Donat Agosti, co-founder and president of Plazi, adds:

“Finally, information technology allows us to create a comprehensive, extended catalogue of life and bring to light this huge corpus of cultural and scientific heritage – the description of life on Earth – for everybody. The nature of taxonomic treatments as a network of citations and syntheses of what scientists have discovered about a species allows us to link distinct fields such as genomics and taxonomy to specimens in natural history museums.”

Dr Tim Smith, Head of Collaboration, Devices and Applications Group at CERN, comments:

“Moving the focus away from the papers, where concepts are communicated, to the concepts themselves is a hugely significant step. It enables BLR to offer a unique new interconnected view of the species of our world, where the taxonomic treatments, their provenance, histories and their illustrations are all linked, accessible and findable. This is inspirational for the digital liberation of other fields of study!”

###

Additional information:

BLR is a joint project led by Plazi in partnership with Pensoft and Zenodo at CERN.

Currently, BLR is supported by a grant from Arcadia, a charitable fund of Lisbet Rausing and Peter Baldwin.

Scientists use forensic technology to genetically document infanticide in brown bears

Modern open-source software helped the researchers identify the male that killed a female and her two cubs

Scientists used a technology designed for the purposes of human forensics, to provide the first genetically documented case of infanticide in brown bears, following the murder of a female and her two cubs in Trentino, the Italian Alps, where a small re-introduced population has been genetically monitored for already 20 years.

The study, conducted and authored by Francesca Davoli, The Italian Institute for Environmental Protection and Research (ISPRA), Bologna, and her team, is published in the open access journal Nature Conservation.

To secure their own reproduction, males of some social mammalian species, such as lions and bears, exhibit infanticidal behaviour where they kill the offspring of their competitors, so that they can mate with the females which become fertile again soon after they lose their cubs. However, sometimes females are also killed while trying to protect their young, resulting in a survival threat to small populations and endangered species.

“In isolated populations with a small number of reproductive adults, sexually selected infanticide can negatively impact the long-term conservation of the species, especially in the case where the female is killed while protecting her cubs,” point out the researchers.

“Taking this into account, the genetic identification of the perpetrators could give concrete indications for the management of small populations, for example, placing radio-collars on infanticidal males to track them,” they add. “Nevertheless, genetic studies for identifying infanticidal males have received little attention.”

Thanks to a database containing the genotypes of all bears known to inhabit the study site and an open-source software used to analyse human forensic genetic profiles, the scientists were able to solve the case much like in a television crime series.

orsa occultata - leggeraUpon finding the three corpses, the researchers were certain that the animals had not been killed by a human. In the beginning, the suspects were all male brown bears reported from the area in 2015.

Hoping to isolate the DNA of the perpetrator, the researchers collected three samples of hairs and swabbed the female’s wounds in search for saliva. Dealing with a relatively small population, the scientists expected that the animals would share a genotype to an extent, meaning they needed plenty of samples.

However, while the DNA retrieved from the saliva swabs did point to an adult male, at first glance it seemed that it belonged to the cubs’ father. Later, the scientists puzzled out that the attacker must have injured the cubs and the mother alternately, thus spreading blood containing the inherited genetic material from the father bear. Previous knowledge also excluded the father, since there are no known cases of male bears killing their offspring. In fact, they seem to distinguish their own younglings, even though they most likely recognise the mother.

To successfully determine the attacker, the scientists had to use the very small amount of genetic material from the saliva swabs they managed to collect and conduct a highly sophisticated analysis, in order to obtain four genetic profiles largely overlapping with each other. Then, they compared them against each of the males reported from the area that year. Eventually, they narrowed down the options to an individual listed as M7.

“The monitoring of litters is a fundamental tool for the management of bear populations: it has allowed the authors to genetically confirm the existence of cases of infanticide and in the future may facilitate the retrieval of information necessary to assess the impact of SSI on demographic trends,” conclude the researchers.

###

Original source:

Davoli F, Cozzo M, Angeli F, Groff C, Randi E (2018) Infanticide in brown bear: a case-study in the Italian Alps – Genetic identification of perpetrator and implications in small populations. Nature Conservation 25: 55-75. https://doi.org/10.3897/natureconservation.25.23776