Pensoft took a BiCIKL ride to Naturalis to report on a 3-year endeavour towards FAIR data

Three years ago, the BiCIKL consortium took to traverse obstacles to wider use and adoption of FAIR and linked biodiversity data.

Leiden – also known as the ‘City of Keys’ and the ‘City of Discoveries’ – was aptly chosen to host the third Empowering Biodiversity Research (EBR III) conference. The two-day conference – this time focusing on the utilisation of biodiversity data as a vehicle for biodiversity research to reach to Policy – was held in a no less fitting locality: the Naturalis Biodiversity Center

On 25th and 26th March 2024, the delegates got the chance to learn more about the latest discoveries, trends and innovations from scientists, as well as various stakeholders, including representatives of policy-making bodies, research institutions and infrastructures. The conference also ran a poster session and a Biodiversity Informatics market, where scientists, research teams, project consortia, and providers of biodiversity research-related services and tools could showcase their work and meet like-minded professionals.

BiCIKL stops at the Naturalis Biodiversity Center

The main outcome of the BiCIKL project: the Biodiversity Knowledge Hub, a one-stop knowledge portal to interlinked and machine-readable FAIR data.

The famous for its bicycle friendliness country also made a suitable stop for BiCIKL (an acronym for the Biodiversity Community Integrated Knowledge Library): a project funded under the European Commission’s Horizon 2020 programme that aimed at triggering a culture change in the way users access, (re)use, publish and share biodiversity data. To do this, the BiCIKL consortium set off on a 3-year journey to build on the existing biodiversity data infrastructures, workflows, standards and the linkages between them.

Many of the people who have been involved in the project over the last three years could be seen all around the beautiful venue. Above all, Naturalis is itself one of the partnering institutions at BiCIKL. Then, on Tuesday, on behalf of the BiCIKL consortium and the project’s coordinator: the scientific publisher and technology innovator: Pensoft, Iva Boyadzhieva presented the work done within the project one month ahead of its official conclusion at the end of April.

As she talked about the way the BiCIKL consortium took to traverse obstacles to wider use and adoption of FAIR and linked biodiversity data, she focused on BiCIKL’s main outcome: the Biodiversity Knowledge Hub (BKH).

Key results from the BiCIKL project three years into its existence presented by Pensoft’s Iva Boyadzhieva at the EBR III conference.

Intended to act as a knowledge broker for users who wish to navigate and access sources of open and FAIR biodiversity data, guidelines, tools and services, in practicality, the BKH is a one-stop portal for understanding the complex but increasingly interconnected landscape of biodiversity research infrastructures in Europe and beyond. It collates information, guidelines, recommendations and best practices in usage of FAIR and linked biodiversity data, as well as a continuously expanded catalogue of compliant relevant services and tools.

At the core of the BKH is the FAIR Data Place (FDP), where users can familiarise themselves with each of the participating biodiversity infrastructures and network organisations, and also learn about the specific services they provide. There, anyone can explore various biodiversity data tools and services by browsing by their main data type, e.g. specimens, sequences, taxon names, literature.

While the project might be coming to an end, she pointed out, the BKH is here to stay as a navigation system in a universe of interconnected biodiversity research infrastructures.

To do this, not only will the partners continue to maintain it, but it will also remain open to any research infrastructure that wishes to feature its own tools and services compliant with the linked and FAIR data requirements set by the BiCIKL consortium.

On the event’s website you can access the BiCIKL’s slides presentation as presented at the EBR III conference.

What else was on at the EBR III?

Indisputably, the ‘hot’ topics at the EBR III were the novel technologies for remote and non-invasive, yet efficient biomonitoring; the utilisation of data and other input sourced by citizen scientists; as well as leveraging different types and sources of biodiversity data, in order to better inform decision-makers, but also future-proof the scientific knowledge we have collected and generated to date.

Project’s coordinator Dr Quentin Groom presents the B-Cubed’s approach towards standardised access to biodiversity data for the use of policy-making at the EBR III conference.

Amongst the other Horizon Europe projects presented at the EBR III conference was B-Cubed (Biodiversity Building Blocks for policy). On Monday, the project’s coordinator Dr Quentin Groom (Meise Botanic Garden) familiarised the conference participants with the project, which aims to standardise access to biodiversity data, in order to empower policymakers to proactively address the impacts of biodiversity change.

You can find more about B-Cubed and Pensoft’s role in it in this blog post.

On the event’s website you can access the B-Cubed’s slides presentation as presented at the EBR III conference.

***

Dr France Gerard (UK Centre for Ecology & Hydrology) talks about the challenges in using raw data – including those provided by drones – to derive habitat condition metrics.

MAMBO: another Horizon Europe project where Pensoft has been contributing with expertise in science communication, dissemination and exploitation, was also an active participant at the event. An acronym for Modern Approaches to the Monitoring of BiOdiversity, MAMBO had its own session on Tuesday morning, where Dr Vincent Kalkman (Naturalis Biodiversity Center), Dr France Gerard (UK Centre for Ecology & Hydrology) and Prof. Toke Høye (Aarhus University) each took to the stage to demonstrate how modern technology developed within the project is to improve biodiversity and habitat monitoring. Learn more about MAMBO and Pensoft’s involvement in this blog post.

MAMBO’s project coordinator Prof. Toke T. Høye talked about smarter technologies for biodiversity monitoring, including camera traps able to count insects at a particular site.

On the event’s website you can access the MAMBO’s slides presentations by Kalkman, Gerard and Høye, as presented at the EBR III conference.

***

The EBR III conference also saw a presentation – albeit remote – from Prof. Dr. Florian Leese (Dean at the University of Duisburg-Essen, Germany, and Editor-in-Chief at the Metabarcoding and Metagenomics journal), where he talked about the promise, but also the challenges for DNA-based methods to empower biodiversity monitoring. 

Amongst the key tasks here, he pointed out, are the alignment of DNA-based methods with the Global Biodiversity Framework; central push and funding for standards and guidance; publication of data in portals that adhere to the best data practices and rules; and the mobilisation of existing resources such as the meteorological ones. 

Prof. Dr. Florian Leese talked about the promise, but also the challenges for DNA-based methods to empower biodiversity monitoring. He also referred to the 2022 Forum Paper: “Introducing guidelines for publishing DNA-derived occurrence data through biodiversity data platforms” by R. Henrik Nilsson et al.

He also made a reference to the Forum Paper “Introducing guidelines for publishing DNA-derived occurrence data through biodiversity data platforms” by R. Henrik Nilsson et al., where the international team provided a brief rationale and an overview of guidelines targeting the principles and approaches of exposing DNA-derived occurrence data in the context of broader biodiversity data. In the study, published in the Metabarcoding and Metagenomics journal in 2022, they also introduced a living version of these guidelines, which continues to encourage feedback and interaction as new techniques and best practices emerge.

***

You can find the programme on the conference website and see highlights on the conference hashtag: #EBR2024.

Don’t forget to also explore the Biodiversity Knowledge Hub for yourself at: https://biodiversityknowledgehub.eu/ 

Digitising UK Natural History Collections is vital to understand life on Earth, reports the Natural History Museum

In a paper published in the journal Research Ideas and Outcomes, authors estimate £18 million has been saved in efficiencies by researchers accessing digital specimens rather than physical collections.

· Scientists from the Natural History Museum (NHM) deep-dive into the uses and users of natural history collections held in the UK

· Modest estimates report a saving of £18 million in efficiencies by researchers accessing digital data rather than physical collections

· Today, software can complete in a week what it would take a human two years to achieve

· Call for investment to secure the UK’s stance as a world superpower in science and tech, and for a future in which both people and planet thrive

A new report has evaluated the use and impact of digitised natural science collections held in the UK and how they contribute to scientific, commercial and societal benefits.

UK natural science collections hold more than 137 million items spanning an incredible 4.56-billion-year history of life on Earth. These collections have emerged as a pivotal data resource to understanding the Earth in its past and current state – and will continue to inform the investors and policy-makers of the future.

UK natural science data in demand

GBIF—the Global Biodiversity Information Facility—is an international database providing open access data on all types of life on Earth. In this paper led by the NHM, scientists report that there are 7.6 million specimens, less than 6% of total UK natural science collections sampled, freely accessible on GBIF.

They found that 12% of the total peer-reviewed journal articles citing GBIF data specifically cite UK natural science collections. These data currently make up just 0.3% of total occurrences on GBIF, meaning they punch an incredible 40 times above their weight.

When asked previously, over 90% of GBIF users linked their use of these data to advancing the UN Sustainable Development Goals which look to reduce hunger, poverty and inequality, and spur economic growth while tackling climate change and protecting the oceans and forests.

The case for digitising UK natural science collections

The introduction of these collections onto a digital platform has revolutionised scientific research. In this paper published in the journal Research Ideas and Outcomes, the authors estimate £18 million has been saved in efficiencies by researchers accessing digital specimens rather than physical collections, assuming a minimal single physical visit replaced per citation. Of this, £1.4 million has been attributed to UK researchers, money which can be reinvested back into UK science institutions – those at the forefront of finding solutions to real world problems.

Lead author and Deputy Head of Digital, Data and Informatics, Helen Hardy says, ‘The advancement of digitisation has been truly transformational to the scientific community. Today it’s possible to use software that takes a week to achieve the type of information gathering it would take a human over 3,000 hours, or two years, to complete – individuals realising an entire life’s work in just a few months! Anticipation is high for further innovations such as the further integration of artificial intelligence into taxonomic work.’

UK government want the UK to be a science and technology superpower, and natural science collections provide a unique opportunity to achieve this. To unlock the true potential of collections data, UK Natural Science collections are joining forces through the Distributed System of Scientific Collections UK (DiSSCo) to make the case for investment of £155 million in a research infrastructure which is expected to unlock at least a seven- to ten- fold economic return on investment. Working alongside the Arts & Humanities Research Council (AHRC) and UK Research and Innovation (UKRI) to digitise the critical mass of collections, the data will be available through a robust technological infrastructure and continually developed in line with recent innovations.

Ken Norris, Deputy Director of Science at the NHM says, ‘In the midst of a planetary emergency, and what some experts believe to be the Earth’s sixth mass extinction event, estimates say that over 50% of the world’s GDP, which equates to approx. 44 trillion dollars, is dependent on the natural world. By understanding what is in collections now, both on a national and international scale, we can identify trends, necessary actions, and what we need to collect to underpin policy and investment decisions for a future where people and planet thrive.’

Hardy H, Livermore L, Kersey P, Norris K, Smith V, Pullar J (2023) Users and uses of UK Natural History Collections – a Summary, https://doi.org/10.5281/zenodo.8403318

A longer paper on this study including further detail on the methodology and findings is also available:

Hardy H, Livermore L, Kersey P, Norris K, Smith V (2023) Understanding the users and uses of UK Natural History Collections. Research Ideas and Outcomes 8: e113378 https://doi.org/10.3897/rio.9.e113378

Photo credit: Trustees of the Natural History Museum

Follow Research Ideas and Outcomes on Facebook, Twitter, and LinkedIn.

Senckenberg Nature Research Society’s General director Prof. Klement Tockner on a visit at the National Museum of Natural History and Pensoft

An important discussion point was the performance of the four Senckenberg journals, which moved to Pensoft’s publishing platform a few years ago. On the agenda was also the opportunity for an Open Access agreement

Prof. Klement Tockner, Director general of the Senckenberg Society for Nature Research (centre) with Pensoft’s founder and CEO Prof. Lyubomir Penev (right) and Prof. Pavel Stoev, Director of the National Museum of Natural History (Bulgaria) and COO at Pensoft (left).

On 2 June 2023, we welcomed Prof. Klement Tockner, Director general of the Senckenberg Society for Nature Research, who travelled to Bulgaria to meet with Pensoft’s and the National Museum of Natural History’s (NMNHS) senior management to discuss current and future collaborations. 

The visit took place in the NMNHS, where Tockner had fruitful discussions with Pensoft’s founder and CEO Prof. Lyubomir Penev and Prof. Pavel Stoev, Director of the Museum and COO at Pensoft.

An important point in the discussion was the performance of the four scientific journals, owned by the Society, which moved to Pensoft’s publishing platform ARPHA a couple of years ago, and marked the beginning of a fruitful and highly promising partnership.

On the agenda was also the opportunity for an Open Access agreement to be signed between the Society and the publisher, in order to support researchers who wish to publish in any Pensoft journal. 

Tockner was also curious to learn more about the additional publishing services, provided by Pensoft via the ARPHA platform, including the various and continuously elaborated data publishing workflows, and the opportunities to streamline the description of new marine species, identified from DNA material.

In early 2021, the Senckenberg Society for Nature Research signed with the publisher to move three of its legacy titles from the natural sciences domain: Arthropod Systematics & Phylogeny, Contributions to Entomology, Geologica Saxonica and Vertebrate Zoology.

Later the same year, in November, the journal Contributions to Entomology followed suit. All four of them went for the white-label publishing solution available from ARPHA, designed to preserve the exclusive identity of historical journals.

The partners also talked about further extending the collaboration between Senckenberg and Pensoft to European Commission-funded scientific projects. Tockner was particularly fascinated with the progress made by the currently undergoing project Biodiversity Community Integrated Knowledge Library (BiCIKL), coordinated by Pensoft and involving 14 European institutions from ten countries. Additionally, over the past 20 years, Pensoft has also partnered in over 50 different consortia as a publisher, science communicator and technology provider.

Stoev (right) showing Tockner (left) around the collections of the National Museum of Natural History (Sofia, Bulgaria).

In his role as Director of the NMNHS, Stoev used the occasion to tour Tockner around the NMNHS collections and tell him more about the Museum’s latest achievements and projects, as well as its traditions in the fields of human evolution research and paleornithology.

Stoev (left) tells Tockner (right) more about the recently launched Bulgarian national unit of DiSSCo.

The two also engaged in a vivid discussion about the poorly studied biodiversity in Bulgaria and the region, but also about the recent efforts of the NMNHS team, including the launch of a Bulgarian national unit of DiSSCo tasked to digitise a large proportion of the institution’s collection in the next three years. Tockner and Stoev also talked about the need of additional networking activities and closer collaborations between smaller natural history museums across Europe that could be mediated through the Consortium of European Taxonomic Facilities (CETAF), where Senckenberg is an active member.

***

Follow ARPHA Platform on Twitter and LinkedIn for further updates.

BiCIKL keeps on adding project outcomes in own collection in RIO Journal

The publications so far include the grant proposal; conference abstracts, a workshop report, guidelines papers and deliverables submitted to the Commission.

The dynamic open-science project collection of BiCIKL, titled “Towards interlinked FAIR biodiversity knowledge: The BiCIKL perspective” (doi: 10.3897/rio.coll.105), continues to grow, as the project progresses into its third year and its results accumulate ever so exponentially. 

Following the publication of three important BiCIKL deliverables: the project’s Data Management Plan, its Visual identity package and a report, describing the newly built workflow and tools for data extraction, conversion and indexing and the user applications from OpenBiodiv, there are currently 30 research outcomes in the BiCIKL collection that have been shared publicly to the world, rather than merely submitted to the European Commission.

Shortly after the BiCIKL project started in 2021, a project-branded collection was launched in the open-science scholarly journal Research Ideas and Outcomes (RIO). There, the partners have been publishing – and thus preserving – conclusive research papers, as well as early and interim scientific outputs.

The publications so far also include the BiCIKL grant proposal, which earned the support of the European Commission in 2021; conference abstracts, submitted by the partners to two consecutive TDWG conferences; a project report that summarises recommendations on interoperability among infrastructures, as concluded from a hackathon organised by BiCIKL; and two Guidelines papers, aiming to trigger a culture change in the way data is shared, used and reused in the biodiversity field. 

In fact, one of the Guidelines papers, where representatives of the Consortium of European Taxonomic Facilities (CETAF), the Society for the Preservation of Natural History Collections (SPNHC) and the Biodiversity Heritage Library (BHL) came together to publish their joint statement on best practices for the citation of authorities of scientific names, has so far generated about 4,000 views by nearly 3,000 unique readers.

At the time of writing, the top three of the most read papers in the BiCIKL collection is completed by the grant proposal and the second Guidelines paper, where the partners – based on their extensive and versatile experience – present recommendations about the use of annotations and persistent identifiers in taxonomy and biodiversity publishing. 

Access to data and services along the entire data and research life cycle in biodiversity science.
The figure was featured in the BiCIKL grant proposal, now made available from the BiCIKL project collection in RIO Journal.

What one might find quite odd when browsing the BiCIKL collection is that each publication is marked with its own publication source, even though all contributions are clearly already accessible from RIO Journal

So, we can see many project outputs marked as RIO publications, but also others that have been published in the likes of F1000Research, the official journal of TDWG: Biodiversity Information Science and Standards, and even preprints servers, such as BiohackrXiv

This is because one of the unique features of RIO allows for consortia to use their project collection as a one-stop access point for all scientific results, regardless of their publication venue, by means of linking to the original source via metadata. Additionally, projects may also upload their documents in their original format and layout, thanks to the integration between RIO and ARPHA Preprints. This is in fact how BiCIKL chose to share their latest deliverables using the very same files they submitted to the Commission.

“In line with the mission of BiCIKL and our consortium’s dedication to FAIRness in science, we wanted to keep our project’s progress and results fully transparent and easily accessible and reusable to anyone, anywhere,” 

explains Prof Lyubomir Penev, BiCIKL’s Project Coordinator and founder and CEO of Pensoft. 

“This is why we opted to collate the outcomes of BiCIKL in one place – starting from the grant proposal itself, and then progressively adding workshop reports, recommendations, research papers and what not. By the time BiCIKL concludes, not only will we be ready to refer back to any step along the way that we have just walked together, but also rest assured that what we have achieved and learnt remains at the fingertips of those we have done it for and those who come after them,” he adds.

***

You can keep tabs on the BiCIKL project collection in RIO Journal by subscribing to the journal newsletter or following @RIOJournal on Twitter and Facebook.

‘Who is in your database and why does it matter?’

The uncertainty about a person’s identity hampers research, hinders the discovery of expertise, and obstructs the ability to give attribution or credit for work performed. 

Collection discovery through disambiguation

Guest blog post by Sabine von Mering, Heather Rogers, Siobhan Leachman, David P. ShorthouseDeborah Paul & Quentin Groom

Worldwide, natural history institutions house billions of physical objects in their collections, they create and maintain data about these items, and they share their data with aggregators such as the Global Biodiversity Information Facility (GBIF), the Integrated Digitized Biocollections (iDigBio), the Atlas of Living Australia (ALA), Genbank and the European Nucleotide Archive (ENA). 

Even though these data often include the names of the people who collected or identified each object, such statements may be ambiguous, as the names frequently lack any globally unique, machine-readable concept of their shared identity.

Despite the data being available online, barriers exist to effectively use the information about who collects or provides the expertise to identify the collection objects. People have similar names, change their name over the course of their lifetime (e.g. through marriage), or there may be variability introduced through the label transcription process itself (e.g. local look-up lists). 

As a result, researchers and collections staff often spend a lot of time deducing who is the person or people behind unknown collector strings while collating or tidying natural history data. The uncertainty about a person’s identity hampers research, hinders the discovery of expertise, and obstructs the ability to give attribution or credit for work performed. 

Disambiguation activities: the act of churning strings into verifiable things using all available evidence – need not be done in isolation. In addition to presenting a workflow on how to disambiguate people in collections, we also make the case that working in collaboration with colleagues and the general public presents new opportunities and introduces new efficiencies. There is tacit knowledge everywhere.

More often than not, data about people involved in biodiversity research are scattered across different digital platforms. However, with linking information sources to each other by using person identifiers, we can better trace the connections in these networks, so that we can weave a more interoperable narrative about every actor.

That said, inconsistent naming conventions or lack of adequate accreditation often frustrate the realization of this vision. This sliver of natural history could be churned to gold with modest improvements in long-term funding for human resources, adjustments to digital infrastructure, space for the physical objects themselves alongside their associated documents, and sufficient training on how to disambiguate people’s names.

“He aha te mea nui o te ao. He tāngata, he tāngata, he tāngata.

“What is the most important thing in the world? It is people, it is people, it is people.”

(Māori proverb)

The process of properly disambiguating those who have contributed to natural history collections takes time. 

The disambiguation process involves the extra challenge of trying to deduce “who is who” for legacy data, compared to undertaking this activity for people alive today. Retrospective disambiguation can require considerable detective work, especially for scarcely known people or if the community has a different naming convention. Provided the results of this effort are well-communicated and openly shared, mercifully, it need only be done once.

At the core of our research is the question of how to solve the issue of assigning proper credit

In our recent Methods paper, we discuss several methods for this, as well as available routes for making records available online that include not only the names of people expressed as text, but additionally twinned with their unique, resolvable identifiers. 

Disambiguation is a cycle. Enrichment of the data feeds off itself leading to further disambiguation. As more names are disambiguated and more biographical data are accumulated, it becomes easier to disambiguate more names. 

First and foremost, we should maintain our own public biographical data by making full use of ORCID. In addition to preserving our own scientific legacy and that of the institutions that employ us, we have a responsibility to avoid generating unnecessary disambiguation work for others. 

For legacy data, where the people connected to the collections are deceased, Wikidata can be used to openly document rich bibliographic and demographic data, each statement with one or more verifiable references. Wikidata can also act as a bridge to link other sources of authority such as VIAF or ORCID identifiers. It has many tools and services to bulk import, export, and to query information, making it well-suited as a universal democratiser of information about people often walled-off in collection management systems (CMS). 

A network of the top twenty most used identifiers for biologists on Wikidata.

Once unique identifiers for people are integrated in collection management systems, these may be shared with the global collections and research community using the new Darwin Core terms, recordedByID or identifiedByID along with the well-known, yet text-based terms, recordedBy or identifiedBy. 

Approximately 120 datasets published through GBIF now make use of these identifier-based terms, which are additionally resolved in Bionomia every few weeks alongside co-curated attributions newly made there. This roundtrip of data – emerging as ambiguous strings of text from the source, affixed with resolvable identifiers elsewhere, absorbed into the source as new digital annotations, and then re-emerging with these fresh, identifier-based enhancements – is an exciting approach to co-manage collections data.

Round tripping. In Bionomia, people identifiers from Wikidata and ORCID are used to enrich data published via GBIF, thus linking natural history specimens to the world’s collectors.

Disambiguation work is particularly important in recognising contributors who have been historically marginalized. For example, gender bias in specimen data can be seen in the case of Wilmatte Porter Cockerell, a prolific collector of botanical, entomological and fossil specimens. Cockerell’s collections are often attributed to her husband as he was also a prolific collector and the two frequently collected together. 

On some labels, her identity is further obscured as she is simply recorded as “& wife” (see example on GBIF). Since Wilmatte Cockerell was her husband’s second wife, it can take some effort to confirm if a specimen can be attributed to her and not her husband’s first wife, who was also involved in collecting specimens. By ensuring that Cockerell is disambiguated and her contributions are appropriately attributed, the impact of her work becomes more visible enabling her work to be properly and fairly credited.

Thus, disambiguation work helps to not only give credit where credit is due, thereby making data about people and their biodiversity collections more findable, but it also creates an inclusive and representative narrative of the landscape of people involved with scientific knowledge creation, identification, and preservation. 

A future – once thought to be a dream – where the complete scientific output of a person is connected as Linked Open Data (LOD) is now

Both the tools and infrastructure are at our disposal and the demand is palpable. All institutions can contribute to this movement by sharing data that include unique identifiers for the people in their collections. We recommend that institutions develop a strategy, perhaps starting with employees and curatorial staff, people of local significance, or those who have been marginalized, and to additionally capitalize on existing disambiguation activities elsewhere. This will have local utility and will make a significant, long-term impact. 

The more we participate in these activities, the greater chance we will uncover positive feedback loops, which will act to lighten the workload for all involved, including our future selves!

The disambiguation of people in collections is an ongoing process, but it becomes easier with practice. We also encourage collections staff to consider modifying their existing workflows and policies to include identifiers for people at the outset, when new data are generated or when new specimens are acquired. 

There is more work required at the global level to define, update, and ratify standards and best practices to help accelerate data exchange or roundtrips of this information; there is room for all contributions. Thankfully, there is a diverse, welcoming, energetic, and international community involved in these activities. 

We see a bright future for you, our collections, and our research products – well within reach – when the identities of people play a pivotal role in the construction of a knowledge graph of life.

You would like to participate and need support getting disambiguation of your collection started? Please contact our TDWG People in Biodiversity Data Task Group.

A good start is also to check Bionomia to find out what metrics exist now for your institution or collection and affiliated people.

The next steps for collections: 7 objectives that can help to disambiguate your institutions’ collection:

1. Promote the use of person identifiers in local, national or international outreach, publishing and research activities

2. Increase the number of collection management systems that use person identifiers

3. Increase the number of living collectors registered and using an ORCID identifier when contributing to collections

4. Undertake disambiguation in the national languages of many countries

5. Increase the number of identified people on Wikidata linked to collections

6. Increase the number of people in collections with expertise in person disambiguation

7. Collaborate towards an exchange standard for attribution data

A real example of how a name string is disambiguated and the steps taken in documenting it. Wikidata item of Jean-André Soulié

***

Methods publication:

Groom Q, Bräuchler C, Cubey RWN, Dillen M, Huybrechts P, Kearney N, Klazenga N, Leachman S, Paul DL, Rogers H, Santos J, Shorthouse DP, Vaughan A, von Mering S, Haston EM (2022) The disambiguation of people names in biological collections. Biodiversity Data Journal 10: e86089. https://doi.org/10.3897/BDJ.10.e86089

***

Follow Biodiversity Data Journal on Twitter and Facebook.

Image recognition to the rescue of natural history museums by enabling curators to identify specimens on the fly

New Research Idea, published in RIO Journal presents a promising machine-learning ecosystem to unite experts around the world and make up for lacking taxonomic expertise.

In their Research Idea, published in Research Ideas and Outcomes (RIO Journal), Swiss-Dutch research team present a promising machine-learning ecosystem to unite experts around the world and make up for lacking expert staff

Guest blog post by Luc Willemse, Senior collection manager at Naturalis Biodiversity Centre (Leiden, Netherlands)

Imagine the workday of a curator in a national natural history museum. Having spent several decades learning about a specific subgroup of grasshoppers, that person is now busy working on the identification and organisation of the holdings of the institution. To do this, the curator needs to study in detail a huge number of undescribed grasshoppers collected from all sorts of habitats around the world. 

The problem here, however, is that a curator at a smaller natural history institution – is usually responsible for all insects kept at the museum, ranging from butterflies to beetles, flies and so on. In total, we know of around 1 million described insect species worldwide. Meanwhile, another 3,000 are being added each year, while many more are redescribed, as a result of further study and new discoveries. Becoming a specialist for grasshoppers was already a laborious activity that took decades, how about knowing all insects of the world? That’s simply impossible. 

Then, how could we expect from one person to sort and update all collections at a museum: an activity that is the cornerstone of biodiversity research? A part of the solution, hiring and training additional staff, is costly and time-consuming, especially when we know that experts on certain species groups are already scarce on a global scale. 

We believe that automated image recognition holds the key to reliable and sustainable practises at natural history institutions. 

Today, image recognition tools integrated in mobile apps are already being used even by citizen scientists to identify plants and animals in the field. Based on an image taken by a smartphone, those tools identify specimens on the fly and estimate the accuracy of their results. What’s more is the fact that those identifications have proven to be almost as accurate as those done by humans. This gives us hope that we could help curators at museums worldwide take better and more timely care of the collections they are responsible for. 

However, specimen identification for the use of natural history institutions is still much more complex than the tools used in the field. After all, the information they store and should be able to provide is meant to serve as a knowledge hub for educational and reference purposes for present and future generations of researchers around the globe.

This is why we propose a sustainable system where images, knowledge, trained recognition models and tools are exchanged between institutes, and where an international collaboration between museums from all sizes is crucial. The aim is to have a system that will benefit the entire community of natural history collections in providing further access to their invaluable collections. 

We propose four elements to this system: 

  1. A central library of already trained image recognition models (algorithms) needs to be created. It will be openly accessible, so any other institute can profit from models trained by others.
Mock-up of a Central Library of Algorithms.
  1. A central library of datasets accessing images of collection specimens that have recently been identified by experts. This will provide an indispensable source of images for training new algorithms.
Mock-up of a Central Library of Datasets.
  1. A digital workbench that provides an easy-to-use interface for inexperienced users to customise the algorithms and datasets to the particular needs in their own collections. 
  2. As the entire system depends on international collaboration as well as sharing of algorithms and datasets, a user forum is essential to discuss issues, coordinate, evaluate, test or implement novel technologies.

How would this work on a daily basis for curators? We provide two examples of use cases.

First, let’s zoom in to a case where a curator needs to identify a box of insects, for example bush crickets, to a lower taxonomic level. Here, he/she would take an image of the box and split it into segments of individual specimens. Then, image recognition will identify the bush crickets to a lower taxonomic level. The result, which we present in the table below – will be used to update object-level registration or to physically rearrange specimens into more accurate boxes. This entire step can also be done by non-specialist staff. 

Mock-up of box with grasshoppers mentioned in the above table

Results of automated image recognition identify specimens to a lower taxonomic level.

Another example is to incorporate image recognition tools into digitisation processes that include imaging specimens. In this case, image recognition tools can be used on the fly to check or confirm the identifications and thus improve data quality.

Mock-up of an interface for automated taxon identification. 

Using image recognition tools to identify specimens in museum collections is likely to become common practice in the future. It is a technical tool that will enable the community to share available taxonomic expertise. 

Using image recognition tools creates the possibility to identify species groups for which there is very limited to none in-house expertise. Such practises would substantially reduce costs and time spent per treated item. 

Image recognition applications carry metadata like version numbers and/or datasets used for training. Additionally, such an approach would make identification more transparent than the one carried out by humans whose expertise is, by design, in no way standardised or transparent.

*

Follow RIO Journal on Twitter and Facebook.

*

Research publication:

Greeff M, Caspers M, Kalkman V, Willemse L, Sunderland BD, Bánki O, Hogeweg L (2022) Sharing taxonomic expertise between natural history collections using image recognition. Research Ideas and Outcomes 8: e79187. https://doi.org/10.3897/rio.8.e79187

Digitising the Natural History Museum London’s entire collection could contribute over £2 billion to the global economy

In a world first, the Natural History Museum, London, has collaborated with economic consultants, Frontier Economics Ltd, to explore the economic and societal value of digitising natural history collections and concluded that digitisation has the potential to see a seven to tenfold return on investment. Whilst significant progress is already being made at the Museum, additional investment is needed in order to unlock the full potential of the Museum’s vast collections – more than 80 million objects. The project’s report is published in the open science scientific journal Research Ideas and Outcomes (RIO Journal).

One of the Museum’s digitisers imaging a butterfly to join the 4.93 million specimens already available online. 
© The Trustees of the Natural History Museum, London

The societal benefits of digitising natural history collections extends to global advancements in food security, biodiversity conservation, medicine discovery, minerals exploration, and beyond. Brand new, rigorous economic report predicts investing in digitising natural history museum collections could also result in a tenfold return. The Natural History Museum, London, has so far made over 4.9 million digitised specimens available freely online – over 28 billion records have been downloaded over 429,000 download events over the past six years. 

Digitisation at the Natural History Museum, London 

Digitisation is the process of creating and sharing the data associated with Museum specimens. To digitise a specimen, all its related information is added to an online database. This typically includes where and when it was collected and who found it, and can include photographs, scans and other molecular data if available. Natural history collections are a unique record of biodiversity dating back hundreds of years, and geodiversity dating back millennia. Creating and sharing data this way enables science that would have otherwise been impossible, and we accelerate the rate at which important discoveries are made from our collections.  

The Natural History Museum’s collection of 80 million items is one of the largest and most historically and geographically diverse in the world. By unlocking the collection online, the Museum provides free and open access for global researchers, scientists, artists and more. Since 2015, the Museum has made 4.9 million specimens available on the Museum’s Data Portal, which have seen more than 28 billion downloads over 427,000 download events. 

This means the Museum has digitised  about 6% of its collections to date. Because digitisation is expensive, costing tens of millions of pounds, it is difficult to make a case for further investment without better understanding the value of this digitisation and its benefits. 

In 2021, the Museum decided to explore the economic impacts of collections data in more depth, and commissioned Frontier Economics to undertake modelling, resulting in this project report, now made publicly available in the open-science journal Research Ideas and Outcomes (RIO Journal), and confirming benefits in excess of £2 billion over 30 years. While the methods in this report are relevant to collections globally, this modelling focuses on benefits to the UK, and is intended to support the Museum’s own digitisation work, as well as a current scoping study funded by the Arts & Humanities Research Council about the case for digitising all UK natural science collections as a research infrastructure.

Sharing data from our collections can transform scientific research and help find solutions for nature and from nature. Our digitised collections have helped establish the baseline plant biodiversity in the Amazon, find wheat crops that are more resilient to climate change and support research into potential zoonotic origins of Covid-19. The research that comes from sharing our specimens has immense potential to transform our world and help both people and the planet thrive,

says Helen Hardy, Science Digital Programme Manager at the Natural History Museum.

How digitisation impacts scientific research?

The data from museum collections accelerates scientific research, which in turn creates benefits for society and the economy across a wide range of sectors. Frontier Economics Ltd have looked at the impact of collections data in five of these sectors: biodiversity conservation, invasive species, medicines discovery, agricultural research and development and mineral exploration. 

The Natural History Museum’s collection is a real treasure trove which, if made easily accessible to scientists all over the world through digitisation, has the potential to unlock ground-breaking research in any number of areas. Predicting exactly how the data will be used in future is clearly very uncertain. We have looked at the potential value that new research could create in just five areas focussing on a relatively narrow set of outcomes. We find that the value at stake is extremely large, running into billions,”

says Dan Popov, Economist at Frontier Economics Ltd.

The new analyses attempt to estimate the economic value of these benefits using a range of approaches, with the results in broad agreement that the benefits of digitisation are at least ten times greater than the costs. This represents a compelling case for investment in museum digital infrastructure without which the many benefits will not be realised.

This new analysis shows that the data locked up in our collections has significant societal and economic value, but we need investment to help us release it,

adds Professor Ken Norris, Head of the Life Sciences Department at the Natural History Museum.

Other benefits could include improvements to the resilience of agricultural crops by better understanding their wild relatives, research into invasive species which can cause significant damage to ecosystems and crops, and improving the accuracy of mining.  

Finally, there are other impacts that such work could have on how science is conducted itself. The very act of digitising specimens means that researchers anywhere on the planet can access these collections, saving time and money that may have been spent as scientists travelled to see specific objects.

The value of research enabled by digitisation of natural history collections can be estimated by looking at specific areas where the Museum’s collections contribute towards scientific research and subsequently impact the wider economy. 
© Frontier Economics Ltd.

Original source: 

Popov D, Roychoudhury P, Hardy H, Livermore L, Norris K (2021) The Value of Digitising Natural History Collections. Research Ideas and Outcomes 7: e78844. https://doi.org/10.3897/rio.7.e78844