In a paper published in the journal Research Ideas and Outcomes, authors estimate £18 million has been saved in efficiencies by researchers accessing digital specimens rather than physical collections.
· Scientists from the Natural History Museum (NHM) deep-dive into the uses and users of natural history collections held in the UK
· Modest estimates report a saving of £18 million in efficiencies by researchers accessing digital data rather than physical collections
· Today, software can complete in a week what it would take a human two years to achieve
· Call for investment to secure the UK’s stance as a world superpower in science and tech, and for a future in which both people and planet thrive
A new report has evaluated the use and impact of digitised natural science collections held in the UK and how they contribute to scientific, commercial and societal benefits.
UK natural science collections hold more than 137 million items spanning an incredible 4.56-billion-year history of life on Earth. These collections have emerged as a pivotal data resource to understanding the Earth in its past and current state – and will continue to inform the investors and policy-makers of the future.
UK natural science data in demand
GBIF—the Global Biodiversity Information Facility—is an international database providing open access data on all types of life on Earth. In this paper led by the NHM, scientists report that there are 7.6 million specimens, less than 6% of total UK natural science collections sampled, freely accessible on GBIF.
They found that 12% of the total peer-reviewed journal articles citing GBIF data specifically cite UK natural science collections. These data currently make up just 0.3% of total occurrences on GBIF, meaning they punch an incredible 40 times above their weight.
When asked previously, over 90% of GBIF users linked their use of these data to advancing the UN Sustainable Development Goals which look to reduce hunger, poverty and inequality, and spur economic growth while tackling climate change and protecting the oceans and forests.
The case for digitising UK natural science collections
The introduction of these collections onto a digital platform has revolutionised scientific research. In this paper published in the journal Research Ideas and Outcomes, the authors estimate £18 million has been saved in efficiencies by researchers accessing digital specimens rather than physical collections, assuming a minimal single physical visit replaced per citation. Of this, £1.4 million has been attributed to UK researchers, money which can be reinvested back into UK science institutions – those at the forefront of finding solutions to real world problems.
Lead author and Deputy Head of Digital, Data and Informatics, Helen Hardy says, ‘The advancement of digitisation has been truly transformational to the scientific community. Today it’s possible to use software that takes a week to achieve the type of information gathering it would take a human over 3,000 hours, or two years, to complete – individuals realising an entire life’s work in just a few months! Anticipation is high for further innovations such as the further integration of artificial intelligence into taxonomic work.’
UK government want the UK to be a science and technology superpower, and natural science collections provide a unique opportunity to achieve this. To unlock the true potential of collections data, UK Natural Science collections are joining forces through the Distributed System of Scientific Collections UK (DiSSCo) to make the case for investment of £155 million in a research infrastructure which is expected to unlock at least a seven- to ten- fold economic return on investment. Working alongside the Arts & Humanities Research Council (AHRC) and UK Research and Innovation (UKRI) to digitise the critical mass of collections, the data will be available through a robust technological infrastructure and continually developed in line with recent innovations.
Ken Norris, Deputy Director of Science at the NHM says, ‘In the midst of a planetary emergency, and what some experts believe to be the Earth’s sixth mass extinction event, estimates say that over 50% of the world’s GDP, which equates to approx. 44 trillion dollars, is dependent on the natural world. By understanding what is in collections now, both on a national and international scale, we can identify trends, necessary actions, and what we need to collect to underpin policy and investment decisions for a future where people and planet thrive.’
Hardy H, Livermore L, Kersey P, Norris K, Smith V, Pullar J (2023) Users and uses of UK Natural History Collections – a Summary, https://doi.org/10.5281/zenodo.8403318
A longer paper on this study including further detail on the methodology and findings is also available:
Hardy H, Livermore L, Kersey P, Norris K, Smith V (2023) Understanding the users and uses of UK Natural History Collections. Research Ideas and Outcomes 8: e113378 https://doi.org/10.3897/rio.9.e113378
Photo credit: Trustees of the Natural History Museum
An important discussion point was the performance of the four Senckenberg journals, which moved to Pensoft’s publishing platform a few years ago. On the agenda was also the opportunity for an Open Access agreement
The visit took place in the NMNHS, where Tockner had fruitful discussions with Pensoft’s founder and CEO Prof. Lyubomir Penev and Prof. Pavel Stoev, Director of the Museum and COO at Pensoft.
An important point in the discussion was the performance of the four scientific journals, owned by the Society, which moved to Pensoft’s publishing platform ARPHA a couple of years ago, and marked the beginning of a fruitful and highly promising partnership.
On the agenda was also the opportunity for an Open Access agreement to be signed between the Society and the publisher, in order to support researchers who wish to publish in any Pensoft journal.
Tockner was also curious to learn more about the additional publishing services, provided by Pensoft via the ARPHA platform, including the various and continuously elaborated data publishing workflows, and the opportunities to streamline the description of new marine species, identified from DNA material.
Later the same year, in November, the journal Contributions to Entomology followed suit. All four of them went for the white-label publishing solution available from ARPHA, designed to preserve the exclusive identity of historical journals.
The partners also talked about further extending the collaboration between Senckenberg and Pensoft to European Commission-funded scientific projects. Tokner was particularly fascinated with the progress made by the currently undergoing project Biodiversity Community Integrated Knowledge Library (BiCIKL), coordinated by Pensoft and involving 14 European institutions from ten countries. Additionally, over the past 20 years, Pensoft has also partnered in over 50 different consortia as a publisher, science communicator and technology provider.
In his role as Director of the NMNHS, Stoev used the occasion to tour Tockner around the NMNHS collections and tell him more about the Museum’s latest achievements and projects, as well as its traditions in the fields of human evolution research and paleornithology.
The two also engaged in a vivid discussion about the poorly studied biodiversity in Bulgaria and the region, but also about the recent efforts of the NMNHS team, including the launch of a Bulgarian national unit of DiSSCo tasked to digitise a large proportion of the institution’s collection in the next three years. Tockner and Stoev also talked about the need of additional networking activities and closer collaborations between smaller natural history museums across Europe that could be mediated through the Consortium of European Taxonomic Facilities (CETAF), where Senckenberg is an active member.
The dynamic open-science project collection of BiCIKL, titled “Towards interlinked FAIR biodiversity knowledge: The BiCIKL perspective” (doi: 10.3897/rio.coll.105), continues to grow, as the project progresses into its third year and its results accumulate ever so exponentially.
Following the publication of three important BiCIKL deliverables: the project’s Data Management Plan, its Visual identity package and a report, describing the newly built workﬂow and tools for data extraction, conversion and indexing and the user applications from OpenBiodiv, there are currently 30 research outcomes in the BiCIKL collection that have been shared publicly to the world, rather than merely submitted to the European Commission.
Shortly after the BiCIKL project started in 2021, a project-branded collection was launched in the open-science scholarly journal Research Ideas and Outcomes(RIO). There, the partners have been publishing – and thus preserving – conclusive research papers, as well as early and interim scientific outputs.
The publications so far also include the BiCIKL grant proposal, which earned the support of the European Commission in 2021; conference abstracts, submitted by the partners to two consecutive TDWG conferences; a project report that summarises recommendations on interoperability among infrastructures, as concluded from a hackathon organised by BiCIKL; and two Guidelines papers, aiming to trigger a culture change in the way data is shared, used and reused in the biodiversity field.
At the time of writing, the top three of the most read papers in the BiCIKL collection is completed by the grant proposal and the second Guidelines paper, where the partners – based on their extensive and versatile experience – present recommendations about the use of annotations and persistent identifiers in taxonomy and biodiversity publishing.
What one might find quite odd when browsing the BiCIKL collection is that each publication is marked with its own publication source, even though all contributions are clearly already accessible from RIO Journal.
This is because one of the unique features of RIOallows for consortia to use their project collection as a one-stop access point for all scientific results, regardless of their publication venue, by means of linking to the original source via metadata. Additionally, projects may also upload their documents in their original format and layout, thanks to the integration between RIO and ARPHA Preprints. This is in fact how BiCIKL chose to share their latest deliverables using the very same files they submitted to the Commission.
“In line with the mission of BiCIKL and our consortium’s dedication to FAIRness in science, we wanted to keep our project’s progress and results fully transparent and easily accessible and reusable to anyone, anywhere,”
explains Prof Lyubomir Penev, BiCIKL’s Project Coordinator and founder and CEO of Pensoft.
“This is why we opted to collate the outcomes of BiCIKL in one place – starting from the grant proposal itself, and then progressively adding workshop reports, recommendations, research papers and what not. By the time BiCIKL concludes, not only will we be ready to refer back to any step along the way that we have just walked together, but also rest assured that what we have achieved and learnt remains at the fingertips of those we have done it for and those who come after them,” he adds.
Worldwide, natural history institutions house billions of physical objects in their collections, they create and maintain data about these items, and they share their data with aggregators such as the Global Biodiversity Information Facility (GBIF), the Integrated Digitized Biocollections (iDigBio), the Atlas of Living Australia (ALA), Genbank and the European Nucleotide Archive (ENA).
Even though these data often include the names of the people who collected or identified each object, such statements may be ambiguous, as the names frequently lack any globally unique, machine-readable concept of their shared identity.
Despite the data being available online, barriers exist to effectively use the information about who collects or provides the expertise to identify the collection objects. People have similar names, change their name over the course of their lifetime (e.g. through marriage), or there may be variability introduced through the label transcription process itself (e.g. local look-up lists).
As a result, researchers and collections staff often spend a lot of time deducing who is the person or people behind unknown collector strings while collating or tidying natural history data. The uncertainty about a person’s identity hampers research, hinders the discovery of expertise, and obstructs the ability to give attribution or credit for work performed.
Disambiguation activities: the act of churning strings into verifiable things using all available evidence – need not be done in isolation. In addition to presenting a workflow on how to disambiguate people in collections, we also make the case that working in collaboration with colleagues and the general public presents new opportunities and introduces new efficiencies. There is tacit knowledge everywhere.
More often than not, data about people involved in biodiversity research are scattered across different digital platforms. However, with linking information sources to each other by using person identifiers, we can better trace the connections in these networks, so that we can weave a more interoperable narrative about every actor.
That said, inconsistent naming conventions or lack of adequate accreditation often frustrate the realization of this vision. This sliver of natural history could be churned to gold with modest improvements in long-term funding for human resources, adjustments to digital infrastructure, space for the physical objects themselves alongside their associated documents, and sufficient training on how to disambiguate people’s names.
The process of properly disambiguating those who have contributed to natural history collections takes time.
The disambiguation process involves the extra challenge of trying to deduce “who is who” for legacy data, compared to undertaking this activity for people alive today. Retrospective disambiguation can require considerable detective work, especially for scarcely known people or if the community has a different naming convention. Provided the results of this effort are well-communicated and openly shared, mercifully, it need only be done once.
At the core of our research is the question of how to solve the issue of assigning proper credit.
In our recent Methods paper, we discuss several methods for this, as well as available routes for making records available online that include not only the names of people expressed as text, but additionally twinned with their unique, resolvable identifiers.
First and foremost, we should maintain our own public biographical data by making full use of ORCID. In addition to preserving our own scientific legacy and that of the institutions that employ us, we have a responsibility to avoid generating unnecessary disambiguation work for others.
For legacy data, where the people connected to the collections are deceased, Wikidata can be used to openly document rich bibliographic and demographic data, each statement with one or more verifiable references. Wikidata can also act as a bridge to link other sources of authority such as VIAF or ORCID identifiers. It has many tools and services to bulk import, export, and to query information, making it well-suited as a universal democratiser of information about people often walled-off in collection management systems (CMS).
Once unique identifiers for people are integrated in collection management systems, these may be shared with the global collections and research community using the new Darwin Core terms, recordedByID or identifiedByID along with the well-known, yet text-based terms, recordedBy or identifiedBy.
Approximately 120 datasets published through GBIF now make use of these identifier-based terms, which are additionally resolved in Bionomia every few weeks alongside co-curated attributions newly made there. This roundtrip of data – emerging as ambiguous strings of text from the source, affixed with resolvable identifiers elsewhere, absorbed into the source as new digital annotations, and then re-emerging with these fresh, identifier-based enhancements – is an exciting approach to co-manage collections data.
Disambiguation work is particularly important in recognising contributors who have been historically marginalized. For example, gender bias in specimen data can be seen in the case of Wilmatte Porter Cockerell, a prolific collector of botanical, entomological and fossil specimens. Cockerell’s collections are often attributed to her husband as he was also a prolific collector and the two frequently collected together.
On some labels, her identity is further obscured as she is simply recorded as “& wife” (see example on GBIF). Since Wilmatte Cockerell was her husband’s second wife, it can take some effort to confirm if a specimen can be attributed to her and not her husband’s first wife, who was also involved in collecting specimens. By ensuring that Cockerell is disambiguated and her contributions are appropriately attributed, the impact of her work becomes more visible enabling her work to be properly and fairly credited.
Thus, disambiguation work helps to not only give credit where credit is due, thereby making data about people and their biodiversity collections more findable, but it also creates an inclusive and representative narrative of the landscape of people involved with scientific knowledge creation, identification, and preservation.
A future – once thought to be a dream – where the complete scientific output of a person is connected as Linked Open Data (LOD) is now.
Both the tools and infrastructure are at our disposal and the demand is palpable. All institutions can contribute to this movement by sharing data that include unique identifiers for the people in their collections. We recommend that institutions develop a strategy, perhaps starting with employees and curatorial staff, people of local significance, or those who have been marginalized, and to additionally capitalize on existing disambiguation activities elsewhere. This will have local utility and will make a significant, long-term impact.
The more we participate in these activities, the greater chance we will uncover positive feedback loops, which will act to lighten the workload for all involved, including our future selves!
The disambiguation of people in collections is an ongoing process, but it becomes easier with practice. We also encourage collections staff to consider modifying their existing workflows and policies to include identifiers for people at the outset, when new data are generated or when new specimens are acquired.
There is more work required at the global level to define, update, and ratify standards and best practices to help accelerate data exchange or roundtrips of this information; there is room for all contributions. Thankfully, there is a diverse, welcoming, energetic, and international community involved in these activities.
We see a bright future for you, our collections, and our research products – well within reach – when the identities of people play a pivotal role in the construction of a knowledge graph of life.
A good start is also to check Bionomia to find out what metrics exist now for your institution or collection and affiliated people.
The next steps for collections: 7 objectives that can help to disambiguate your institutions’ collection:
Groom Q, Bräuchler C, Cubey RWN, Dillen M, Huybrechts P, Kearney N, Klazenga N, Leachman S, Paul DL, Rogers H, Santos J, Shorthouse DP, Vaughan A, von Mering S, Haston EM (2022) The disambiguation of people names in biological collections. Biodiversity Data Journal 10: e86089. https://doi.org/10.3897/BDJ.10.e86089
In their Research Idea, published in Research Ideas and Outcomes (RIO Journal), Swiss-Dutch research team present a promising machine-learning ecosystem to unite experts around the world and make up for lacking expert staff
Guest blog post by Luc Willemse, Senior collection manager at Naturalis Biodiversity Centre (Leiden, Netherlands)
Imagine the workday of a curator in a national natural history museum. Having spent several decades learning about a specific subgroup of grasshoppers, that person is now busy working on the identification and organisation of the holdings of the institution. To do this, the curator needs to study in detail a huge number of undescribed grasshoppers collected from all sorts of habitats around the world.
The problem here, however, is that a curator at a smaller natural history institution – is usually responsible for all insects kept at the museum, ranging from butterflies to beetles, flies and so on. In total, we know of around 1 million described insect species worldwide. Meanwhile, another 3,000 are being added each year, while many more are redescribed, as a result of further study and new discoveries. Becoming a specialist for grasshoppers was already a laborious activity that took decades, how about knowing all insects of the world? That’s simply impossible.
Then, how could we expect from one person to sort and update all collections at a museum: an activity that is the cornerstone of biodiversity research? A part of the solution, hiring and training additional staff, is costly and time-consuming, especially when we know that experts on certain species groups are already scarce on a global scale.
We believe that automated image recognition holds the key to reliable and sustainable practises at natural history institutions.
Today, image recognition tools integrated in mobile apps are already being used even by citizen scientists to identify plants and animals in the field. Based on an image taken by a smartphone, those tools identify specimens on the fly and estimate the accuracy of their results. What’s more is the fact that those identifications have proven to be almost as accurate as those done by humans. This gives us hope that we could help curators at museums worldwide take better and more timely care of the collections they are responsible for.
However, specimen identification for the use of natural history institutions is still much more complex than the tools used in the field. After all, the information they store and should be able to provide is meant to serve as a knowledge hub for educational and reference purposes for present and future generations of researchers around the globe.
This is why we propose a sustainable system where images, knowledge, trained recognition models and tools are exchanged between institutes, and where an international collaboration between museums from all sizes is crucial. The aim is to have a system that will benefit the entire community of natural history collections in providing further access to their invaluable collections.
We propose four elements to this system:
A central library of already trained image recognition models (algorithms) needs to be created. It will be openly accessible, so any other institute can profit from models trained by others.
A central library of datasets accessing images of collection specimens that have recently been identified by experts. This will provide an indispensable source of images for training new algorithms.
A digital workbench that provides an easy-to-use interface for inexperienced users to customise the algorithms and datasets to the particular needs in their own collections.
As the entire system depends on international collaboration as well as sharing of algorithms and datasets, a user forum is essential to discuss issues, coordinate, evaluate, test or implement novel technologies.
How would this work on a daily basis for curators? We provide two examples of use cases.
First, let’s zoom in to a case where a curator needs to identify a box of insects, for example bush crickets, to a lower taxonomic level. Here, he/she would take an image of the box and split it into segments of individual specimens. Then, image recognition will identify the bush crickets to a lower taxonomic level. The result, which we present in the table below – will be used to update object-level registration or to physically rearrange specimens into more accurate boxes. This entire step can also be done by non-specialist staff.
Another example is to incorporate image recognition tools into digitisation processes that include imaging specimens. In this case, image recognition tools can be used on the fly to check or confirm the identifications and thus improve data quality.
Using image recognition tools to identify specimens in museum collections is likely to become common practice in the future. It is a technical tool that will enable the community to share available taxonomic expertise.
Using image recognition tools creates the possibility to identify species groups for which there is very limited to none in-house expertise. Such practises would substantially reduce costs and time spent per treated item.
Image recognition applications carry metadata like version numbers and/or datasets used for training. Additionally, such an approach would make identification more transparent than the one carried out by humans whose expertise is, by design, in no way standardised or transparent.
Greeff M, Caspers M, Kalkman V, Willemse L, Sunderland BD, Bánki O, Hogeweg L (2022) Sharing taxonomic expertise between natural history collections using image recognition. Research Ideas and Outcomes 8: e79187. https://doi.org/10.3897/rio.8.e79187
In a world first, the Natural History Museum, London, has collaborated with economic consultants, Frontier Economics Ltd, to explore the economic and societal value of digitising natural history collections and concluded that digitisation has the potential to see a seven to tenfold return on investment. Whilst significant progress is already being made at the Museum, additional investment is needed in order to unlock the full potential of the Museum’s vast collections – more than 80 million objects. The project’s report is published in the open science scientific journal Research Ideas and Outcomes (RIO Journal).
The societal benefits of digitising natural history collections extends to global advancements in food security, biodiversity conservation, medicine discovery, minerals exploration, and beyond. Brand new, rigorous economic report predicts investing in digitising natural history museum collections could also result in a tenfold return. The Natural History Museum, London, has so far made over 4.9 million digitised specimens available freely online – over 28 billion records have been downloaded over 429,000 download events over the past six years.
Digitisation at the Natural History Museum, London
Digitisation is the process of creating and sharing the data associated with Museum specimens. To digitise a specimen, all its related information is added to an online database. This typically includes where and when it was collected and who found it, and can include photographs, scans and other molecular data if available. Natural history collections are a unique record of biodiversity dating back hundreds of years, and geodiversity dating back millennia. Creating and sharing data this way enables science that would have otherwise been impossible, and we accelerate the rate at which important discoveries are made from our collections.
The Natural History Museum’s collection of 80 million items is one of the largest and most historically and geographically diverse in the world. By unlocking the collection online, the Museum provides free and open access for global researchers, scientists, artists and more. Since 2015, the Museum has made 4.9 million specimens available on the Museum’s Data Portal, which have seen more than 28 billion downloads over 427,000 download events.
This means the Museum has digitised about 6% of its collections to date. Because digitisation is expensive, costing tens of millions of pounds, it is difficult to make a case for further investment without better understanding the value of this digitisation and its benefits.
In 2021, the Museum decided to explore the economic impacts of collections data in more depth, and commissioned Frontier Economics to undertake modelling, resulting in this project report, now made publicly available in the open-science journal Research Ideas and Outcomes (RIO Journal), and confirming benefits in excess of £2 billion over 30 years. While the methods in this report are relevant to collections globally, this modelling focuses on benefits to the UK, and is intended to support the Museum’s own digitisation work, as well as a current scoping study funded by the Arts & Humanities Research Council about the case for digitising all UK natural science collections as a research infrastructure.
How digitisation impacts scientific research?
The data from museum collections accelerates scientific research, which in turn creates benefits for society and the economy across a wide range of sectors. Frontier Economics Ltd have looked at the impact of collections data in five of these sectors: biodiversity conservation, invasive species, medicines discovery, agricultural research and development and mineral exploration.
The new analyses attempt to estimate the economic value of these benefits using a range of approaches, with the results in broad agreement that the benefits of digitisation are at least ten times greater than the costs. This represents a compelling case for investment in museum digital infrastructure without which the many benefits will not be realised.
Other benefits could include improvements to the resilience of agricultural crops by better understanding their wild relatives, research into invasive species which can cause significant damage to ecosystems and crops, and improving the accuracy of mining.
Finally, there are other impacts that such work could have on how science is conducted itself. The very act of digitising specimens means that researchers anywhere on the planet can access these collections, saving time and money that may have been spent as scientists travelled to see specific objects.
Popov D, Roychoudhury P, Hardy H, Livermore L, Norris K (2021) The Value of Digitising Natural History Collections. Research Ideas and Outcomes 7: e78844. https://doi.org/10.3897/rio.7.e78844