Pensoft journals integrated with Catalogue of Life to help list the species of the world

While not every taxonomic study is conducted with a nature conservation idea in mind, most ecological initiatives need to be backed by exhaustive taxonomic research. There simply isn’t a way to assess a species’ distributional range, migratory patterns or ecological trends without knowing what this species actually is and where it is coming from.

In order to facilitate taxonomic and other studies, and lay the foundations for effective biodiversity conservation in a time where habitat loss and species extinction are already part of our everyday life, the global organisation Catalogue of Life (CoL) works together with major programmes, including GBIFEncyclopedia of Life and the IUCN Red List, to collate the names of all species on the planet set in the context of a taxonomic hierarchy and their distribution.

Recently, the scholarly publisher and technological provider Pensoft has implemented a new integration with CoL, so that it joins in the effort to encourage authors publishing global taxonomic review in any of the publisher’s journals to upload their taxonomic contributions to the database.

Whenever authors submit a manuscript containing a world revision or checklist of a taxon to a Pensoft journal, they are offered the possibility to upload their datasets in CoL-compliant format, so that they can contribute to CoL, gain more visibility and credit for their work, and support future research and conservation initiatives.

Once the authors upload the dataset, Pensoft will automatically notify CoL about the new contribution, so that the organisation can further process the knowledge and contact the authors, if necessary.

In addition, CoL will also consider for indexing global taxonomic checklists, which have already been published by Pensoft.

It is noteworthy to mention that unlike an automated search engine, CoL does not simply gather the uploaded data and store them. All databases in CoL are thoroughly reviewed by experts in the relevant field and comply with a set of explicit instructions.

“Needless to say that the Species 2000 / Catalogue of Life community is very happy with this collaboration,” says Dr. Peter Schalk, Executive Secretary.

“It is essential that all kinds of data and information sharing initiatives in the realm of taxonomy and biodiversity science get connected, in order to provide integrated quality services to the users in and outside of our community. The players in this field carry responsibility to forge partnerships and collaborations that create added value for science and society and are mutually reinforcing for the participants. Our collaboration is a fine example how this can be achieved,” he adds.

“With our extensive experience in biodiversity research, at Pensoft we have already taken various steps to encourage and support data sharing practices,” says Prof. Lyubomir Penev, Pensoft’s founder and CEO. To better serve this purpose, last year, we even published a set of guidelines and strategies for scholarly publishing of biodiversity data as recommended by our own experience. Furthermore, at our Biodiversity Data Journal, we have not only made the publication of open data mandatory, but we were also the first to implement integrated narrative and data publication within a single paper.”

“It only makes sense to collaborate with organisations, such as Catalogue of Life, to make sure that all these global indexers are up-to-date and serve the world’s good in preserving our wonderful biodiversity,” he concludes.

How the names of organisms help to turn ‘small data’ into ‘Big Data’

Innovation in ‘Big Data’ helps address problems that were previously overwhelming. What we know about organisms is in hundreds of millions of pages published over 250 years. New software tools of the Global Names project find scientific names, index digital documents quickly, correcting names and updating them. These advances help “Making small data big” by linking together to content of many research efforts. The study was published in the open access journal Biodiversity Data Journal.

The ‘Big Data’ vision of science is transformed by computing resources to capture, manage, and interrogate the deluge of information coming from new technologies, infrastructural projects to digitise physical resources (such as our literature from the Biodiversity Heritage Library), or digital versions of specimens and records about specimens by museums.

Increased bandwidth has made dialogue among distributed data centres feasible and this is how new insights into biology are arising. In the case of biodiversity sciences, data centres range in size from the large GenBank for molecular records and the Global Biodiversity Information Facility for records of occurrences of species, to a long tail of tens of thousands of smaller datasets and web-sites which carry information compiled by individuals, research projects, funding agencies, local, state, national and international governmental agencies.

The large biological repositories do not yet approach the scale of astronomy and nuclear physics, but the very large number of sources in the long tail of useful resources do present biodiversity informaticians with a major challenge – how to discover, index, organize and interconnect the information contained in a very large number of locations.

In this regard, biology is fortunate that, from the middle of the 18th Century, the community has accepted the use of latin binomials such as Homo sapiens or Ba humbugi for species. All names are listed by taxonomists. Name recognition tools can call on large expert compilations of names (Catalogue of Life, Zoobank, Index Fungorum, Global Names Index) to find matches in sources of digital information. This allows for the rapid indexing of content.

Even when we do not know a name, we can ‘discover’ it because scientific names have certain distinctive characteristics (written in italics, most often two successive words in a latinised form, with the first one – capitalised). These properties allow names not yet present in compilations of names to be discovered in digital data sources.

The idea of a names-based cyberinfrastructure is to use the names to interconnect large and small distributed sites of expert knowledge distributed across the Internet. This is the concept of the described Global Names project which carried out the work described in this paper.

The effectiveness of such an infrastructure is compromised by the changes to names over time because of taxonomic and phylogenetic research. Names are often misspelled, or there might be errors in the way names are presented. Meanwhile, increasing numbers of species have no names, but are distinguished by their molecular characteristics.

In order to assess the challenge that these problems may present to the realization of a names-based cyberinfrastructure, we compared names from GenBank and DRYAD (a digital data repository) with names from Catalogue of Life to assess how well matched they are.

As a result, we found out that fewer than 15% of the names in pair-wise comparisons of these data sources could be matched. However, with a names parser to break the scientific names into all of their component parts, those parts that present the greatest number of problems could be removed to produce a simplified or canonical version of the name. Thanks to such tools, name-matching was improved to almost 85%, and in some cases to 100%.

The study confirms the potential for the use of names to link distributed data and to make small data big. Nonetheless, it is clear that we need to continue to invest more and better names-management software specially designed to address the problems in the biodiversity sciences.

###

Original source:

Patterson D, Mozzherin D, Shorthouse D, Thessen A (2016) Challenges with using names to link digital biodiversity information. Biodiversity Data Journal, doi: 10.3897/BDJ.4.e8080.

Additional information:

The study was supported by the National Science Foundation.