New BiCIKL project to build a freeway between pieces of biodiversity knowledge

Within Biodiversity Community Integrated Knowledge Library (BiCIKL), 14 key research and natural history institutions commit to link infrastructures and technologies to provide flawless access to biodiversity data.

In a recently started Horizon 2020-funded project, 14 European institutions from 10 countries, representing both the continent’s and global key players in biodiversity research and natural history, deploy and improve their own and partnering infrastructures to bridge gaps between each other’s biodiversity data types and classes. By linking their technologies, they are set to provide flawless access to data across all stages of the research cycle.

Three years in, BiCIKL (abbreviation for Biodiversity Community Integrated Knowledge Library) will have created the first-of-its-kind Biodiversity Knowledge Hub, where a researcher will be able to retrieve a full set of linked and open biodiversity data, thereby accessing the complete story behind an organism of interest: its name, genetics, occurrences, natural history, as well as authors and publications mentioning any of those.

Ultimately, the project’s products will solidify Open Science and FAIR (Findable, Accessible, Interoperable and Reusable) data practices by empowering and streamlining biodiversity research.

Together, the project partners will redesign the way biodiversity data is found, linked, integrated and re-used across the research cycle. By the end of the project, BiCIKL will provide the community with a more transparent, trustworthy and efficient highly automated research ecosystem, allowing for scientists to access, explore and put into further use a wide range of data with only a few clicks.

“In recent years, we’ve made huge progress on how biodiversity data is located, accessed, shared, extracted and preserved, thanks to a vast array of digital platforms, tools and projects looking after the different types of data, such as natural history specimens, species descriptions, images, occurrence records and genomics data, to name a few. However, we’re still missing an interconnected and user-friendly environment to pull all those pieces of knowledge together. Within BiCIKL, we all agree that it’s only after we puzzle out how to best bridge our existing infrastructures and the information they are continuously sourcing that future researchers will be able to realise their full potential,” 

explains BiCIKL’s project coordinator Prof. Lyubomir Penev, CEO and founder of Pensoft, a scholarly publisher and technology provider company.

Continuously fed with data sourced by the partnering institutions and their infrastructures, BiCIKL’s key final output: the Biodiversity Knowledge Hub, is set to persist with time long after the project has concluded. On the contrary, by accelerating biodiversity research that builds on – rather than duplicates – existing knowledge, it will in fact be providing access to exponentially growing contextualised biodiversity data.

***

Learn more about BiCIKL on the project’s website at: bicikl-project.eu

Follow BiCIKL Project on Twitter and Facebook. Join the conversation on Twitter at #BiCIKL_H2020.

***

The project partners:

One water bucket to find them all: Detecting fish, mammals, and birds from a single sample

Revolutionary environmental DNA analysis holds great potential for the future of biodiversity monitoring, concludes a new study.

Revolutionary environmental DNA analysis holds great potential for the future of biodiversity monitoring, concludes a new study

Collection of water samples for eDNA metabarcoding bioassessment.
Photo by Till-Hendrik Macher.

In times of exacerbating biodiversity loss, reliable data on species occurrence are essential, in order for prompt and adequate conservation actions to be initiated. This is especially true for freshwater ecosystems, which are particularly vulnerable and threatened by anthropogenic impacts. Their ecological status has already been highlighted as a top priority by multiple national and international directives, such as the European Water Framework Directive.

However, traditional monitoring methods, such as electrofishing, trapping methods, or observation-based assessments, which are the current status-quo in fish monitoring, are often time- and cost-consuming. As a result, over the last decade, scientists progressively agree that we need a more comprehensive and holistic method to assess freshwater biodiversity.

Meanwhile, recent studies have continuously been demonstrating that eDNA metabarcoding analyses, where DNA traces found in the water are used to identify what organisms live there, is an efficient method to capture aquatic biodiversity in a fast, reliable, non-invasive and relatively low-cost manner. In such metabarcoding studies, scientists sample, collect and sequence DNA, so that they can compare it with existing databases and identify the source organisms.

Furthermore, as eDNA metabarcoding assessments use samples from water, often streams, located at the lowest point, one such sample usually contains not only traces of specimens that come into direct contact with water, for example, by swimming or drinking, but also collects traces of terrestrial species indirectly via rainfalls, snowmelt, groundwaters etc. 

In standard fish eDNA metabarcoding assessments, these ‘bycatch data’ are typically left aside. Yet, from a viewpoint of a more holistic biodiversity monitoring, they hold immense potential to also detect the presence of terrestrial and semi-terrestrial species in the catchment.

In their new study, reported in the open-access scholarly journal Metabarcoding and MetagenomicsGerman researchers from the University of Duisburg-Essen and the German Environment Agency successfully detected an astonishing quantity of the local mammals and birds native to the Saxony-Anhalt state by collecting as much as 18 litres of water from across a two-kilometre stretch along the river Mulde.

After water filtration the eDNA filter is preserved in ethanol until further processing in the lab.
Photo by Till-Hendrik Macher.

In fact, it took only one day for the team, led by Till-Hendrik Macher, PhD student in the German Federal Environmental Agency-funded GeDNA project, to collect the samples. Using metabarcoding to analyse the DNA from the samples, the researchers identified as much as 50% of the fishes, 22% of the mammal species, and 7.4% of the breeding bird species in the region. 

However, the team also concluded that while it would normally take only 10 litres of water to assess the aquatic and semi-terrestrial fauna, terrestrial species required significantly more sampling.

Unlocking data from the increasingly available fish eDNA metabarcoding information enables synergies among terrestrial and aquatic biodiversity monitoring programs, adding further important information on species diversity in space and time. 

“We thus encourage to exploit fish eDNA metabarcoding biodiversity monitoring data to inform other conservation programs,”

says lead author Till-Hendrik Macher. 

“For that purpose, however, it is essential that eDNA data is jointly stored and accessible for different biodiversity monitoring and biodiversity assessment campaigns, either at state, federal, or international level,”

concludes Florian Leese, who coordinates the project.

Original source:

Macher T-H, Schütz R, Arle J, Beermann AJ, Koschorreck J, Leese F (2021) Beyond fish eDNA metabarcoding: Field replicates disproportionately improve the detection of stream associated vertebrate species. Metabarcoding and Metagenomics 5: e66557. https://doi.org/10.3897/mbmg.5.66557

48 years of Australian collecting trips in one data package

From 1973 to 2020, Australian zoologist Dr Robert Mesibov kept careful records of the “where” and “when” of his plant and invertebrate collecting trips. Now, he has made those valuable biodiversity data freely and easily accessible via the Zenodo open-data repository, so that future researchers can rely on this “authority file” when using museum specimens collected from those events in their own studies. The new dataset is described in the open-access, peer-reviewed Biodiversity Data Journal.

While checking museum records, Dr Robert Mesibov found there were occasional errors in the dates and places for specimens he had collected many years before. He was not surprised.

“It’s easy to make mistakes when entering data on a computer from paper specimen labels”, said Mesibov. “I also found specimen records that said I was the collector, but I know I wasn’t!”

One solution to this problem was what librarians and others have long called an “authority file”.

“It’s an authoritative reference, in this case with the correct details of where I collected and when”, he explained.

“I kept records of almost all my collecting trips from 1973 until I retired from field work in 2020. The earliest records were on paper, but I began storing the key details in digital form in the 1990s.”

The 48-year record has now been made publicly available via the Zenodo open-data repository after conversion to the Darwin Core data format, which is widely used for sharing biodiversity information. With this “authority file”, described in detail in the open-access, peer-reviewed Biodiversity Data Journal, future researchers will be able to rely on sound, interoperable and easy to access data, when using those museum specimens in their own studies, instead of repeating and further spreading unintentional errors.

“There are 3829 collecting events in the authority file”, said Mesibov, “from six Australian states and territories. For each collecting event there are geospatial and date details, plus notes on the collection.”

Mesibov hopes the authority file will be used by museums to correct errors in their catalogues.

“It should also save museums a fair bit of work in future”, he explained. “No need to transcribe details on specimen labels into digital form in a database, because the details are already in digital form in the authority file.”

Mesibov points out that in the 19th and 20th centuries, lists of collecting events were often included in the reports of major scientific expeditions.

“Those lists were authority files, but in the pre-digital days it was probably just as easy to copy collection data from specimen labels.”

“In the 21st century there’s a big push to digitise museum specimen collections”, he said. “Museum databases often have lookup tables with scientific names and the names of collectors. These lookup tables save data entry time and help to avoid errors in digitising.”

“Authority files for collecting events are the next logical step,” said Mesibov. “They can be used as lookup tables for all the important details of individual collections: where, when, by whom and how.”

###

Research paper:

Mesibov RE (2021) An Australian collector’s authority file, 1973–2020. Biodiversity Data Journal 9: e70463. https://doi.org/10.3897/BDJ.9.e70463

###

Robert Mesibov’s webpage: https://www.datafix.com.au/mesibov.html

Robert Mesibov’s ORCID page: https://orcid.org/0000-0003-3466-5038

Unlocking Australia’s biodiversity, one dataset at a time

Illustration by CSIRO

Australia’s unique and highly endemic flora and fauna are threatened by rapid losses in biodiversity and ecosystem health, caused by human influence and environmental challenges. To monitor and respond to these trends, scientists and policy-makers need reliable data.

Biodiversity researchers and managers often don’t have the necessary information, or access to it, to tackle some of the greatest environmental challenges facing society, such as biodiversity loss or climate change. Data can be a powerful tool for the development of science and decision-making, which is where the Atlas of Living Australia (ALA) comes in.

ALA – Australia’s national biodiversity database – uses cutting-edge digital tools which enable  people to share, access and analyse data about local plants, animals and fungi. It brings together millions of sightings as well as environmental data like rainfall and temperature in one place to be searched and analysed. All data are made publicly available – ALA was established in line with open-access principles and uses an open-source code base.

The impressive set of databases on Australia’s biodiversity includes information on species occurrence, animal tracking, specimens, biodiversity projects, and Australia’s Natural History Collections. The ALA also manages a wide range of other data, including information on spatial layers, indigenous ecological knowledge, taxonomic profiles and biodiversity literature. Together with its partner tools, the ALA has radically enhanced ease of access to biodiversity data. A forum paper recently published with the open-access, peer-reviewed Biodiversity Data Journal details its history, current state and future directions.

Established in 2010 under the Australian Government’s National Collaborative Research Infrastructure Strategy (NCRIS) to support the research sector with trusted biodiversity data, it now delivers data and related services to more than 80,000 users every year, helping scientists, policy makers, environmental planners, industry, and the general public to work more efficiently. It also supports the international community as the Australian node of the Global Biodiversity Information Facility and the code base for the successful international Living Atlases community.

With thousands of records being added daily, the ALA currently contains nearly 95 million occurrence records of over 111,000 species, the earliest of them being from the late 1600s. Among them, 1.7 million are observation records harvested by computer algorithms, and the trend is that their share will keep growing.

An ALA staff member. Photo by CSIRO

Recognising the potential of citizen science for contributing valuable information to Australia’s biodiversity, the ALA became a member of the iNaturalist Network in 2019 and established an Australian iNaturalist node to encourage people to submit their species observations. Projects like DigiVol and BioCollect were also born from ALA’s interest in empowering citizen science.

The ALA BioCollect platform supports biodiversity-related projects by capturing both descriptive metadata and raw primary field data. BioCollect has a strong citizen science emphasis, with 524 citizen science projects that are open to involvement by anyone. The platform also provides information on projects related to ecoscience and natural resource management activities.

Hosted by the Australian Museum, DigiVol is a volunteer portal where over 6,000 public volunteers have transcribed over 800,000 specimen labels and 124,000 pages of field notes. Harnessing the power and passion of volunteers, the tool makes more information available to science by digitising specimens, images, field notes and archives from collections all over the world.

Built on a decade of partnerships with biodiversity data partners, government departments, community and citizen science organisations, the ALA provides a robust suite of services, including a range of data systems and software applications that support both the research sector and decision makers. Well regarded both domestically and internationally, it has built a national community that is working to improve the availability and accessibility of biodiversity data.

Original source:

Belbin L, Wallis E, Hobern D, Zerger A (2021) The Atlas of Living Australia: History, current state and future directions. Biodiversity Data Journal 9: e65023. https://doi.org/10.3897/BDJ.9.e65023

Call for data papers describing datasets from Russia to be published in Biodiversity Data Journal

GBIF partners with FinBIF and Pensoft to support publication of new datasets about biodiversity from across Russia

Original post via GBIF

In collaboration with the Finnish Biodiversity Information Facility (FinBIF) and Pensoft Publishers, GBIF has announced a new call for authors to submit and publish data papers on Russia in a special collection of Biodiversity Data Journal (BDJ). The call extends and expands upon a successful effort in 2020 to mobilize data from European Russia.

Between now and 15 September 2021, the article processing fee (normally €550) will be waived for the first 36 papers, provided that the publications are accepted and meet the following criteria that the data paper describes a dataset:

The manuscript must be prepared in English and is submitted in accordance with BDJ’s instructions to authors by 15 September 2021. Late submissions will not be eligible for APC waivers.

Sponsorship is limited to the first 36 accepted submissions meeting these criteria on a first-come, first-served basis. The call for submissions can therefore close prior to the stated deadline of 15 September 2021. Authors may contribute to more than one manuscript, but artificial division of the logically uniform data and data stories, or “salami publishing”, is not allowed.

BDJ will publish a special issue including the selected papers by the end of 2021. The journal is indexed by Web of Science (Impact Factor 1.331), Scopus (CiteScore: 2.1) and listed in РИНЦ / eLibrary.ru.

For non-native speakers, please ensure that your English is checked either by native speakers or by professional English-language editors prior to submission. You may credit these individuals as a “Contributor” through the AWT interface. Contributors are not listed as co-authors but can help you improve your manuscripts.

In addition to the BDJ instruction to authors, it is required that datasets referenced from the data paper a) cite the dataset’s DOI, b) appear in the paper’s list of references, and c) has “Russia 2021” in Project Data: Title and “N-Eurasia-Russia2021“ in Project Data: Identifier in the dataset’s metadata.

Authors should explore the GBIF.org section on data papers and Strategies and guidelines for scholarly publishing of biodiversity data. Manuscripts and datasets will go through a standard peer-review process. When submitting a manuscript to BDJ, authors are requested to select the Biota of Russia collection.

To see an example, view this dataset on GBIF.org and the corresponding data paper published by BDJ.

Questions may be directed either to Dmitry Schigel, GBIF scientific officer, or Yasen Mutafchiev, managing editor of Biodiversity Data Journal.

The 2021 extension of the collection of data papers will be edited by Vladimir Blagoderov, Pedro Cardoso, Ivan Chadin, Nina Filippova, Alexander Sennikov, Alexey Seregin, and Dmitry Schigel.

This project is a continuation of the successful call for data papers from European Russia in 2020. The funded papers are available in the Biota of Russia special collection and the datasets are shown on the project page.

***

Definition of terms

Datasets with more than 5,000 records that are new to GBIF.org

Datasets should contain at a minimum 5,000 new records that are new to GBIF.org. While the focus is on additional records for the region, records already published in GBIF may meet the criteria of ‘new’ if they are substantially improved, particularly through the addition of georeferenced locations.” Artificial reduction of records from otherwise uniform datasets to the necessary minimum (“salami publishing”) is discouraged and may result in rejection of the manuscript. New submissions describing updates of datasets, already presented in earlier published data papers will not be sponsored.

Justification for publishing datasets with fewer records (e.g. sampling-event datasets, sequence-based data, checklists with endemics etc.) will be considered on a case-by-case basis.

Datasets with high-quality data and metadata

Authors should start by publishing a dataset comprised of data and metadata that meets GBIF’s stated data quality requirement. This effort will involve work on an installation of the GBIF Integrated Publishing Toolkit.

Only when the dataset is prepared should authors then turn to working on the manuscript text. The extended metadata you enter in the IPT while describing your dataset can be converted into manuscript with a single-click of a button in the ARPHA Writing Tool (see also Creation and Publication of Data Papers from Ecological Metadata Language (EML) Metadata. Authors can then complete, edit and submit manuscripts to BDJ for review.

Datasets with geographic coverage in Russia

In correspondence with the funding priorities of this programme, at least 80% of the records in a dataset should have coordinates that fall within the priority area of Russia. However, authors of the paper may be affiliated with institutions anywhere in the world.

***

Check out the Biota of Russia dynamic data paper collection so far.

Follow Biodiversity Data Journal on Twitter and Facebook to keep yourself posted about the new research published.

New DNA barcoding project aims at tracking down the “dark taxa” of Germany’s insect fauna

New dynamic article collection at Biodiversity Data Journal is already accumulating the project’s findings

About 1.4 million species of animals are currently known, but it is generally accepted that this figure grossly underestimates the actual number of species in existence, which likely ranges between five and thirty million species, or even 100 million. 

Meanwhile, a far less well-known fact is that even in countries with a long history of taxonomic research, such as Germany, which is currently known to be inhabited by about 48,000 animal species, there are thousands of insect species still awaiting discovery. In particular, the orders Diptera (flies) and Hymenoptera (especially the parasitoid wasps) are insect groups suspected to contain a strikingly large number of undescribed species. With almost 10,000 known species each, these two insect orders account for approximately two-thirds of Germany’s insect fauna, underlining the importance of these insects in many ways.

The conclusion that there are not only a few, but so many unknown species in Germany is a result of the earlier German Barcode of Life projects: GBOL I and GBOL II, both supported by the German Federal Ministry of Education and Research (Bundesministerium für Bildung und Forschung, BMBF) and the Bavarian Ministry of Science under the project Barcoding Fauna Bavarica. 

In its previous phases, GBOL aimed to identify all German species reliably, quickly and inexpensively using DNA barcodes. Since the first project was launched twelve years ago, more than 25,000 German animal species have been barcoded. Among them, the comparatively well-known groups, such as butterflies, moths, beetles, grasshoppers, spiders, bees and wasps, showed an almost complete coverage of the species inventory.

In 2020, another BMBF-funded DNA barcoding project, titled GBOL III: Dark Taxa, was launched, in order to focus on the lesser-known groups of Diptera and parasitoid Hymenoptera, which are often referred to as “dark taxa”. The new project commenced at three major German natural history institutions: the Zoological Research Museum Alexander Koenig (Bonn), the Bavarian State Collection of Zoology (SNSB, Munich) and the State Museum of Natural History Stuttgart, in collaboration with the University of Würzburg and the Entomological Society Krefeld. Together, the project partners are to join efforts and skills to address a range of questions related to the taxonomy of the “dark taxa” in Germany.

As part of the initiative, the project partners are invited to submit their results and outcomes in the dedicated GBOL III: Dark Taxa article collection in the peer-reviewed, open-access Biodiversity Data Journal. There, the contributions will be published dynamically, as soon as approved and ready for publication. The articles will include taxonomic revisions, checklists, data papers, contributions to methods and protocols, employed in DNA barcoding studies with a focus on the target taxa of the project.

“The collection of articles published in the Biodiversity Data Journal is an excellent approach to achieving the consortium’s goals and project partners are encouraged to take advantage of the journal’s streamlined publication workflows to publish and disseminate data and results that were generated during the project,”

says the collection’s editor Dr Stefan Schmidt of the Bavarian State Collection of Zoology.

***

Find and follow the dynamic article collection GBOL III: Dark Taxa in Biodiversity Data Journal.

Follow Biodiversity Data Journal on Twitter and Facebook.

Pensoft Annotator – a tool for text annotation with ontologies

By Mariya Dimitrova, Georgi Zhelezov, Teodor Georgiev and Lyubomir Penev

The use of written language to record new knowledge is one of the advancements of civilisation that has helped us achieve progress. However, in the era of Big Data, the amount of published writing greatly exceeds the physical ability of humans to read and understand all written information. 

More than ever, we need computers to help us process and manage written knowledge. Unlike humans, computers are “naturally fluent” in many languages, such as the formats of the Semantic Web. These standards were developed by the World Wide Web Consortium (W3C) to enable computers to understand data published on the Internet. As a result, computers can index web content and gather data and metadata about web resources.

To help manage knowledge in different domains, humans have started to develop ontologies: shared conceptualisations of real-world objects, phenomena and abstract concepts, expressed in machine-readable formats. Such ontologies can provide computers with the necessary basic knowledge, or axioms, to help them understand the definitions and relations between resources on the Web. Ontologies outline data concepts, each with its own unique identifier, definition and human-legible label.

Matching data to its underlying ontological model is called ontology population and involves data handling and parsing that gives it additional context and semantics (meaning). Over the past couple of years, Pensoft has been working on an ontology population tool, the Pensoft Annotator, which matches free text to ontological terms.

The Pensoft Annotator is a web application, which allows annotation of text input by the user, with any of the available ontologies. Currently, they are the Environment Ontology (ENVO) and the Relation Ontology (RO), but we plan to upload many more. The Annotator can be run with multiple ontologies, and will return a table of matched ontological term identifiers, their labels, as well as the ontology from which they originate (Fig. 1). The results can also be downloaded as a Tab-Separated Value (TSV) file and certain records can be removed from the table of results, if desired. In addition, the Pensoft Annotator allows to exclude certain words (“stopwords”) from the free text matching algorithm. There is a list of default stopwords, common for the English language, such as prepositions and pronouns, but anyone can add new stopwords.

Figure 1. Interface of the Pensoft Annotator application

In Figure 1, we have annotated a sentence with the Pensoft Annotator, which yields a single matched term, labeled ‘host of’, from the Relation Ontology (RO). The ontology term identifier is linked to a webpage in Ontobee, which points to additional metadata about the ontology term (Fig. 2).

Figure 2. Web page about ontology term

Such annotation requests can be run to perform text analyses for topic modelling to discover texts which contain host-pathogen interactions. Topic modelling is used to build algorithms for content recommendation (recommender systems) which can be implemented in online news platforms, streaming services, shopping websites and others.

At Pensoft, we use the Pensoft Annotator to enrich biodiversity publications with semantics. We are currently annotating taxonomic treatments with a custom-made ontology based on the Relation Ontology (RO) to discover treatments potentially describing species interactions. You can read more about using the Annotator to detect biotic interactions in this abstract.

The Pensoft Annotator can also be used programmatically through an API, allowing you to integrate the Annotator into your own script. For more information about using the Pensoft Annotator, please check out the documentation.

Data checking for biodiversity collections and other biodiversity data compilers from Pensoft

Guest blog post by Dr Robert Mesibov

Proofreading the text of scientific papers isn’t hard, although it can be tedious. Are all the words spelled correctly? Is all the punctuation correct and in the right place? Is the writing clear and concise, with correct grammar? Are all the cited references listed in the References section, and vice-versa? Are the figure and table citations correct?

Proofreading of text is usually done first by the reviewers, and then finished by the editors and copy editors employed by scientific publishers. A similar kind of proofreading is also done with the small tables of data found in scientific papers, mainly by reviewers familiar with the management and analysis of the data concerned.

But what about proofreading the big volumes of data that are common in biodiversity informatics? Tables with tens or hundreds of thousands of rows and dozens of columns? Who does the proofreading?

Sadly, the answer is usually “No one”. Proofreading large amounts of data isn’t easy and requires special skills and digital tools. The people who compile biodiversity data often lack the skills, the software or the time to properly check what they’ve compiled.

The result is that a great deal of the data made available through biodiversity projects like GBIF is — to be charitable — “messy”. Biodiversity data often needs a lot of patient cleaning by end-users before it’s ready for analysis. To assist end-users, GBIF and other aggregators attach “flags” to each record in the database where an automated check has found a problem. These checks find the most obvious problems amongst the many possible data compilation errors. End-users often have much more work to do after the flags have been dealt with.

In 2017, Pensoft employed a data specialist to proofread the online datasets that are referenced in manuscripts submitted to Pensoft’s journals as data papers. The results of the data-checking are sent to the data paper’s authors, who then edit the datasets. This process has substantially improved many datasets (including those already made available through GBIF) and made them more suitable for digital re-use. At blog publication time, more than 200 datasets have been checked in this way.

Note that a Pensoft data audit does not check the accuracy of the data, for example, whether the authority for a species name is correct, or whether the latitude/longitude for a collecting locality agrees with the verbal description of that locality. For a more or less complete list of what does get checked, see the Data checklist at the bottom of this blog post. These checks are aimed at ensuring that datasets are correctly organised, consistently formatted and easy to move from one digital application to another. The next reader of a digital dataset is likely to be a computer program, not a human. It is essential that the data are structured and formatted, so that they are easily processed by that program and by other programs in the pipeline between the data compiler and the next human user of the data.

Pensoft’s data-checking workflow was previously offered only to authors of data paper manuscripts. It is now available to data compilers generally, with three levels of service:

  • Basic: the compiler gets a detailed report on what needs fixing
  • Standard: minor problems are fixed in the dataset and reported
  • Premium: all detected problems are fixed in collaboration with the data compiler and a report is provided

Because datasets vary so much in size and content, it is not possible to set a price in advance for basic, standard and premium data-checking. To get a quote for a dataset, send an email with a small sample of the data topublishing@pensoft.net.


Data checklist

Minor problems:

  • dataset not UTF-8 encoded
  • blank or broken records
  • characters other than letters, numbers, punctuation and plain whitespace
  • more than one version (the simplest or most correct one) for each character
  • unnecessary whitespace
  • Windows carriage returns (retained if required)
  • encoding errors (e.g. “Dum?ril” instead of “Duméril”)
  • missing data with a variety of representations (blank, “-“, “NA”, “?” etc)

Major problems:

  • unintended shifts of data items between fields
  • incorrect or inconsistent formatting of data items (e.g. dates)
  • different representations of the same data item (pseudo-duplication)
  • for Darwin Core datasets, incorrect use of Darwin Core fields
  • data items that are invalid or inappropriate for a field
  • data items that should be split between fields
  • data items referring to unexplained entities (e.g. “habitat is type A”)
  • truncated data items
  • disagreements between fields within a record
  • missing, but expected, data items
  • incorrectly associated data items (e.g. two country codes for the same country)
  • duplicate records, or partial duplicate records where not needed

For details of the methods used, see the author’s online resources:

***

Find more for Pensoft’s data audit workflow provided for data papers submitted to Pensoft journals on Pensoft’s blog.

Special ZooKeys memorial volume open to submissions to commemorate our admirable founding Editor-in-Chief Terry Erwin

In recognition of the love and devotion that Terry expressed for the study of the World’s biodiversity, ZooKeys invites contributions to this memorial issue, covering all subjects falling within the area of systematic zoology. Titled “Systematic Zoology and Biodiversity Science: A tribute to Terry Erwin (1940-2020)”.

In tribute to our beloved friend and founding Editor-in-Chief, Dr Terry Erwin, who passed away on 11th May 2020, we are planning a special memorial volume to be published on 11 May 2021, the date Terry left us. Terry will be remembered by all who knew him for his radiant spirit, charming enthusiasm for carabid beetles and never-ceasing exploration of the world of biodiversity! 

In recognition of the love and devotion that Terry expressed for study of the World’s biodiversity, ZooKeys invites contributions to this memorial issue, titled “Systematic Zoology and Biodiversity Science: A tribute to Terry Erwin (1940-2020)”, to all subjects falling within the area of systematic zoology. Of special interest are papers recognising Terry’s dedication to collection based research, massive biodiversity  surveys and origin of biodiversity hot spot areas. The Special will be edited by John Spence, Achille Casale, Thorsten Assmann, James Liebherr and Lyubomir Penev.

Article processing charges (APCs) will be waived for: (1) Contributions to systematic biology and diversity of carabid beetles, (2) Contributions from Terry’s students and (3) Contributions from his colleagues from the Smithsonian Institution. The APC for articles which do not fall in the above categories will be discounted at 30%.

The submission deadline is 31st December 2020.

Contributors are also invited to send memories and photos which shall be published in a special addendum to the volume.

The memorial volume will also include a joint project of Plazi, Pensoft and the Biodiversity Literature Repository aimed at extracting of taxonomic data from Terry Erwin’s publications and making it easily accessible to the scientific community.