Call for Expression of Interest for biodiversity data-related scientific projects from BiCIKL

The purpose of this call is to solicit, select and implement four to six biodiversity data-related scientific projects that will make use of the added value services developed by the leading Research Infrastructures that make the BiCIKL project.

The BiCIKL project invites submissions of Expression of Interest (EoI) to the First BiCIKL Open Call for projects. The purpose of this call is to solicit, select and implement four to six biodiversity data-related scientific projects that will make use of the added value services developed by the leading Research Infrastructures that make the BiCIKL project.

By opening this call, BiCIKL aims to better understand how it could support scientific questions that arise from across the biodiversity world in the future, while addressing specific scientific or technical biodiversity data challenges presented by the applicants.

We need and want to assess real-world problems and make the best possible use of our data and technical capabilities. This will greatly assist in defining the long-term development goals of the participating Research Infrastructures and improve the way they can technically and operationally work together to deliver greater scientific value.

explain the project partners.

The BiCIKL project – a Horizon 2020-funded project involving 14 European institutions, representing major global players in biodiversity research and natural history, and coordinated by Pensoft – establishes a European starting community of key research infrastructures, researchers, citizen scientists and other biodiversity and life sciences stakeholders based on open science practices through access to data, tools and services.

Find more about the Call and submit your Expression of Interest

***

Follow BiCIKL on Twitter and Facebook.

Join the conversation on Twitter via #BiCIKL_H2020.

Call for data papers describing datasets from Russia to be published in Biodiversity Data Journal

GBIF partners with FinBIF and Pensoft to support publication of new datasets about biodiversity from across Russia

Original post via GBIF

In collaboration with the Finnish Biodiversity Information Facility (FinBIF) and Pensoft Publishers, GBIF has announced a new call for authors to submit and publish data papers on Russia in a special collection of Biodiversity Data Journal (BDJ). The call extends and expands upon a successful effort in 2020 to mobilize data from European Russia.

Between now and 15 September 2021, the article processing fee (normally €550) will be waived for the first 36 papers, provided that the publications are accepted and meet the following criteria that the data paper describes a dataset:

The manuscript must be prepared in English and is submitted in accordance with BDJ’s instructions to authors by 15 September 2021. Late submissions will not be eligible for APC waivers.

Sponsorship is limited to the first 36 accepted submissions meeting these criteria on a first-come, first-served basis. The call for submissions can therefore close prior to the stated deadline of 15 September 2021. Authors may contribute to more than one manuscript, but artificial division of the logically uniform data and data stories, or “salami publishing”, is not allowed.

BDJ will publish a special issue including the selected papers by the end of 2021. The journal is indexed by Web of Science (Impact Factor 1.331), Scopus (CiteScore: 2.1) and listed in РИНЦ / eLibrary.ru.

For non-native speakers, please ensure that your English is checked either by native speakers or by professional English-language editors prior to submission. You may credit these individuals as a “Contributor” through the AWT interface. Contributors are not listed as co-authors but can help you improve your manuscripts.

In addition to the BDJ instruction to authors, it is required that datasets referenced from the data paper a) cite the dataset’s DOI, b) appear in the paper’s list of references, and c) has “Russia 2021” in Project Data: Title and “N-Eurasia-Russia2021“ in Project Data: Identifier in the dataset’s metadata.

Authors should explore the GBIF.org section on data papers and Strategies and guidelines for scholarly publishing of biodiversity data. Manuscripts and datasets will go through a standard peer-review process. When submitting a manuscript to BDJ, authors are requested to select the Biota of Russia collection.

To see an example, view this dataset on GBIF.org and the corresponding data paper published by BDJ.

Questions may be directed either to Dmitry Schigel, GBIF scientific officer, or Yasen Mutafchiev, managing editor of Biodiversity Data Journal.

The 2021 extension of the collection of data papers will be edited by Vladimir Blagoderov, Pedro Cardoso, Ivan Chadin, Nina Filippova, Alexander Sennikov, Alexey Seregin, and Dmitry Schigel.

This project is a continuation of the successful call for data papers from European Russia in 2020. The funded papers are available in the Biota of Russia special collection and the datasets are shown on the project page.

***

Definition of terms

Datasets with more than 5,000 records that are new to GBIF.org

Datasets should contain at a minimum 5,000 new records that are new to GBIF.org. While the focus is on additional records for the region, records already published in GBIF may meet the criteria of ‘new’ if they are substantially improved, particularly through the addition of georeferenced locations.” Artificial reduction of records from otherwise uniform datasets to the necessary minimum (“salami publishing”) is discouraged and may result in rejection of the manuscript. New submissions describing updates of datasets, already presented in earlier published data papers will not be sponsored.

Justification for publishing datasets with fewer records (e.g. sampling-event datasets, sequence-based data, checklists with endemics etc.) will be considered on a case-by-case basis.

Datasets with high-quality data and metadata

Authors should start by publishing a dataset comprised of data and metadata that meets GBIF’s stated data quality requirement. This effort will involve work on an installation of the GBIF Integrated Publishing Toolkit.

Only when the dataset is prepared should authors then turn to working on the manuscript text. The extended metadata you enter in the IPT while describing your dataset can be converted into manuscript with a single-click of a button in the ARPHA Writing Tool (see also Creation and Publication of Data Papers from Ecological Metadata Language (EML) Metadata. Authors can then complete, edit and submit manuscripts to BDJ for review.

Datasets with geographic coverage in Russia

In correspondence with the funding priorities of this programme, at least 80% of the records in a dataset should have coordinates that fall within the priority area of Russia. However, authors of the paper may be affiliated with institutions anywhere in the world.

***

Check out the Biota of Russia dynamic data paper collection so far.

Follow Biodiversity Data Journal on Twitter and Facebook to keep yourself posted about the new research published.

Pensoft Annotator – a tool for text annotation with ontologies

By Mariya Dimitrova, Georgi Zhelezov, Teodor Georgiev and Lyubomir Penev

The use of written language to record new knowledge is one of the advancements of civilisation that has helped us achieve progress. However, in the era of Big Data, the amount of published writing greatly exceeds the physical ability of humans to read and understand all written information. 

More than ever, we need computers to help us process and manage written knowledge. Unlike humans, computers are “naturally fluent” in many languages, such as the formats of the Semantic Web. These standards were developed by the World Wide Web Consortium (W3C) to enable computers to understand data published on the Internet. As a result, computers can index web content and gather data and metadata about web resources.

To help manage knowledge in different domains, humans have started to develop ontologies: shared conceptualisations of real-world objects, phenomena and abstract concepts, expressed in machine-readable formats. Such ontologies can provide computers with the necessary basic knowledge, or axioms, to help them understand the definitions and relations between resources on the Web. Ontologies outline data concepts, each with its own unique identifier, definition and human-legible label.

Matching data to its underlying ontological model is called ontology population and involves data handling and parsing that gives it additional context and semantics (meaning). Over the past couple of years, Pensoft has been working on an ontology population tool, the Pensoft Annotator, which matches free text to ontological terms.

The Pensoft Annotator is a web application, which allows annotation of text input by the user, with any of the available ontologies. Currently, they are the Environment Ontology (ENVO) and the Relation Ontology (RO), but we plan to upload many more. The Annotator can be run with multiple ontologies, and will return a table of matched ontological term identifiers, their labels, as well as the ontology from which they originate (Fig. 1). The results can also be downloaded as a Tab-Separated Value (TSV) file and certain records can be removed from the table of results, if desired. In addition, the Pensoft Annotator allows to exclude certain words (“stopwords”) from the free text matching algorithm. There is a list of default stopwords, common for the English language, such as prepositions and pronouns, but anyone can add new stopwords.

Figure 1. Interface of the Pensoft Annotator application

In Figure 1, we have annotated a sentence with the Pensoft Annotator, which yields a single matched term, labeled ‘host of’, from the Relation Ontology (RO). The ontology term identifier is linked to a webpage in Ontobee, which points to additional metadata about the ontology term (Fig. 2).

Figure 2. Web page about ontology term

Such annotation requests can be run to perform text analyses for topic modelling to discover texts which contain host-pathogen interactions. Topic modelling is used to build algorithms for content recommendation (recommender systems) which can be implemented in online news platforms, streaming services, shopping websites and others.

At Pensoft, we use the Pensoft Annotator to enrich biodiversity publications with semantics. We are currently annotating taxonomic treatments with a custom-made ontology based on the Relation Ontology (RO) to discover treatments potentially describing species interactions. You can read more about using the Annotator to detect biotic interactions in this abstract.

The Pensoft Annotator can also be used programmatically through an API, allowing you to integrate the Annotator into your own script. For more information about using the Pensoft Annotator, please check out the documentation.

How to get data from research articles back into the research cycle аt no additional costs?

Pensoft’s journals introduce a standard appendix template for primary biodiversity data to provide direct harvesting and conversion to interlinked FAIR data

by Lyubomir Penev, Mariya Dimitrova, Iva Kostadinova, Teodor Georgiev, Donat Agosti, Jorrit Poelen

Linking open data is far from being a “new” or “innovative” concept ever since Tim Berners-Lee published his “5-Star Rating of Linked Open Data (LOD)” in 2006. The real question is how to implement it in practice, especially when most data are still waiting to be liberated from the narratives of more than 2.5 million scholarly articles published annually? We are still far from the dream world of linked and re-usable open data, not least because the inertia in academic publishing practices appears much stronger than the necessary cultural changes.

Already, there are many exciting tools and projects that harvest data from large corpora of published literature, including historical papers, such as PubMedCentral in biomedicine or Biodiversity Heritage Library in biodiversity science. Yet, finding data elements within the text of these corpora and linking data to external resources, even with the help of AI tools, is still in its infancy and is presently only half way there.

Data should not only be extracted, they should be semantically enriched and linked to both their original resources (e.g. accession numbers for sequences need to be linked to GenBank), but also between each other, as well as with data from other domains. Only then, the data can be made FAIR: Findable, Accessible, Interoperable and Re-usable. There are already research infrastructures, which provide extraction, liberation and semantic enrichment of data from the published narratives, for example, the Biodiversity Literature Repository, established at Zenodo by the digitisation company Plazi and the science publisher and technology provider Pensoft

Quick access to high-quality Linked Open Data can become vitally important in cases like the current COVID-19 pandemic, when scientists need re-usable data from different research fields to come up with healthcare solutions. To complete the puzzle, they need data related to the taxonomy and biology of viruses, but also data taken from their potential hosts and vectors in the animal world, like bats or pangolins. Therefore, what could publishers do to facilitate the re-usability and interoperability of data they publish? 

In a recently published paper by Patterson et al. (2020) on the phylogenetics of Old World Leaf-nosed bats in the journal ZooKeys, the authors and the publisher worked together to present the data on the studied voucher specimens of bats in an Appendix table, where each row represents a set of valuable links between the different data related to a specimen (see Fig. 1). 

Fig. 1. Screenshot of the Appendix table with data on 324 specimens of bats (Patterson et al. 2020).


Specimens in natural history collections, for instance, have their so-called human-readable Specimen codes, for example, FMNH 221308 translates to a specimen with Catalogue No 221308, which is preserved in the collection of the Field Museum of Natural History Chicago (FMNH). When added to a collection, such voucher specimens are also assigned Globally Unique Identifiers (GUIDs). For example, the GUID of the above-mentioned specimen looks like this:  

25634cae-5a0c-490b-b380-9cabe456316a 

and is available from the Global Biodiversity Information Facilities (GBIF) under Original Occurrence ID (Fig. 2), from where computer algorithms can locate various types of data associated with the GUID of a particular specimen, regardless of where these data are stored. Examples for data types and relevant repositories, besides the occurrence record of the specimen available from the GBIF, are specimen data stored at the US-based natural history collection network iDigBio, specimen’s genetic sequences at GenBank, images or sound recordings stored in other third-party databases (e.g. MorphoSource, BioAcustica) and others. 

The complex digital environment of various information linked to the globally unique identifier of a physical specimen in a collection together constitutes its “openDS digital specimen” representation, recently formulated within the EU project ICEDIG. Nevertheless, this complex linking could occur more easily and at a modest cost if only the GUIDs were always submitted to the respective data repositories together with the data about that particular specimen. Unfortunately, this is too rarely the case, hence we have to look for other ways to link these fragmented data.

Fig. 2. The representation of the specimen FMNH 221308 on GBIF. The Global Unique Identifier (GUID) of the specimen is shown in the Original Occurrence ID field.

Next to the Specimen code in the table (Fig. 1), there are one or more columns containing accession numbers of different gene sequences from that specimen, linked to their entries in GenBank. There is also a column for the species names associated with the specimens, linked through the Pensoft Taxon Profile (PTP) tool to several trusted international resources, in whose data holdings it appears, such as GBIF, GenBank, Biodiversity Heritage Library, PubMedCentral and many more (see example for the bat species Hipposideros ater). The next column contains the country where the specimen has been collected. The last columns contain the geo-coordinated locations of the collecting spot. 

The structure of such a specimen-based table is not fixed and can also have several other data elements, for example, resolvable persistent identifiers for the deposition of MicroCt or other images of the specimen at a repository (e.g. MorphoSource) or of a tissue sample from where a virus has been isolated (see the sample table template below). 

So far, so good, but what would the true value of those interlinked data be, besides that a reader could easily click on to a linked data item and see immediately more information about that particular element? What other missing links can we include to bring extra value to the data, so that these can be put together and re-used by the research community? Moreover, from where do we take these missing links?

The missing links are present in the table rows!

Firstly, if we open the GBIF record for the specimen in question (FMNH 221308), we see a lot of additional information there (Fig.2), which can be read by humans and retrieved by computers through GBIF’s Application Programming Interface (API). However, the links to the GenBank accession numbers KT583829 of the cyt-b gene sequenced from that specimen are missing, probably because, at the time of deposition of this specimen data in GBIF, its sequences had not yet been submitted to GenBank.

Now, we would probably wish to determine the specimen from which a particular gene has been sequenced and deposited in GenBank and where this specimen is preserved? We can easily click on any accession number in the table but, again, while we find a lot of useful information about the gene, for example, about the methods of sequencing, its taxon name etc., the voucher specimen’s GUID is actually missing (see KT583829 accession number of the specimen FMNH 221308, Fig. 3). How could we then locate the GUID of that specimen and the additional information linked to it? By publishing all this information in the Appendix in the way described here, we can easily locate this missing link between the specimen’s GUID and its sequence, either “by hand” or through API call requests provided by computers. 

Fig. 3. GenBank record for the accession number KT583829 of the voucher specimen FMNH 221308. The GUID for the voucher specimen is not present in the record. 


While biodiversity researchers are used to working with taxon names, these names are far from being stable entities. Names can either change over time or several different names could be associated with the same “thing” (synonyms) or identical names (homonyms) may be used for different “things”. The biodiversity community needs to resolve this problem by agreeing in the future Catalogue of Life on taxon names that are unambiguously identified with GUIDs through their taxon concepts (the content behind each name, according to a particular author who has already used that name in a publication, for example, Hipposideros vittatus (Peters, 1852) is used in the work of Patterson et al. (2020). Here comes another missing link that the table could provide – the link between the specimen, the taxon name to which it belongs and the taxon concept of that name, according to the article in which this name has been used and published.

Now, once we have listed all available linked information about several specimens belonging to a number of different species in a table, we can continue by adding some other important data, such as the biotic interactions between specimens or species. For example, we can name the table we have already constructed “Source specimens/species” and add to it some more columns under the heading “Target specimens/species”. The linking between the two groups of specimens or species in the extended biotic interaction table can be modelled using the OBO Relations Ontology, especially its list of terms, in a drop-down menu provided in the table template. Observed biotic interactions between specimens or species of the type “pathogen of”, “preys on”, “has vector” etc. can then be easily harvested and recorded in the Global Biotic Interactions database GloBI (see example on interactions between specimens).

As a result, we could have a table like the one below, where column names and data elements linked in the rows follow simple but strict rules: 

Appendix A. Specimen data table. Legend: 1 – Two groupings of specimen/species data (Source and Target); 2 – Data type groups – not changeable, linked to the appropriate ontology terms, whenever possible; 3- Column names – not changeable, linked to the appropriate ontology terms, whenever possible;  4- Linked to; 5 – Linked by.

1
Source specimens/species Biotic intercations (after OBO Relation Ontology) Target specimens/species
2 Preserved specimen (Specimen code) Associated sequences Taxon name/MOTU Other thematic repositories Location Habitat / Envoronment (after ENVO Ontology) Preserved specimen (Specimen code) Associated sequences Taxon name/MOTU
3 Institution Code Collection Code Cat hubalogiue ID Gene #1 Gene #2
PID (e.g. images dataset) PID (e.g. sound recordings) Latitude Longitude
Institution Code Collection Code Catalogiue ID Gene #1
4 GRSciCol GRSciCol GBIF, iDigBio, or DiSSCo INDSC (GenBank, ENA or DDBJ) INDSC (GenBank, ENA or DDBJ Pensoft Taxon Profile Image repository
Google Maps Google Maps ENVO vocabulary OBO term vocabulary GRSciCol GRSciCol GBIF, iDigBio, or DiSSCo INDSC (GenBank, ENA or DDBJ) Pensoft Taxon Profile
5 Pensoft Pensoft Author Pensoft Pensoft Pensoft Author Author Pensoft Pensoft Pensoft Author Pensoft Pensoft Author Pensoft Pensoft

(Google spreadsheet format: https://docs.google.com/spreadsheets/d/1AWf75FSHppTifNpmhpvWNgtTJJGu-vFtFudYrhbMOuY/edit#gid=0)

As one can imagine, some columns or cells provided in the table could be empty, as the full completion of this kind of data is rarely possible. For the purposes of a publication, the author can remove all empty columns or add additional columns, for example, for listing more genes or other types of data repository records containing data about a particular specimen. What should not be changed, though, are the column names, because they give the semantic meaning of the data in the table, which allows computers to transform them into machine-readable formats.

At the end of the publishing process, this table is published, not only for humans, but also in a code language, called Extensive Markup Language (XML), which makes the data in the table “understandable” for computers. At the moment of publication, tables published in XML contain not only data, but also information about what these data mean (semantics) and how they could be identified. Thanks to these two features, an algorithm can automatically convert the data into another machine-readable language: Resource Description Framework (RDF), which, in turn, makes the data compatible (interoperable) with other data that can be linked together, using any of the identifiers of the data elements in the table. Such converted data are represented as simple statements, called “RDF triples” and stored in special triple stores, such as OpenBiodiv or Ozymandias, from where knowledge graphs can be created and used further. As an example, one can search and access data associated with a particular specimen, but deposited at various data repositories, for example, other research groups might be interested in having together all pathogens that have been extracted from particular tissues from specimens belonging to a particular host species within a specific geographical location and so on.

Finding and preserving links between the different data elements, for example, between a Specimen, Tissue, Sequence, Taxon name and Location, is by itself a task deserving special attention and investments. How could such bi- and multilateral linking work? Having the table above in place alongside all relevant tools and competences, one can run, for example, the following operations via scripts and APIs:

  1. Locate the GUID for Specimen Code at GBIF (= OccurrenceID)
  2. Lookup sequence data associated with that GUID at GenBank
  3. Represent the link between the GUID and Sequence accession numbers in a series of RDF triples
  4. Link and express in RDF the presentation of the specimen on GBIF with the article where it has been published.
  5. Automatically inform institutions/collections for published materials containing data on their holdings (specimens, authors, publications, links to other research infrastructures, etc.).

Semantic representation of data found in such an Appendix Specimen Data Table allows the utilisation of the Linked Open Data model to map and link several data elements to each other, including the provenance record, that is the original source (article) from where these links have been extracted (Fig. 4). 

Fig. 4. Example of a semantic representation between some of the data elements from the Appendix Specimen Data Table. The proposed schema for mapping these elements uses mostly Darwin Core terms to maintain interoperability across different platforms. The link between the specimen GUID, GBIF occurrence, GenBank sequence and scientific name is marked in red.

At the very end, we will be able to construct a new “virtual super-table“ of semantic links between the data elements associated with a specimen, which, in the ideal case, would provide the fully-linked information on data and metadata along and across the lines:

Species A: Specimen <> Tissue sample <> Sequence <> Location <> Taxon name <> Habitat <> Publication source 

↑↓

Species B: Specimen <> Tissue sample <> Sequence <> Location <> Taxon name <> Habitat <> Publication source

Retrieving such additional information, for example, about an occurrence from GBIF or sequence information from GenBank through APIs and linking these pieces of information together in one dataset opens new possibilities for data discovery and re-use, as well as to the reproducibility of the research results.

An example for how data from different resources could be put and represented together is the visualisation of the host-parasite interactions between species, such as those between bats and coronaviruses, indexed by the Global Biotic Interactions (GloBI) (Fig. 5). Various other interactions, such as pollination, feeding, co-existence and others, are stored in GloBI’s database which is also available in the form of a Linked Open Dataset, openly accessible through files or through a SPARQL endpoint.

Fig. 5. Visualisation resulting from querying biotic interactions existing between a bat species from order Chiroptera (Plecotus auritus) and bat coronavirus.

The technology of Linked Open Data is already widely used across many fields, so data scientists will not be tremendously impressed by the fact that all of the above is possible. The problem is how to get there. One of the most obvious ways seems to be for publishers to start publishing data in a standard, community-agreed format so that these can easily be handled by machines with little or no human intervention. Will they do that? Some will, but until it becomes routine practice, most of the published data, i.e. high-quality, peer-reviewed data vetted by the act of publishing, will remain hardly accessible, hence unusable.

This pilot was elaborated as a use case published as the first article in a free-to-publish special issue on the biology of bats and pangolins as potential vectors for Coronaviruses in the journal ZooKeys. An additional benefit from the use case is the digitisation and data liberation from many articles on bats contained in the bibliography of the Patterson et al. article by Plazi. The use case is also a contribution to the recently opened COVID-19 Joint Task Force of the Consortium of European Taxonomic Facilities (CETAF), the Distributed System for Scientific Collections (DiSSCo) and the Integrated Digitized Biocollections (iDigBio).

To facilitate the quick adoption of the improved data table standards, Pensoft invites all who would like to test and see how their data are distributed and re-used after publication to submit manuscripts containing specimen data and biotic interaction tables, following the standard described above. The authors would be provided with a template table for completion of all fields relevant to their study while conforming to the standard used by Pensoft.

This initiative was supported in part by the IGNITE project.

Information: 

Pensoft Publishers

Field Museum of Natural History Chicago

References:

Patterson BD, Webala PW, Lavery TH, Agwanda BR, Goodman SM, Kerbis Peterhans JC, Demos TC (2020) Evolutionary relationships and population genetics of the Afrotropical leaf-nosed bats (Chiroptera, Hipposideridae). ZooKeys 929: 117-161. https://doi.org/10.3897/zookeys.929.50240

Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2014.08.005.

Hardisty A, Ma K, Nelson G, Fortes J (2019) ‘openDS’ – A New Standard for Digital Specimens and Other Natural Science Digital Object Types. Biodiversity Information Science and Standards 3: e37033. https://doi.org/10.3897/biss.3.37033