Pensoft’s journals introduce a standard appendix template for primary biodiversity data to provide direct harvesting and conversion to interlinked FAIR data
by Lyubomir Penev, Mariya Dimitrova, Iva Kostadinova, Teodor Georgiev, Donat Agosti, Jorrit Poelen
Linking open data is far from being a “new” or “innovative” concept ever since Tim Berners-Lee published his “5-Star Rating of Linked Open Data (LOD)” in 2006. The real question is how to implement it in practice, especially when most data are still waiting to be liberated from the narratives of more than 2.5 million scholarly articles published annually? We are still far from the dream world of linked and re-usable open data, not least because the inertia in academic publishing practices appears much stronger than the necessary cultural changes.
Already, there are many exciting tools and projects that harvest data from large corpora of published literature, including historical papers, such as PubMedCentral in biomedicine or Biodiversity Heritage Library in biodiversity science. Yet, finding data elements within the text of these corpora and linking data to external resources, even with the help of AI tools, is still in its infancy and is presently only half way there.
Data should not only be extracted, they should be semantically enriched and linked to both their original resources (e.g. accession numbers for sequences need to be linked to GenBank), but also between each other, as well as with data from other domains. Only then, the data can be made FAIR: Findable, Accessible, Interoperable and Re-usable. There are already research infrastructures, which provide extraction, liberation and semantic enrichment of data from the published narratives, for example, the Biodiversity Literature Repository, established at Zenodo by the digitisation company Plazi and the science publisher and technology provider Pensoft.
Quick access to high-quality Linked Open Data can become vitally important in cases like the current COVID-19 pandemic, when scientists need re-usable data from different research fields to come up with healthcare solutions. To complete the puzzle, they need data related to the taxonomy and biology of viruses, but also data taken from their potential hosts and vectors in the animal world, like bats or pangolins. Therefore, what could publishers do to facilitate the re-usability and interoperability of data they publish?
In a recently published paper by Patterson et al. (2020) on the phylogenetics of Old World Leaf-nosed bats in the journal ZooKeys, the authors and the publisher worked together to present the data on the studied voucher specimens of bats in an Appendix table, where each row represents a set of valuable links between the different data related to a specimen (see Fig. 1).
Fig. 1. Screenshot of the Appendix table with data on 324 specimens of bats (Patterson et al. 2020).
Specimens in natural history collections, for instance, have their so-called human-readable Specimen codes, for example, FMNH 221308 translates to a specimen with Catalogue No 221308, which is preserved in the collection of the Field Museum of Natural History Chicago (FMNH). When added to a collection, such voucher specimens are also assigned Globally Unique Identifiers (GUIDs). For example, the GUID of the above-mentioned specimen looks like this:
and is available from the Global Biodiversity Information Facilities (GBIF) under Original Occurrence ID (Fig. 2), from where computer algorithms can locate various types of data associated with the GUID of a particular specimen, regardless of where these data are stored. Examples for data types and relevant repositories, besides the occurrence record of the specimen available from the GBIF, are specimen data stored at the US-based natural history collection network iDigBio, specimen’s genetic sequences at GenBank, images or sound recordings stored in other third-party databases (e.g. MorphoSource, BioAcustica) and others.
The complex digital environment of various information linked to the globally unique identifier of a physical specimen in a collection together constitutes its “openDS digital specimen” representation, recently formulated within the EU project ICEDIG. Nevertheless, this complex linking could occur more easily and at a modest cost if only the GUIDs were always submitted to the respective data repositories together with the data about that particular specimen. Unfortunately, this is too rarely the case, hence we have to look for other ways to link these fragmented data.
Fig. 2. The representation of the specimen FMNH 221308 on GBIF. The Global Unique Identifier (GUID) of the specimen is shown in the Original Occurrence ID field.
Next to the Specimen code in the table (Fig. 1), there are one or more columns containing accession numbers of different gene sequences from that specimen, linked to their entries in GenBank. There is also a column for the species names associated with the specimens, linked through the Pensoft Taxon Profile (PTP) tool to several trusted international resources, in whose data holdings it appears, such as GBIF, GenBank, Biodiversity Heritage Library, PubMedCentral and many more (see example for the bat species Hipposideros ater). The next column contains the country where the specimen has been collected. The last columns contain the geo-coordinated locations of the collecting spot.
The structure of such a specimen-based table is not fixed and can also have several other data elements, for example, resolvable persistent identifiers for the deposition of MicroCt or other images of the specimen at a repository (e.g. MorphoSource) or of a tissue sample from where a virus has been isolated (see the sample table template below).
So far, so good, but what would the true value of those interlinked data be, besides that a reader could easily click on to a linked data item and see immediately more information about that particular element? What other missing links can we include to bring extra value to the data, so that these can be put together and re-used by the research community? Moreover, from where do we take these missing links?
The missing links are present in the table rows!
Firstly, if we open the GBIF record for the specimen in question (FMNH 221308), we see a lot of additional information there (Fig.2), which can be read by humans and retrieved by computers through GBIF’s Application Programming Interface (API). However, the links to the GenBank accession numbers KT583829 of the cyt-b gene sequenced from that specimen are missing, probably because, at the time of deposition of this specimen data in GBIF, its sequences had not yet been submitted to GenBank.
Now, we would probably wish to determine the specimen from which a particular gene has been sequenced and deposited in GenBank and where this specimen is preserved? We can easily click on any accession number in the table but, again, while we find a lot of useful information about the gene, for example, about the methods of sequencing, its taxon name etc., the voucher specimen’s GUID is actually missing (see KT583829 accession number of the specimen FMNH 221308, Fig. 3). How could we then locate the GUID of that specimen and the additional information linked to it? By publishing all this information in the Appendix in the way described here, we can easily locate this missing link between the specimen’s GUID and its sequence, either “by hand” or through API call requests provided by computers.
While biodiversity researchers are used to working with taxon names, these names are far from being stable entities. Names can either change over time or several different names could be associated with the same “thing” (synonyms) or identical names (homonyms) may be used for different “things”. The biodiversity community needs to resolve this problem by agreeing in the future Catalogue of Life on taxon names that are unambiguously identified with GUIDs through their taxon concepts (the content behind each name, according to a particular author who has already used that name in a publication, for example, Hipposideros vittatus (Peters, 1852) is used in the work of Patterson et al. (2020). Here comes another missing link that the table could provide – the link between the specimen, the taxon name to which it belongs and the taxon concept of that name, according to the article in which this name has been used and published.
Now, once we have listed all available linked information about several specimens belonging to a number of different species in a table, we can continue by adding some other important data, such as the biotic interactions between specimens or species. For example, we can name the table we have already constructed “Source specimens/species” and add to it some more columns under the heading “Target specimens/species”. The linking between the two groups of specimens or species in the extended biotic interaction table can be modelled using the OBO Relations Ontology, especially its list of terms, in a drop-down menu provided in the table template. Observed biotic interactions between specimens or species of the type “pathogen of”, “preys on”, “has vector” etc. can then be easily harvested and recorded in the Global Biotic Interactions database GloBI (see example on interactions between specimens).
As a result, we could have a table like the one below, where column names and data elements linked in the rows follow simple but strict rules:
Appendix A. Specimen data table. Legend: 1 – Two groupings of specimen/species data (Source and Target); 2 – Data type groups – not changeable, linked to the appropriate ontology terms, whenever possible; 3- Column names – not changeable, linked to the appropriate ontology terms, whenever possible; 4- Linked to; 5 – Linked by.
|Source specimens/species||Biotic intercations (after OBO Relation Ontology)||Target specimens/species|
|2||Preserved specimen (Specimen code)||Associated sequences||Taxon name/MOTU||Other thematic repositories||Location||Habitat / Envoronment (after ENVO Ontology)||Preserved specimen (Specimen code)||Associated sequences||Taxon name/MOTU|
|3||Institution Code||Collection Code||Cat hubalogiue ID||Gene #1||Gene #2||
||PID (e.g. images dataset)||PID (e.g. sound recordings)||Latitude||Longitude||
||Institution Code||Collection Code||Catalogiue ID||Gene #1||
|4||GRSciCol||GRSciCol||GBIF, iDigBio, or DiSSCo||INDSC (GenBank, ENA or DDBJ)||INDSC (GenBank, ENA or DDBJ||Pensoft Taxon Profile||Image repository||
||Google Maps||Google Maps||ENVO vocabulary||OBO term vocabulary||GRSciCol||GRSciCol||GBIF, iDigBio, or DiSSCo||INDSC (GenBank, ENA or DDBJ)||Pensoft Taxon Profile|
(Google spreadsheet format: https://docs.google.com/spreadsheets/d/1AWf75FSHppTifNpmhpvWNgtTJJGu-vFtFudYrhbMOuY/edit#gid=0)
As one can imagine, some columns or cells provided in the table could be empty, as the full completion of this kind of data is rarely possible. For the purposes of a publication, the author can remove all empty columns or add additional columns, for example, for listing more genes or other types of data repository records containing data about a particular specimen. What should not be changed, though, are the column names, because they give the semantic meaning of the data in the table, which allows computers to transform them into machine-readable formats.
At the end of the publishing process, this table is published, not only for humans, but also in a code language, called Extensive Markup Language (XML), which makes the data in the table “understandable” for computers. At the moment of publication, tables published in XML contain not only data, but also information about what these data mean (semantics) and how they could be identified. Thanks to these two features, an algorithm can automatically convert the data into another machine-readable language: Resource Description Framework (RDF), which, in turn, makes the data compatible (interoperable) with other data that can be linked together, using any of the identifiers of the data elements in the table. Such converted data are represented as simple statements, called “RDF triples” and stored in special triple stores, such as OpenBiodiv or Ozymandias, from where knowledge graphs can be created and used further. As an example, one can search and access data associated with a particular specimen, but deposited at various data repositories, for example, other research groups might be interested in having together all pathogens that have been extracted from particular tissues from specimens belonging to a particular host species within a specific geographical location and so on.
Finding and preserving links between the different data elements, for example, between a Specimen, Tissue, Sequence, Taxon name and Location, is by itself a task deserving special attention and investments. How could such bi- and multilateral linking work? Having the table above in place alongside all relevant tools and competences, one can run, for example, the following operations via scripts and APIs:
- Locate the GUID for Specimen Code at GBIF (= OccurrenceID)
- Lookup sequence data associated with that GUID at GenBank
- Represent the link between the GUID and Sequence accession numbers in a series of RDF triples
- Link and express in RDF the presentation of the specimen on GBIF with the article where it has been published.
- Automatically inform institutions/collections for published materials containing data on their holdings (specimens, authors, publications, links to other research infrastructures, etc.).
Semantic representation of data found in such an Appendix Specimen Data Table allows the utilisation of the Linked Open Data model to map and link several data elements to each other, including the provenance record, that is the original source (article) from where these links have been extracted (Fig. 4).
Fig. 4. Example of a semantic representation between some of the data elements from the Appendix Specimen Data Table. The proposed schema for mapping these elements uses mostly Darwin Core terms to maintain interoperability across different platforms. The link between the specimen GUID, GBIF occurrence, GenBank sequence and scientific name is marked in red.
At the very end, we will be able to construct a new “virtual super-table“ of semantic links between the data elements associated with a specimen, which, in the ideal case, would provide the fully-linked information on data and metadata along and across the lines:
Species A: Specimen <> Tissue sample <> Sequence <> Location <> Taxon name <> Habitat <> Publication source
Species B: Specimen <> Tissue sample <> Sequence <> Location <> Taxon name <> Habitat <> Publication source
Retrieving such additional information, for example, about an occurrence from GBIF or sequence information from GenBank through APIs and linking these pieces of information together in one dataset opens new possibilities for data discovery and re-use, as well as to the reproducibility of the research results.
An example for how data from different resources could be put and represented together is the visualisation of the host-parasite interactions between species, such as those between bats and coronaviruses, indexed by the Global Biotic Interactions (GloBI) (Fig. 5). Various other interactions, such as pollination, feeding, co-existence and others, are stored in GloBI’s database which is also available in the form of a Linked Open Dataset, openly accessible through files or through a SPARQL endpoint.
Fig. 5. Visualisation resulting from querying biotic interactions existing between a bat species from order Chiroptera (Plecotus auritus) and bat coronavirus.
The technology of Linked Open Data is already widely used across many fields, so data scientists will not be tremendously impressed by the fact that all of the above is possible. The problem is how to get there. One of the most obvious ways seems to be for publishers to start publishing data in a standard, community-agreed format so that these can easily be handled by machines with little or no human intervention. Will they do that? Some will, but until it becomes routine practice, most of the published data, i.e. high-quality, peer-reviewed data vetted by the act of publishing, will remain hardly accessible, hence unusable.
This pilot was elaborated as a use case published as the first article in a free-to-publish special issue on the biology of bats and pangolins as potential vectors for Coronaviruses in the journal ZooKeys. An additional benefit from the use case is the digitisation and data liberation from many articles on bats contained in the bibliography of the Patterson et al. article by Plazi. The use case is also a contribution to the recently opened COVID-19 Joint Task Force of the Consortium of European Taxonomic Facilities (CETAF), the Distributed System for Scientific Collections (DiSSCo) and the Integrated Digitized Biocollections (iDigBio).
To facilitate the quick adoption of the improved data table standards, Pensoft invites all who would like to test and see how their data are distributed and re-used after publication to submit manuscripts containing specimen data and biotic interaction tables, following the standard described above. The authors would be provided with a template table for completion of all fields relevant to their study while conforming to the standard used by Pensoft.
This initiative was supported in part by the IGNITE project.
Field Museum of Natural History Chicago
Patterson BD, Webala PW, Lavery TH, Agwanda BR, Goodman SM, Kerbis Peterhans JC, Demos TC (2020) Evolutionary relationships and population genetics of the Afrotropical leaf-nosed bats (Chiroptera, Hipposideridae). ZooKeys 929: 117-161. https://doi.org/10.3897/zookeys.929.50240
Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2014.08.005.
Hardisty A, Ma K, Nelson G, Fortes J (2019) ‘openDS’ – A New Standard for Digital Specimens and Other Natural Science Digital Object Types. Biodiversity Information Science and Standards 3: e37033. https://doi.org/10.3897/biss.3.37033