Developed in the 1990s and early 2000s, LifeWatch is one of the large-scale European Research Infrastructures (ESFRI) created to support biodiversity science and its developments. Its ultimate goal is to model Earth’s biodiversity based on large-scale data, to build a vast network of partners, and to liaise with other high-quality and viable research infrastructures (RI).
Making use of the technologically advanced open access Biodiversity Data Journal and its Collections feature, the LifeWatchGreece team is publishing a vast collection of peer-reviewed scientific outputs, including software descriptions, data papers, taxonomic checklists and research articles, along with the accompanying datasets and supporting material. Their intention is to demonstrate the availability and applicability of the developed e-Services and Virtual Laboratories (vLabs) to both the scientific community, as well as the broader domain of biodiversity management.
The LifeWatchGreece Special Collection is now available in Biodiversity Data Journal, with a series of articles highlighting key contributions to the large-scale European LifeWatch RI. The Software Description papers explain the LifeWatchGreece Portal, where all the e-Services and the vLabs provided by LifeWatchGreece RI are hosted; the Data Services based on semantic web technologies, which provide detailed and specialized search paths to facilitate data mining; the R vLab which can be used for a series of statistical analyses in ecology, based on an integrated and optimized online R environment; and the Micro-CT vLab, which allows the online exploration, dissemination and interactive manipulation of micro-tomography datasets.
The LifeWatchGreece Special Collection also includes a series of taxonomic checklists (preliminary, updated and/or annotated); a series of data papers presenting historical and original datasets; and a selection of research articles reporting on the outcomes, methodologies and citizen science initiatives developed by collaborating research projects, which have shared human, hardware and software resources with LifeWatchGreece RI.
LifeWatchGreece relies on a multidisciplinary approach, involving several subsidiary initiatives; collaborations with Greek, European and World scientific communities; specialised staff, responsible for continuous updates and developments; and, of course, innovative online tools and already established IT infrastructure.
###
Original source:
Arvanitidis C, Chatzinikolaou E, Gerovasileiou V, Panteri E, Bailly N, Minadakis N, Hardisty A, Los W (2016) LifeWatchGreece: Construction and operation of the National Research Infrastructure (ESFRI). Biodiversity Data Journal 4: e10791. https://doi.org/10.3897/BDJ.4.e10791
Additional information:
This work has been supported by the LifeWatchGreece infrastructure (MIS 384676), funded by the Greek Government under the General Secretariat of Research and Technology (GSRT), ESFRI Projects, National Strategic Reference Framework (NSRF).
We want to stress at this point that the import functionality itself is agnostic of the data source and any metadata file in EML 2.1.1 or 2.1.0 can be imported. We have listed these three most likely sources of metadata to illustrate the workflow.
In the remainder of the post, we will go through the original post from October 13, 2015 and highlight the latest updates.
At the time of the writing of the original post, the Biodiversity Information Standards conference, TDWG 2015, was taking place in Kenya. Data sharing, data re-use, and data discovery were being brought up in almost every talk. We might have entered the age of Big Data twenty years ago, but it is now that scientists face the real challenge – storing and searching through the deluge of data to find what they need.
As the rate at which we exponentially generate data exceeds the rate at which data storage technologies improve, the field of data management seems to be greatly challenged. Worse, this means the more new data is generated, the more of the older ones will be lost. In order to know what to keep and what to delete, we need to describe the data as much as possible, and judge the importance of datasets. This post is about a novel way to automatically generate scientific papers describing a dataset, which will be referred to as data papers.
The common characters of the records, i.e. descriptions of the object of study, the measurement apparatus and the statistical summaries used to quantify the records, the personal notes of the researcher, and so on, are called metadata. Major web portals such as DataONE, the Global Biodiversity Information Facility(GBIF), or the Long Term Ecological Research Network store metadata in conjunction with a given dataset as one or more text files, usually structured in special formats enabling the parsing of the metadata by algorithms.
To make the metadata and the corresponding datasets discoverable and citable, the concept of the data paper was introduced in the early 2000’s by the Ecological Society of America. This concept was brought to the attention of the biodiversity community by Chavan and Penev (2011) with the introduction of a new data paper concept, based on a metadata standard, such as the Ecological Metadata Language, and derived from metadata content stored at large data platforms, in this case the Global Biodiversity Information Facility (GBIF). You can read this article for an in-depth discussion of the topic.
Therefore, in the remainder of this post we will explain how to use an automated approach to publish a data paper describing an online dataset in Biodiversity Data Journal. The ARPHA system will convert the metadata describing your dataset into a manuscript for you after reading in the metadata. We will illustrate the workflow on the previously mentioned DataONE and GBIF.
The Data Observation Network for Earth (DataONE) is a distributed cyberinfrastructure funded by the U.S. National Science Foundation. It links together over twenty five nodes, primarily in the U.S., hosting biodiversity and biodiversity-related data, and provides an interface to search for data in all of them(Note: In the meantime, DataONE has updated their search interface).
Since butterflies are neat, let’s search for datasets about butterflies on DataONE! Type “Lepidoptera” in the search field and scroll down to the dataset describing “The Effects of Edge Proximity on Butterfly Biodiversity.” You should see something like this:
As you can notice, this resource has two objects associated with it: metadata, which has been highlighted, and the dataset itself. Let’s download the metadata from the cloud! The resulting text file, “Blandy.235.1.xml”, or whatever you want to call it, can be read by humans, but is somewhat cryptic because of all the XML tags. Now, you can import this file to the ARPHA writing platform and the information stored in it would be used to create a data paper!Go to the ARPHA web-site, and click on “Start a manuscript,” then scroll all the way down and click on “Import manuscript”.
Upload the “blandy” file and you will see an “Authors’ page,” where you can select which of the authors mentioned in the metadata must be included as authors of the data paper itself. Note that the user of ARPHA uploading the metadata is added to the list of the authors even if they are not included in the metadata. After the selection is done, a scholarly article is created by the system with the information from the metadata already in the respective sections of the article:
Now, the authors can add some description, edit out errors, tell a story, cite someone – all of this without leaving ARPHA – i.e. do whatever it takes to produce a high-quality scholarly text. After they are done, they can submit their article for peer-review and it could be published in a matter of hours. Voila!
Let’s look at GBIF. Go to “Data -> Explore by country” and select “Saint Vincent and the Grenadines,” an English-speaking Caribbean island. There are, as of the time of writing of this post, 166 occurrence datasets containing data about the islands. Select the dataset from the Museum of Comparative Zoology at Harvard. If you scroll down, you will see the GBIF annotated EML. Download this as a separate text file (if you are using Chrome, you can view the source, and then use Copy-Paste). Do the exact same steps as before – go to “Import manuscript” in ARPHA and upload the EML file. The result should be something like this, ready to finalize:
To finish it up, we want to leave you with some caveats and topics for further discussion. Till today, useful and descriptive metadata has not always been present. There are two challenges: metadata completeness and metadata standards. The invention of the EML standard was one of the first efforts to standardize how metadata should be stored in the field of ecology and biodiversity science.
Currently, our import system supports the last two versions of the EML standard: 2.1.1 and 2.1.0, but we hope to further develop this functionality. In an upcoming version of their search interface, DataONE will provide infographics on the prevalence of the metadata standards on their site (as illustrated below), so there is still work to be done, but if there is a positive feedback from the community, we will definitely keep elaborating this feature.
Image: DataONE
Regarding metadata completeness, our hope is that by enabling scientists to create scholarly papers from their metadata with a single-step process, they will be incentivized to produce high-quality metadata.
Now, allow us to give a disclaimer here: the authors of this blog post have nothing to do with the two datasets. They have not contributed to any of them, nor do they know the authors. The datasets have been chosen more or less randomly since the authors wanted to demonstrate the functionality with a real-world example. You should only publish data papers if you know the authors or you are the author of the dataset itself. During the actual review process of the paper, the authors that have been included will get an email from the journal.
Additional information:
This project has received funding from the European Union’s FP7 project EU BON (Building the European Biodiversity Observation Network), grant agreement No 308454, and Horizon 2020 research and innovation project BIG4 (Biosystematics, informatics and genomics of the big 4 insect groups: training tomorrow’s researchers and entrepreneurs) under the Marie Sklodovska-Curie grant agreement No. 642241 for a PhD project titled Technological Implications of the Open Biodiversity Knowledge Management System.
On October 20, 2015, we published a blog postabout the novel functionalities in ARPHA that allows streamlined import of specimen or occurrence records into taxonomic manuscripts.
Recently, this process was reflected in the “Tips and Tricks” section of the ARPHA authoring tool. Here, we’ll list the individual workflows:
Based on our earlier post, we will now go through our latest updates and highlight the new features that have been added since then.
Repositories and data indexing platforms, such as GBIF, BOLD systems, iDigBio, or PlutoF, hold, among other types of data, specimen or occurrence records. It is now possible to directly import specimen or occurrence records into ARPHA taxonomic manuscripts from these platforms [see Fig. 1]. We’ll refer to specimen or occurrence records as simply occurrence records for the rest of this post.
[Fig. 1] Workflow for directly importing occurrence records into a taxonomic manuscript.Until now, when users of the ARPHA writing tool wanted to include occurrence records as materials in a manuscript, they would have had to format the occurrences as an Excel sheet that is uploaded to the Biodiversity Data Journal, or enter the data manually. While the “upload from Excel” approach significantly simplifies the process of importing materials, it still requires a transposition step – the data which is stored in a database needs to be reformatted to the specific Excel format. With the introduction of the new import feature, occurrence data that is stored at GBIF, BOLD systems, iDigBio, or PlutoF, can be directly inserted into the manuscript by simply entering a relevant record identifier.
The functionality shows up when one creates a new “Taxon treatment” in a taxonomic manuscript in the ARPHA Writing Tool. To import records, the author needs to:
Locate an occurrence record or records in one of the supported data portals;
Note the ID(s) of the records that ought to be imported into the manuscript (see Tips and Tricks for screenshots);
Enter the ID(s) of the occurrence record(s) in a form that is to be seen in the “Materials” section of the species treatment;
Select a particular database from a list, and then simply clicks ‘Add’ to import the occurrence directly into the manuscript.
In the case of BOLD Systems, the author may also select a given Barcode Identification Number (BIN; for a treatment of BIN’s read below), which then pulls all occurrences in the corresponding BIN.
We will illustrate this workflow by creating a fictitious treatment of the red moss, Sphagnum capillifolium, in a test manuscript. We have started a taxonomic manuscript in ARPHA and know that the occurrence records belonging to S. capillifolium can be found on iDigBio. What we need to do is to locate the ID of the occurrence record in the iDigBio webpage. In the case of iDigBio, the ARPHA system supports import via a Universally Unique Identifier (UUID). We have already created a treatment for S. capillifolium and clicked on the pencil to edit materials [Fig. 2].
[Fig. 2] Edit materialsIn this example, type or paste the UUID (b9ff7774-4a5d-47af-a2ea-bdf3ecc78885), select the iDigBio source and click ‘Add’. This will pull the occurrence record for S. capillifolium from iDigBio and insert it as a material in the current paper [Fig. 3].
[Fig. 3] Materials after they have been importedThis workflow can be used for a number of purposes. An interesting future application is the rapid re-description of species, but even more exciting is the description of new species from BIN’s. BIN’s (Barcode Identification Numbers) delimit Operational Taxonomic Units (OTU’s), created algorithmically at BOLD Systems. If a taxonomist decides that an OTU is indeed a new species, then he/she can import all the type information associated with that OTU for the purposes of describing it as a new species.
Not having to retype or copy/paste species occurrence records, the authors save a lot of efforts. Moreover, they automatically import them in a structured Darwin Core format, which can easily be downloaded from the article text into structured data by anyone who needs the data for reuse.
Another important aspect of the workflow is that it will serve as a platform for peer-review, publication and curation of raw data, that is of unpublished individual data records coming from collections or observations stored at GBIF, BOLD, iDigBio and PlutoF. Taxonomists are used to publish only records of specimens they or their co-authors have personally studied. In a sense, the workflow will serve as a “cleaning filter” for portions of data that are passed through the publishing process. Thereafter, the published records can be used to curate raw data at collections, e.g. put correct identifications, assign newly described species names to specimens belonging to the respective BIN and so on.
Additional Information:
The work has been partially supported by the EC-FP7 EU BON project (ENV 308454, Building the European Biodiversity Observation Network) and the ITN Horizon 2020 project BIG4 (Biosystematics, informatics and genomics of the big 4 insect groups: training tomorrow’s researchers and entrepreneurs), under Marie Sklodovska-Curie grant agreement No. 642241.