While digital curation, publication and dissemination of data have been steadily picking up in recent years in scientific fields ranging from biodiversity and ecology to chemistry and aeronautics, so have imminent concerns about their quality, availability and reusability. What’s the use of any dataset if it isn’t FAIR (i.e. findable, accessible, interoperable and reusable)?
With the all-too-fresh memory of researchers like Elizabeth “Lizzie” Wolkovich who would spend a great deal of time chasing down crucial and impossible-to-replicate data by means of pleading to colleagues (or their successors) via inactive email addresses and literally dusting off card folders and floppy disks, it is easy to imagine that we could be bound to witness history repeating itself once more. At the end of yet another day in today’s “Big Data” world, data loss caused by accidental entry errors or misused data standards seems even more plausible than an outdated contact or a drawer that has suddenly caught fire.
When a 2013 study, which looked into 516 papers from 1991 to 2011, reported that the chances of associated datasets to be available for reuse fell by 17% each year starting from the publication date, it cited issues mostly dealing with the data having simply been lost through the years or stored on currently inaccessible storage media. However, the researcher of today is increasingly logging their data into external repositories, where datasets are swiftly provided with a persistent link via a unique digital object identifier (DOI), while more and more publishers and funders require from authors and project investigators to make their research data openly available upon the publication of the associated paper. Further, we saw the emergence of the Data Paper, a research article type later customised for the needs of various fields, including biodiversity, launched in order to describe datasets and facilitate their reach to a wider audience. So, aren’t data finally here to last?
Today, biodiversity scientists publish and deposit biodiversity data at an unprecedented rate and the pace is only increasing, boosted by the harrowing effects of climate change, species loss, pollution and habitat degradation among others. Meanwhile, the field is yet to adopt universal practices and standards for efficiently linking all those data together – currently available from rather isolated platforms – so that researchers can indeed navigate through the available knowledge and build on it, rather than duplicate unknowingly the efforts of multiple teams from across the globe. Given the limited human capabilities as opposed to the unrestricted amounts of data piling up by the minute, biodiversity science is bound to stagnate if researchers don’t hand over the “data chase” to their computers.
Truth be told, a machine that stumbles across ‘messy’ data – i.e. data whose format and structure have been compromised, so that the dataset is no longer interoperable, i.e. it fails to be retrieved from one application to another – differs little from a researcher whose personal request to a colleague is being ignored. Easily missable details such as line breaks within data items, invalid characters or empty fields could lead to data loss, eventually compromising future research that would otherwise build on those same records. Unfortunately, institutionally available data collections are just as prone to ‘messiness’, as evidenced by data expert and auditor Dr Robert Mesibov.
“Proofreading data takes at least as much time and skill as proofreading text,” says Dr Mesibov. “Just as with text, mistakes easily creep into data files, and the bigger the data file, the more likely it has duplicates, disagreements between data fields, misplaced and truncated (cut-off) data items, and an assortment of formatting errors.”
Similarly to research findings and conclusions which cannot be considered truthful until backed up by substantial evidence, the same evidence (i.e. a dataset) should be of questionable relevance and credibility if its components are not easily retrievable for anyone wishing to follow them up, be it a human researcher or a machine. In order to ensure that their research contribution is made in a responsible fashion in compliance with good scientific practices, scientists should not only care to make their datasets openly available online, but also ensure they are clean and tidy, therefore truly FAIR.
With the kind help of Dr Robert Mesibov, Pensoft has implemented mandatory data audit for all data paper manuscripts submitted to the relevant journals in its exclusively open access portfolio to support responsibility, efficiency and FAIRness in biodiversity science. Learn more about the workflow here. The workflow is illustrated in a case study, describing the experience of University of Cordoba’s Dr Gloria Martínez-Sagarra and Prof Juan Antonio Devesa, while preparing their data paper later published in PhytoKeys. A “Data Quality Checklist and Recommendations” is accessible on the websites of the relevant Pensoft journals, including Biodiversity Data Journal.