Open scientific data

From Wikipedia, the free encyclopedia
  (Redirected from Open science data)
Jump to navigation Jump to search

Open scientific data or open research data is a type of open data focused on publishing observations and results of scientific activities available for anyone to analyze and reuse. A major purpose of the drive for open data is to allow the verification of scientific claims, by allowing others to look at the reproducibility of results,[1] and to allow data from many sources to be integrated to give new knowledge.[2]

The modern concept of scientific data emerged in the second half of the 20th century, with the development of large knowledge infrastructure to compute scientific information and observation. The sharing and distribution of data has been early identified as an important stake but was impeded by the technical limitations of the infrastructure and the lack of common standards for data communication. The World Wide Web was immediately conceived as a universal protocol for the sharing of scientific data, especially coming from high-energy physics.


Scientific data[edit]

The concept of open scientific data has developed in parallel with the concept of scientific data.

Scientific data was not formally defined until the late 20th century. Before the generalization of computational analysis, data has been mostly an informal terms, frequently used interchangeably with knowledge or information.[3] Institutional and epistemological discourses favored alternative concepts and outlooks on scientific activities: "Even histories of science and epistemology comments, mention data only in passing. Other foundational works on the making of meaning in science discuss facts, representations, inscriptions, and publications, with little attention to data per se."[4]

The first influential policy definition of scientific data appeared as late as 1999, when the National Academies of Science described data as "facts, letters, numbers or symbols that describe an object, condition, situation or other factors".[5] Terminologies have continued to evolve: in 2011, the National Academies updated the definition to include a large variety of dataified objects such as "spectrographic, genomic sequencing, and electron microscopy data; observational data, such as remote sensing, geospatial, and socioeconomic data; and other forms of data either generated or compiled, by humans or machines" as well as "digital representation of literature"[5]

While the forms and shapes of data remain expansive and unsettled, standard definitions and policies have recently tended to restrict scientific data to computational or digital data.[6] The open data pilot of Horizon 2020 has been voluntarily restricted to digital research: "‘Digital research data’ is information in digital form (in particular facts or numbers), collected to be examined and used as a basis for reasoning, discussion or calculation; this includes statistics, results of experiments, measurements, observations resulting from fieldwork, survey results, interview recordings and images"

Overall, the status scientific data remains a flexible point of discussion among individual researchers, communities and policy-makers: "In broader terms, whatever ‘data’ is of interest to researchers should be treated as ‘research data’"[6] Important policy reports, like the 2011 collective synthesis of the National Academies of science on data citation, have intentionally adopted a relative and nominalist definition of data: "we will devote little time to definitional issues (e.g., what are data?), except to acknowledge that data often exist in the eyes of the beholder."[7] For Christine Borgman, the main issue is not to define scientific data ("what are data") but to contextualize the point where data became a focal point of discussion within a discipline, an institution or a national research program ("when are data").[8] In the 2010s, the expansion of available data sources and the sophistication of data analysis method has expanded the range of disciplines primarily affected by data management issues to "computational social science, digital humanities, social media data, citizen science research projects, and political science."[9]

Open scientific data[edit]

Opening and sharing have both been major topic of discussion in regard to scientific data management, but also a motivation to make data emerge as a relevant issue within an institution, a discipline or a policy framework.

For Paul Edwards, whether or not to share the data, to what extent it should be shared and to whom have been major causes of data friction, that revealed the otherwise hidden infrastructures of science: "Edwards’ metaphor of data friction describes what happens at the interfaces between data ‘surfaces’: the points where data move between people, substrates, organizations, or machines (...) Every movement of data across an interface comes at some cost in time, energy, and human attention. Every interface between groups and organizations, as well as between machines, represents a point of resistance where data can be garbled, misinterpreted, or lost. In social systems, data friction consumes energy and produces turbulence and heat – that is, conflicts, disagreements, and inexact, unruly processes."[10] The opening of scientific data is both a data friction in itself and a way to collectively manage data frictions by weakening complex issues of data ownership. Scientific or epistemic cultures have been acknowledged as primary factors in the adoption of open data policies: "data sharing practices would be expected to be community-bound and largely determined by epistemic culture."[11]

In the 2010s, new concepts have been introduced by scientist and policy-makers to more accurately define what open scientific data. Since its introduction in 2016, FAIR Data has become a major focus of open research policies. The acronym describe an ideal-type of Findable, Accessible, Interoperable, and Reusable data. Open scientific data has been categorized as a commons or a public good, which is primarily maintained, enriched and preserved by collective rather than individual action: "What makes collective action useful in understanding scientific data sharing is its focus on how the appropriation of individual gains is determined by adjusting the costs and benefits that accrue with contributions to a common resource"[12]


Development of knowledge infrastructures (1945-1960)[edit]

Punch-card storage in US National Weather Records Center in Asheville (early 1960s). Data holding have expanded so much that the entrance hall has to be used as a storage facility.

The emergence of scientific data is associated with a semantic shift in the way core scientific concepts like data, information and knowledge are commonly understood.[13] Following the development of computing technologies, data and information are increasingly described as "things":[14] "Like computation, data always have a material aspect. Data are things. They are not just numbers but also numerals, with dimensionality, weight, and texture".[15]

After the Second World War large scientific projects have increasingly relied on knowledge infrastructure to collect, process and analyze important amount of data. Punch-cards system were first used experimentally on climate data in the 1920s and were applied on a large scale in the following decade: "In one of the first Depression-era government make-work projects, Civil Works Administration workers punched some 2 million ship log observations for the period 1880–1933."[16] By 1960, the meteorological data collections of the US National Weather Records Center has expanded to 400 millions cards and had a global reach. The physically of scientific data was by then fully apparent and threatened the stability of entire buildings: "By 1966 the cards occupied so much space that the Center began to fill its main entrance hall with card storage cabinets (figure 5.4). Officials became seriously concerned that the building might collapse under their weight".[17]

By the end of the 1960s, knowledge infrastructure have been embedded in a various set of disciplines and communities. The first initiative to create a database of electronic bibliography of open access data was the Educational Resources Information Center (ERIC) in 1966. In the same year, MEDLINE was created – a free access online database managed by the National Library of Medicine and the National Institute of Health (USA) with bibliographical citations from journals in the biomedical area, which later would be called PubMed, currently with over 14 million complete articles.[18] Knowledge infrastructures were also set up in space engineering (with NASA/RECON), library search (with OCLC Worldcat) or the social sciences: "The 1960s and 1970s saw the establishment of over a dozen services and professional associations to coordinate quantitative data collection".[19]

Opening and sharing data: early attempts (1960-1990)[edit]

Early discourses and policy frameworks on open scientific data emerged immediately in the wake of the creation of the first large knowledge infrastructure. The World Data Center system (now the World Data System), aimed to make observation data more readily available in preparation for the International Geophysical Year of 1957–1958.[20] The International Council of Scientific Unions (now the International Council for Science) established several World Data Centers to minimize the risk of data loss and to maximize data accessibility, further recommending in 1955 that data be made available in machine-readable form.[21] In 1966, the International Council for Science created CODATA, an initiative to "promote cooperation in data management and use".[22]

These early forms of open scientific data did not develop much further. There were too many data frictions and technical resistance to the integration of external data to implement a durable ecosystem of data sharing. Data infrastructures were mostly invisible to researchers, as most of the research was done by professional librarians. Not only were the search operating systems complicated to use, but the search has to be performed very efficiently given the prohibitive cost of long-distance telecommunication.[23] While their conceptors have originally anticipated direct uses by researcher, that could not really emerge due to technical and economic impediment:

The designers of the first online systems had presumed that searching would be done by end users; that assumption undergirded system design. MEDLINE was intended to be used by medical researchers and clinicians, NASA/RECON was designed for aerospace engineers and scientists. For many reasons, however, most users through the seventies were librarians and trained intermediaries working on behalf of end users. In fact, some professional searchers worried that even allowing eager end users to get at the terminals was a bad idea.[24]

Christine Borgman does not recall any significant policy debates over the meaning, the production and the circulation of scientific data save for a few specific fields (like climatology) after 1966.[22] The insulated scientific infrastructures could hardly be connected before the advent of the web.[25] Projects, and communities relied on their own unconnected networks at a national or institutional level: "the Internet was nearly invisible in Europe because people there were pursuing a separate set of network protocols".[26] Communication between scientific infrastructures was not only challenging across space, but also across time. Whenever a communication protocol was no longer maintained, the data and knowledge it disseminated was likely to disappear as well: "the relationship between historical research and computing has been durably affected by aborted projects, data loss and unrecoverable formats".[27]

Sharing scientific data on the web (1990-1995)[edit]

The World Wide Web was originally conceived as an infrastructure for open scientific data. Sharing of data and data documentation was a major focus in the initial communication of the World Wide Web when the project was first unveiled in August 1991 : "The WWW project was started to allow high energy physicists to share data, news, and documentation. We are very interested in spreading the web to other areas, and having gateway servers for other data".[28]

The project stemmed from a close knowledge infrastructure, ENQUIRE. It was an information management software commissioned to Tim Berners-Lee by the CERN for the specific needs of high energy physics. The structure of ENQUIRE was closer to an internal web of data: it connected "nodes" that "could refer to a person, a software module, etc. and that could be interlined with various relations such as made, include, describes and so forth".[29] While it "facilitated some random linkage between information" Enquire was not able to "facilitate the collaboration that was desired for in the international high-energy physics research community".[30] Like any significant computing scientific infrastructure before the 1990s, the development of ENQUIRE was ultimately impeded by the lack of interoperability and the complexity of managing network communications: "although Enquire provided a way to link documents and databases, and hypertext provided a common format in which to display them, there was still the problem of getting different computers with different operating systems to communicate with each other".[26]

The web rapidly superseded pre-existing closed infrastructure for scientific data, even when they included more advanced computing features. From 1991 to 1994, users of the Worm Community System, a major biology database on worms, switched to the Web and Gopher. While the Web did not include many advanced functions for data retrieval and collaboration, it was easily accessible. Conversely, the Worm Community System could only be browsed on specific terminals shared across scientific institutions: "To take on board the custom-designed, powerful WCS (with its convenient interface) is to suffer inconvenience at the intersection of work habits, computer use, and lab resources (…) The World-Wide Web, on the other hand, can be accessed from a broad variety of terminals and connections, and Internet computer support is readily available at most academic institutions and through relatively inexpensive commercial services.[31] "

Defining open scientific data (1995-2010)[edit]

The development and the generalization of the World Wide Web lifted numerous technical barriers and frictions had constrained the free circulation of data. Yet, scientific data had yet to be defined and new research policy had to be implemented to realize the original vision laid out by Tim Berners-Lee of a web of data. At this point, scientific data has been largely defined through the process of opening scientific data, as the implementation of open policies created new incentives for setting up actionable guidelines, principles and terminologies.

Climate research has been a pioneering field in the conceptual definition of open scientific data, as it has been in the construction of the first large knowledge infrastructure in the 1950s and the 1960s. In 1995 the GCDIS articulated a clear commitment On the Full and Open Exchange of Scientific Data: "International programs for global change research and environmental monitoring crucially depend on the principle of full and open data exchange (i.e., data and information are made available without restriction, on a non-discriminatory basis, for no more than the cost of reproduction and distribution).[32] The expansion of the scope and the management of knowledge infrastructures also created to incentives to share data, as the "allocation of data ownership" between a large number of individual and institutional stakeholders has become increasingly complex.[33] Open data creates a simplified framework to ensure that all contributors and users of the data have access to it.[33]

Open data has been rapidly identified as a key objective of the emerging open science movement. While initially focused on publications and scholarly articles, the international initiatives in favor of open access expanded their scope to all the main scientific productions.[34] In 2003 the Berlin Declaration supported the diffusion of "original scientific research results, raw data and metadata, source materials and digital representations of pictorial and graphical and scholarly multimedia materials"

After 2000, international organizations, like the OECD (Organisation for Economic Co-operation and Development), have played an instrumental role in devising generic and transdisciplinary definitions of scientific data, as open data policies have to be implemented beyond the specific scale of a discipline of a country.[5] One of the first influential definition of scientific data was coined in 1999[5] by a report of the National Academies of Science: "Data are facts, numbers, letters, and symbols that describe an object, idea, condition, situation, or other factors".[35] In 2004, the Science Ministers of all nations of the OECD signed a declaration which essentially states that all publicly funded archive data should be made publicly available.[36] In 2007 the OECD "codified the principles for access to research data from public funding"[37] through the Principles and Guidelines for Access to Research Data from Public Funding which defined scientific data as "factual records (numerical scores, textual records, images and sounds) used as primary sources for scientific research, and that are commonly accepted in the scientific community as necessary to validate research findings."[38] The Principles acted as soft-law recommendation and affirmed that "access to research data increases the returns from public investment in this area; reinforces open scientific inquiry; encourages diversity of studies and opinion; promotes new areas of work and enables the exploration of topics not envisioned by the initial investigators."[39]

Policy implementations (2010-…)[edit]

After 2010, national and supra-national institutions took a more interventionist stance. New policies have been implemented not only to ensure and incentivize the opening of scientific data, usually in continuation to existing open data program. In Europe, the "European Union Commissioner for Research, Science, and Innovation, Carlos Moedas made open research data one of the EU’s priorities in 2015."[9]

First published in 2016, the FAIR Guiding Principles[2] have become an influential framework for opening scientific data.[9] The principles have been originally designed two years earlier during a policy ad research workshop at Lorentz, Jointly Designing a Data FAIRport.[40] During the deliberations of the workshop, "the notion emerged that, through the definition of, and widespread support for, a minimal set of community-agreed guiding principles and practice"[41]

The principles do not attempt to define scientific data, which remains a relatively plastic concept, but strive to describe "what constitutes ‘good data management’".[42] They cover four foundational principles, "that serve to guide data producer": Findability, Accessibility, Interoperability, and Reusability.[42] and also aim to provide a step toward machine-actionability by expliciting the underlying semantics of data.[41] As it fully acknowledge the complexity of data management, the principles do not claim to introduce a set of rigid recommendations but rather "degrees of FAIRness", that can be adjusted depending on the organizational costs but also external restrictions in regards to copyright or privacy.[43]

The FAIR principles have immediately been coopted by major international organization: "FAIR experienced rapid development, gaining recognition from the European Union, G7, G20 and US-based Big Data to Knowledge (BD2K)"[44] In August 2016, the European Commission set up an expert group to turn "FAIR Data into reality".[45] As of 2020, the FAIR principles remain "the most advanced technical standards for open scientific data to date"[46]

By the end of the 2010s, open data policy are well supported by scientific communities. Two large surveys commissioned by the European Commission in 2016 and 2018 find a commonly perceived benefit: "74% of researchers say that having access to other data would benefit them"[47] Yet, more qualitative observations gathered in the same investigation also showed that "what scientists proclaim ideally, versus what they actually practice, reveals a more ambiguous situation."[47]

Scientific data management[edit]

Data management has recently become a primary focus of the policy and research debate on open scientific data. The influential FAIR principles are voluntarily centered on the key features of "good data management" in a scientific context.[42]

In a research context, data management is frequently associated to data lifecycles. Various models of lifecycles in different stage have been theorized by institutions, infrastructures and scientific communities, although "such lifecycles are a simplification of real life, which is far less linear and more iterative in practice."[48]

Plan and governance[edit]

Research data management can be laid out in a data management plan or DMP.

Data management plans were incepted in 1966 for the specific needs of aeronautic and engineering research, which already faced increasingly complex data frictions.[49] These first examples were focused on material issues associated with the access, transfert and storage of the data: "Until the early 2000s, DMPs were utilised in this manner: in limited fields, for projects of great technical complexity, and for limited mid-study data collection and processing purposes"[50] After 2000, the implementation of large research infrastructure and the development of open science dramatically changed the scope and the purpose of data management plans. Policy-makers, rather than scientists, have been instrumental in this development: "The first publications to provide general advice and guidance to researchers around the creation of DMPs were published from 2009 following the publications from JISC and the OECD (…) DMP use, we infer, has been imposed onto the research community through external forces"[51]

The implication of external shareholders in research projects create significant potential tensions with the principles of sharing open data. Contributions from commercial actors can especially rely on some form of exclusivity and appropriation of the final research results. In 2022, Pujol Priego, Wareham and Romasanta created several accommodation strategies to overcome these issues, such as data modularity (with sharing limited to some part of the data) and time delay (with year-long embargoes before the final release of the data).[52]

Scientific culture[edit]

The management of scientific data is rooted in scientific cultures or communities of practice. As digital tools have become widespread, the infrastructures, the practices and the common representations of research communities have increasingly relied of shared meanings of what is data and what can be done with it.[11] Pre-existing epistemic machineries can be more or less predisposed to data sharing. Important factors may include shared values (individualistic or collective), data ownership allocation and frequent collaborations with external actors which may be reluctant to data sharing.[53]

In 2022, Pujol Priego, Wareham and Romasanta stressed that incentives for the sharing of scientific data were primarily collective and include reproducibility, scientific efficiency, scientific quality, along with more individual retributions such as personal credit[54]


In an effort to address issues with the reproducibility of research results, some scholars are asking that authors agree to share their raw data as part of the scholarly peer review process.[55] As far back as 1962, for example, a number of psychologists have attempted to obtain raw data sets from other researchers, with mixed results, in order to reanalyze them. A recent attempt resulted in only seven data sets out of fifty requests. The notion of obtaining, let alone requiring, open data as a condition of peer review remains controversial[56]


Preservation and archiving have been early on identified as critical issues, especially in relation to observational data which are considered essential to preserve, because they are the most difficult to replicate.[33]

First published in 2012, the reference model of Open Archival Information System state that scientific infrastructure shoul seek for long term preservation, that is "long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community".[57] Consequently, good practices of data management imply both on storage (to materially preserve the data) and, even more crucially on curation, "to preserve knowledge about the data to facilitate reuse".[58]

The opening of scientific data has contributed to mitigate preservation risks. Instead of being only maintained by one or a few producers.

Diffusion of scientific data[edit]

Publication and edition[edit]

Until the 2010s, the publication of scientific data referred mostly to "the release of datasets associated with an individual journal article"[59] As associated file, datasets has an ambiguous status between public and non-public, since they were meant to be raw documents, giving access to the background of research. Yet, in practice, the released datasets have often to be specially curated for publication, especially in the case where it may contain personal data.

Scientific datasets have been increasingly acknowledged as an autonomous scientific publication. The assimilation of data to academic articles aimed to increase the prestige and recognition of published datasets: "implicit in this argument is that familiarity will encourage data release".[59] This approach has been favored by several publishers and repositories as it made it possible to easily integrate data in existing publishing infrastructure and to extensively reuse editorial concepts initially created around articles[59] Data papers were explicitly introduced as "a mechanism to incentivize data publishing in biodiversity science".[60]

Citation and indexation[edit]

The first digital databases of the 1950s and the 1960s have immediately raised issues of citability and bibliographic descriptions.[61] The mutability of computer memory was especially challenging: in contrast with printed publications, digital data could not be expected to remain stable on the long run. In 1965, Ralph Blasco underlined that this uncertainty affected all the associated documents like code notebooks, which may become increasingly out of date. Data management have to find a middle ground between continuous enhancements and some form of generic stability: "the concept of a fluid, changeable, continually improving data archive means that study cleaning and other processing must be carried to such a point that changes will not significantly affect prior analyses"[62]

Structured bibliographic metadata for database has been a debated topic since the 1960s.[61] In 1977, the American Standard for Bibliographic Reference adopted a definition of "data file" with a strong focus on the materiability and the mutability of the dataset: neither dates nor authors were indicated but the medium or "Packaging Method" had to be specified.[63] Two years later, Sue Dodd introduced an alternative convention, that brought the citation of data closer to the standard of references of other scientific publications:[61] Dodd's recommendation included the use of titles, author, editions and date, as well as alternative mentions for sub-documentations like code notebook.[64]

The indexation of dataset has been radically transformed by the development of the web, as barriers to data sharing were substantially reduced.[61] In this process, data archiving, sustainability and persistence have become critical issues. Permanent digital object identifiers (or DOI) have been introduced for scientific articles to avoid broken links, as website structures continuously evolved. In the early 2000s, pilot programs started to allocate DOIs to dataset as well[65] While it solves concrete issues of link sustainability, the creation of data DOI and norms of data citation is also part of legitimization process, that assimilate dataset to standard scientific publications and can draw from similar sources of motivation (like the bibliometric indexes)[66]

As of 2022, the recognition of open scientific data is still an ongoing process. The leading reference software Zotero does not have yet a specific item for dataset.

Reuse and economic impact[edit]

Analysis of the uses of open scientific data run into the same issues as for any open content: while free, universal and indiscriminate access has demonstrably expanded the scope, range and intensity of the reception it has also made it harder to track, due to the lack of transaction process.

These issues are further complicated by the novelty of data as a scientific publication: "In practice, it can be difficult to monitor data reuse, mainly because researchers rarely cite the repository"[67]

In 2018, a report of the European Commission estimated the cost of not opening scientific data in accordance with the FAIR principles: it amounted at 10.2 billion annually in direct impact and 16 billions in indirect impact over the entire innovation economy.[68] Implementing open scientific open data at a global scale "would have a considerable impact on the time we spent manipulating data and the way we store data."[68]

In 2022, Nature reports that many biomedical and health researchers who already agreed to share their data "do not respond to access requests or hand over the data."[69]

Legal status[edit]

The opening of scientific data has raised a variety of legal issues in regards to ownership rights, copyrights, privacy and ethics. While it is commonly considered that researchers "own the data they collect in the course of their research", this "view is incorrect":[70] the creation of dataset involves potentially the rights of numerous additional actors such as institutions (research agencies, funders, public bodies), associated data producers, personal data on private citizens.[70] The legal situation of digital data has been consequently described as a "bundle of rights" due to the fact that the "legal category of "property" (...) is not a suitable model for dealing with the complexity of data governance problems"[71]


Copyright has been the primary focus of the legal literature of open scientific data until the 2010s. The legality of data sharing was early on identified a crucial issue. In contrast with the sharing of scientific publication, the main impediment was not copyright but uncertainty: "the concept of ‘data’ [was] a new concept, created in the computer age, while copyright law emerged at the time of printed publications."[72] In theory, copyright and author rights provisions do not apply to simple collections of facts and figures. In practice, the notion of data is much more expansive and could include protected content or creative arrangement of non-copyrightable contents.

The status of data in international conventions on intellectual property is ambiguous. According to the Article 2 of the Berne Convention "every production in the literary, scientific and artistic domain" are protected.[73] Yet, research data is often not an original creation entirely produced by one or several authors, but rather a "collection of facts, typically collated using automated or semiautomated instruments or scientific equipment."[73] Consequently, there are no universal convention on data copyright and debates over "the extent to which copyright applies" are still prevalent, with different outcomes depending on the jurisdiction or the specifics of the dataset.[73] This lack of harmonization stems logically from the novelty of "research data" as a key concept of scientific research: "the concept of ‘data’ is a new concept, created in the computer age, while copyright law emerged at the time of printed publications."[73]

In the United States, the European Union and several other jurisdictions, copyright laws have acknowledged a distinction between data itself (which can be an unprotected "fact") and the compilation of the data (which can be a creative arrangement).[73] This principle largely predates the contemporary policy debate over scientific data, as the earliest court cases ruled in favor of compilation rights go back to the 19th century.

In the United States compilation rights have been defined in the Copyright Act of 1976 with an explicit mention of datasets: "a work formed by the collection and assembling of pre-existing materials or of data" (Par 101).[74] In its 1991 decision, Feist Publications, Inc., v. Rural Telephone Service Co., the Supreme Court has clarified the extents and the limitations on database copyrights, as the "assembling" should be demonstrably original and the "raw facts" contained in the compilation are still unprotected.[74]

The European Union provides one of the strongest intellectual property framework for data, with a double layer of rights: copyrights for original compilations (similarly to the United States) and sui generis database rights.[75] Criteria for the originality of compilations have been harmonized across the membership states, by the 1996 Database Directive and by several major case laws settled by the European court of justice such as Infopaq International A/S v Danske Dagblades Forening c or Football Dataco Ltd et al. v Yahoo! UK Ltd. Overall, it has been acknowledged that significant efforts in the making of the dataset are not sufficient to claim compilation rights, as the structure has to "express his creativity in an original manner"[76] The Database Directive has also introduced an original framework of protection for dataset, the sui generis rights that are conferred to any dataset that required a "substantial investment".[77] While they last 15 year, sui generis rights have the potential to become permanent, as they can be renewed for every update of the dataset. Due to their large scope in length and protection, sui generis rights have initially not been largely acknowledged by the European jurisprudence, which has raised a high bar its enforcement. This cautious approach has been reversed in the 2010s, as the 2013 decision Innoweb BV v Wegener ICT Media BV and Wegener Mediaventions strengthened the positions of database owners and condemned the reuse of non-protected data in web search engines.[78] The consolidation and expansion of database rights remain a controversial topic in European regulations, as it is partly at odds with the commitment of the European Union in favor of data-driven economy and open science.[78] While a few exceptions exists for scientific and pedagogic uses, they are limited in scope (no rights for further reutilization) and they have not been activated in all member states.[78]

Overall, even in the jurisdiction where the application of the copyright to data outputs remains unsettled and partly theoretical, it has nevertheless created significant legal uncertainties. The frontier between a set of raw facts and an original compilation is not clearly delineated.[75] Although scientific organizations are usually well aware of copyright laws, the complexity of data rights create unprecedented challenges.[79]

After 2010, national and supra-national jurisdiction have partly changed their stance in regard to the copyright protection of research data. As the sharing is encouraged, scientific data has been also acknowledged as an informal public good: "policymakers, funders, and academic institutions are working to increase awareness that, while the publications and knowledge derived from research data pertain to the authors, research data needs to be considered a public good so that its potential social and scientific value can be realised"[11]


Copyright issues with scientific datasets have been further complicated by uncertainties regarding ownership. Research is largely a collaborative activity that involves a wide range of contributions. Initiatives like CRediT (Contributor Roles Taxonomy) have identified 14 different roles, of which 4 are explicitly related to data management (Formal Analysis, Investigation, Data curation and Visualization).[80]

In the United States, ownership of research data is usually "determined by the employer of the researcher", with the principal investigator acting as the caretaker of the data rather than the owner.[81] Until the development of research open data, US institutions have been usually more reluctant to waive copyrights on data than on publications, as they are considered strategic assets.[82] In the European Union, there is no largely agreed framework on the ownership of data.[83]

The additional rights of external stakeholders has also been raised, especially in the context of medical research. Since the 1970s, patients have claimed some form of ownership of the data produced in the context of clinical trials, notably with important controversies concerning 'whether research subjects and patients actually own their own tissue or DNA."[82]


Numerous scientific projects rely on data collection of persons, notably in medical research and the social sciences. In such cases, any policy of data sharing has to be necessarily balanced with the preservation and protection of personal data.[84]

Researchers and, most specifically, principal investigators have been subjected to obligations of confidentiality in several jurisdictions.[84] Health data has been increasingly regulated since the late 20th century, either by law or by sectorial agreements. In 2014, the European Medicines Agency have introduced important changes to the sharing of clinical trial data, in order to prevent the release of all personal details and all commercially relevant information. Such evolution of the European regulation "are likely to influence the global practice of sharing clinical trial data as open data".[85]

Research management plans and practices have to be open, transparent and confidential by design.

Free licenses[edit]

Open licenses have been the preferred legal framework to clear the restrictions and ambiguities in the legal definition of scientific data. In 2003, the Berlin Declaration called for a universal waiver of reuse rights on scientific contributions that explicitly included "raw data and metadata".[86]

In contrast with the development of open licenses for publications which occurred on short time frame, the creation of licenses for open scientific data has been a complicated process. Specific rights, like the sui generis database rights in the European Union or specific legal principles, like the distinction between simple facts and original compilation have not been initially anticipated. Until the 2010s, free licenses could paradoxically add more restrictions to the reuse of datasets, especially in regard with attributions (which is not required for non-copyrighted objects like raw facts): "in such cases, when no rights are attached to research data, then there is no ground for licencing the data"[87]

To circumvent the issue several institutions like the Harvard-MIT Data Center started to share the data in the Public Domain.[88] This approach ensures that no right is applied on non-copyrighted items. Yet, the public domain and some associated tools like the Public Domain Mark are not a properly defined legal contract and varies significantly from one jurisdiction to another.[88] First introduced in 2009, the Creative Commons Zero (or CC0) license has been immediately contemplated for data licensing.[89] It has since become "the recommended tool for releasing research data into the public domain".[90] In accordance with the principles of the Berlin Declaration it is not a license but a waiver, as the producer of the data "overtly, fully, permanently, irrevocably and unconditionally waives, abandons, and surrenders all of Affirmer’s Copyright and Related Rights".

Alternative approaches have included the design of new free license to disentangle the attribution stacking specific to database rights. In 2009, the Open Knowledge Foundation published the Open Database License which has been adopted by major online projects like OpenStreetMap. Since 2015, all the different Creative Commons licenses have been updated to become fully effective on dataset, as database rights have been explicitly anticipated in the 4.0 version.[87]

See also[edit]


  1. ^ Spiegelhalter, D. Open data and trust in the literature. The Scholarly Kitchen. Retrieved 7 September 2018.
  2. ^ a b Wilkinson et al. 2016.
  3. ^ Lipton 2021, p. 19.
  4. ^ Borgman 2015, p. 18.
  5. ^ a b c d Lipton 2020, p. 59.
  6. ^ a b Lipton 2020, p. 61.
  7. ^ National Academies 2011, p. 1.
  8. ^ Borgman 2015, pp. 4–5.
  9. ^ a b c Pujol Priego, Wareham & Romasanta 2022, p. 220.
  10. ^ Edwards et al. 2011, p. 669.
  11. ^ a b c Pujol Priego, Wareham & Romasanta 2022, p. 224.
  12. ^ Pujol Priego, Wareham & Romasanta 2022, p. 225.
  13. ^ Rosenberg 2018, pp. 557–558
  14. ^ Buckland 1991
  15. ^ Edwards 2010, p. 84
  16. ^ Edwards 2010, p. 99
  17. ^ Edwards 2010, p. 102
  18. ^ Machado, Jorge. "Open data and open science". In Albagli, Maciel, Abdo. "Open Science, Open Questions", 2015
  19. ^ Shankar et al. 2016, p. 63
  20. ^ Committee on Scientific Accomplishments of Earth Observations from Space, National Research Council (2008). Earth Observations from Space: The First 50 Years of Scientific Achievements. The National Academies Press. p. 6. ISBN 978-0-309-11095-2. Retrieved 2010-11-24.
  21. ^ World Data Center System (2009-09-18). "About the World Data Center System". NOAA, National Geophysical Data Center. Retrieved 2010-11-24.
  22. ^ a b Borgman 2015, p. 7
  23. ^ Regazzi 2015, p. 128
  24. ^ Bourne & Hahn 2003, p. 397
  25. ^ Campbell-Kelly & Garcia-Swartz 2013
  26. ^ a b Berners-Lee & Fischetti 2008, p. 17
  27. ^ Dacos 2013
  28. ^ Tim Berners-Lee, "Qualifiers on Hypertext Links", mail sent on August 6, 1991 to the alt.hypertext
  29. ^ Hogan 2014, p. 20
  30. ^ Bygrave & Bing 2009, p. 30
  31. ^ Star & Ruhleder 1996, p. 131
  32. ^ National Research Council (1995). On the Full and Open Exchange of Scientific Data. Washington, DC: The National Academies Press. doi:10.17226/18769. ISBN 978-0-309-30427-6.
  33. ^ a b c Pujol Priego, Wareham & Romasanta 2022, p. 223.
  34. ^ Lipton 2020, p. 16.
  35. ^ National Research Council 1999, p. 16.
  36. ^ OECD Declaration on Open Access to publicly funded data Archived 20 April 2010 at the Wayback Machine
  37. ^ Lipton 2020, p. 17.
  38. ^ OECD 2007, p. 13.
  39. ^ OECD 2007, p. 4.
  40. ^ Wilkinson et al. 2016, p. 8.
  41. ^ a b Wilkinson et al. 2016, p. 3.
  42. ^ a b c Wilkinson et al. 2016, p. 1.
  43. ^ Wilkinson et al. 2016, p. 4.
  44. ^ van Reisen et al. 2020.
  45. ^ Horizon 2020 Commission expert group on Turning FAIR data into reality (E03464)
  46. ^ Lipton 2020, p. 66.
  47. ^ a b Pujol Priego, Wareham & Romasanta 2022, p. 241.
  48. ^ Cox & Verbaan 2018, p. 26-27.
  49. ^ Smale et al. 2018, p. 3.
  50. ^ Smale et al. 2018, p. 4.
  51. ^ Smale et al. 2018, p. 9.
  52. ^ Pujol Priego, Wareham & Romasanta 2022, p. 239-240.
  53. ^ Pujol Priego, Wareham & Romasanta 2022, p. 224-225.
  54. ^ Pujol Priego, Wareham & Romasanta 2022, p. 226.
  55. ^ "The PRO Initiative for Open Science". Peer Reviewers' Openness Initiative. Retrieved 15 September 2018.
  56. ^ Wiktowski et al. 2017.
  57. ^ CCSDS 2012, p. 1.
  58. ^ Lipton 2020, p. 73.
  59. ^ a b c Borgman 2015, p. 48.
  60. ^ Chavan & Penev 2011.
  61. ^ a b c d Crosas 2014, p. 63.
  62. ^ Blasco 1965, p. 148.
  63. ^ Dodd 1979, p. 78.
  64. ^ Dodd 1979.
  65. ^ Brase et al. 2004.
  66. ^ Borgman 2015, p. 47.
  67. ^ Lipton 2020, p. 65.
  68. ^ a b European Commission 2018, p. 31.
  69. ^ Watson, Clare (2022-06-21). "Many researchers say they'll share data — but don't". Nature. doi:10.1038/d41586-022-01692-1. PMID 35725829. S2CID 249886978.
  70. ^ a b Lipton & 2020 127.
  71. ^ Kerber 2021, p. 1.
  72. ^ Lipton 2020, p. 119
  73. ^ a b c d e Lipton 2020, p. 119.
  74. ^ a b Lipton 2020, p. 122.
  75. ^ a b Lipton 2020, p. 123.
  76. ^ Article 6, Directive 2006/116/EC
  77. ^ Lipton 2020, p. 124.
  78. ^ a b c Lipton 2020, p. 125.
  79. ^ Lipton 2020, p. 126.
  80. ^ Allen et al. 2019, p. 73.
  81. ^ Lipton 2020, p. 129.
  82. ^ a b Lipton 2020, p. 130.
  83. ^ Lipton 2020, p. 131.
  84. ^ a b Lipton 2020, p. 138.
  85. ^ Lipton 2020, p. 139.
  86. ^ Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities
  87. ^ a b Lipton 2020, p. 133.
  88. ^ a b Lipton 2020, p. 134.
  89. ^ Schofield et al. 2009.
  90. ^ Lipton 2020, p. 132.



Journal articles[edit]

Books & thesis[edit]

External links[edit]

  1. ^ Besançon, Lonni; Peiffer-Smadja, Nathan; Segalas, Corentin; Jiang, Haiting; Masuzzo, Paola; Smout, Cooper; Billy, Eric; Deforet, Maxime; Leyrat, Clémence (2020). "Open Science Saves Lives: Lessons from the COVID-19 Pandemic". BMC Medical Research Methodology. 21 (1): 117. doi:10.1186/s12874-021-01304-y. PMC 8179078. PMID 34090351.