This tool responds to the GC Data Strategy for the Federal Public Service (2023-2026) Priority 2.2.b (development of a FAIR
principles assessment tool and guidance
on assessment of existing data for reuse). This tool is suitable for a first screening level
estimation to see if a particular data asset is FAIRER.
There are two companion tools which you may also find useful:
Do you work with data? Are you looking to make it future proof? The
FAIRER data principles will help you!
In 2023, the international consortium, Common
Infrastructure for National Cohorts in Europe, Canada, and Africa (CINECA)
stated that: "
While the FAIR principles have become a guiding technical resource for data sharing, legal
and socio-ethical considerations are equally important for a fair data ecosystem . ... FAIR
data should be FAIRER, including also ethical and reproducible as key components
.”
FAIRER principles refer to the Findability, Accessibility, Interoperability, Reusability,
Ethics and Reproducibility of data assets, including related code. Applying these
principles to your data assets will help others to find, verify, cite, and reuse your data
and code more easily.
This tool helps you to assess the FAIRERness of a data asset, and get tips on
how you could increase its value and impact.
The tool is discipline-agnostic, making it relevant to any scientific field.
FAIRER data assessment consists of 25 questions with additional guidance.
The data assessment will take 30-60 minutes, after which you will receive a quantitative
summary of the level of FAIRERness of your data asset and tips on how you can improve its
level
of FAIRERness. No information is saved on our servers, but you will be able to save the
results of the
assessment, including tips for improvement, to your local computer and add notes for future
reference.
This new tool is inspired from and builds upon the online SATIFYD tool. Please see the
author statement, below.
Add any notes you may have here.
These notes will be included when you print and save your results to
your local computer. No information will be saved to our server.
Feel free to capture any thoughts or insights that you'd like to
remember or revisit later.
Your data are
0%
FAIRER
FINDABLE
Now that you have finished your research project, you are on the brink of depositing your research
data in a trustworthy long-term repository.
Findability is one of the four pillars of the FAIRER Principles .
If you take care of the findability of your data, you will enable search engines to find it and
possibly also link it to related sources on the web.
Moreover, you will improve the exposure of your research and help researchers to find and
potentially reuse your data.
Findability generally comes down to giving a proper description of your dataset.
This description can be divided into three elements:
Rich and detailed metadata and additional information
Persistent links / Persistent Identifiers
Standards: the more standardized terms you use, the more findable your data are. Some
domains have specific standards,
for other domains there are more generic standards like the Getty Thesaurus of Geographical
Names or the .
Using standards will enable peers to find your data through (domain-specific) search
engines.
DATA are a set of values of subjects with respect to qualitative or quantitative variables
representing facts, statistics, or items of information in a formalized manner suitable for
communication, reinterpretation, or processing (TBS Policy on
Service and Digital, 2019).
Data are facts, measurements, recordings, records, or observations about the world collected by
scientists and others with a minimum of contextual interpretation. Data may be in any format or
medium taking the form of writings, notes, numbers, symbols, text, images, films, video, sound
recordings, pictorial reproductions, drawings, designs or other graphical representations,
procedural manuals, forms, diagrams, work flow charts, equipment descriptions, data files, data
processing algorithms/code/scripts, or statistical records (CODATA-IRiDiuM (2018) -International Research Data Management glossary).
The word “data” may be used very broadly to comprise data (in the strict sense) and the ecosystem of
digital things that relate to data, including metadata, software and algorithms, as well as physical
samples and analogue artefacts - and the digital representations and metadata relating to these
things. (CODATA 2019 - Beijing Declaration on Research Data). There are dozens of other definitions of data that
may be useful depending on the context1.
SCIENTIFIC DATA are data that are used by scientists as primary sources to support technical
and regulatory development or scientific enquiry, research, or scholarship, and that are used as
evidence in the scientific process and/or are commonly accepted in the scientific community as
necessary to validate scientific findings and results. All other digital and non-digital content
have the potential of becoming research data. Examples of scientific data include data arising from:
experiments, research and development, ‘citizen science’, surveys, operations, surveillance,
monitoring, field analyzers or data-loggers, instruments, laboratory analyses, inventories, modeling
and simulation output, processed data, and repurposed data (IRiDiuM - International Research Data Management glossary). The scientific nature of the data is demonstrated
when the process of creating, maintaining and quality-proofing the data comply with commonly
recognized scientific standards.
Although scientific data share many aspects in common with other types of data (e.g., administrative
data, financial data, business data), their processing frequently requires more complex software and
infrastructure. The data themselves may also be:
more complex (e.g., associated accuracy, precision, detection limits, confidence intervals,
quality assurance/quality control procedures, etc.);
more tightly controlled;
held to higher standards;
retained for a longer period of time, often indefinitely;
documented more carefully and in greater detail (e.g., description of methods used to obtain
measurements);
used as evidence, including in court, and therefore require a higher level of credibility,
reliability, and accessibility.
This work reproduces the online [SATIFYD Data Assessment tool](https://satifyd.dans.knaw.nl/)
for FAIR principles (with minor edits), modifies the layout somewhat, replaces the introduction,
adds a definition of 'Data', adds information on retention and disposition, adds new sections on
‘Ethical’ and ‘Reproducible’ principles, and adds a 'Checklist for Reproducibility’.
All authors reviewed, discussed, and agreed to all aspects of the final
work.
All views and opinions expressed are those of the co-authors, and do not necessarily reflect the
official policy or position of their respective employers, or of any government, agency or
organization.
Cite as: Liu E, Charles D, and Austin CC (2024).FAIRER Aware Data Assessment
Tool.
Now that you have finished your research project, you are on the brink of depositing your research data in a
trustworthy long-term repository.
Findability is one of the four pillars of the FAIRER Principles.
If you take care of the findability of your data, you will enable search engines to find it and possibly
also link it to related sources on the web.
Moreover, you will improve the exposure of your research and help researchers to find and potentially reuse
your data.
Findability generally comes down to giving a proper description of your dataset.
This description can be divided into three elements:
Rich and detailed metadata and additional information
Persistent links / Persistent Identifiers
Standards: the more standardized terms you use, the more findable your data are. Some domains have
specific standards,
for other domains there are more generic standards like the Getty Thesaurus of Geographical Names or
the .
Using standards will enable peers to find your data through (domain-specific) search engines.
Metadata is information that describes an object such as a dataset. It gives context to the research data,
providing information about the URL where the dataset is stored,
creator, provenance, purpose, time, geographic locations, access conditions, and terms of use of a data
collection. The extent to which metadata is provided for a dataset
can vary greatly and has an effect on how findable a dataset is. The following
list provides a comprehensive list of items that should be covered
when aiming for sufficient metadata:
Other related people who contributed to the dataset
Date on which the dataset was completed
A description of how the data were created (contextual information)
Target group for the dataset deposited (i.e. scientific disciplines)
Keywords that describe your data (use controlled vocabularies if available for your field)
A licence that clearly states the extent to which the data is accessible
Temporal coverage: the period of time to which the data relate
Spatial coverage: Geographical location of the research area or site
Related datasets, resources like publications, websites etc. (digital or analogue)
File formats used in the dataset
Many of the items on this list also relate to the accessibility,
interoperability and reusability of the dataset.
These aspects will be dealt with in the respective sections of this tool.
You can document your research on metadata level and on dataset level. In order to make your metadata
interoperable and machine actionale, use standardised controlled vocabularies,
thesauri, ontologies. On the dataset level you should provide a project description and a dataset
description. For example, add a codebook to make your data understandable
for other researchers, add provenance information and a data/workflow process description. If you want to
get to know more about standards, see the second question under Findable.
Click here if you want to know more about the term
metadata.
Click here if you want to know more about the term
interoperability.
To make your (meta)data findable we encourage the use of controlled vocabularies, taxonomies and/or
ontologies.
A controlled vocabulary is an organized and standardized list of terms and can be used to describe
data. Controlled vocabularies are mostly discipline-specific and
therefore very useful for describing your data. By using controlled vocabularies your metadata becomes much
more understandable for machines and users and therefore they
improve the findability of your data.
A taxonomy is a classification of entities in an ordered system. A taxonomy is mostly domain specific
and is used to identify the content/data by adding terms from the
taxonomy to the content/data description. Identifying content in a structured way gives search engines the
opportunity to optimize their search functionality.
In this way more relevant data can be found based on a single search query.
Therefore adding taxonomy terms to your dataset description the findability of your dataset will improve.
An ontology is a formal description of knowledge. This knowledge is described as a set of concepts
and relations between these concepts within a specific domain.
Ontologies are created to organize information into data and knowledge. An ontology attempts to represent
entities, ideas and events, with all their interdependent properties and relations, according to a system of
categories.
By applying existing ontologies to describe your data, your data becomes more understandable for machines
and thus improves the findability of your data.
From ontologies it is a small step to linked open data. Making use of linked open data means that
your data is interlinked with other data, that your data is openly accessible and that your data can be
shared within the semantic web.
In this way your data is published in a structured and understandable way. Linked (open) data is described
as a set of triples; following the RDF data structure. triple is a basic set of a subject, a predicate and
an object.
For example, a subject is “grass”, its corresponding predicate “has color” and the object is “green”. By
linking your data to other data, more knowledge and information and links to your data becomes available.
This will help to increase the findability of your data.
It is true that standardized controlled vocabularies, taxonomies or ontologies are not equally developed in
the disciplines. For some disciplines a broad range of standards are available whilst others have none yet.
There are, however, general standards, such as the Getty Thesaurus for geographical names, which can be used across disciplines.
Click here if you want to know more about linked
data.
Click here if you want to know more about the semantic
web.
Click here if you want to know more about RDF data structure.
Additional information is information that helps users to assess the content and the relevance of the
dataset they are viewing. The most important means to provide additional information is a so-called readme
file in which topics like the structure of the dataset are addressed.
Questions like how many files does the dataset contain and how are they related to each other? Which
software has to be used to assess the data? How many versions of data are contained in the dataset? Help
users to assess and contextualize the dataset.
Other topics to address include but are by no means limited to methodologies used, a detailed summary of the
project in which the data was collected, information about whether and how the data was cleaned, how many
versions of the dataset were made etc.
Information about the provenance and the versioning of your data, moreover, can be added in addition to the
readme file.
If you have covered most of the items on the metadata list (see explanatory text Question 1) you already
provide a satisfactory amount of additional information. Nevertheless, it is important to supplement your
metadata with more contextual information.
This question also relates to the letter R (reusability) of FAIRER.
Click here if you want to know more about readme
flies.
You should comply with your organization’s retention and disposition schedule as described in the
organization’s File Plan. Different rules apply to different categories of data. The File Plan, and the
retention and disposition schedules therein, are developed and maintained by the organization’s record
keeping services in collaboration with the business units. Consult the disposition rules and ensure that the
data are moved to appropriate long-term curation and preservation platforms at the end of the retention
period. Keep in mind that, in the case of destruction of the data at the end of the retention period, this
is performed only by designated corporate information specialists according to well defined rules, not by
scientists or the business units.
For more information, see Q1 in the self-assessment tool.
CONTRADICTION
You answered question 1 with “no metadata”. This won’t allow you to answer the following
two questions in F.
Advice to improve Findability
You filled in all or almost all of the optional fields on the Content Description page. This makes your data
findable for other researchers and users. Question 2 concerns the use of standards to describe your data
which enables machines to find and interlink your data.
You filled in some information on the Content Description page. In order to make your dataset more findable
for researchers and users, check again if you can fill in more of the optional fields on the page. The more
metadata you provide, the more findable (and reusable) your data will be.
On the Content Description page, additional fields like Relations to projects, internet pages or
researchers, Format types, Languages, Sources on which the dataset is based can be filled in. Adding
additional, rich metadata to your dataset will help other researchers to find but also to reuse (see
questions under letter R) your data.
Fill in the required fields on the primary information page. Then, go to the Content Description page and
check which additional metadata you could add to make your dataset more findable. The more metadata you
provide, the more findable (and reusable, see question under letter R) your data will be.
Fill in the required fields on the primary information page. Then, go to the Content Description page and
check which additional metadata you could add to make your dataset more findable. The more metadata you
provide, the more findable (and reusable, see question under letter R) your data will be.
Check whether there are standards in your domain or field or generic standards that you can use to describe
your dataset. Use them in the description (metadata). It is possible that there are no standards available
in your field. If that is the case, make use of generic standards.
Using ontologies and taxonomies will improve the automated findability of your dataset. To increase the
findability of your dataset, you can also use domain specific ontologies and linked open data if they are
available.
Be aware that there are generic and domain-specific controlled ontologies, vocabularies and taxonomies.
You included the most important standards to make your dataset findable. Be aware of the fact that, within
your domain, there could be specific controlled vocabularies, taxonomies or ontologies.
Using controlled vocabularies and ontologies will improve the automated findability of your dataset. You can
increase the findability of your dataset even more by also making use of taxonomies, if available for you
specific domain.
Using domain-specific controlled vocabularies and ontologies will improve the automated findability of your
dataset.
Using taxonomies and ontologies to describe you data, will improve the automated findability of your
dataset. You can also add terms from (domain-specific) controlled vocabularies to your data description to
increase the findability of your dataset.
Using domain-specific controlled vocabularies and taxonomies will improve the automated findability of your
dataset.
Add documentation about the datasets will improve the findability of your dataset. Think of a readme file,
versioning, or the provenance of the data.
Next to the readme file, consider adding information about the provenance of the data and the versioning.
Consider also adding information about the provenance of your data.
You added rich and detailed information to your dataset by not only providing a readme file but also giving
information about the provenance and the versioning of your data. If seen as an addition to rich and
detailed metadata (question F2), it makes your dataset more findable and reusable.
Consider also adding information about the versioning.
Next to the versioning, consider adding information about the provenance of the data and a readme file.
Consider also adding a readme file to your dataset.
Next to the provenance, consider adding a readme file and information about the versioning.
Findable Question 4 Advice Under Development.
The accessibility of a dataset and its corresponding metadata is essential for researchers to assess and
potentially reuse a dataset.
The questions that you will find under accessibility concern the accessibility of the metadata over
time, meaning that the repository guarantees that the metadata will be available even if the data itself
is no longer available, and the usage license chosen for the dataset.
The latter determines to what extent or under which circumstances the dataset can be accessed.
In the FAIRER Principles, the automated accessibility of metadata and data by machines is also covered
under Accessibility.
There is no question about this technical aspect in this part.
Metadata as described in Question 1 is the description of your data. As such it is associated with your
dataset. For the accessibility but also for the findability of your data it is essential that the metadata
of the dataset remains accessible even if the data itself is not available anymore. It is the repository you
deposit in, which should ensure that this is the case.
With this question we would like to encourage you to check whether the metadata is publicly
accessible even if the dataset is no longer available.
The extent to which you can make your dataset openly available depends on whether your dataset contains
personal data. If it contains personal data, it is clear that you will have to restrict the access to your
dataset. In question six you can further specify which usage license you intend to choose.
Appropriately handle restricted, confidential, and sensitive data (e.g., implement user authentication and
controlled access to the data and/or implement data anonymization and de-identification).
For more information, see Q11 in the self-assessment tool.
For example, depending on the data and on whether or not the data contains personal data (see question 5)
you can choose:
Open Access (everyone): CC0 Waver, accessible to everyone. Choose this
license if your dataset doesn’t contain personal data and if you are allowed to publish it openly.
Open Access (registered users): accessible to registered users in accordance with the General
Conditions of Use. Choose this license if your data doesn’t contain personal data but you would like
users to identify themselves before downloading your data.
Restricted Access (request permission): with your prior consent, users can view and download data in
accordance with the General Conditions of Use. You can also impose additional conditions. Choose
this license if your data contains personal data.
Restricted Access (archaeology group): accessible to registered archaeologists and archaeology
students in accordance with the General Conditions of Use. Choose this license if you are in the
field of archeology.
Other Access: the data will be accessible through another repository. Choose this license if your
data is deposited and available in another repository. Contact DANS (info@dans.knaw.nl) if you
would like to use this license.
CONTRADICTION
If your data contains personal data, you won’t be able to choose the CC0 licence for your
dataset.
Advice to improve Accessibility
Adding information to the metadata on your affiliation at the time of your research, provides
users with a contact point to consult if they would like to track the availability of your data.
Check the source of your information about the availability of metadata again. Adding
information to the metadata on your affiliation at the time of your research, provides users with a contact
point to consult if they would like to track the availability of your data.
Adding information to the metadata on your affiliation at the time of your
research, provides users with a contact point to consult if they would like to track the availability of
your data.
Accessible Question 3 Advice Under Development.
You have chosen the right license if your dataset doesn't contain personal data. Personal data is any
information that relates to an identified or identifiable living individual. Different pieces of
information, which together can lead to the identification of a particular person, also constitute personal
data. Personal data that has been anonymised in such a way that the individual is not or no longer
identifiable is no longer considered personal data. For data to be truly anonymised, the anonymisation must
be irreversible.
You have chosen the right license if your data contains personal data. Personal data is
any information that relates to an identified or identifiable living individual. Different pieces of
information, which together can lead to the identification of a particular person, also constitute personal
data. Personal data that has been de-identified, encrypted or pseudonymised but can be used to re-identify a
person remains personal data and falls within the scope of the GDPR.
You have chosen the right license if your dataset cannot be accessed yet due to unpublished papers or an
ongoing project. Your dataset will score low under Accessibilty, as
users cannot access it.
You have chosen the right license if your dataset cannot be accessed yet due to unpublished papers or an
ongoing project. You should have a good reason for choosing it. The maximum number of months for the embargo
to last is 24.
Your dataset will score low under Accessabilty, as users cannot access it.
Go to legal information and read the information about
the different license types. Then assess which of these licenses apply to your type of data. See also the
website of the European Commission for information about what personal data is: data protection .
Go to legal information and read the information about
the different license types. Then assess which of these licenses apply to your type of data. See also the
website of the European Commission for information about what personal data is: data protection .
If you want other researchers to reuse your data, it is important that your data can be integrated in other
data(sets).
This process of exchanging information between different information systems such as applications, storage
or workflows is called interoperability.
The following actions will improve the interoperability of your data:
Use standardized controlled vocabularies, taxonomies and/or ontologies (see Question 2) both in
describing your data (metadata level) and on in your dataset (data level)
Use prefered formats (see Question 7) in your dataset
Link to other/relevant (meta)data that are online resolvablee
Add contextual information to your dataset
Add files that explain the context in which the research was performed. You can think of
documentation in the form of notebooks, version logs,
software documentation, documentation about the data collection describing the hypotheses,
project history and objectives of the project,
documentation of methods used such as sampling, data collection process, etc. and
information on access and terms of use
Add documentation about the structure of the dataset, for instance a readme.txt file
Add documentation about the content of the dataset. Provide a description on the data level
such as a codebook
Adding scientific links (e.g. links to datasets/research paper used within your project, ORCIDs to
identify people who worked on the project,
persistent links (PIDs) to related research/dataset) between your dataset and other datasets
Preferred formats not only give a higher certainty that your data can be read in the future, they will also
help to increase the reusability and interoperability.
Preferred formats are formats that are widely used and supported by the most commonly used software and
tools. Using preferred formats enables data to be loaded directly
into the software and tools used for data analysis. It makes it possible to easily integrate your data with
other data using the same preferred format.
The use of preferred formats will also help to transform the format to a newer one, in case a preferred
format gets outdated.
The more interoperable your dataset is, the better it will be understood and processed by machines.
Complementary information about your dataset can be stored in multiple other datasets.
Therefore, it is essential to add context or contextual knowledge to your dataset by adding meaningful links
to relevant resources. For instance, you should specify if your
dataset builds on any other dataset or whether other, external datasets are needed to complete your dataset.
If present, use Persistent Identifiers
(see Question 2) to link to these online available (meta)data.
In order to increase the interoperability of your dataset, you should enrich its contextual knowledge.
Contextual knowledge is information about how your data(set) was created and
how it is composed. You can describe the contextual knowledge by adding links to all other (meta)data you
have used when you collected your data. With the help of these links,
other researchers will know which other datasets are needed in order to have the complete set of your data.
It is also possible that complementary information is stored somewhere
else or in another dataset. You need to describe all these scientific links, by properly citing related
datasets. If these datasets have a unique and
Persistent Identifiers ,
use it to link them.
Advice to improve Interoperability
Using preferred formats does increase the interoperability of your data!
Before depositing your data, try to convert your data (if possible) to preferred formats. Not only will this
increase the interoperability of your dataset but also the accessibility and reusability.
Before depositing your data, try to convert your data (if possible) to preferred formats. Not only will this
increase the interoperability of your dataset but also the accessibility and reusability.
Preferred formats are file formats of which DANS is confident that they will be stable enough in the long
term to ensure accessibility, interoperability and reusability.
To increase te interoperability of your dataset we advice to use preferred formats.
Linking to other metadata will increase the interoperability of your dataset. If you link to other metadata
which is online available, if possible, always make use of a Persistent Identifier (PID) to refer to this
metadata.
It is advisable to link to other (meta)data even though these are not accessible online. You can add a
description to the (meta)data you have linked to in your own dataset.
Linking your dataset with other metadata / datasets will increase the interoperability of your dataset. You
can add a link via a Persistent Identifier (PID). Examples for PIDs are DOI, URN, ORCiD.
Linking your dataset with other metadata / datasets will increase the interoperability of your dataset. You
can add a link via a Persistent Identifier (PID). Examples for PIDs are DOI, URN, ORCiD.
Adding contextual information to your dataset will increase the interoperability of your dataset. You can
think of adding references to related / own publications and / or datasets. Moreover, you can add links via
Persistent Identifiers (PIDs). Examples for PIDs are DOI, URN, ORCiD.
It is useful to enrich your metadata with your ORCiD (persistent digital identifier for people). Also,
adding links to related publications, if possible with their Persistent Identifier (PID), will improve the
quality of the contextual information. Also refer to other publications. If possible, always make use of a
Persistent Identifier (PID) to link to publications.
To increase the interoperability of your dataset add references to relevant publications and related
datasets. Also add Persistent Identifiers such as your ORCiD (persistent digital identifier for people).
Adding as much contextual information as possible will increase the interoperability of your dataset
To increase the interoperability of your dataset add references to relevant related datasets. Also add
Persistent Identifiers such as your ORCiD (persistent digital identifier for people).
It is useful to enrich your metadata with your ORCiD (persistent digital identifier for people). Also,
adding links to related publications, if possible with their Persistent Identifier (PID), will improve the
quality of the contextual information.
It is useful to enrich your metadata with your ORCiD (persistent digital identifier for people) and other
Persistent Identifiers that relate to your research and your dataset.
It is highly recommended to refer to other publications. If possible, always make use of a Persistent
Identifier (PID) to link to publications. You can also add ORCiDs (persistent digital identifier for people)
to your dataset.
The ultimate goal in making data FAIRER is to foster reusability. Whether or not datasets are reusable by
other researchers is dependent on a number of aspects.
One of the preconditions is that the dataset has a usage license which clarifies under which circumstances
the data may be reused. Because of the importance of this aspect,
the question about the licenses, which you already answered under Accessible, is repeated here. In order to
gain insight into the process of data generation,
it is important to describe the data and metadata as detailed as possible. Think of questions like Under
which circumstances did I / we collect the data?
Where does the data come from? Moreover, similar to aspects in Findable, Accessible and Interoperable, it is
important that you meet the standards in your discipline when
describing your data and metadata.
To let other researchers make use of your dataset, it is essential to explain the origin of your data and
what steps you have taken to produce the dataset.
Therefore it is very important to provide provenance information with your dataset. This provenance
information can consist of for instance the description of the
origin of the data; How did you collect your data, did you reuse other data? In that case add the right
citations to your dataset. Or did you create your own data?
Describe the workflow for the data creation and describe the processing of the data. If you have used any
versioning in your data, add this versioning information to your dataset.
You already answered this question under Accessible. Nevertheless, we consider it important that choosing
the right usage license is highlighted under Reusable,
too, as it is one of the key elements the may or may not allow other researchers to reuse a dataset. For
example, depending on the data and on whether or not the
data contains personal data (see question 5) you can choose:
Open Access (everyone): CC0 Waver, accessible to everyone.
Choose this license if your dataset doesn’t contain personal data and if you are allowed to publish
it openly.
Open Access (registered users): accessible to registered users in accordance with the General
Conditions of Use.
Choose this license if your data doesn’t contain personal data but you would like users to identify
themselves before downloading your data.
Restricted Access (request permission): with your prior consent, users can view and download data in
accordance with the General Conditions of Use.
You can also impose additional conditions. Choose this license if your data contains personal data.
Restricted Access (archaeology group): accessible to registered archaeologists and archaeology
students in accordance with the General Conditions of Use.
Choose this license if you are in the field of archeology.
Other Access: the data will be accessible through another repository. Choose this license if your
data is deposited and available in another repository.
It is more likely that other researchers reuse your data if the metadata contains domain-specific standards,
i.e. (meta)data has the same type,
is organised in a standardized way, follows a commonly used or community accepted template, etc. Within
different communities and domains minimal standards
have been described but, unfortunately, not every domain has standards yet. More generic standards that you
could use if there are no domain-specific standards
are described in Question 2. Most of the standards come with instructions on how to use them.
A Data Management Plan (DMP) reduces information technology (IT) footprint and costs; reduces legal and
security risks to data assets; reduces manual processes; reduces redundancies; reduces manpower; improves
planning (e.g., IT provisioning, triggering actions, disposition and retention, etc.); prepares for the
Cloud; improves compliance with mandatory requirements; enables automated services; enables automated
exchange of information between machines; facilitates collaboration; provides concrete bases of approvals;
supports Open Science, Open Data, Big Data, and Artificial Intelligence (AI); improves data sharing;
safeguards Indigenous data sovereignty; easier to manage FAIRER (Findable, Accessible, Interoperable,
Reusable, Ethical, and Reproducible) data; and can be leveraged to build infrastructure.
Machine-actionable Data Management Plans (maDMP) are living documents that relay all the information and
metadata related to a data asset thereby enabling FAIRER data management throughout the scientific data
lifecycle. This information, but not the actual data asset, is stored in a tiny file (e.g, in a text based
JSON file) in an maDMP repository where it can be found, updated, tested, and queried by other systems.
An maDMP would be the centralized go-to-place for any person or machine to find any dataset, and to find any
type of information that they might want to know about a dataset at any point in the data lifecycle.
Machines can mine that information from the maDMP.
If that information exists elsewhere, the maDMP would provide a link to the information, not duplicate it
within the DMP.
Ultimately, standardized, machine-enabled DMPs will reduce the burden on scientists while simultaneously
increasing efficiency in providing information and insight about our data holdings.
General information should include: access, classification level, date created/modified, identifier,
download url, language, and title.
Approvals should include: approval status, approved by, and approval date.
Project description should include: title, start/end dates, partner organization(s), and partner
agreements.
People should include: contacts, authors, contributors, ORCID IDs, affiliations, roles, and a
succession plan.
Resources should include: technical resources, and computing environment including code.
Funding should include: funding type, funding status, funder identifier, cost.
Legal or ethical issues should include: security, privacy, intellectual property, copyright/licence,
and Indigenous considerations.
Dataset metadata should include: a metadata standard.
Dataset should include: title, description, download url, byte size, data type, available until,
last updated, data priority, quality control level, and status.
Dataset distribution means a particular instance of a dataset that has been, or is intended to be
made available in some fashion. It is important to separate the logical notion of a 'dataset' from
its distributions, of which there may be several.
Host should include: host name, host url, availability, backup type/frequency, certification, geo
location, pid system, storage type, and versioning.
Retention and disposition should include: archival value, legal issues, required destruction,
required perpetual use, retention review trigger date, disposition action authorization, and
disposition action completed date.
Advice to improve Reusability
Adding provenance information to your dataset will increase the reusability. The more provenance information
you provide, the better. Information about provenance includes but is not limited to: Origin of data,
citations for reused data, workflow description for collecting data (machine readable), processing and
version history of data.
Not only has it become easier to find your data (see unddf 'F'), your metadata now also meets the
requirements for proper and correct reuse. This is an important step for your data to become FAIRER.
Your data meet domain standards to a certain extent. FAIRER data should at least contain minimal information
standards. Try to look for ways to improve. The more your data and metadata are organized in a standardized
way, the better they are suited for re-use! Always try to keep in mind the user-perspective.
Generic metadata standards are widely adopted. Domain standards, however, are much richer in vocabulary and
structure and therefore will help researchers within your discipline to reuse your data. Check whether your
domain has specific metadata standards.
FAIRER data should at least contain minimal information standards. Did you check whether there are metadata
standards available for your domain? The more your metadata and data are organized in a standardized way,
the better they are suited for re-use. Always try to keep in mind the user-perspective.
FAIRER data should at least contain minimal information standards. Did you check whether there are metadata
standards available for your domain? The more your metadata and data are organized in a standardized way,
the better they are suited for re-use. Always try to keep in mind the user-perspective.
Reusable Question 3 Advice Under Development
Data are ethical when:
(a) Data are collected and managed in compliance with relevant government and professional codes of
conduct, values and ethics, scientific integrity and responsible conduct of research;
(b) Restricted, confidential, and sensitive data are handled appropriately, for example by implementing
user authentication and controlled access to the data and and/or data anonymization and
de-identification;
(c) A statement is made as to whether or not Indigenous considerations exist and where applicable,
Indigenous data sovereignty is respected and data are managed in accordance with
CARE,
OCAP,
and
UNDRIP principles;
(d) Data assets are managed in a manner such that data used as input to Big Data or Artificial
Intelligence applications can be confirmed to be relevant, accurate, and up-to-date, and can be tested
for unintended biases
(TBS Directive 2019
);
(e) Contributors and contact person information is provided.
Transparency and reproducibility are fundamental to the scientific integrity and utility of the data and
code and are a fundamental component of open science by-default-by-design. At the same time, it is essential
to respect autonomy, privacy, confidentiality, the public, individuals, and communities.
For additional information, see Q11 in the self-assessment tool.
Advice to improve Ethics
Ethical Question 1 Advice Under Development
Indigenous data sovereignty must be respected, and data should be managed in accordance with CARE, OCAP and
UNDRIP principles. Indigenous data management
protocols should aim to ensure community consent, access and ownership of Indigenous data, and protection of
Indigenous intellectual property rights.
For additional information, see Q12 in the self-assessment tool.
Ethical Question 2 Advice Under Development
Ensure that input data and information are complete, high quality, and well documented:
Implement good data governance practices;
Use advanced tools for data validation, cleaning, processing, and analysis;
Develop methods to identify and mitigate biases in datasets;
Audit the data regularly and make any necessary updates or corrections; and,
Document the methods and code.
For more information, see Q13 in the self-assessment tool.
Ethical Question 3 Advice Under Development
At the beginning of the design phase of your project and again before going into production, use an
algorithmic assessment tool to determine the impact level of an automated decision-system (e.g., GC Algorithmic Impact Assessment Tool).
Rely on statistical practices (e.g., designing the collection of, summarizing, processing, analyzing,
interpreting, and presenting data, as well as model or algorithm development and deployment).
Test data and information, as well as the underlying model, for unintended biases; and,
Safeguard against unintentional outcomes
Verify compliance with institutional and program legislation.
For additional information, see Q13 in the self-assessment tool.
Ethical Question 4 Advice Under Development
Reproducible data and code means that the final data and code are computationally reproducible within
some tolerance interval or defined limits of precision and accuracy, i.e. a 3rd party will be able to
verify the data lineage and processing, reanalyze the data and obtain consistent computational results
using the same input raw data, computational steps, methods, computer software & code, and conditions of analysis
in order to determine if the same result emerges from the reprocessing and reanalysis. “Same result” can
mean different things in different contexts: identical measures in a fully deterministic context, the
same numeric results but differing in some irrelevant detail, statistically similar results in a
non-deterministic context, or validation of a hypothesis. All data and code are made available for
3rd-party verification of reproducibility. Note that reproducibility is a different concept from
replicability. In the latter case, the final published data are linked to sufficiently detailed methods
and information for a 3rd-party to be able to verify the results based on the independent collection of
new raw data using similar or different methods but leading to comparable results.
(See also NASEM 2019 ).
Advice to improve Reproducibility
Question 20 CONTRADICTION
If a statement is provided, then one and only one of the following 4 options must be
selected.
Reproducible data and code means that the final data and code are computationally reproducible within some
tolerance interval or defined limits of precision and accuracy, i.e. a 3rd-party will be able to verify the
data lineage and processing, reanalyze the data and obtain consistent computational results using the same
input raw data, computational steps, methods, computer code, and conditions of analysis in order to
determine if the same result emerges from the reprocessing and reanalysis. “Same result” can mean different
things in different contexts: identical measures in a fully deterministic context, the same numeric results
but differing in some irrelevant detail, statistically similar results in a non-deterministic context, or
validation of a hypothesis.
For additional information, see more info provided for Q14 in the self-assessment tool.
Reproducible Question 1 Advice Under Development
Computational steps means detailed documentation of the processes applied to the raw data, including any
transformations, filtering, or aggregation. Methods include detailed descriptions of data collection and
processing, statistical techniques used, data mining techniques, machine learning models, and any other
methods used. Computer code is the actual code or scripts used to process and analyze the data. Conditions
of analysis means describing the environment in which the analysis was conducted (e.g., software versions,
hardware specifications, and any other conditions that might influence the results).
For additional information, see Q15 in the self-assessment tool.
Reproducible Question 2 Advice Under Development
Reproducible Question 3 Info Under Development
Reproducible Question 3 Advice Under Development
Reproducible Question 4 Info Under Development
Reproducible Question 4 Advice Under Development
Reproducible Question 5 Info Under Development
Reproducible Question 5 Advice Under Development
Reproducibility Checklist
My code is:
A DETERMINISTIC algorithm that, given a particular input, always
produces the same output. The behavior of the algorithm is entirely predictable and does not
involve any randomness or decision-making that can lead to different outcomes on different
executions with the same input.
Yes
No
A NON-DETERMINISTIC algorithm that, given the same input, can produce different
outcomes on different executions. The algorithm makes choices that can vary between runs
because it involves randomness or a selection among multiple possibilities without a
specific rule for which option to choose.
Yes
No
Executable only on a high performance computing (HPC) system or a super computer.
Yes
No
QUANTUM CODE, executable only in a quantum computing environment that uses quantum
bits (qubits) in a system that exhibits quantum mechanical behavior which can represent and
store information in a more complex way than in classical computing that uses bits (switches
represented by 1s and 0s) as the smallest unit of information.
Yes
No
For all models and algorithms, I provided a link to:
A clear description of the mathematical setting, algorithm, and/or model.
Yes
No
Partial
N/A
A clear explanation of any assumptions.
Yes
No
Partial
N/A
An analysis of the complexity (time, space, sample size) of the algorithm.
Yes
No
Partial
N/A
A conceptual outline and/or pseudocode description.
Yes
No
Partial
N/A
For any theoretical claims, I provided a link to:
A clear statement of the claim.
Yes
No
Partial
N/A
A complete proof of the claim.
Yes
No
Partial
N/A
A clear formal statement of all assumptions.
Yes
No
Partial
N/A
A clear formal statement of all restrictions.
Yes
No
Partial
N/A
Proofs of all novel claims.
Yes
No
Partial
N/A
Proof sketches or intuitions for complex and/or novel results.
Yes
No
Partial
N/A
Appropriate citations to theoretical tools used are given.
Yes
No
Partial
N/A
An empirical demonstration that all theoretical claims hold.
Yes
No
Partial
N/A
All experimental code used to eliminate or disprove claims.
Yes
No
Partial
N/A
For all datasets used, I provided a link to:
A downloadable version of the dataset or simulation environment.
Yes
No
Partial
N/A
The relevant statistics (e.g., the number of examples).
Yes
No
Partial
N/A
The details of train / validation / test splits.
Yes
No
Partial
N/A
An explanation of any data that were excluded, and all pre-processing steps.
Yes
No
Partial
N/A
A complete description of the data collection process for any new data collected, including
instructions to annotators and methods for quality control.
Yes
No
Partial
N/A
A motivation statement for why the experiments are conducted on the selected datasets.
Yes
No
Partial
N/A
A licence that allows free usage of the datasets for research purposes.
Yes
No
Partial
N/A
All datasets drawn from the existing literature (potentially including authors’ own
previously published work) are publicly available.
Yes
No
Partial
N/A
A detailed explanation, where applicable, as to why datasets used are not publicly
available, and why publicly available alternatives were not used.
Yes
No
Partial
N/A
A complete description of the data collection process (e.g., expt’l setup, device(s) used,
image acquisition parameters, subjects/objects involved, instructions to annotators, and
qa/qc methods.
Yes
No
Partial
N/A
Ethics approval.
Yes
No
Partial
N/A
For all code used, I provided a link to:
The specification of dependencies.
Yes
No
Partial
N/A
The training code.
Yes
No
Partial
N/A
The evaluation code.
Yes
No
Partial
N/A
The (pre-)trained model(s).
Yes
No
Partial
N/A
A ReadMe file that includes a table of results accompanied by precise command to run to
produce those results.
Yes
No
Partial
N/A
Any code required for pre-processing data.
Yes
No
Partial
N/A
All source code required for conducting and analyzing the experiment(s).
Yes
No
Partial
N/A
A licence that allows free usage of the code for research purposes.
Yes
No
Partial
N/A
A document with comments detailing the implementation of new methods, with references to the
paper where each step comes from.
Yes
No
Partial
N/A
The method used for setting seeds (if an algorithm depends on randomness) described in a way
sufficient to allow replication of results.
Yes
No
Partial
N/A
A description of the computing infrastructure used (hardware and software), including
GPU/CPU models; memory; OS; names/versions of software libraries and frameworks.
Yes
No
Partial
N/A
A formal description of evaluation metrics used and and explanation of the motivation for
choosing these metrics.
Yes
No
Partial
N/A
A statement of the number of algorithm runs used to compute each reported result.
Yes
No
Partial
N/A
An analysis of experiments that goes beyond single-dimensional summaries of performance
(e.g., average; median) to include measures of variation, confidence, or other
distributional information.
Yes
No
Partial
N/A
A description of the significance of any improvement or decrease in performance, judged
using appropriate statistical tests (e.g., Wilcoxon signed-rank).
Yes
No
Partial
N/A
A list of all final (hyper-)parameters used for each model/algorithm in the for each of the
experiments.
Yes
No
Partial
N/A
A statement of the number and range of values tried per (hyper-) parameter during
development, along with the criterion used for selecting the final parameter setting.
Yes
No
Partial
N/A
A ReadMe file with a table of results accompanied by precise commands to produce those
results
Yes
No
Partial
N/A
For all reported experimental results, I provided a link to:
The range of hyper-parameters considered, method to select the best
hyper-parameter configuration, and specification of all hyper-parameters used to generate
results.
Yes
No
Partial
N/A
The exact number of training and evaluation runs.
Yes
No
Partial
N/A
A clear definition of the specific measure or statistics used to report results.
Yes
No
Partial
N/A
A description of results with central tendency (e.g. mean) & variation (e.g. error bars).
Yes
No
Partial
N/A
The average runtime for each result, or estimated energy cost.
Yes
No
Partial
N/A
A description of the computing infrastructure used.
Yes
No
Partial
N/A
A document that clearly delineates statements that are opinions, hypothesis, and speculation
from objective facts and results.
Yes
No
Partial
N/A
A description of the range of hyper-parameters considered, method to select the best
hyper-parameter configuration, and specification of all hyper-parameters used to generate
results.
Yes
No
Partial
N/A
Information on sensitivity regarding parameter changes.
Yes
No
Partial
N/A
Details on how baseline methods were implemented and tuned.
Yes
No
Partial
N/A
A clear definition of the specific evaluation metrics and/or statistics used to report
results.
Yes
No
Partial
N/A
A description of results with central tendency (e.g. mean) and variation (e.g. error bars).
Yes
No
Partial
N/A
An analysis of statistical significance of reported differences in performance between
methods.
Yes
No
Partial
N/A
A description of the average runtime for each result, or estimated energy cost.
Yes
No
Partial
N/A
A description of the memory footprint.
Yes
No
Partial
N/A
An analysis of situations in which the method failed.