FAIRER Data Assessment Tool

Data - Definition and Context

SOURCE: ECCC Science & Technology Branch, Scientific and regulatory data strategy and action plan (Draft).

DATA are a set of values of subjects with respect to qualitative or quantitative variables representing facts, statistics, or items of information in a formalized manner suitable for communication, reinterpretation, or processing (TBS Policy on Service and Digital, 2019).

Data are facts, measurements, recordings, records, or observations about the world collected by scientists and others with a minimum of contextual interpretation. Data may be in any format or medium taking the form of writings, notes, numbers, symbols, text, images, films, video, sound recordings, pictorial reproductions, drawings, designs or other graphical representations, procedural manuals, forms, diagrams, work flow charts, equipment descriptions, data files, data processing algorithms/code/scripts, or statistical records (CODATA-IRiDiuM (2018)  -International Research Data Management glossary).

The word “data” may be used very broadly to comprise data (in the strict sense) and the ecosystem of digital things that relate to data, including metadata, software and algorithms, as well as physical samples and analogue artefacts - and the digital representations and metadata relating to these things. (CODATA 2019 - Beijing Declaration on Research Data). There are dozens of other definitions of data that may be useful depending on the context¹.

SCIENTIFIC DATA are data that are used by scientists as primary sources to support technical and regulatory development or scientific enquiry, research, or scholarship, and that are used as evidence in the scientific process and/or are commonly accepted in the scientific community as necessary to validate scientific findings and results. All other digital and non-digital content have the potential of becoming research data. Examples of scientific data include data arising from: experiments, research and development, ‘citizen science’, surveys, operations, surveillance, monitoring, field analyzers or data-loggers, instruments, laboratory analyses, inventories, modeling and simulation output, processed data, and repurposed data (IRiDiuM - International Research Data Management glossary). The scientific nature of the data is demonstrated when the process of creating, maintaining and quality-proofing the data comply with commonly recognized scientific standards.

Although scientific data share many aspects in common with other types of data (e.g., administrative data, financial data, business data), their processing frequently requires more complex software and infrastructure. The data themselves may also be:

more complex (e.g., associated accuracy, precision, detection limits, confidence intervals, quality assurance/quality control procedures, etc.);
more tightly controlled;
held to higher standards;
retained for a longer period of time, often indefinitely;
documented more carefully and in greater detail (e.g., description of methods used to obtain measurements);
used as evidence, including in court, and therefore require a higher level of credibility, reliability, and accessibility.

Author Statement

Esther Liu (Data curation, Software, Visualization) ;
Dominique Charles (Methodology, Validation) ; and,
Claire C. Austin (Conceptualization, Supervision, Writing).

This work reproduces the online [SATIFYD Data Assessment tool](https://satifyd.dans.knaw.nl/) for FAIR principles (with minor edits), modifies the layout somewhat, replaces the introduction, adds a definition of 'Data', adds information on retention and disposition, adds new sections on ‘Ethical’ and ‘Reproducible’ principles, and adds a 'Checklist for Reproducibility’.

All authors reviewed, discussed, and agreed to all aspects of the final work.

All views and opinions expressed are those of the co-authors, and do not necessarily reflect the official policy or position of their respective employers, or of any government, agency or organization.

Cite as: Liu E, Charles D, and Austin CC (2024).FAIRER Aware Data Assessment Tool.

Now that you have finished your research project, you are on the brink of depositing your research data in a trustworthy long-term repository. Findability is one of the four pillars of the FAIRER Principles. If you take care of the findability of your data, you will enable search engines to find it and possibly also link it to related sources on the web. Moreover, you will improve the exposure of your research and help researchers to find and potentially reuse your data.

Findability generally comes down to giving a proper description of your dataset. This description can be divided into three elements:

Rich and detailed metadata and additional information
Persistent links / Persistent Identifiers
Standards: the more standardized terms you use, the more findable your data are. Some domains have specific standards, for other domains there are more generic standards like the Getty Thesaurus of Geographical Names or the . Using standards will enable peers to find your data through (domain-specific) search engines.

Metadata is information that describes an object such as a dataset. It gives context to the research data, providing information about the URL where the dataset is stored, creator, provenance, purpose, time, geographic locations, access conditions, and terms of use of a data collection. The extent to which metadata is provided for a dataset can vary greatly and has an effect on how findable a dataset is. The following list provides a comprehensive list of items that should be covered when aiming for sufficient metadata:

A globally unique Persistent Identifier (PID) e.g. a DOI
A title
Related people, i.e. the creator of the dataset
Other related people who contributed to the dataset
Date on which the dataset was completed
A description of how the data were created (contextual information)
Target group for the dataset deposited (i.e. scientific disciplines)
Keywords that describe your data (use controlled vocabularies if available for your field)
A licence that clearly states the extent to which the data is accessible
Temporal coverage: the period of time to which the data relate
Spatial coverage: Geographical location of the research area or site
Related datasets, resources like publications, websites etc. (digital or analogue)
File formats used in the dataset

Many of the items on this list also relate to the accessibility, interoperability and reusability of the dataset. These aspects will be dealt with in the respective sections of this tool.

You can document your research on metadata level and on dataset level. In order to make your metadata interoperable and machine actionale, use standardised controlled vocabularies, thesauri, ontologies. On the dataset level you should provide a project description and a dataset description. For example, add a codebook to make your data understandable for other researchers, add provenance information and a data/workflow process description. If you want to get to know more about standards, see the second question under Findable.

Click here if you want to know more about the term metadata.
Click here if you want to know more about the term interoperability.

To make your (meta)data findable we encourage the use of controlled vocabularies, taxonomies and/or ontologies.

A controlled vocabulary is an organized and standardized list of terms and can be used to describe data. Controlled vocabularies are mostly discipline-specific and therefore very useful for describing your data. By using controlled vocabularies your metadata becomes much more understandable for machines and users and therefore they improve the findability of your data.

A taxonomy is a classification of entities in an ordered system. A taxonomy is mostly domain specific and is used to identify the content/data by adding terms from the taxonomy to the content/data description. Identifying content in a structured way gives search engines the opportunity to optimize their search functionality. In this way more relevant data can be found based on a single search query. Therefore adding taxonomy terms to your dataset description the findability of your dataset will improve.

An ontology is a formal description of knowledge. This knowledge is described as a set of concepts and relations between these concepts within a specific domain. Ontologies are created to organize information into data and knowledge. An ontology attempts to represent entities, ideas and events, with all their interdependent properties and relations, according to a system of categories. By applying existing ontologies to describe your data, your data becomes more understandable for machines and thus improves the findability of your data.

From ontologies it is a small step to linked open data. Making use of linked open data means that your data is interlinked with other data, that your data is openly accessible and that your data can be shared within the semantic web. In this way your data is published in a structured and understandable way. Linked (open) data is described as a set of triples; following the RDF data structure. triple is a basic set of a subject, a predicate and an object. For example, a subject is “grass”, its corresponding predicate “has color” and the object is “green”. By linking your data to other data, more knowledge and information and links to your data becomes available. This will help to increase the findability of your data.

It is true that standardized controlled vocabularies, taxonomies or ontologies are not equally developed in the disciplines. For some disciplines a broad range of standards are available whilst others have none yet. There are, however, general standards, such as the Getty Thesaurus for geographical names, which can be used across disciplines.

Click here if you want to know more about linked data.
Click here if you want to know more about the semantic web.
Click here if you want to know more about RDF data structure.

Additional information is information that helps users to assess the content and the relevance of the dataset they are viewing. The most important means to provide additional information is a so-called readme file in which topics like the structure of the dataset are addressed. Questions like how many files does the dataset contain and how are they related to each other? Which software has to be used to assess the data? How many versions of data are contained in the dataset? Help users to assess and contextualize the dataset. Other topics to address include but are by no means limited to methodologies used, a detailed summary of the project in which the data was collected, information about whether and how the data was cleaned, how many versions of the dataset were made etc. Information about the provenance and the versioning of your data, moreover, can be added in addition to the readme file. If you have covered most of the items on the metadata list (see explanatory text Question 1) you already provide a satisfactory amount of additional information. Nevertheless, it is important to supplement your metadata with more contextual information.

This question also relates to the letter R (reusability) of FAIRER.

Click here if you want to know more about readme flies.

You should comply with your organization’s retention and disposition schedule as described in the organization’s File Plan. Different rules apply to different categories of data. The File Plan, and the retention and disposition schedules therein, are developed and maintained by the organization’s record keeping services in collaboration with the business units. Consult the disposition rules and ensure that the data are moved to appropriate long-term curation and preservation platforms at the end of the retention period. Keep in mind that, in the case of destruction of the data at the end of the retention period, this is performed only by designated corporate information specialists according to well defined rules, not by scientists or the business units.

For more information, see Q1 in the self-assessment tool.

CONTRADICTION

You answered question 1 with “no metadata”. This won’t allow you to answer the following two questions in F.

Advice to improve Findability

You filled in all or almost all of the optional fields on the Content Description page. This makes your data findable for other researchers and users. Question 2 concerns the use of standards to describe your data which enables machines to find and interlink your data.

You filled in some information on the Content Description page. In order to make your dataset more findable for researchers and users, check again if you can fill in more of the optional fields on the page. The more metadata you provide, the more findable (and reusable) your data will be.

On the Content Description page, additional fields like Relations to projects, internet pages or researchers, Format types, Languages, Sources on which the dataset is based can be filled in. Adding additional, rich metadata to your dataset will help other researchers to find but also to reuse (see questions under letter R) your data.

Fill in the required fields on the primary information page. Then, go to the Content Description page and check which additional metadata you could add to make your dataset more findable. The more metadata you provide, the more findable (and reusable, see question under letter R) your data will be.

Check whether there are standards in your domain or field or generic standards that you can use to describe your dataset. Use them in the description (metadata). It is possible that there are no standards available in your field. If that is the case, make use of generic standards.

Using ontologies and taxonomies will improve the automated findability of your dataset. To increase the findability of your dataset, you can also use domain specific ontologies and linked open data if they are available.

Be aware that there are generic and domain-specific controlled ontologies, vocabularies and taxonomies.

You included the most important standards to make your dataset findable. Be aware of the fact that, within your domain, there could be specific controlled vocabularies, taxonomies or ontologies.

Using controlled vocabularies and ontologies will improve the automated findability of your dataset. You can increase the findability of your dataset even more by also making use of taxonomies, if available for you specific domain.

Using domain-specific controlled vocabularies and ontologies will improve the automated findability of your dataset.

Using taxonomies and ontologies to describe you data, will improve the automated findability of your dataset. You can also add terms from (domain-specific) controlled vocabularies to your data description to increase the findability of your dataset.

Using domain-specific controlled vocabularies and taxonomies will improve the automated findability of your dataset.

Add documentation about the datasets will improve the findability of your dataset. Think of a readme file, versioning, or the provenance of the data.

Next to the readme file, consider adding information about the provenance of the data and the versioning.

Consider also adding information about the provenance of your data.

You added rich and detailed information to your dataset by not only providing a readme file but also giving information about the provenance and the versioning of your data. If seen as an addition to rich and detailed metadata (question F2), it makes your dataset more findable and reusable.

Consider also adding information about the versioning.

Next to the versioning, consider adding information about the provenance of the data and a readme file.

Consider also adding a readme file to your dataset.

Next to the provenance, consider adding a readme file and information about the versioning.

Findable Question 4 Advice Under Development.

The accessibility of a dataset and its corresponding metadata is essential for researchers to assess and potentially reuse a dataset. The questions that you will find under accessibility concern the accessibility of the metadata over time, meaning that the repository guarantees that the metadata will be available even if the data itself is no longer available, and the usage license chosen for the dataset. The latter determines to what extent or under which circumstances the dataset can be accessed. In the FAIRER Principles, the automated accessibility of metadata and data by machines is also covered under Accessibility. There is no question about this technical aspect in this part.

Metadata as described in Question 1 is the description of your data. As such it is associated with your dataset. For the accessibility but also for the findability of your data it is essential that the metadata of the dataset remains accessible even if the data itself is not available anymore. It is the repository you deposit in, which should ensure that this is the case. With this question we would like to encourage you to check whether the metadata is publicly accessible even if the dataset is no longer available.

The extent to which you can make your dataset openly available depends on whether your dataset contains personal data. If it contains personal data, it is clear that you will have to restrict the access to your dataset. In question six you can further specify which usage license you intend to choose.

Appropriately handle restricted, confidential, and sensitive data (e.g., implement user authentication and controlled access to the data and/or implement data anonymization and de-identification).

For more information, see Q11 in the self-assessment tool.

For example, depending on the data and on whether or not the data contains personal data (see question 5) you can choose:

Open Access (everyone): CC0 Waver, accessible to everyone. Choose this license if your dataset doesn’t contain personal data and if you are allowed to publish it openly.
Open Access (registered users): accessible to registered users in accordance with the General Conditions of Use. Choose this license if your data doesn’t contain personal data but you would like users to identify themselves before downloading your data.
Restricted Access (request permission): with your prior consent, users can view and download data in accordance with the General Conditions of Use. You can also impose additional conditions. Choose this license if your data contains personal data.
Restricted Access (archaeology group): accessible to registered archaeologists and archaeology students in accordance with the General Conditions of Use. Choose this license if you are in the field of archeology.
Other Access: the data will be accessible through another repository. Choose this license if your data is deposited and available in another repository. Contact DANS (info@dans.knaw.nl) if you would like to use this license.

CONTRADICTION

If your data contains personal data, you won’t be able to choose the CC0 licence for your dataset.

Advice to improve Accessibility

Adding information to the metadata on your affiliation at the time of your research, provides users with a contact point to consult if they would like to track the availability of your data.

Check the source of your information about the availability of metadata again. Adding information to the metadata on your affiliation at the time of your research, provides users with a contact point to consult if they would like to track the availability of your data.

Adding information to the metadata on your affiliation at the time of your research, provides users with a contact point to consult if they would like to track the availability of your data.

Accessible Question 3 Advice Under Development.

You have chosen the right license if your dataset doesn't contain personal data. Personal data is any information that relates to an identified or identifiable living individual. Different pieces of information, which together can lead to the identification of a particular person, also constitute personal data. Personal data that has been anonymised in such a way that the individual is not or no longer identifiable is no longer considered personal data. For data to be truly anonymised, the anonymisation must be irreversible.

You have chosen the right license if your data contains personal data. Personal data is any information that relates to an identified or identifiable living individual. Different pieces of information, which together can lead to the identification of a particular person, also constitute personal data. Personal data that has been de-identified, encrypted or pseudonymised but can be used to re-identify a person remains personal data and falls within the scope of the GDPR.

You have chosen the right license if your dataset cannot be accessed yet due to unpublished papers or an ongoing project. Your dataset will score low under Accessibilty, as users cannot access it.

You have chosen the right license if your dataset cannot be accessed yet due to unpublished papers or an ongoing project. You should have a good reason for choosing it. The maximum number of months for the embargo to last is 24. Your dataset will score low under Accessabilty, as users cannot access it.

Go to legal information and read the information about the different license types. Then assess which of these licenses apply to your type of data. See also the website of the European Commission for information about what personal data is: data protection .

If you want other researchers to reuse your data, it is important that your data can be integrated in other data(sets). This process of exchanging information between different information systems such as applications, storage or workflows is called interoperability. The following actions will improve the interoperability of your data:

Use standardized controlled vocabularies, taxonomies and/or ontologies (see Question 2) both in describing your data (metadata level) and on in your dataset (data level)
Use prefered formats (see Question 7) in your dataset
Link to other/relevant (meta)data that are online resolvablee
Add contextual information to your dataset
- Add files that explain the context in which the research was performed. You can think of documentation in the form of notebooks, version logs, software documentation, documentation about the data collection describing the hypotheses, project history and objectives of the project, documentation of methods used such as sampling, data collection process, etc. and information on access and terms of use
- Add documentation about the structure of the dataset, for instance a readme.txt file
- Add documentation about the content of the dataset. Provide a description on the data level such as a codebook
Adding scientific links (e.g. links to datasets/research paper used within your project, ORCIDs to identify people who worked on the project, persistent links (PIDs) to related research/dataset) between your dataset and other datasets

Preferred formats not only give a higher certainty that your data can be read in the future, they will also help to increase the reusability and interoperability. Preferred formats are formats that are widely used and supported by the most commonly used software and tools. Using preferred formats enables data to be loaded directly into the software and tools used for data analysis. It makes it possible to easily integrate your data with other data using the same preferred format. The use of preferred formats will also help to transform the format to a newer one, in case a preferred format gets outdated.

The more interoperable your dataset is, the better it will be understood and processed by machines. Complementary information about your dataset can be stored in multiple other datasets. Therefore, it is essential to add context or contextual knowledge to your dataset by adding meaningful links to relevant resources. For instance, you should specify if your dataset builds on any other dataset or whether other, external datasets are needed to complete your dataset. If present, use Persistent Identifiers (see Question 2) to link to these online available (meta)data.

In order to increase the interoperability of your dataset, you should enrich its contextual knowledge. Contextual knowledge is information about how your data(set) was created and how it is composed. You can describe the contextual knowledge by adding links to all other (meta)data you have used when you collected your data. With the help of these links, other researchers will know which other datasets are needed in order to have the complete set of your data. It is also possible that complementary information is stored somewhere else or in another dataset. You need to describe all these scientific links, by properly citing related datasets. If these datasets have a unique and Persistent Identifiers , use it to link them.

Advice to improve Interoperability

Using preferred formats does increase the interoperability of your data!

Before depositing your data, try to convert your data (if possible) to preferred formats. Not only will this increase the interoperability of your dataset but also the accessibility and reusability.

Preferred formats are file formats of which DANS is confident that they will be stable enough in the long term to ensure accessibility, interoperability and reusability.

To increase te interoperability of your dataset we advice to use preferred formats.

Linking to other metadata will increase the interoperability of your dataset. If you link to other metadata which is online available, if possible, always make use of a Persistent Identifier (PID) to refer to this metadata.

It is advisable to link to other (meta)data even though these are not accessible online. You can add a description to the (meta)data you have linked to in your own dataset.

Linking your dataset with other metadata / datasets will increase the interoperability of your dataset. You can add a link via a Persistent Identifier (PID). Examples for PIDs are DOI, URN, ORCiD.

Adding contextual information to your dataset will increase the interoperability of your dataset. You can think of adding references to related / own publications and / or datasets. Moreover, you can add links via Persistent Identifiers (PIDs). Examples for PIDs are DOI, URN, ORCiD.

It is useful to enrich your metadata with your ORCiD (persistent digital identifier for people). Also, adding links to related publications, if possible with their Persistent Identifier (PID), will improve the quality of the contextual information. Also refer to other publications. If possible, always make use of a Persistent Identifier (PID) to link to publications.

To increase the interoperability of your dataset add references to relevant publications and related datasets. Also add Persistent Identifiers such as your ORCiD (persistent digital identifier for people).

Adding as much contextual information as possible will increase the interoperability of your dataset

To increase the interoperability of your dataset add references to relevant related datasets. Also add Persistent Identifiers such as your ORCiD (persistent digital identifier for people).

It is useful to enrich your metadata with your ORCiD (persistent digital identifier for people) and other Persistent Identifiers that relate to your research and your dataset.

It is highly recommended to refer to other publications. If possible, always make use of a Persistent Identifier (PID) to link to publications. You can also add ORCiDs (persistent digital identifier for people) to your dataset.

The ultimate goal in making data FAIRER is to foster reusability. Whether or not datasets are reusable by other researchers is dependent on a number of aspects. One of the preconditions is that the dataset has a usage license which clarifies under which circumstances the data may be reused. Because of the importance of this aspect, the question about the licenses, which you already answered under Accessible, is repeated here. In order to gain insight into the process of data generation, it is important to describe the data and metadata as detailed as possible. Think of questions like Under which circumstances did I / we collect the data? Where does the data come from? Moreover, similar to aspects in Findable, Accessible and Interoperable, it is important that you meet the standards in your discipline when describing your data and metadata.

To let other researchers make use of your dataset, it is essential to explain the origin of your data and what steps you have taken to produce the dataset. Therefore it is very important to provide provenance information with your dataset. This provenance information can consist of for instance the description of the origin of the data; How did you collect your data, did you reuse other data? In that case add the right citations to your dataset. Or did you create your own data? Describe the workflow for the data creation and describe the processing of the data. If you have used any versioning in your data, add this versioning information to your dataset.

You already answered this question under Accessible. Nevertheless, we consider it important that choosing the right usage license is highlighted under Reusable, too, as it is one of the key elements the may or may not allow other researchers to reuse a dataset. For example, depending on the data and on whether or not the data contains personal data (see question 5) you can choose:

Open Access (everyone): CC0 Waver, accessible to everyone. Choose this license if your dataset doesn’t contain personal data and if you are allowed to publish it openly.
Open Access (registered users): accessible to registered users in accordance with the General Conditions of Use. Choose this license if your data doesn’t contain personal data but you would like users to identify themselves before downloading your data.
Restricted Access (request permission): with your prior consent, users can view and download data in accordance with the General Conditions of Use. You can also impose additional conditions. Choose this license if your data contains personal data.
Restricted Access (archaeology group): accessible to registered archaeologists and archaeology students in accordance with the General Conditions of Use. Choose this license if you are in the field of archeology.
Other Access: the data will be accessible through another repository. Choose this license if your data is deposited and available in another repository.

It is more likely that other researchers reuse your data if the metadata contains domain-specific standards, i.e. (meta)data has the same type, is organised in a standardized way, follows a commonly used or community accepted template, etc. Within different communities and domains minimal standards have been described but, unfortunately, not every domain has standards yet. More generic standards that you could use if there are no domain-specific standards are described in Question 2. Most of the standards come with instructions on how to use them.

A Data Management Plan (DMP) reduces information technology (IT) footprint and costs; reduces legal and security risks to data assets; reduces manual processes; reduces redundancies; reduces manpower; improves planning (e.g., IT provisioning, triggering actions, disposition and retention, etc.); prepares for the Cloud; improves compliance with mandatory requirements; enables automated services; enables automated exchange of information between machines; facilitates collaboration; provides concrete bases of approvals; supports Open Science, Open Data, Big Data, and Artificial Intelligence (AI); improves data sharing; safeguards Indigenous data sovereignty; easier to manage FAIRER (Findable, Accessible, Interoperable, Reusable, Ethical, and Reproducible) data; and can be leveraged to build infrastructure.

Machine-actionable Data Management Plans (maDMP) are living documents that relay all the information and metadata related to a data asset thereby enabling FAIRER data management throughout the scientific data lifecycle. This information, but not the actual data asset, is stored in a tiny file (e.g, in a text based JSON file) in an maDMP repository where it can be found, updated, tested, and queried by other systems.

An maDMP would be the centralized go-to-place for any person or machine to find any dataset, and to find any type of information that they might want to know about a dataset at any point in the data lifecycle.

Machines can mine that information from the maDMP.

If that information exists elsewhere, the maDMP would provide a link to the information, not duplicate it within the DMP.

Ultimately, standardized, machine-enabled DMPs will reduce the burden on scientists while simultaneously increasing efficiency in providing information and insight about our data holdings.

General information should include: access, classification level, date created/modified, identifier, download url, language, and title.
Approvals should include: approval status, approved by, and approval date.
Project description should include: title, start/end dates, partner organization(s), and partner agreements.
People should include: contacts, authors, contributors, ORCID IDs, affiliations, roles, and a succession plan.
Resources should include: technical resources, and computing environment including code.
Funding should include: funding type, funding status, funder identifier, cost.
Legal or ethical issues should include: security, privacy, intellectual property, copyright/licence, and Indigenous considerations.
Dataset metadata should include: a metadata standard.
Dataset should include: title, description, download url, byte size, data type, available until, last updated, data priority, quality control level, and status.
Dataset distribution means a particular instance of a dataset that has been, or is intended to be made available in some fashion. It is important to separate the logical notion of a 'dataset' from its distributions, of which there may be several.
Host should include: host name, host url, availability, backup type/frequency, certification, geo location, pid system, storage type, and versioning.
Retention and disposition should include: archival value, legal issues, required destruction, required perpetual use, retention review trigger date, disposition action authorization, and disposition action completed date.

Advice to improve Reusability

Adding provenance information to your dataset will increase the reusability. The more provenance information you provide, the better. Information about provenance includes but is not limited to: Origin of data, citations for reused data, workflow description for collecting data (machine readable), processing and version history of data.

Not only has it become easier to find your data (see unddf 'F'), your metadata now also meets the requirements for proper and correct reuse. This is an important step for your data to become FAIRER.

Your data meet domain standards to a certain extent. FAIRER data should at least contain minimal information standards. Try to look for ways to improve. The more your data and metadata are organized in a standardized way, the better they are suited for re-use! Always try to keep in mind the user-perspective.

Generic metadata standards are widely adopted. Domain standards, however, are much richer in vocabulary and structure and therefore will help researchers within your discipline to reuse your data. Check whether your domain has specific metadata standards.

FAIRER data should at least contain minimal information standards. Did you check whether there are metadata standards available for your domain? The more your metadata and data are organized in a standardized way, the better they are suited for re-use. Always try to keep in mind the user-perspective.

Reusable Question 3 Advice Under Development

Data are ethical when: (a) Data are collected and managed in compliance with relevant government and professional codes of conduct, values and ethics, scientific integrity and responsible conduct of research; (b) Restricted, confidential, and sensitive data are handled appropriately, for example by implementing user authentication and controlled access to the data and and/or data anonymization and de-identification; (c) A statement is made as to whether or not Indigenous considerations exist and where applicable, Indigenous data sovereignty is respected and data are managed in accordance with CARE, OCAP, and UNDRIP principles; (d) Data assets are managed in a manner such that data used as input to Big Data or Artificial Intelligence applications can be confirmed to be relevant, accurate, and up-to-date, and can be tested for unintended biases (TBS Directive 2019 ); (e) Contributors and contact person information is provided.

Transparency and reproducibility are fundamental to the scientific integrity and utility of the data and code and are a fundamental component of open science by-default-by-design. At the same time, it is essential to respect autonomy, privacy, confidentiality, the public, individuals, and communities.

For additional information, see Q11 in the self-assessment tool.

Advice to improve Ethics

Ethical Question 1 Advice Under Development

Indigenous data sovereignty must be respected, and data should be managed in accordance with CARE, OCAP and UNDRIP principles. Indigenous data management protocols should aim to ensure community consent, access and ownership of Indigenous data, and protection of Indigenous intellectual property rights.

For additional information, see Q12 in the self-assessment tool.

Ethical Question 2 Advice Under Development

Ensure that input data and information are complete, high quality, and well documented:

Implement good data governance practices;
Use advanced tools for data validation, cleaning, processing, and analysis;
Develop methods to identify and mitigate biases in datasets;
Audit the data regularly and make any necessary updates or corrections; and,
Document the methods and code.

For more information, see Q13 in the self-assessment tool.

Ethical Question 3 Advice Under Development

At the beginning of the design phase of your project and again before going into production, use an algorithmic assessment tool to determine the impact level of an automated decision-system (e.g., GC Algorithmic Impact Assessment Tool).

Rely on statistical practices (e.g., designing the collection of, summarizing, processing, analyzing, interpreting, and presenting data, as well as model or algorithm development and deployment).

Test and monitor outcomes (Directive on automated decision-making):

Test data and information, as well as the underlying model, for unintended biases; and,
Safeguard against unintentional outcomes
Verify compliance with institutional and program legislation.

For additional information, see Q13 in the self-assessment tool.

Ethical Question 4 Advice Under Development

Reproducible data and code means that the final data and code are computationally reproducible within some tolerance interval or defined limits of precision and accuracy, i.e. a 3rd party will be able to verify the data lineage and processing, reanalyze the data and obtain consistent computational results using the same input raw data, computational steps, methods, computer software & code, and conditions of analysis in order to determine if the same result emerges from the reprocessing and reanalysis. “Same result” can mean different things in different contexts: identical measures in a fully deterministic context, the same numeric results but differing in some irrelevant detail, statistically similar results in a non-deterministic context, or validation of a hypothesis. All data and code are made available for 3rd-party verification of reproducibility. Note that reproducibility is a different concept from replicability. In the latter case, the final published data are linked to sufficiently detailed methods and information for a 3rd-party to be able to verify the results based on the independent collection of new raw data using similar or different methods but leading to comparable results. (See also NASEM 2019 ).

Advice to improve Reproducibility

Question 20 CONTRADICTION

If a statement is provided, then one and only one of the following 4 options must be selected.

Reproducible Question 1 Advice Under Development

Computational steps means detailed documentation of the processes applied to the raw data, including any transformations, filtering, or aggregation. Methods include detailed descriptions of data collection and processing, statistical techniques used, data mining techniques, machine learning models, and any other methods used. Computer code is the actual code or scripts used to process and analyze the data. Conditions of analysis means describing the environment in which the analysis was conducted (e.g., software versions, hardware specifications, and any other conditions that might influence the results).

For additional information, see Q15 in the self-assessment tool.

Reproducible Question 2 Advice Under Development

Reproducible Question 3 Info Under Development

Reproducible Question 3 Advice Under Development

Reproducible Question 4 Info Under Development

Reproducible Question 4 Advice Under Development

Reproducible Question 5 Info Under Development

Reproducible Question 5 Advice Under Development

Reproducibility Checklist

My code is:
A DETERMINISTIC algorithm that, given a particular input, always produces the same output. The behavior of the algorithm is entirely predictable and does not involve any randomness or decision-making that can lead to different outcomes on different executions with the same input.	Yes No
A NON-DETERMINISTIC algorithm that, given the same input, can produce different outcomes on different executions. The algorithm makes choices that can vary between runs because it involves randomness or a selection among multiple possibilities without a specific rule for which option to choose.	Yes No
Executable only on a high performance computing (HPC) system or a super computer.	Yes No
QUANTUM CODE, executable only in a quantum computing environment that uses quantum bits (qubits) in a system that exhibits quantum mechanical behavior which can represent and store information in a more complex way than in classical computing that uses bits (switches represented by 1s and 0s) as the smallest unit of information.	Yes No
For all models and algorithms, I provided a link to:
A clear description of the mathematical setting, algorithm, and/or model.	Yes No Partial N/A
A clear explanation of any assumptions.	Yes No Partial N/A
An analysis of the complexity (time, space, sample size) of the algorithm.	Yes No Partial N/A
A conceptual outline and/or pseudocode description.	Yes No Partial N/A
For any theoretical claims, I provided a link to:
A clear statement of the claim.	Yes No Partial N/A
A complete proof of the claim.	Yes No Partial N/A
A clear formal statement of all assumptions.	Yes No Partial N/A
A clear formal statement of all restrictions.	Yes No Partial N/A
Proofs of all novel claims.	Yes No Partial N/A
Proof sketches or intuitions for complex and/or novel results.	Yes No Partial N/A
Appropriate citations to theoretical tools used are given.	Yes No Partial N/A
An empirical demonstration that all theoretical claims hold.	Yes No Partial N/A
All experimental code used to eliminate or disprove claims.	Yes No Partial N/A
For all datasets used, I provided a link to:
A downloadable version of the dataset or simulation environment.	Yes No Partial N/A
The relevant statistics (e.g., the number of examples).	Yes No Partial N/A
The details of train / validation / test splits.	Yes No Partial N/A
An explanation of any data that were excluded, and all pre-processing steps.	Yes No Partial N/A
A complete description of the data collection process for any new data collected, including instructions to annotators and methods for quality control.	Yes No Partial N/A
A motivation statement for why the experiments are conducted on the selected datasets.	Yes No Partial N/A
A licence that allows free usage of the datasets for research purposes.	Yes No Partial N/A
All datasets drawn from the existing literature (potentially including authors’ own previously published work) are publicly available.	Yes No Partial N/A
A detailed explanation, where applicable, as to why datasets used are not publicly available, and why publicly available alternatives were not used.	Yes No Partial N/A
A complete description of the data collection process (e.g., expt’l setup, device(s) used, image acquisition parameters, subjects/objects involved, instructions to annotators, and qa/qc methods.	Yes No Partial N/A
Ethics approval.	Yes No Partial N/A
For all code used, I provided a link to:
The specification of dependencies.	Yes No Partial N/A
The training code.	Yes No Partial N/A
The evaluation code.	Yes No Partial N/A
The (pre-)trained model(s).	Yes No Partial N/A
A ReadMe file that includes a table of results accompanied by precise command to run to produce those results.	Yes No Partial N/A
Any code required for pre-processing data.	Yes No Partial N/A
All source code required for conducting and analyzing the experiment(s).	Yes No Partial N/A
A licence that allows free usage of the code for research purposes.	Yes No Partial N/A
A document with comments detailing the implementation of new methods, with references to the paper where each step comes from.	Yes No Partial N/A
The method used for setting seeds (if an algorithm depends on randomness) described in a way sufficient to allow replication of results.	Yes No Partial N/A
A description of the computing infrastructure used (hardware and software), including GPU/CPU models; memory; OS; names/versions of software libraries and frameworks.	Yes No Partial N/A
A formal description of evaluation metrics used and and explanation of the motivation for choosing these metrics.	Yes No Partial N/A
A statement of the number of algorithm runs used to compute each reported result.	Yes No Partial N/A
An analysis of experiments that goes beyond single-dimensional summaries of performance (e.g., average; median) to include measures of variation, confidence, or other distributional information.	Yes No Partial N/A
A description of the significance of any improvement or decrease in performance, judged using appropriate statistical tests (e.g., Wilcoxon signed-rank).	Yes No Partial N/A
A list of all final (hyper-)parameters used for each model/algorithm in the for each of the experiments.	Yes No Partial N/A
A statement of the number and range of values tried per (hyper-) parameter during development, along with the criterion used for selecting the final parameter setting.	Yes No Partial N/A
A ReadMe file with a table of results accompanied by precise commands to produce those results	Yes No Partial N/A
For all reported experimental results, I provided a link to:
The range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results.	Yes No Partial N/A
The exact number of training and evaluation runs.	Yes No Partial N/A
A clear definition of the specific measure or statistics used to report results.	Yes No Partial N/A
A description of results with central tendency (e.g. mean) & variation (e.g. error bars).	Yes No Partial N/A
The average runtime for each result, or estimated energy cost.	Yes No Partial N/A
A description of the computing infrastructure used.	Yes No Partial N/A
A document that clearly delineates statements that are opinions, hypothesis, and speculation from objective facts and results.	Yes No Partial N/A
A description of the range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results.	Yes No Partial N/A
Information on sensitivity regarding parameter changes.	Yes No Partial N/A
Details on how baseline methods were implemented and tuned.	Yes No Partial N/A
A clear definition of the specific evaluation metrics and/or statistics used to report results.	Yes No Partial N/A
A description of results with central tendency (e.g. mean) and variation (e.g. error bars).	Yes No Partial N/A
An analysis of statistical significance of reported differences in performance between methods.	Yes No Partial N/A
A description of the average runtime for each result, or estimated energy cost.	Yes No Partial N/A
A description of the memory footprint.	Yes No Partial N/A
An analysis of situations in which the method failed.	Yes No Partial N/A

Total Score: 0%

Reproducible Question 6 Advice Under Development

FINDABLE

ACCESSIBLE

INTEROPERABLE

REUSABLE

ETHICAL

REPRODUCIBLE

Your Notes

Reproducibility Checklist