This new tool responds to the GC Data Strategy for the Federal Public Service (2023-2026) Priority 4.2.b (tools to support assessment of data skill needs) because it supports self-assessment of a respondent’s knowledge of FAIRER principles and provides information that would be useful for raising awareness and training.

There are two companion tools which you may also find useful:




Thank you for your participation!



F A I R E P

Awareness:


F A I R E P

Willingness to comply:

Guidance:

Based on your answers, you can find the guidance below to improve your awareness on some FAIR issues.



Summary of your responses:

FAIRER questions

FINDABLE

F


How likely are you to comply with this?
Unlikely     Likely


How likely are you to comply with it?
Unlikely     Likely


How likely are you to comply with this?
Unlikely     Likely

ACCESSIBLE

A


How likely are you to comply with this?
Unlikely     Likely


How likely are you to comply with this?
Unlikely     Likely




INTEROPERABLE

I


How likely are you to comply with this?
Unlikely     Likely








REUSABLE

R


How likely are you to comply with this?
Unlikely     Likely


How likely are you to comply with this?
Unlikely     Likely


How likely are you to comply with this?
Unlikely     Likely


How likely are you to comply with this?
Unlikely     Likely

ETHICAL

E






REPRODUCIBLE

P








Your Notes

  • This is a place to capture any thoughts or insights that you'd like to remember or revisit later. These notes will be included when you print and save your results to your local computer. No information will be saved to our server.











Do you work with data? Are you looking to make it future proof? The FAIRER data principles will help you!

In 2023, the international consortium, Common Infrastructure for National Cohorts in Europe, Canada, and Africa (CINECA) stated that: "While the FAIR principles have become a guiding technical resource for data sharing, legal and socio-ethical considerations are equally important for a fair data ecosystem . ... FAIR data should be FAIRER, including also ethical and reproducible as key components.”

FAIRER principles refer to the Findability, Accessibility, Interoperability, Reusability, Ethics and Reproducibility of data assets, including related code. Applying these principles to your data assets will help others to find, verify, cite, and reuse your data and code more easily.

FAIRER-Aware helps you to assess your knowledge of the FAIRER Principles, and better understand how making your data and code FAIRER will increase their value and impact.

The tool is discipline-agnostic, making it relevant to any scientific field.

FAIRER-Aware consists of 15 questions with additional guidance to help you make your data assets as FAIRER as possible.

The self-assessment will take 30-60 minutes, after which you will receive a quantitative summary of your awareness level and tips on how you can improve your FAIRER skills. No information is saved on our servers, but you will be able to save the results of the assessment, including tips for improvement, to your local computer and add notes for future reference.

This new tool is inspired from and builds upon the FAIRsFAIR online tool. Please see the author statement, below.

Version 0.1.0 (alpha) 2024-03-13

CRediT Author statement

Data(set)

FAIRER-Aware addresses data, but other research outputs can be made FAIRER as well. If you are interested in what FAIRER means for research software, take a look at the

Willingness to comply

Based on what you have just learned, how likely is it that you will follow this FAIRER practice in the future? To answer this question, imagine a situation where there are no barriers (e.g. financial, technical, practical) for you to put this into practice.

If you are using this tool as part of a course, please enter the identification code provided by your trainer here.

If you don’t work in a specific research domain, please choose “Other”. Please note that you can select multiple research domains.

What does this mean?

The first requirement for findability is to ensure your data don’t get lost. You do this by storing and protecting that data in compliance with your enterprise records keeping policy. This means that the data must be saved in an approved corporate or institutional repository for the period dictated by your organization’s retention schedule and then, either destroyed or archived. Keep in mind that research, scientific, and monitoring data may need to be retained indefinitely to preserve scientific integrity. Making sure the data are not lost is not the same thing as ensuring that they are findable. You must ensure that they remain findable even when they move to another repository or archive. This is done by providing a persistent identifier (PID).

A persistent identifier is a long-lasting reference to a resource. The data(set) you deposit in a data repository should be assigned a globally unique, persistent and resolvable identifier (PID) so that both humans and machines can find it. Persistent identifiers are maintained and governed so that they remain stable and direct the users to the same relevant object consistently over time. Examples of PIDs include Digital Object Identifier (DOI), Handle, and Archival Resource Key (ARK).

Why is this important?

If your data(set) or metadata does not have a PID, you run the risk of "link rot" (also known as “link death”). When your data(set) or metadata is moved, updated to a new version, or deleted, older hyperlinks will no longer refer to an active page. Without a PID, others will not be able to find or reuse your data(set) or metadata in the long-term.

How to do this?

Maintain compliance with your organization’s retention and disposition schedule as described in the organization’s File Plan. Different rules apply to different categories of data. The File Plan, and the retention and disposition schedules therein, are developed and maintained by the organization’s record keeping services in collaboration with the business units. Consult the disposition rules and ensure that the data are moved to appropriate long-term curation and preservation platforms at the end of the retention period. Keep in mind that, in the case of destruction of the data at the end of the retention period, this is performed only by designated corporate information specialists according to well defined rules, not by scientists or the business units. If you have concerns about the File Plan rules, reach out to your records keeping unit.

When you upload your data(set) or metadata to a data repository, the data repository (or other service providers) usually assigns a PID. Repositories ensure that the identifier continues to point to the same data or metadata, according to access terms and conditions you specified.

There are many different types of PIDs, each with their own advantages, disadvantages, and disciplines they are typically used in. Generally speaking, the data repository will have thought about these aspects before deciding which PID type to use. Your organization may maintain a data repository and issue persistent identifiers, including DOIs. In case you have to choose the PID type yourself, you can visit the Knowledge Hub on the PID Forum for guidance. Some disciplines or organisations also provide tools to help you make this choice, see for example this Persistent Identifier Guide for cultural heritage researchers. Once you have chosen a PID type, you can search for data repositories providing that specific PID in registries such as Re3data or FAIRsharing (see related databases) . When choosing a repository and a persistent identifier, keep in mind that identifiers are only as persistent as the service providing them.

Not all data you produce during your research will need a PID. In general, those that underpin published findings or have longer term value are worth assigning a PID. If in doubt about which data should be allocated a PID, speak to your local research data management support team or the data repository.

Want to know more?

Did you know that a PID can refer to any kind of resource? Besides publications or datasets, a PID can also refer to, for example, a person, a scientific sample, a funding body, a set of geographical coordinates, an unpublished report, or a piece of software. Depending on what you find important to link to, you might want to consider using a PID for one or more of these resource types.

Persistent identifiers may point to a data file, a web service response that contains data values, or ideally to an online page that contains metadata for context and the link to access the actual data or details about how to request access. The technical process of translating the identifier to a location is called ‘resolving’ an identifier.

What does this mean?

Metadata is “data about data”, meaning that this type of data only contains information that describes or characterizes other data. There are different types of metadata that underlie different FAIRER aspects. The focus of this question is on making sure your data(set) has a set of minimum descriptive information elements (also known as “discovery metadata”) to adequately communicate the content of your data(set) to others.

Why is this important?

By providing the minimum descriptive information about your data(set), you will be assured that potential users, including those from other research domains, will be able to find and cite your data(set).

It is worth spending time on providing a good description of your data(set). By sharing more details, you will make your data not just findable but also easier to understand for others. The more extensive, accurate, and clear the discovery metadata, the easier it is for potential reusers to determine whether or not they want to access your data(set).

How to do this?

When depositing your data(set), the data repository will show the metadata fields they support. The more fields you fill in, the easier it will be for others to find your data(set). You can use the following list as guidance on which minimum metadata elements to include:

  • Descriptive information about the data(set) (e.g., creator, title, publisher, creation and publication date, summary and keywords describing the data).
  • The unique, persistent, and resolvable identifier (PID) for the data(set).
  • Data content (e.g., resource type, variable(s) measured or observed, method, data format and size) to accurately reflect the deposited data and increase its reusability.
  • Access rights (e.g., information on how to request access in case the data(set) cannot be shared openly for ethical, legal, or commercial reasons). You should also include information about the rights holder and contact details here (see Q4).
  • Meaningful and explicit links to other research outputs (e.g., prior versions of the data(set), other relevant data(sets), related publications, data source, relevant people (data creators or collectors), relevant organisations (the funder or host institution), ideally with their PIDs) to increase the interoperability and the potential for reuse of your data(set).

What does this mean?

This question refers to the capability to make the metadata accessible online in a standard and machine-readable format. Machine-readability means that the data is presented in a structured format that computers can read and process. Facilitating this process is a responsibility of the data repository you deposit your data(set) in.

Why is this important?

By ensuring that the metadata describing your data(set) is machine-readable, it will be findable to the systems that collect (also known as harvesting) and aggregate data for search engines or databases (e.g., Google Search, Web of Science, or a university library collection). This improves your chances of having your data(set) cited and reused, because it will reach a larger audience. Without machine-readable metadata, your data(set) will only be found by people searching the data collection of the specific repository you deposited in or those that have a direct link to your data(set).

How to do this?

Most digital data repositories will have some kind of protocol for making metadata machine-readable. Two protocols that support FAIRER (because they are open, free, and universally implementable) are OAI-PMH and REST API. Therefore, even though it is the responsibility of the data repository to carry out this task, it is your responsibility to select the right data repository to meet this requirement. You can search for such a data repository on a registry such as Re3data by filtering on ‘API’.

Want to know more?

Metadata may also be exposed as structured data embedded within a webpage. This makes the metadata more machine-actionable. The Schema.org standard is one approach that helps to ensure indexing by web search engines such as Google and Bing. This facilitates the Google Dataset Search which you can use to check if data hosted by the repository of your choice is indexed.

If you are interested to understand more about how a machine reads a webpage, you can enable the developer view in your web browser (instructions vary between browsers) and take a look at a website. Zenodo is an example of a website that uses the Schema.org metadata standard.

What does this mean?

Ideally, data(sets) should be public domain and openly accessible without restrictions. However, there can be legitimate reasons not to share data (e.g., privacy protection, ethical, legal, or commercial constraints). As such, it is your responsibility to be aware of what can be shared, with whom and when, and to take appropriate steps to ensure that the data is as open as possible and as closed as necessary.

Why is this important?

As explained in Q2, the metadata describing your data(set) should include details about who can access the data(set) as well as any possible related conditions that need to be met in order to gain access. By clearly specifying these details, you will be assured of the most appropriate access level to your data(set). A data(set) can have a public, embargoed, restricted, or closed access level. You can read more about access levels in the ‘What to know more?’ section.

In some cases, you may even need to apply a variety of access levels to different parts of the same data(set). It is important to consider these issues early on for all your data(sets), so that the access levels are clearly defined before you upload the data(set) to a data repository.

Data should also be accompanied by a clear licence so that other people can legally reuse it. It is recommended to add a licence to all kinds of data(sets) and access levels. Without an explicit licence or a waiver, potential reusers do not have a clear sense of what can be done with your data. It is easiest to use a standard type license for your data(set), since there are many different types that will cover most basic legal situations (e.g., Creative Commons). It is also possible to create your own bespoke licence, though it is recommended to seek help from a legal expert if you wish to pursue this. Your chosen licence should also be part of your (machine-readable) metadata to effectively inform any human or machine that comes across your data(set) about what they’re allowed to do with it.

How to do this?

You should determine the access level(s) and licence of your data(set) before depositing in a data repository. If you are not sure about the right access level or the licence for your data(set), check the institutional or funder policies or speak to your local research data management support team. Also be sure to choose a data repository that supports your desired access level and licence. You can search for such a data repository on a registry such as Re3data, by filtering on ‘Database access’ and ‘Database license’.

If you are depositing data associated with a publication, it is recommended to include a data availability statement in your publication. This statement communicates to readers of your publication where the data(set) is available and how it can be accessed. It also includes a link to the data(set). Most journals, especially those with a data sharing policy, will have templates for data availability statements available.

Want to know more?

Access levels can be categorised as follows:

  • Public access refers to data which everyone can access without any restrictions.
  • Restricted access refers to data that one can access under certain conditions (e.g. because of commercial, sensitive, or other confidentiality reasons or the data is only accessible via a subscription or a fee). Restricted data may be available to a particular group of users or after permission is granted. For restricted data, the metadata should include the conditions of access to the data (e.g., point of contact or instructions to access the data). In case you need to restrict access to your data(set), you should check if your data repository of choice supports access requests.
  • Closed access refers to data that is not made publicly available and for which only metadata is publicly available.
  • Embargoed access refers to data that will be made publicly accessible – either publicly or restricted – at a specific date which should be specified in the metadata. For example, a researcher may release their data after having published their findings from the data.

You can learn more about standard and bespoke licenses in this “How to License Research Data” guide from DCC.

What does this mean?

Even if a data(set) is no longer available, published references and links should always point to its metadata for transparency and integrity. In other words, this question is about whether the metadata will be preserved even when the data(set) it describes may no longer be available.

Why is this important?

Even when your data(set) is not accessible, the metadata in itself can be very valuable for future reuse, especially when it is rich enough. Rich metadata means that it is elaborate enough to adequately inform someone with no prior knowledge about your data(set). Keeping metadata accessible is a way to assure that your work can still be of use for future (replication) studies. If you want others to continue to discover and cite your work, it is essential to make sure the metadata describing your data(set) always remains available. As long as this is the case, links to your data(set) will not become invalid, but will simply point to the metadata only.

How to do this?

Whether or not a data repository provides continued access to metadata depends on the data repository preservation Practices, which are usually documented in a preservation plan/policy/strategy. You should check whether the data repository you deposit your data(set) in has the policy to maintain metadata when data are removed. Make sure to include the persistent identifier in the remaining metadata and a statement on why this data(set) is no longer available.

Want to know more?

After the data retention period has passed, there are different justifiable reasons to not keep (all) of your data(set) available over time:

In case you have used consent forms which specified a certain preservation period after which your data must be destroyed, we recommend:

  • to update your metadata record after destroying the rest of the data(set)
  • to indicate why the data(set) is no longer available
  • to make sure the metadata is rich enough for others to understand what the data(set) was about

You should discuss with your local research data management support whether it is necessary to incorporate a data destruction statement in your consent form. Try to aim for the longest data retention and sharing period possible for you. If data destruction is unavoidable, make sure you choose a data repository that can adequately handle this.

Maintaining (all of) your data(set) may be too costly in the long term. By removing (parts of) your data(set), you can lower these costs and keep your data(set) accessible for longer. Make sure the remaining (meta)data is rich enough for others to understand what the data that is no longer accessible was about.

Data retention standards vary greatly in scientific fields. You should follow the data retention guidelines from the field you work in. In some cases, this could mean that you preserve a smaller part of your data(set) or that you preserve it for a shorter period of time. Do make sure that your metadata is rich enough both for humans and machines to understand it.

What does this mean?

There are many different ways you can describe the same information when filling out the metadata for your deposit. To prevent ambiguity and facilitate better findability, interoperability, and machine-readability, you should use a controlled vocabulary to enter your metadata.

Controlled vocabularies are lists of terms that are created for specific uses or contexts. They are a type of semantic artefact and can take the form of, for example, an ontology, thesaurus, or taxonomy. Each type of vocabulary comes with a different degree of sophistication (e.g. in their level of expressiveness, structure, and inferential power).

Why is this important?

When using controlled vocabularies, the discovery, linking, understanding, and reuse of research data are improved. Using controlled vocabularies in metadata facilitates enhanced data search because people will not have to guess the exact terms you used to describe your data(set) to find it. It also helps facilitate better interoperability of data from different sources, since it will be clear that data(sets) using the same terms cover the same information.

Data repositories should provide support for the use of controlled vocabularies in metadata by offering relevant functionalities. They will often display which controlled vocabularies they support on their website. When controlled vocabularies are included in the metadata, your data repository of choice may be able to publish the metadata in machine-readable format, thus greatly increasing their machine actionability.

How to do this?

Controlled vocabularies are often domain-specific. It is recommended to use the vocabulary that is used most often in your field or specific line of research (see Q8). If you are unsure about this, you can contact your research support staff or look up some data(sets) from colleagues in your field.

You can find data repositories supporting your preferred controlled vocabulary in registries such as FAIRsharing or Re3data by filtering on ‘metadata standards’. Below is a non-exhaustive list of some registries or look-up services for vocabularies. You can use these resources to search for a vocabulary that covers terms relevant for your research.

Want to know more?

If your field has no common controlled vocabularies (yet), you can search for one you personally find most suitable. It is recommended to do this in collaboration with your research support staff. Before using a controlled vocabulary, you should establish the following:

  • Whether it is available online and is open to other users
  • Whether it contains the relevant terms for your line of research
  • Whether you know who curates and makes the vocabulary available to other users
  • Whether it is an nationally or internationally recognized vocabulary and if it is used extensively

What does this mean?

Data provenance (also known as lineage) is a type of metadata that represents the history of your data(set), including information about the people, entities, and processes involved in the data creation. You can also describe and/or link previous versions of your data(set) in the provenance information. Aside from conveying important information about your data(set) to potential reusers, you can also communicate how you wish to be cited.

Why is this important?

By providing provenance information about the data(set) (e.g. sources, date, contributor, version), you make it possible for users to determine whether to trust the authenticity of the data(set) and enable its (re)use. It is a transparent way to communicate why, how, when, where, and by whom your data(set) was created.

How to do this?

The provenance information that is necessary for your data(set) depends on the data type (e.g., measurement, observation, derived data, or data product) and research domain of your work. For that reason, it is difficult to capture a set of finite provenance records adequate to all domains. It is recommended to include at a minimum the following provenance properties of data generation or collection should be supplied as part of the metadata (this is not an exhaustive list):

  • Sources of data generation or collection (e.g., model, instrument, methodology)
  • The date of data creation or collection
  • The contributor(s) involved
  • Data versioning information (indicate relations to other versions and describe changes)

Want to know more?

There are various ways in which provenance information may be included in a metadata record. Some of the provenance properties (e.g., instrument, contributor) may be best represented using persistent identifiers (such as DOI for data, ORCID for researchers, GRID for organisations). This way, humans and machines can retrieve more information about each of the properties by resolving the PID.

Alternatively, the provenance information can be given a linked provenance record expressed explicitly using a controlled vocabulary (e.g., PROV-O, PAV, or Vocabulary of Interlinked Datasets (VoID)). For further information on which provenance data is necessary for the research community of your research domain, contact your research support staff.

What does this mean?

To ensure that your metadata can be broadly shared and understood within your research domain, we recommend using a community-endorsed standard. A standard is any semantic artefact (see Q6), format (see Q3) or other information structure that is widely used in a specific group. If there are any agreements on best practices or standards in your community, you should always give preference to this option.

Community standards may be formally recognized or less formal in their occurrence, but they will always be used and endorsed by the majority of the given community. What “community” means in this context can vary. It is highly recommended to follow domain- or discipline-specific standards, but there are also general purpose standards and smaller scale standards for sub-disciplines or organisations.

Why is this important?

The main purpose of a community standard is to prevent misunderstandings due to ambiguity by making sure everyone is speaking the same language. By following the specifications of a community-endorsed standard for your metadata, you make sure that others can clearly understand and (re)use your data. It is easier to understand and reuse data that is similar to your own or that of others in the same field. Moreover, using a clearly defined structure and wording in your metadata can facilitate replication studies or meta-analyses.

How to do this?

You can use metadata registries such as the RDA or DCC to find more information on community-endorsed standards. For example, some well-established metadata standards are:

  • Dublin Core Metadata Initiative (DCMI) : General purpose standard for which elements and vocabulary to use in metadata.
  • DDI : Standard for the social, behavioral, economic and health sciences
  • Darwin Core : A body of standards for the life sciences.
  • ABCDEFG : An extension of the standard for biological sciences to support the geosciences.
  • NeXus : A data format standard specifically for x-ray, neutron, and muon science.

A domain- or discipline-specific data repository should be your preferred choice for depositing your data(set). Such a repository will support a metadata standard for your relevant community. You can search for a suitable data repository by using Re3data and browsing on domain or subject.

In case your domain has limited standards or standards that are still under development, you should follow general purpose standards. In case no dedicated data repositories are available for your domain (yet), we recommend contacting a research data management expert in your area to identify possible solutions.

Want to know more?

If you are interested in learning how metadata elements are converted into machine-readable files, you can take a look at the “Dublin Core Generator”. This tool can transform your Dublin Core metadata into a machine-readable format.

Many standards undergo continuous evaluation and development. The community-endorsed standards aim to be relevant for the users in the community which is why many of the organisations and initiatives that develop and maintain these standards seek community feedback. If you are interested in the topic of community standards and want to share your experiences or provide input, you can visit the website of any specific standard to get in touch.

What does this mean?

File formats refer to methods for encoding digital information. You can recognize a file format by the extension (the three or four letters at the end of a filename). A file can either be open or proprietary in its format. It is highly recommended to use an open file format for your data(set), so that others can easily access and reuse it.

Why is this important?

A proprietary file format is created and owned by a company and often accompanies a specific software. This means that people who don’t have a licence for this software will not be able to open your file(s). Moreover, the developments and updates such a file format may undergo are dependent on the owner of the company, therefore it may be that older versions of the file format will not be supported in the future, or that the file format will cease to exist entirely. This doesn’t facilitate long-term preservation, as people may not be able to open your file(s) in the future.

Open file formats are publicly accessible and often undergo less changes over time. They can be opened by anyone without the necessity of a license, which means your data(set) can be used more widely.

How to do this?

Data should be made available in a recommended file format that is accepted by the research community to enable data sharing, interoperability, and reuse. It is also important that the file format is supported by the data repository to enable long-term preservation. Repositories usually display an overview of their supported or preferred ( and sometimes also non-preferred) file formats on their website, and will generally have a couple of different options per data type (see for example the UK Data Service or DANS). If you deposit your data in such a file format, the repository should have the necessary information, procedures, and expert knowledge to migrate your file(s) to a new one once it becomes outdated.

If there is no open file format available for your data type, you may use a proprietary format. Try to find formats that are widely used and well-established, as the chances of those formats becoming obsolete are much smaller.

Want to know more?

The choice you make for short-term data processing may differ from the choices you make for long-term data preservation. During your research, you may want to adhere to the proprietary file formats that suit your chosen software or measurement instrument. To meet this FAIRER requirement, you should convert your file(s) before depositing in a data repository. The software will often have some options built-in for this. During this process, it is important to be mindful of the risk of data loss during conversion. If you are unsure about which file format to convert your data to or how to do it, you should consult your data repository of choice or research support staff.

What does this mean?

Data curation is the active and ongoing management of data to ensure that it’s available for discovery and reuse. This process covers the entire lifecycle of the data(set), starting at the selection or creation and continuing on for as long as the data(set) exists. Digital preservation is a part of this data curation process and refers to the series of managed activities necessary to ensure continued access to and reusability of the data(set) for as long as necessary (i.e. keeping the data(set) FAIRER over time). These actions are a collaborative effort of the researcher and the data repository. Data stewardship also plays an important role throughout the process by managing and overseeing an organization’s data assets.

Why is this important?

Data curation and digital preservation requires people, skills, and technology. All the steps you take towards making your data(set) FAIRER and of good quality contribute to the data curation process. You should be aware of the role a data repository plays in this process from the start and determine what kind of care and expertise you expect from a data repository to make sure your data(set) is preserved and kept FAIRER over time. As other questions in this tool have already emphasized, the choice of data repository that you make has a great impact on not only the findability, accessibility, interoperability, and reusability, but also the general value and impact of your data(set).

How to do this?

To make sure your data(set) is in good hands, we recommend you to deposit it in a trusted digital repository (TDR). TDRs have an explicit mission to provide access to and preserve data. They play a critical role in making data FAIRER, providing support, and preserving data over time in a FAIRER manner. You can find certified data repositories in registries such as Re3data , by filtering on ‘certificates’, or on the website of the certificate organisation.

To make sure your data(set) can become as FAIRER as possible and receive the best care over time, you should plan sufficient financial and/or human resources in your research project early in advance. This should include the costs of data stewardship throughout the project and the costs of long term preservation of the data to make sure your data is accessible for as long as possible. You can use the FAIR-Aware Additional guidance to the Science Europe DMP assessment rubric  to be more FAIRER-explicit in your Data Management Plan. In addition, the Government of Canada’s interdepartmental data management plan working group (DMP WG) has developed an extension to the RDA maDMP Standard to capture additional information required in a government context.

Want to know more?

There are different community-endorsed repository certification standards, such as the CoreTrustSeal, DIN31644/NESTOR and ISO163638. Each of these standards have different requirements that a repository should meet to receive certification. You can find these requirements on their respective websites.

More recently, the digital repository community has developed and endorsed a set of guiding principles to demonstrate general digital repository trustworthiness. These principles stand for Transparency, Responsibility, User focus, Sustainability and Technology (TRUST). Certification requirements are based on these principles, to make sure certification is an indication of trustworthiness.

In the absence of formal certification, a data repository can still be FAIRER-enabling by facilitating some of the important qualities mentioned in the FAIRER and TRUST principles. However, this is more difficult to discover on the outside. Other organisations, including funders or publishers, can also enable or support FAIRERness by upholding requirements for data management or FAIRER practices (e.g., mandating Data Management Plans), these organisations contribute to a more FAIRER scientific landscape.

What does this mean?

Stakeholder trust in the research and scientific information depends upon the integrity of the process by which such information is produced, managed and communicated. So too does trust in the decision-making process that makes use of such information. Relevant legislation, government policies, directives, and codes of conduct must be adhered to as well as professional codes of conduct incumbent upon doctors, engineers, chemists, and other professionals engaging in the research or scientific activity. Best practices must be adhered to by maintaining complete and accurate records of your data, code, methodologies and findings, including graphs and images. You should: do no harm; promote sovereignty, fairness, and transparency; respect autonomy, privacy, confidentiality, the public, individuals, and communities; conduct all your research and scientific activities with the highest scientific rigour; address bias and equality; hold yourself and others accountable; and, act with honesty, integrity, and humility.

Why is this important?

Excellence in the design and delivery of policy, programs and services is beneficial to every aspect of public life. By upholding the highest ethical standards, researchers, scientists and public servants conserve and enhance public confidence in the honesty, fairness and impartiality of science and the public sector. Curation is essential to verification (reproducibility) and/or or replication of the work.

How to do this?

  • Obtain authorship consent from all those and only those who have made a substantial conceptual and/or material contribution to, and who accept responsibility for, the data and/or code, and include an author contribution statement (CRediT ANSI/NISO Standard 2022) with the work; Recognize and acknowledge all and only those who participated in the research or scientific activity and their contributions; Acknowledge and describe individuals and organizations who sponsored and/or funded it. Report and appropriately manage any real, perceived or potential conflicts of interest;
  • Reference all data, code, source material, methodologies, findings, and images (obtain permissions where applicable);
  • Appropriately handle restricted, confidential, and sensitive data (e.g., implement user authentication and controlled access to the data and/or implement data anonymization and de-identification);
  • Describe any ethical issues that exist related to the data and/or code, and comply with all policies, guidelines, and codes of conduct relevant to research involving humans or animals;
  • When blending data from multiple sources:
    • Describe the auspice and purpose of blending data, anticipated final products potential downstream uses, potential considerations for disclosure risks and harms, and data usefulness;
    • Describe the ingredient data files, data sources used to accomplish blending, the interests of data holders, steps taken to reduce disclosure risks and enhance usefulness when compiling ingredient files;
    • Describe the disclosure risk/usefulness trade-offs for accessing ingredient files;
    • Describe linkage strategies and determine if the resultant blended data are useful and meet the objective;
    • Describe stakeholders engagement in decision-making and mitigation plans for confidentiality breaches;
    • Describe how data provenance and update files are tracked, the decision-making process for continuing access to or sunsetting the blended data product, how participating partners contribute to those decisions, and how decisions about disclosure management policies are communicated to stakeholders.

What does this mean?

Indigenous data management protocols should aim to ensure community consent, access and ownership of Indigenous data, and protection of Indigenous intellectual property rights (Tri-Agency).

Why is this important?

Given the history of the government of Canada in the collection of information that has ultimately harmed Indigenous people, distrust in government systems and processes may impact the ability of agencies to collect such information  (GC 2020). As such, agencies should to adopt a transparent, clear and secure approach to the collection and storage of such information including a clear articulation of the purpose of its collection and storage, as well as the importance of limiting access to personal information to those strictly requiring it as part of the intake process. It is also important to develop a data sovereignty process to align with key existing commitments to ensure protections for the ability to change or withdraw personal information and the ethical protection of those data (GC 2024).

We recognize that data related to research by and with Indigenous communities must be managed in accordance with data management principles developed and approved by these communities, and on the basis of free, prior and informed consent (GC 2021). 

How to do this?

Include the following information in your machine-actionable Data Management Plan (maDPM).

If there are Indigenous considerations related to the data or code, indicate whether or not (and provide details):

  • the project and data management plan have received approval from the Indigenous community;
  • there is an Indigenous information sharing agreement in place; describe the Indigenous information sharing agreement;
  • there Indigenous control over the data;
  • Indigenous governance exist for these data; and,
  • Indigenous traditional knowledge used;

Identify the:

  • Indigenous government group the contributor belongs to;
  • Indigenous non-government organization (NGO) the contributor belongs to;
  • Indigenous research method that was used;
  • Indigenous knowledge classification that was used;
  • Indian band number that is associated with the dataset distribution; and,
  • method of Indigenous data identification that is used with the dataset distribution.

What does this mean?

The data and information should be pertinent to the context or problem they are being used to address. They need to be current, correct, and reliable to avoid incorrect conclusions or faulty machine learning models. They need to by unbiased to avoid unfair or prejudiced outcomes.

Why is this important?

It is important that Big Data and AI used to make or support decisions be done in a manner that is compatible with core principles of administrative law such as transparency, accountability, legality, and procedural fairness. It is essential to ensure that automated decision systems are deployed in a manner that reduces risks to clients, institutions and society, and leads to more efficient, accurate, consistent and interpretable decisions made pursuant to law.

Big Data and Artificial Intelligence outcomes are only as good as the input data (“garbage in garbage out”). High-quality data improves the performance of AI models, leading to more accurate and effective predictions and decision-making processes.

How to do this?

Ensure ethical AI:

  1. Promote, protect and respect human rights.
  2. Recognize and minimize actual or potential harms of AI.
  3. Rely on statistical practices (e.g., designing the collection of, summarizing, processing, analyzing, interpreting, and presenting data, as well as model or algorithm development and deployment).

At the beginning of the design phase of your project and again before going into production, use an algorithmic assessment tool to determine the impact level of an automated decision-system (e.g., GC Algorithmic Impact Assessment Tool).

Ensure that input data and information are complete, high quality, and well documented:

  • Implement good data governance practices;
  • Use advanced tools for data validation, cleaning, processing, and analysis;
  • Develop methods to identify and mitigate biases in datasets;
  • Audit the data regulary and make any necessary updates or corrections; and,
  • Document the methods and code.

Test and monitor outcomes (Directive on automated decision-making):

  • Test data and information, as well as the underlying model, for unintended biases; and,
  • Safeguard against unintentional outcomes
  • Verify compliance with institutional and program legislation.

Figure. How ethical AI arises from ethical statistical and ethical computing practices
SOURCE:Tractenberg - Statistical learning and ethical artificial intelligence

What does this mean?

Reproducible data and code means that the final data and code are computationally reproducible within some tolerance interval or defined limits of precision and accuracy, i.e. a 3rd-party will be able to verify the data lineage and processing, reanalyze the data and obtain consistent computational results using the same input raw data, computational steps, methods, computer code, and conditions of analysis in order to determine if the same result emerges from the reprocessing and reanalysis. “Same result” can mean different things in different contexts: identical measures in a fully deterministic context, the same numeric results but differing in some irrelevant detail, statistically similar results in a non-deterministic context, or validation of a hypothesis.

Note that reproducibility is a different concept from replicability. In the latter case, the final published data are linked to sufficiently detailed methods and information for a 3rd-party to be able to verify the results based on the independent collection of new raw data using similar or different methods but leading to comparable results.

Why is this important?

Reproducibility of results is the cornerstone of the scientific method and the minimum standard for assessing the value of scientific claims and conclusions. Lack of reproducibility undermines trust in the results.

How to do this?

Here is a non-exhaustive list of things you can do to enhance reproducibility of your deterministic and non-deterministic code:

  • Make your data and code available;
  • Where possible, use open-source software and tools;
  • Ensure that your code acts in a predictable way regardless of the time of the input, presence of noise, garbage collector runs at critical moments (even if only for a few milliseconds), etc.;
  • Design the code to accommodate and recover from unknown states or particularities (e.g., particularity of the data structure, libraries, of API’s) with minimal damage and interruption;
  • Verify that indexes are not run on column expressions that contain non-deterministic functions;
  • Consider the compiler’s ability to optimize recursion;
  • In a complex timeseries model, consider permutation entropy (ordinal calculation of forward information transfer); or, where information is the limit, be sure to gather and use sufficient information so that the level of complexity does not effectively overwhelm the predictive power of the deterministic forecast model;
  • In a complex model, ensure that the time required to compute the required data length (e.g., 0.5 sec) does not exceed the real-time control of a computer system (e.g., a MHz clock rate) so that the level of complexity does not effectively overwhelm the predictive power of the deterministic forecast model;
  • Optionally, use tools like pipenv, conda, or Docker to create and manage virtual environments that isolate your dependencies from your system and ensure they are consistent across different platforms;
  • Use relative paths (relative to the current working directory) rather than absolute paths (defined from the root directory) to define the location of a file or directory;
  • Test and debug your code on different platforms (e.g., Windows, Linux, Mac OS);
  • Test your web or mobile applications on different browsers and devices; and,
  • Complete a reproducibility checklist (See, for example, reproducibility checklist)

Reproducibility Checklist for Deterministic and Non-deterministic Data and Code

My code is:
A DETERMINISTIC algorithm that, given a particular input, always produces the same output. The behavior of the algorithm is entirely predictable and does not involve any randomness or decision-making that can lead to different outcomes on different executions with the same input.
Yes      No
A NON-DETERMINISTIC algorithm that, given the same input, can produce different outcomes on different executions. The algorithm makes choices that can vary between runs because it involves randomness or a selection among multiple possibilities without a specific rule for which option to choose.
Yes      No
Executable only on a high performance computing (HPC) system or a super computer.
Yes      No
QUANTUM CODE, executable only in a quantum computing environment that uses quantum bits (qubits) in a system that exhibits quantum mechanical behavior which can represent and store information in a more complex way than in classical computing that uses bits (switches represented by 1s and 0s) as the smallest unit of information.
Yes      No
For all models and algorithms, I provided a link to:
A clear description of the mathematical setting, algorithm, and/or model.
Yes      No
Partial  N/A
A clear explanation of any assumptions.
Yes      No
Partial  N/A
An analysis of the complexity (time, space, sample size) of the algorithm.
Yes      No
Partial  N/A
A conceptual outline and/or pseudocode description.
Yes      No
Partial  N/A
For any theoretical claims, I provided a link to:
A clear statement of the claim.
Yes      No
Partial  N/A
A complete proof of the claim.
Yes      No
Partial  N/A
A clear formal statement of all assumptions.
Yes      No
Partial  N/A
A clear formal statement of all restrictions.
Yes      No
Partial  N/A
Proofs of all novel claims.
Yes      No
Partial  N/A
Proof sketches or intuitions for complex and/or novel results.
Yes      No
Partial  N/A
Appropriate s to theoretical tools used are given.
Yes      No
Partial  N/A
An empirical demonstration that all theoretical claims hold.
Yes      No
Partial  N/A
All experimental code used to eliminate or disprove claims.
Yes      No
Partial  N/A
For all datasets used, I provided a link to:
A downloadable version of the dataset or simulation environment.
Yes      No
Partial  N/A
The relevant statistics (e.g., the number of examples).
Yes      No
Partial  N/A
The details of train / validation / test splits.
Yes      No
Partial  N/A
An explanation of any data that were excluded, and all pre-processing steps.
Yes      No
Partial  N/A
A complete description of the data collection process for any new data collected, including instructions to annotators and methods for quality control.
Yes      No
Partial  N/A
A motivation statement for why the experiments are conducted on the selected datasets.
Yes      No
Partial  N/A
A licence that allows free usage of the datasets for research purposes.
Yes      No
Partial  N/A
All datasets drawn from the existing literature (potentially including authors’ own previously published work) are publicly available.
Yes      No
Partial  N/A
A detailed explanation, where applicable, as to why datasets used are not publicly available, and why publicly available alternatives were not used.
Yes      No
Partial  N/A
A complete description of the data collection process (e.g., expt’l setup, device(s) used, image acquisition parameters, subjects/objects involved, instructions to annotators, and qa/qc methods.
Yes      No
Partial  N/A
Ethics approval.
Yes      No
Partial  N/A
For all code used, I provided a link to:
The specification of dependencies.
Yes      No
Partial  N/A
The training code.
Yes      No
Partial  N/A
The evaluation code.
Yes      No
Partial  N/A
The (pre-)trained model(s).
Yes      No
Partial  N/A
A ReadMe file that includes a table of results accompanied by precise command to run to produce those results.
Yes      No
Partial  N/A
Any code required for pre-processing data.
Yes      No
Partial  N/A
All source code required for conducting and analyzing the experiment(s).
Yes      No
Partial  N/A
A licence that allows free usage of the code for research purposes.
Yes      No
Partial  N/A
A document with comments detailing the implementation of new methods, with references to the paper where each step comes from.
Yes      No
Partial  N/A
The method used for setting seeds (if an algorithm depends on randomness) described in a way sufficient to allow replication of results.
Yes      No
Partial  N/A
A description of the computing infrastructure used (hardware and software), including GPU/CPU models; memory; OS; names/versions of software libraries and frameworks.
Yes      No
Partial  N/A
A formal description of evaluation metrics used and and explanation of the motivation for choosing these metrics.
Yes      No
Partial  N/A
A statement of the number of algorithm runs used to compute each reported result.
Yes      No
Partial  N/A
An analysis of experiments that goes beyond single-dimensional summaries of performance (e.g., average; median) to include measures of variation, confidence, or other distributional information.
Yes      No
Partial  N/A
A description of the significance of any improvement or decrease in performance, judged using appropriate statistical tests (e.g., Wilcoxon signed-rank).
Yes      No
Partial  N/A
A list of all final (hyper-)parameters used for each model/algorithm in the for each of the experiments.
Yes      No
Partial  N/A
A statement of the number and range of values tried per (hyper-) parameter during development, along with the criterion used for selecting the final parameter setting.
Yes      No
Partial  N/A
A ReadMe file with a table of results accompanied by precise commands to produce those results
Yes      No
Partial  N/A
For all reported experimental results, I provided a link to:
The range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results.
Yes      No
Partial  N/A
The exact number of training and evaluation runs.
Yes      No
Partial  N/A
A clear definition of the specific measure or statistics used to report results.
Yes      No
Partial  N/A
A description of results with central tendency (e.g. mean) & variation (e.g. error bars).
Yes      No
Partial  N/A
The average runtime for each result, or estimated energy cost.
Yes      No
Partial  N/A
A description of the computing infrastructure used.
Yes      No
Partial  N/A
A document that clearly delineates statements that are opinions, hypothesis, and speculation from objective facts and results.
Yes      No
Partial  N/A
A description of the range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results.
Yes      No
Partial  N/A
Information on sensitivity regarding parameter changes.
Yes      No
Partial  N/A
Details on how baseline methods were implemented and tuned.
Yes      No
Partial  N/A
A clear definition of the specific evaluation metrics and/or statistics used to report results.
Yes      No
Partial  N/A
A description of results with central tendency (e.g. mean) and variation (e.g. error bars).
Yes      No
Partial  N/A
An analysis of statistical significance of reported differences in performance between methods.
Yes      No
Partial  N/A
A description of the average runtime for each result, or estimated energy cost.
Yes      No
Partial  N/A
A description of the memory footprint.
Yes      No
Partial  N/A
An analysis of situations in which the method failed.
Yes      No
Partial  N/A

What does this mean?

All data and code are made available for 3rd-party verification of reproducibility. Making the original, unprocessed data available allows others to examine what was initially collected or generated before any manipulation or analysis. Computational steps means detailed documentation of the processes applied to the raw data, including any transformations, filtering, or aggregation. Methods include detailed descriptions of data collection and processing, statistical techniques used, data mining techniques, machine learning models, and any other methods used. Computer code is the actual code or scripts used to process and analyze the data. Conditions of analysis means describing the environment in which the analysis was conducted (e.g., software versions, hardware specifications, and any other conditions that might influence the results).

Why is this important?

  • Transparency and reproducibility are fundamental to the scientific integrity and utility of the data and code.
  • Accessibility of the data and code ensures that data providers are accountable for the quality and integrity of the data and reported results and discourages questionable practices.
  • Access to the data and code facilitates collaboration, allows others to build upon your work, apply your methods to new datasets, and explore alternative analyses thereby increasing the value of your data and the potential for innovation.
  • Facilitating collaboration leads to increased co-authorship, more publications and s, and wider disseminaton.

How to do this?

  • Provide infrastructure to support data and code sharing.
  • Provide clear data sharing workflows and step-by-step procedures.
  • Deposit the data and code in established repositories that support data sharing and ensure that data are accessible in a standardized format; Use a version control system (e.g., Git) to track and manage changes over time;
  • Document your code using embedded comments and external ReadMe files; Document code dependencies (i.e. external packages, libraries, or modules that your code relies on);
  • Use portable file formats for input and output files (e.g., CSV, JSON, XML), which can be read and written by different programs and platforms;
  • Provide the most permissive license possible under which the data and code are made available (e.g., one of the creative commons licenses).
  • Include a machine-actionable Data Management Plan (maDMP). For minimum requirements of an maDMP, see the international Research Data Alliance (RDA) Common Standard for maDMP’s.
  • Documenting, organizing, and making available data and code to the degree needed for 3rd-party verification of reproducibility is a lot of work requiring additional training, time, and infrastructure all of which increase costs. To incentivise researchers, scientists, and programmers to ensure reproducibility, encourage publication in peer-reviewed data journals and code journals, and give the same weight to these publications as traditional research publications when reviewing promotion packages.

Want to know more?

Data - Definition and Context

DATA are a set of values of subjects with respect to qualitative or quantitative variables representing facts, statistics, or items of information in a formalized manner suitable for communication, reinterpretation, or processing (TBS Policy on Service and Digital, 2019).

Data are facts, measurements, recordings, records, or observations about the world collected by scientists and others with a minimum of contextual interpretation. Data may be in any format or medium taking the form of writings, notes, numbers, symbols, text, images, films, video, sound recordings, pictorial reproductions, drawings, designs or other graphical representations, procedural manuals, forms, diagrams, work flow charts, equipment descriptions, data files, data processing algorithms/code/scripts, or statistical records (CODATA-IRiDiuM (2018)  -International Research Data Management glossary).

The word “data” may be used very broadly to comprise data (in the strict sense) and the ecosystem of digital things that relate to data, including metadata, software and algorithms, as well as physical samples and analogue artefacts - and the digital representations and metadata relating to these things. (CODATA 2019 - Beijing Declaration on Research Data). There are dozens of other definitions of data that may be useful depending on the context1.

SCIENTIFIC DATA are data that are used by scientists as primary sources to support technical and regulatory development or scientific enquiry, research, or scholarship, and that are used as evidence in the scientific process and/or are commonly accepted in the scientific community as necessary to validate scientific findings and results. All other digital and non-digital content have the potential of becoming research data. Examples of scientific data include data arising from: experiments, research and development, ‘citizen science’, surveys, operations, surveillance, monitoring, field analyzers or data-loggers, instruments, laboratory analyses, inventories, modeling and simulation output, processed data, and repurposed data (IRiDiuM - International Research Data Management glossary). The scientific nature of the data is demonstrated when the process of creating, maintaining and quality-proofing the data comply with commonly recognized scientific standards.

Although scientific data share many aspects in common with other types of data (e.g., administrative data, financial data, business data), their processing frequently requires more complex software and infrastructure. The data themselves may also be:

  • more complex (e.g., associated accuracy, precision, detection limits, confidence intervals, quality assurance/quality control procedures, etc.);
  • more tightly controlled;
  • held to higher standards;
  • retained for a longer period of time, often indefinitely;
  • documented more carefully and in greater detail (e.g., description of methods used to obtain measurements);
  • used as evidence, including in court, and therefore require a higher level of credibility, reliability, and accessibility.
Author Statement

Esther Liu ORCID logo (Data curation, Software, Visualization) ;
Dominique Charles ORCID logo (Methodology, Validation) ; and,
Claire C. Austin ORCID logo (Conceptualization, Supervision, Writing).

In keeping with the FAIRsFAIR MIT licence, this work reproduces the online FAIR-Aware Self-Assessment tool for Findable, Accessible, Interoperable, and Reusable (with minor edits) principles, adds a definition of 'Data', adds new sections on ‘Ethical’ and ‘Reproducible’ principles, and adds a new ‘Checklist for Reproducibility’. Its layout is based on and inspired by FAIRsFAIR FAIR-Aware Self-Assessment tool.

All authors reviewed, discussed, and agreed to all aspects of the final work.

All views and opinions expressed are those of the co-authors, and do not necessarily reflect the official policy or position of their respective employers, or of any government, agency or organization.

Cite as: Liu E, Charles D, and Austin CC (2024). FAIRER Aware Self-Assessment Tool.

GLOSSARY
Access

Continued availability and ongoing usability of a digital resource, retaining all qualities of authenticity, accuracy and functionality deemed to be essential for the purposes the digital material was created and/or acquired for. Users who have access can retrieve, understand, manipulate, and store copies.

source

Controlled vocabulary

A controlled vocabulary is a flat, normalised, restricted list of terms for a specific use or context. Thesauri and taxonomies are types of controlled vocabularies, but not all controlled vocabularies are thesauri or taxonomies.

source

Data

Facts, measurements, recordings, records, or observations about the world, collected by researchers, that are yet to be processed/interpreted/analysed. Data may be in any format or medium taking the form of writings, notes, numbers, symbols, text, images, films, video, sound recordings, pictorial reproductions, drawings, designs or other graphical representations, procedural manuals, forms, diagrams, work flow charts, equipment descriptions, data files, data processing algorithms, or statistical records.

  • Research Data: Data that are used as primary sources to support technical or scientific enquiry, research, scholarship, or artistic activity, and that are used as evidence in the research process and/or are commonly accepted in the research community as necessary to validate research findings and results. All other digital and non-digital content have the potential of becoming research data. Research data may be experimental data, observational data, operational data, third party data, public sector data, monitoring data,processed data, or repurposed data.
  • Dataset: Organised collection of data or objects in a computational format, that are generated or collected by researchers in the course of their investigations, regardless of their form or method, that form the object on which researchers test a hypothesis. This includes the full range of data: raw, unprocessed datasets, proprietary generated and processed data and secondary data obtained from third parties. The presentation of the data in the application is enabled through metadata.

source

Data availability statement

A statement accompanying an article published in a scientific journal about the availability of the data underlying the article. What such a statement looks like is determined by each journal in their data sharing policy. The statement will usually describe where the data can be found and what access and reuse conditions there are.

source

Data curation

Managed process throughout the data lifecycle, by which data/data collections are cleansed, documented, standardised, formatted and inter-related. This includes versioning data, or forming a new collection from several data sources, annotating with metadata, adding codes to raw data (e.g., classifying a galaxy image with a galaxy type such as “spiral”). Higher levels of curation involve maintaining links with annotation and with other published materials. Thus a dataset may include a citation link to publication whose analysis was based on the data. The goal of curation is to manage and promote the use of data from its point of creation to ensure it is fit for contemporary purpose and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Special forms of curation may be available in data repositories. The data curation process itself must be documented as part of curation. Thus curation and provenance are highly related.

source

Data destruction

Process of destroying data stored on tapes, hard disks and other forms of electronic media so that it is completely unreadable and cannot be accessed or used.

source

Data management plan (DMP)

Statement describing how research data will be managed throughout a specified research project's life cycle - during and after the active phase of the research project - including terms regarding archiving and potential preservation of the data in a data repository. The DMP is considered to be a 'living' document, i.e., one which can be updated when necessary.

source

(Data) repository

Physical or digital storage location that can house, preserve, manage, and provide access to many types of digital and physical materials in a variety of formats. Materials in online repositories are curated to enable search, discovery, and reuse. There must be sufficient control for the physical and digital material to be authentic, reliable, accessible and usable on a continuing basis.

source

Data retention policy

Established protocol of an organisation for retaining information for operational or regulatory compliance needs. The objectives of a data retention policy are to keep important information for future use or reference, to organise information so it can be searched and accessed at a later date, and to dispose of information that is no longer needed. A data retention policy must consider both the value of data over time, and regulations to which the data may be subject.

source

Data stewardship

Course of action taken by a person or group to manage and supervise organisational data assets with responsibility and commitment. Good stewardship involves adequate care, making use of the FAIR principles, and holding ownership and regulation to provide high-quality data (including metadata), combining trust and ethical practice.

source

Data type

Data type (or simply type) in computer science and computer programming is a classification identifying one of various types of data, that determines the possible values for that type; the operations that can be done on values of that type; the meaning of the data; and the way values of that type can be stored.

Common data types include: integers, booleans, characters, floating-point numbers, alphanumeric strings.

source

Digital preservation

Series of managed activities necessary to ensure continued access to digital materials for as long as necessary. All of the actions required to maintain access to digital materials beyond the limits of media failure or technological change. Those materials may be records created during the day-today business of an organisation; born-digital materials created for a specific purpose (such as teaching resources); or the products of digitisation projects. This definition specifically excludes the potential use of digital technology to preserve the original artefacts through digitisation.

source

Ethical data

Data are ethical when: (a) Data are collected and managed in compliance with relevant government and professional codes of conduct, values and ethics, scientific integrity and responsible conduct of research; (b) Restricted, confidential, and sensitive data are handled appropriately, for example by implementing user authentication and controlled access to the data and and/or data anonymization and de-identification; (c) A statement is made as to whether or not Indigenous considerations exist and where applicable, Indigenous data sovereignty is respected and data are managed in accordance with CARE, OCAP, and UNDRIP principles; (d) Data assets are managed in a manner such that data used as input to Big Data or Artificial Intelligence applications can be confirmed to be relevant, accurate, and up-to-date, and can be tested for unintended biases (TBS Directive 2019); (e) Contributors and contact person information is provided.

FAIRER (Data Principles)

A set of guiding principles to make data Findable, Accessible, Interoperable, Reusable, Ethical and Reproducible.

  • Findable: (meta)data should be richly described to enable attribute-based search
  • Accessible: (meta)data should be retrievable in a variety of formats that are sensible to humans and machines using persistent identifiers
  • Interoperable: the description of (meta)data should follow community guidelines that use an open, well defined vocabulary
  • Reusable: the description of essential, recommended, and optional metadata elements should be machine processable and verifiable. Use should be easy and data should be citable to sustain data sharing and recognize the value of data
  • Ethical: Ethical data means that: (a) Data are collected and managed in compliance with applicable legislation and codes; (b) Restricted, confidential, and sensitive data are handled appropriately; (c) Indigenous data sovereignty is respected and data are managed in accordance with relevant legislation and principles; (d) Data assets can be confirmed to be relevant, accurate, and up-to-date, and can be tested for unintended biases for use as input to Big Data or Artificial Intelligence applications; (e) Contributors and contact person information is provided. See Ethical data for more details.
  • Reproducible: Ability to replicate the results of a study using the same input data and procedures used by the original investigator. See Reproducibility for more details.

source

source

File format

The layout of a file in terms of how the data within the file are organized. A program that uses the data in a file must be able to recognize and possibly access data within the file. A particular file format is often indicated as part of a file’s name by a filename extension (suffix). Conventionally, the extension is separated by a period from the name and contains three or four letters that identify the format.

A proprietary file format is owned and copyrighted by a specific company. Open file formats are publicly available.

source

source

Harvesting

Metadata harvesting is an automated, regular process of collecting metadata descriptions from different sources to create useful aggregations, so that services can be built using metadata from many repositories.

source

Licence

A legal document that specifies what a user can do with a resource.

  • Standard licence: A licence that is defined in a published and recognised specification.Examples of standard licences are: Creative Commons licences, Open Data Commons.
  • Machine-understandable licence: A licence that is expressed in such a way that a machine can take a decision on further actions.

source

Long-term preservation

Continued access to digital materials, or at least to the information contained in them, indefinitely.

source

Machine-actionable

A continuum of possible states wherein a digital object provides increasingly more detailed information to an autonomously-acting, computational data explorer. This information enables the agent—to a degree dependent on the amount of detail provided—to have the capacity, when faced with a digital object never encountered before, to:

  1. identify the type of object (with respect to both structure and intent),
  2. determine if it is useful within the context of the agent’s current task by interrogating metadata and/or data elements,
  3. determine if it is usable, with respect to license, consent, or other accessibility or use constraints, and
  4. take appropriate action, in much the same manner that a human would.

source

Machine-readable format

In a form that can be used and understood by a computer.

source

Metadata

Data about data. It is data (or information) that defines and describes the characteristics of other data. It is used to improve the understanding and use of the data.

source

Persistent Identifier

Long-lasting digital reference to an object that gives information about that object regardless of what happens to that object. Developed to address link rot, a persistent identifier can be resolved to provide an appropriate representation of an object whether that object changes its online location or goes offline.

source

Provenance

A type of historical information or metadata about the origin, location or the source of something, or the history of the ownership or location of an object or resource including digital objects. For example, information about the Principal Investigator who recorded the data, and the information concerning its storage, handling, and migration

source

Reproducibility

Reproducible data and code means that the final data and code are computationally reproducible within some tolerance interval or defined limits of precision and accuracy, i.e. a 3rd party will be able to verify the data lineage and processing, reanalyze the data and obtain consistent computational results using the same input raw data, computational steps, methods, computer software & code, and conditions of analysis in order to determine if the same result emerges from the reprocessing and reanalysis. “Same result” can mean different things in different contexts: identical measures in a fully deterministic context, the same numeric results but differing in some irrelevant detail, statistically similar results in a non-deterministic context, or validation of a hypothesis. All data and code are made available for 3rd-party verification of reproducibility. Note that reproducibility is a different concept from replicability. In the latter case, the final published data are linked to sufficiently detailed methods and information for a 3rd-party to be able to verify the results based on the independent collection of new raw data using similar or different methods but leading to comparable results. (See also NASEM 2019 ).

Semantic artefact

Semantic artefacts are machine readable models of knowledge such as controlled vocabularies, thesauri, and ontologies which facilitate the extraction and representation of knowledge within data sets using annotations or assertions.

source

Standard

1) A level of quality, achievement, etc., that is considered acceptable or desirable.
2) Ideas about morally correct and acceptable behavior.

A standard can be mandatory due to government statute or regulation, or less formally required in the form of a consensus or de facto standard. In this context, it is important that the standard is relevant for and used by a specific community.

source

Structured data

Data whose elements have been organized into a consistent format and data structure within a defined data model such that the elements can be easily addressed, organized and accessed in various combinations to make better use of the information, such as in a relational database.

source

Trusted Digital Repository

Infrastructure component that provides reliable, long-term access to managed digital resources. It stores, manages, and curates digital objects and returns their bit streams when a request is issued. Trusted repositories undergo regular assessments according to a set of rules such as defined by CoreTrustSeal or TRAC (ISO 16363). Such an assessment has the potential to increase trust from its depositors and users. Certain quality criteria need to be met to distinguish trusted repositories from other entities that store data, such as notebooks or lab servers.

source

TRUST principles

A set of guiding principles to demonstrate digital repository trustworthiness:

  • Transparency: to be transparent about specific repository services and data holdings that are verifiable by publicly accessible evidence.
  • Responsibility: to be responsible for ensuring the authenticity of data holdings and for the reliability and persistence of its service.
  • User focus: to ensure that the data management norms and expectations of target user communities are met.
  • Sustainability: to sustain services and preserve data holdings for the long-term.
  • Technology: to provide infrastructure and capabilities to support secure, persistent, and reliable services.

source