Skip to Main Content

Data Organization and Documentation

A guide for improving the organization, documentation, and long-term preservation of digital research data.

The ALCOA Standard

The ALCOA standard has been used by the Food and Drug Administration since the 1990s to define and enforce data integrity with regard to drug manufacturing practice. This acronym outlines five aspects of well-managed data that can help to ensure its authenticity and integrity.

  • A: Attributable - creation of and all updates to data should be attributable to a known person. Audit trails or versioning in data creation software can help to ensure attribution.
  • L: Legible - data should be easily readable and responsibly recorded. This is as important with digital data (e.g., making data 'tidy') as it is with handwritten analog data.
  • C: Contemporaneous - data should be recorded and authenticated (with signatures, timestamps, etc.) at the moment of its creation
  • O: Original - location of original source data should always be recorded, and subsequent modifications to data should not overwrite the original source
  • A: Accurate - the data should accurately reflect the true observations resulting from the study, whether they reflect expected results or not. Data also require documentation of all supplemental information that supports accuracy, such as data dictionaries and readme files.

FAIR Principles for Scientific Data Management

The FAIR principles represent a consensus among data and information security professionals about best practices to make data freely and safely available. Data that is stored, curated, and shared according to the FAIR principles is:

Findable: described richly with metadata and have a unique identifier (often a URI)

Accessible: retrievable by their identifier through free and open Internet protocols that allow authentication and authorization where necessary

Interoperable: described with commonly used metadata standards and controlled vocabularies

Re-usable: have been assigned a license assuring their re-use and have a clear provenance

Reference:

Wilkinson, Mark D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Nature (online, comment), March 15, 2016. https://doi.org/10.1038/sdata.2016.18

Metadata

Metadata is commonly defined as "data about data," and can also be thought of as tags added to existing resources in order to describe them. When managing datasets that you may eventually may want to make findable either in institutional repositories, or repositories to which deposits are required by funders, you may want to keep in mind basic descriptors for the datasets. Datasets can be described either minimally or in great detail, using either free-text or controlled terms.

The following metadata elements, at a minimum, should be appended to any dataset stored in a repository:

  • Title of study
  • PI/Lead investigator
  • Project description
  • Resource type (generally file type)
  • Keywords
  • Rights/Licensing statement

To meet the Attributable requirement of ALCOA-compliant data, each action taken on data should be clearly attributable to one actor. Clear attribution to a particular person is more easily made if, when describing the dataset, tags for the PI and other investigators are linked to a service that verifies researchers' identifies, such as ORCiD. To create a Northwestern-linked ORCiD account, visit: https://orcid.it.northwestern.edu/

To increase datasets' findability and interoperability, consider including among the keywords terms from controlled vocabularies such as MeSH or the Library of Congress Subject Headings.

To ensure that the terms of dataset re-use are clear with any shared datasets, assign a license to datasets deposited or described in repositories. A license for your dataset is different than a license or embargo period that a publisher may enforce for a journal submission (check the website SHERPA/RoMEO to check on journal licensing requirements), and it does not refer to copyrighting raw data, which is not commonly considered a 'creative work.' However, your processed data may have required unique or creative input that you may license upon sharing, in order to place clear restrictions on what re-users can and cannot do with the dataset. Data.world, a for-profit company that works to reduce barriers to dataset access, has published a helpful list of "Common license types for datasets." A flowchart created by Creative Commons Australia can help you to select the most appropriate Creative Commons license for your data works.