Skip to Main Content

Data Organization and Documentation

A guide for improving the organization, documentation, and long-term preservation of digital research data.

File Naming Conventions

A file naming convention is any set of rules that your research team establishes to consistently assign names to your research files. A descriptive filename may be up to 256 characters long, excluding the file extension. Best practice is to make your file name simultaneously as descriptive as possible and as short as possible, within the character limit.

The following recommendations for developing a file naming convention are adapted from Stanford Libraries Data Management Services “Data Best Practices and Case Studies - How to Name Files".”

  • Make file names descriptive. This will usually require the use of several different elements in a file name, which can either be run together or separated by underscores. Elements to consider including in file names for research data are:
    • Project or experiment name or acronym
    • Subject (participant) code
    • Researcher name/initials
    • Date or date range of observation
    • Type of data
    • Conditions
    • Version number of file

An example file name: 20180817_StrsHlth_Survey1_43584_KGH_verbal_v01.pdf

  • Use a consistent form of expressing dates. YYYYMMDD conforms to a widely-used international standard (ISO 8601). Use this date format at the beginning of a file if you would like files to sort by date.
  • Avoid special, non-alphanumeric characters. The underscore is the only special character recommended in file names.
  • Use leading zeroes in sequential numbering systems. -09 or -009 allows room for expansion if there will be many versions of a file. It also assists with sorting.
  • Avoid spaces in file names, since some software will not recognize them.

Document and post your team's agreed-upon file naming conventions in a place where other operational documents are stored

Tidy Data

Data stored in spreadsheets serve as fuel for analyses. To move more effectively from source worksheet to data manipulation software, it helps to adopt tidy data practices.

The precepts behind tidy data are simple, but if applied correctly they allow for complicated manipulations. The basic guidelines for tidy data are:

1. Make variables (related qualities measured across units) the columns in a spreadsheet.

2. Make observations (representing units or instances of study) the rows in a spreadsheet.

A final golden rule: don't combine variables.

An example of variable combining can be see in the table below:

Participants Male<30 Male>30 HrtRate

Participant 1

1 80
Participant 2 1

90

In the table above, the variables for gender and age range were combined. A tidy version of this table is below:

Participants Gender Age RestingHeartRate ExerciseHeatRate
Participant 1 M 49 80 150
Participant 2 M 21 90 170

Additional best practices when working with data in spreadsheets:

  • Name things consistently. Use the official variable and observation names consistently at all times, never abbreviating, in order to avoid confusion during computation or in resulting publications.
  • Fill all cells. Blank cells imply that a measurement was not taken, or mistakenly not recorded. Define and use a null value (either 0 or a string of alphabetic characters)
  • Do not store different types of data (data representing different variables) in the same columns
  • Do not embed calculations or graphs into a worksheet. They can become corrupted over time, or their original meaning may be lost. Use the spreadsheet to store data in its raw form, and store processed data elsewhere.
  • Do not use highlighting or conditional formatting as a variable. Though colored cells can help to point out information when reviewing a spreadsheet, highlighting should not be used as a way of storing information that will later need to be mathematically analyzed.
  • Use one table per observation type, and only one table per worksheet.
  • Create a data dictionary. The data dictionary contains the official spelling of each variable (or its abbreviation) and a concise definition. 

References:

Folder Tips

Creating an organization system for your digital folders can be a challenging task, especially if team or project folders will be shared with others. Follow these basic tips to maximize your use of folders:

  • Start with the top-level hierarchy. Agree with your team members about what the highest level should be in the folder hierarchy. Each high-level folder is likely to have several sub and sub-sub folders, but the highest level should be a grouping of concepts or processes that are most important to the project.
  • Decide on a naming convention. Much like file names, your folder names should be named consistently according to a documented standard. Standards of naming should also apply to lower levels of folders.
  • Incorporate dates as needed. As with file names, incorporating a date in YYYYMMDD format (ISO 8601) into folder names can help you keep track of when a project took place or when significant updates occurred. Put the date at the beginning of the folder name to make folders line up properly by date.
  • File everything. Once you have created a folder structure, don't leave any files sitting outside of this structure. If there is no existing folder or sub-folder into which loose files can fit, create a new folder at the necessary level in the hierarchy, or use this opportunity to re-examine the hierarchy to determine whether the existing organization still accurately reflects the types of files being produced.
  • Include a README file. Just as a README file included with data files can help to explain how the data was created and used and the names of the variables, a README file included on its own, outside the highest levels of the holder hierarchy, can be used to explain folder organization..

References:

Data Versioning

Working with data in digital files requires careful versioning to ensure the accuracy and authenticity of both original and modified data. File versioning can be achieved through either manual methods or web-based services.

Manual versioning

The simplest way to practice versioning through manual file management is to save a completely new file, which incorporates all the changes made to the original, with a slightly modified filename that allows recording of the version. If you know at the beginning of a research project that files may go through many versions, you may title the original document with "v01" at the end of the filename. Example filenames are:

  • Wren_Obs_20180912_FBE_v01.xlsx
  • Wren_Obs_20180912_FBE_v02.xlsx

Appending the version ("v") addition to the filename can be done at the beginning or end, depending on preference, and as long as the preference is documented and followed in the researcher's data management procedures. Starting with "01" allows for the possibility of version numbers reaching the double digits.

Automated versioning

Several online collaboration and filesharing services offer versioning or version control as part of the service. Three popular services are listed below:

Northwestern OneDrive: Northwestern OneDrive is a cloud-based storage and collaboration system available to faculty, staff, and students for storing or sharing files. OneDrive also offers access to all previous versions of stored documents. Changes to files stored in OneDrive are saved contemporaneously, and version history is available for all files. Version history is available for all changes, regardless of who made them (either the document owner or a collaborator).

Google Drive: Like Box, files stored in Google drive offer unlimited collaboration and access to all prior versions, regardless of who updated a file. In addition files in the Drive offer real-time editing.

GitHub: Git at its core is a version management tool, allowing developers working on the same software projects to make changes and update code without overwriting previous versions. Project and data managers can use GitHub for the same function. Changes can be made by cloning a copy of the software or files from Git, then committing the changed files back to the repository in a new version. For more details see "Understanding the GitHub Flow."