Making Your Data Reusable
The purpose of recent federal guidance on open data is to make data more accessible to the research community for the purposes of reuse. Specifically, the goal is to make research data more Findable, Accessible, Interoperable, and Reusable (FAIR).
- Findable: Know that your data exist, either through a persistent link (like a DOI) in a paper or by searching a data repository.
- Accessible: Obtain a copy of the data automatically through an open repository or a defined authorization procedure.
- Interoperable: The data should be able to be combined with other similar data.
- Reusable: Documented well enough that another expert in your field could understand how to use your data.
Though reusability is critical at the end of a project when you make your data publicly available, the process will be much easier if you consider reusability throughout your project.
Thinking about what to document about both your process and the data itself before you start to collect your data will save you time in the long run. Make sure to write down everything someone in your field would need to know to understand how the data was collected and how they would analyze the data to obtain the same results. There are many types of documentation that could be useful for your research data.
Writing a README file is an excellent way to document your data in a human-readable format that is flexible enough to accommodate any dataset. A README file should contain all the information someone needs to get started understanding your project.
Things to put in your README include:
- A description of your project in general terms; think of this as an abstract for your project
- An outline of the overall structure of your project
- Documentation of your file naming convention
- A simplified data inventory that does not list every file but gives the general scope of what is there or what you expect to collect
- Any other information that someone who is familiar with your field would need to understand your project
Codebooks for Tabular Data
Raw tabular data generally consists of tables with rows that hold individual observations and columns that contain a variable or one type of data about each observation. A codebook describes what each variable is, what type of data is contained in that variable (e.g., numeric or text), what the acceptable range of values is, and a description of any codes for missing values for each variable. See the variable information section of this page for an example of a codebook for a tabular dataset.
Describing your data using established metadata standards is useful, especially if you submit your data to a discipline-specific data repository. These repositories generally have a specific set of metadata fields that need to be provided before you can submit your data. A good way to find a relevant metadata standard is to find repositories that take that kind of data. For example, ecologists use Ecological Metadata Language (EML) to describe their data. Repositories like the Ecological Data Initiative use this standard to make their databases searchable. See the open data-sharing page for more information.
If you need to check if your data has a relevant metadata standard, you can search for standards in FAIRSharing Standards search. If you need help choosing a metadata standard for your data, email email@example.com to speak to a data management specialist.
In general, use data formats that are common in your field. For example, researchers using GIS data will often use a combination of shapefile formats developed by Esri, the makers of ArcGIS. Most researchers who use GIS data in their research would be able to use these formats without trouble.
However, not all researchers have the resources to purchase licenses for software that use proprietary formats and vendor formats can change without warning. To make sure that all researchers, regardless of resources, can use your data, open formats are recommended when you get to the open-sharing phase of your research process. Open alternatives to shapefiles are GeoJSON or the GEOPackage Encoding standard. Other examples of moving from proprietary formats to open formats include using .csv files to store data instead of Excel files, .sav files in SPSS, or .dta files in Stata.
If you would like advice about file formats to use in your research, email firstname.lastname@example.org to speak with a data management specialist.