Prepare a clean four-part data package for a statistician using the guide's checklist so analysis can start immediately.
Structure a research dataset as a tidy CSV with one variable per column and one observation per row.
Write a code book that documents every variable's units, collection method, and meaning so an analyst can interpret the numbers correctly.
Avoid silent Excel date reformatting errors by saving and sharing data as plain CSV or tab-delimited files.
This repository contains a written guide explaining how to prepare and hand off data to a statistician or data scientist. It is aimed at researchers, students, and collaborators who collect data in one field and need someone else to analyze it. The document does not contain software. It is a practical reference document focused on making that handoff go smoothly. The guide recommends delivering four things: the original, untouched raw data exactly as it came from the source, a cleaned, structured version called a tidy data set, a code book that describes every variable and how it was measured, and a script or step-by-step written description of exactly how the raw data was turned into the tidy data set. Each of these plays a different role in helping the analyst understand the data without having to guess or backtrack. The concept of tidy data comes from work by data scientist Hadley Wickham and follows a few basic rules: each measured variable gets its own column, each observation gets its own row, and similar kinds of data stay in one table rather than spread across multiple sheets. The guide goes through a genomics example to show what this looks like in practice, and explains that sharing data as a plain CSV or tab-delimited file is safer than using Excel, which can silently alter date values. The code book section explains that variable names in a spreadsheet are rarely self-explanatory enough on their own. Units, how measurements were collected, and how the study was designed all need to be written down somewhere so the analyst can interpret the numbers correctly. The guide was written by Jeff Leek and reflects the experience of his research group at Johns Hopkins, where slow data handoffs were consistently the biggest delay in getting from raw measurements to finished analysis.
← jtleek on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.