explaingit

jtleek/datasharing

6,735Audience · researcherComplexity · 1/5Setup · easy

TLDR

A practical written guide explaining how to prepare and hand off data to a statistician or data scientist, covering raw data, tidy datasets, code books, and the transformation script that connects them.

Mindmap

mindmap
  root((datasharing))
    What it does
      Data handoff guide
      Tidy data rules
      Analyst collaboration
    Four Deliverables
      Raw data
      Tidy dataset
      Code book
      Processing script
    Key Concepts
      Tidy data format
      Variable documentation
      CSV over Excel
    Audience
      Researchers
      Students
      Collaborators
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Prepare a clean four-part data package for a statistician using the guide's checklist so analysis can start immediately.

USE CASE 2

Structure a research dataset as a tidy CSV with one variable per column and one observation per row.

USE CASE 3

Write a code book that documents every variable's units, collection method, and meaning so an analyst can interpret the numbers correctly.

USE CASE 4

Avoid silent Excel date reformatting errors by saving and sharing data as plain CSV or tab-delimited files.

Getting it running

Difficulty · easy Time to first run · 5min

In plain English

This repository contains a written guide explaining how to prepare and hand off data to a statistician or data scientist. It is aimed at researchers, students, and collaborators who collect data in one field and need someone else to analyze it. The document does not contain software. It is a practical reference document focused on making that handoff go smoothly. The guide recommends delivering four things: the original, untouched raw data exactly as it came from the source, a cleaned, structured version called a tidy data set, a code book that describes every variable and how it was measured, and a script or step-by-step written description of exactly how the raw data was turned into the tidy data set. Each of these plays a different role in helping the analyst understand the data without having to guess or backtrack. The concept of tidy data comes from work by data scientist Hadley Wickham and follows a few basic rules: each measured variable gets its own column, each observation gets its own row, and similar kinds of data stay in one table rather than spread across multiple sheets. The guide goes through a genomics example to show what this looks like in practice, and explains that sharing data as a plain CSV or tab-delimited file is safer than using Excel, which can silently alter date values. The code book section explains that variable names in a spreadsheet are rarely self-explanatory enough on their own. Units, how measurements were collected, and how the study was designed all need to be written down somewhere so the analyst can interpret the numbers correctly. The guide was written by Jeff Leek and reflects the experience of his research group at Johns Hopkins, where slow data handoffs were consistently the biggest delay in getting from raw measurements to finished analysis.

Copy-paste prompts

Prompt 1
I have a genomics experiment with 50 samples and 200 measured gene values each. Show me what a tidy CSV layout looks like following the datasharing guide's rules.
Prompt 2
Write a code book template in markdown for a dataset with columns: patient_id, age_years, blood_pressure_mmhg, and treatment_group.
Prompt 3
I collected survey responses in an Excel sheet with multiple tabs. Give me step-by-step instructions for converting it to a tidy CSV that a statistician can open in R.
Prompt 4
Show me a minimal R script that reads raw_data.csv, renames columns to snake_case, removes rows with any missing values, and saves the result as tidy_data.csv.
Open on GitHub → Explain another repo

← jtleek on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.