explaingit

treeverse/dvc

15,595PythonAudience · dataComplexity · 3/5Setup · moderate

TLDR

DVC adds Git-style version control for large data files, ML models, and experiments to machine learning projects, syncing big files to cloud storage while keeping lightweight pointers in your Git repo.

Mindmap

mindmap
  root((DVC))
    What It Does
      Data version control
      Git for large files
      Experiment tracking
    How It Works
      Cloud storage sync
      Git pointer files
      Pipeline stages
    Remotes Supported
      S3 and Azure
      Google Cloud
      SSH on-premise
    Use Cases
      Reproducible ML
      Team data sharing
      Experiment comparison
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Version large training datasets and trained models the same way you version code, so any past experiment is fully reproducible.

USE CASE 2

Share a machine learning project with teammates so they can pull the exact same data version and recreate any experiment from scratch.

USE CASE 3

Track and compare multiple experiment runs with parameters, metrics, and plots stored locally in Git, no separate server needed.

Tech stack

PythonGit

Getting it running

Difficulty · moderate Time to first run · 30min

Requires configuring a cloud storage remote (S3, Azure, GCS, or SSH) to share large files with teammates, local-only use works without a remote.

In plain English

DVC, short for Data Version Control, is a command-line tool and a VS Code extension that helps people working on machine learning projects keep track of their data, their models, and their experiments. The problem it tackles is that Git is great for code but awkward for large data files, DVC fills that gap by letting you version data the same way you version code. The way it works is by being roughly a "Git for data": you keep your code in a normal Git repository, and you tell DVC to track the larger files, like images, datasets, and trained models. DVC stores those big files in a cache outside of Git and uploads them to a remote of your choice, any major cloud storage like S3, Azure, or Google Cloud, or on-premise storage over SSH. In your Git repo it leaves small placeholder files that point at the cached versions. On top of that, DVC works like a Makefile for machine learning: you describe pipeline stages that say which inputs produce which outputs, and when something changes, only the affected steps rerun. There is also experiment tracking that lives in your local Git repo with no separate server, letting you run many experiments and compare their parameters, metrics, and plots. You would use DVC when your ML project has grown beyond fitting comfortably in Git and you want reproducibility, being able to share a project so someone else can recreate any given experiment by pulling code from Git and data from the configured remote. DVC is a Python tool installed through pip, conda, Homebrew, Chocolatey, snap, or the VS Code marketplace, with optional extras like dvc-s3 or dvc-azure for specific remotes. The full README is longer than what was provided.

Copy-paste prompts

Prompt 1
Set up DVC in my existing Git repo to track a 2GB training dataset stored in S3, so my teammates can pull the exact data version I used for each experiment with `dvc pull`.
Prompt 2
Write a DVC pipeline with three stages, data preprocessing, model training, and evaluation, that only reruns changed stages when the raw data is updated.
Prompt 3
Using DVC experiment tracking, run my train.py script 5 times with different learning rates and show me how to compare the resulting accuracy metrics in a table.
Prompt 4
How do I configure DVC to use Google Cloud Storage as the remote so the team can share large model files without committing them to Git?
Prompt 5
What is the difference between `dvc repro` and `dvc run`, and when should I use each one in a machine learning pipeline?
Open on GitHub → Explain another repo

← treeverse on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.