treeverse/dvc

★ 15,595PythonAudience · dataComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((DVC))
    What It Does
      Data version control
      Git for large files
      Experiment tracking
    How It Works
      Cloud storage sync
      Git pointer files
      Pipeline stages
    Remotes Supported
      S3 and Azure
      Google Cloud
      SSH on-premise
    Use Cases
      Reproducible ML
      Team data sharing
      Experiment comparison

mindmap root((DVC)) What It Does Data version control Git for large files Experiment tracking How It Works Cloud storage sync Git pointer files Pipeline stages Remotes Supported S3 and Azure Google Cloud SSH on-premise Use Cases Reproducible ML Team data sharing Experiment comparison

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Version large training datasets and trained models the same way you version code, so any past experiment is fully reproducible.

USE CASE 2

Share a machine learning project with teammates so they can pull the exact same data version and recreate any experiment from scratch.

USE CASE 3

Track and compare multiple experiment runs with parameters, metrics, and plots stored locally in Git, no separate server needed.

Tech stack

PythonGit

Getting it running

Difficulty · moderate Time to first run · 30min

Requires configuring a cloud storage remote (S3, Azure, GCS, or SSH) to share large files with teammates, local-only use works without a remote.

In plain English

DVC, short for Data Version Control, is a command-line tool and a VS Code extension that helps people working on machine learning projects keep track of their data, their models, and their experiments. The problem it tackles is that Git is great for code but awkward for large data files, DVC fills that gap by letting you version data the same way you version code. The way it works is by being roughly a "Git for data": you keep your code in a normal Git repository, and you tell DVC to track the larger files, like images, datasets, and trained models. DVC stores those big files in a cache outside of Git and uploads them to a remote of your choice, any major cloud storage like S3, Azure, or Google Cloud, or on-premise storage over SSH. In your Git repo it leaves small placeholder files that point at the cached versions. On top of that, DVC works like a Makefile for machine learning: you describe pipeline stages that say which inputs produce which outputs, and when something changes, only the affected steps rerun. There is also experiment tracking that lives in your local Git repo with no separate server, letting you run many experiments and compare their parameters, metrics, and plots. You would use DVC when your ML project has grown beyond fitting comfortably in Git and you want reproducibility, being able to share a project so someone else can recreate any given experiment by pulling code from Git and data from the configured remote. DVC is a Python tool installed through pip, conda, Homebrew, Chocolatey, snap, or the VS Code marketplace, with optional extras like dvc-s3 or dvc-azure for specific remotes. The full README is longer than what was provided.

Copy-paste prompts

Prompt 1

Set up DVC in my existing Git repo to track a 2GB training dataset stored in S3, so my teammates can pull the exact data version I used for each experiment with `dvc pull`.

Prompt 2

Write a DVC pipeline with three stages, data preprocessing, model training, and evaluation, that only reruns changed stages when the raw data is updated.

Prompt 3

Using DVC experiment tracking, run my train.py script 5 times with different learning rates and show me how to compare the resulting accuracy metrics in a table.

Prompt 4

How do I configure DVC to use Google Cloud Storage as the remote so the team can share large model files without committing them to Git?

Prompt 5

What is the difference between `dvc repro` and `dvc run`, and when should I use each one in a machine learning pipeline?

Open on GitHub → Explain another repo

← treeverse on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.