explaingit

snorkel-team/snorkel

5,960PythonAudience · dataComplexity · 3/5Setup · easy

TLDR

Python framework for building ML training datasets by writing heuristic labeling functions instead of hand-labeling examples, then combining those imperfect labels statistically into a clean training set.

Mindmap

mindmap
  root((snorkel))
    What it does
      Auto-labels data
      Weak supervision
      Statistical label combining
    Techniques
      Labeling functions
      Data augmentation
      Data slicing
      Multi-task learning
    Status
      Stanford origin
      Not actively developed
      Community maintained
    Use cases
      NLP datasets
      Model debugging
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Label thousands of examples automatically by writing Python functions that apply rules or heuristics, instead of hand-annotating each one.

USE CASE 2

Build an ML training dataset in hours rather than weeks by encoding domain knowledge as labeling functions.

USE CASE 3

Debug a model's failure modes on specific data subsets using Snorkel's data slicing tools.

USE CASE 4

Generate additional training examples from an existing dataset using data augmentation transforms.

Tech stack

Pythonpipconda

Getting it running

Difficulty · easy Time to first run · 30min

The original team now focuses on the commercial Snorkel Flow platform, the open-source library is stable but no longer actively developed.

In plain English

Snorkel is a Python framework for building machine learning training datasets using code instead of manual labeling. The core problem it addresses is that training an AI model requires large amounts of labeled data, and labeling data by hand is slow and expensive. Snorkel's approach, called weak supervision, lets you write simple functions that apply rough heuristics or rules to automatically label examples. Those imperfect labels are then combined statistically to produce a cleaner training set. The project started at Stanford in 2015 and was used in production deployments at Google, Intel, and Stanford Medicine, among others. The idea was that the quality of training data matters more to a model's success than most architectural choices, and that treating data creation as a programming problem rather than a manual annotation task could make machine learning faster and more adaptable. Beyond labeling, Snorkel supported related techniques including data augmentation (generating new examples by transforming existing ones), data slicing (identifying and debugging specific subsets where a model performs poorly), and multi-task learning. Over several years the team published more than sixty research papers on these techniques. The README includes an important note: the core team has shifted focus to Snorkel Flow, a commercial end-to-end platform that builds on these ideas. The open-source repository is no longer under active development from the original team. It remains available, installable via pip or conda with Python 3.11 or later, and the tutorials and documentation are still online. Community contributions are accepted through pull requests. For anyone exploring the library, the recommended starting point is the getting-started page on the Snorkel website followed by the tutorials repository, which covers a range of labeling tasks and domains. Windows users are advised to use Docker or the Linux subsystem due to limited testing on that platform.

Copy-paste prompts

Prompt 1
Show me a minimal Snorkel script with three labeling functions for a sentiment classification task, then combine them with a LabelModel and export a pandas DataFrame of probabilistic labels.
Prompt 2
I have 50,000 unlabeled customer support emails. Show me how to write Snorkel labeling functions using keyword lists and regex to label them as billing or technical issues.
Prompt 3
My model performs poorly on short product reviews. Show me how to define a Snorkel slice for reviews under 20 words and identify where the model fails.
Prompt 4
How do I use Snorkel transformation functions to augment a small labeled NLP dataset by randomly swapping synonyms?
Open on GitHub → Explain another repo

← snorkel-team on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.