Label thousands of examples automatically by writing Python functions that apply rules or heuristics, instead of hand-annotating each one.
Build an ML training dataset in hours rather than weeks by encoding domain knowledge as labeling functions.
Debug a model's failure modes on specific data subsets using Snorkel's data slicing tools.
Generate additional training examples from an existing dataset using data augmentation transforms.
The original team now focuses on the commercial Snorkel Flow platform, the open-source library is stable but no longer actively developed.
Snorkel is a Python framework for building machine learning training datasets using code instead of manual labeling. The core problem it addresses is that training an AI model requires large amounts of labeled data, and labeling data by hand is slow and expensive. Snorkel's approach, called weak supervision, lets you write simple functions that apply rough heuristics or rules to automatically label examples. Those imperfect labels are then combined statistically to produce a cleaner training set. The project started at Stanford in 2015 and was used in production deployments at Google, Intel, and Stanford Medicine, among others. The idea was that the quality of training data matters more to a model's success than most architectural choices, and that treating data creation as a programming problem rather than a manual annotation task could make machine learning faster and more adaptable. Beyond labeling, Snorkel supported related techniques including data augmentation (generating new examples by transforming existing ones), data slicing (identifying and debugging specific subsets where a model performs poorly), and multi-task learning. Over several years the team published more than sixty research papers on these techniques. The README includes an important note: the core team has shifted focus to Snorkel Flow, a commercial end-to-end platform that builds on these ideas. The open-source repository is no longer under active development from the original team. It remains available, installable via pip or conda with Python 3.11 or later, and the tutorials and documentation are still online. Community contributions are accepted through pull requests. For anyone exploring the library, the recommended starting point is the getting-started page on the Snorkel website followed by the tutorials repository, which covers a range of labeling tasks and domains. Windows users are advised to use Docker or the Linux subsystem due to limited testing on that platform.
← snorkel-team on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.