Write Expectations that automatically flag null values or out-of-range numbers in incoming data before it enters your pipeline
Auto-generate human-readable data documentation from your Expectations that stays accurate as data and checks evolve
Use the automated profiler to scan an existing dataset and get a suggested starting set of data quality checks to review
Connect Great Expectations to Databricks or BigQuery to validate large cloud datasets at scale
Cloud data sources like Databricks and BigQuery require credentials configured before running validations.
Great Expectations is a Python library for testing and documenting data. The idea behind it is that data pipelines, much like software code, need automated checks to catch problems before they cause real damage downstream. When a data file arrives with unexpected nulls, duplicate values, or the wrong date format, Great Expectations can catch that before the bad data spreads through a system. The core concept is something called an Expectation. An Expectation is a statement about what your data should look like: values in a column should not be null, a numeric column should fall within a certain range, a timestamp column should match a specific format. You write a collection of these statements, and then run them against your data each time new data arrives. If the data does not match the stated expectations, you get an alert. Beyond just running checks, the library also generates human-readable documentation directly from those Expectations. Because the documentation comes from the same tests that run against real data, it stays accurate automatically. As data changes over time, the documentation updates along with the test results. There is also an automated profiling feature. You can point the profiler at an existing dataset and it will examine the data and suggest a starting set of Expectations based on what it finds, which you can then review and adjust. This is described in the README as a beta feature. The framework is designed to connect with data stored in many places, including Databricks, Google BigQuery, and other cloud data systems. Each component, including how results are stored, how alerts are sent, and how documentation is rendered, is built to be extended or replaced. Installation is through pip or conda, and getting started involves running a single init command after installing.
← great-expectations on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.