explaingit

datahub-project/datahub

11,909PythonAudience · dataComplexity · 4/5LicenseSetup · hard

TLDR

An open-source data catalog platform that connects to 80+ data tools to give companies a searchable map of all their data, where it lives, who owns it, where it came from, and how it flows between systems.

Mindmap

mindmap
  root((DataHub))
    Core Purpose
      Data catalog
      Lineage mapping
      Ownership tracking
    Integrations
      80+ connectors
      Warehouses
      BI dashboards
      ML systems
    Governance
      Access policies
      Audit trail
      Compliance
    AI Features
      Analytics Agent
      Plain English queries
      MCP integration
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Build a searchable catalog of all your company's databases and dashboards so engineers can find the right dataset without asking on Slack.

USE CASE 2

Trace data lineage to see exactly which upstream tables feed a broken dashboard so you can fix the root cause quickly.

USE CASE 3

Ask questions about your data in plain English using the Analytics Agent, which generates and runs SQL on your behalf.

USE CASE 4

Enforce data ownership and access policies across your entire data stack to help with GDPR or HIPAA compliance.

Tech stack

PythonJavaReactElasticsearchKafkaMySQLDocker

Getting it running

Difficulty · hard Time to first run · 1day+

Requires Docker Compose with multiple services (Elasticsearch, Kafka, MySQL). Self-hosted setup is non-trivial, managed cloud version is simpler.

Free to use, modify, and distribute for any purpose, including commercial, as long as you keep the Apache 2.0 license notice.

In plain English

DataHub is an open-source platform for keeping track of all the data assets a company uses. It was originally built at LinkedIn to handle their internal data at large scale, and it has since been adopted by thousands of organizations. The core problem it solves is that modern companies store and process data across many different tools: data warehouses, databases, business intelligence dashboards, machine learning systems, and data pipelines. Understanding what data exists, where it came from, and who is responsible for it becomes difficult when it is scattered across dozens of systems. DataHub acts as a central catalog for all of that. It connects to your existing tools through a collection of more than 80 connectors, pulling in information about tables, columns, dashboards, pipelines, and other data objects. Once that information is collected, it builds a searchable index so that people can find the data they need, and it maps out lineage, meaning which datasets came from which other datasets and which transformations happened along the way. This helps teams understand the impact of a change before making it, and helps trace quality problems back to their source. The platform also handles governance tasks: tracking ownership, applying tags and categories, managing data access policies, and recording an audit trail of how data has been used. These features are aimed at helping organizations comply with regulations and maintain quality standards across their data. A recent addition is an open-source Analytics Agent that lets users ask questions about their data in plain English. The agent uses the DataHub catalog as context, generates SQL queries, runs them, and returns results along with charts. It also supports connecting to AI coding assistants like Claude Desktop or Cursor via the Model Context Protocol. DataHub can be self-hosted or used as a managed cloud service. It is licensed under Apache 2.0. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1
I set up DataHub and ingested our Snowflake warehouse. How do I write a metadata ingestion recipe YAML file to also pull in our Looker dashboards and link them to the underlying Snowflake tables?
Prompt 2
Using DataHub's lineage graph, walk me through how to find which ETL pipeline feeds the 'revenue_summary' table and which dashboards depend on it.
Prompt 3
How do I use the DataHub Analytics Agent to ask 'which tables have not been queried in 90 days?' and get a chart back?
Prompt 4
Set up DataHub's MCP server so I can query our data catalog from Cursor or Claude Desktop, show me the config file I need.
Open on GitHub → Explain another repo

← datahub-project on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.