explaingit

amundsen-io/amundsen

4,768PythonAudience · dataComplexity · 4/5LicenseSetup · hard

TLDR

Amundsen is an open-source data catalog that works like Google search for your company's internal datasets, employees search by keyword, see who owns a table, how often it is used, and what the columns mean, all in one place.

Mindmap

mindmap
  root((Amundsen))
    What it does
      Data catalog search
      Usage-based ranking
      Metadata management
    Data Sources
      Redshift and BigQuery
      Snowflake and Hive
      Dashboards and ML features
    Components
      Web search interface
      Elasticsearch backend
      Graph database metadata
      Ingestion pipeline
    Audience
      Data analysts
      Data engineers
      Platform teams
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Let analysts search for tables by keyword and discover datasets ranked by how frequently colleagues query them

USE CASE 2

Display column descriptions, ownership, and usage stats for every table in your data warehouse automatically

USE CASE 3

Ingest metadata from Redshift, BigQuery, Snowflake, or Hive into one unified catalog using the ingestion pipeline

USE CASE 4

Help data engineers trace relationships between tables, dashboards, and machine learning features across the org

Tech stack

PythonElasticsearchNeo4j

Getting it running

Difficulty · hard Time to first run · 1day+

Requires running multiple services concurrently: web UI, Elasticsearch, Neo4j graph DB, and the metadata service. Docker quickstart available but still non-trivial.

Apache 2.0, use freely for any purpose including commercial projects, with attribution.

In plain English

Amundsen is an open-source data catalog that helps people inside an organization find the data they need. The project describes itself as Google search for data: you type a table name or keyword, and it shows you relevant datasets ranked by how often others in your company have used them. It was built at Lyft to solve the common problem of analysts and engineers not knowing what data exists or which datasets are trustworthy. The system is made up of several services that work together. One service handles the web interface, where users search and browse. Another runs the search engine, backed by Elasticsearch. A third stores the metadata about datasets, using a graph database to track relationships between tables, columns, owners, and consumers. A fourth is a data ingestion tool that reads from your existing databases and data warehouses and populates the catalog. Amundsen supports a long list of data sources: Amazon Redshift, BigQuery, Snowflake, PostgreSQL, MySQL, Apache Hive, and many others. It can also pull in metadata about dashboards and machine learning features, not just database tables. For each table, it can display column descriptions, usage statistics, a sample data preview, and who in the organization owns or frequently queries it. The project is hosted by the Linux Foundation AI and Data organization, which means it has formal governance and a community of contributors beyond its original creators at Lyft. It is released under the Apache 2.0 license. Setting up Amundsen requires running multiple services, so it is aimed at teams with some infrastructure experience rather than individual users. A quick-start guide using Docker is available in the documentation to try it locally with sample data before deploying it to production.

Copy-paste prompts

Prompt 1
I want to set up Amundsen locally to catalog our Snowflake tables. Walk me through the Docker quickstart and how to configure the Snowflake metadata ingestion.
Prompt 2
How do I customize Amundsen search ranking so tables used by more than 10 analysts in the past month appear higher in results?
Prompt 3
I need to add Airflow DAG run lineage into Amundsen. How do I write a custom ingestion extractor to pull DAG metadata?
Prompt 4
Our team wants to bulk-load column descriptions from a CSV into Amundsen without using the UI. How do I do that via the metadata service API?
Open on GitHub → Explain another repo

← amundsen-io on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.