Let analysts search for tables by keyword and discover datasets ranked by how frequently colleagues query them
Display column descriptions, ownership, and usage stats for every table in your data warehouse automatically
Ingest metadata from Redshift, BigQuery, Snowflake, or Hive into one unified catalog using the ingestion pipeline
Help data engineers trace relationships between tables, dashboards, and machine learning features across the org
Requires running multiple services concurrently: web UI, Elasticsearch, Neo4j graph DB, and the metadata service. Docker quickstart available but still non-trivial.
Amundsen is an open-source data catalog that helps people inside an organization find the data they need. The project describes itself as Google search for data: you type a table name or keyword, and it shows you relevant datasets ranked by how often others in your company have used them. It was built at Lyft to solve the common problem of analysts and engineers not knowing what data exists or which datasets are trustworthy. The system is made up of several services that work together. One service handles the web interface, where users search and browse. Another runs the search engine, backed by Elasticsearch. A third stores the metadata about datasets, using a graph database to track relationships between tables, columns, owners, and consumers. A fourth is a data ingestion tool that reads from your existing databases and data warehouses and populates the catalog. Amundsen supports a long list of data sources: Amazon Redshift, BigQuery, Snowflake, PostgreSQL, MySQL, Apache Hive, and many others. It can also pull in metadata about dashboards and machine learning features, not just database tables. For each table, it can display column descriptions, usage statistics, a sample data preview, and who in the organization owns or frequently queries it. The project is hosted by the Linux Foundation AI and Data organization, which means it has formal governance and a community of contributors beyond its original creators at Lyft. It is released under the Apache 2.0 license. Setting up Amundsen requires running multiple services, so it is aimed at teams with some infrastructure experience rather than individual users. A quick-start guide using Docker is available in the documentation to try it locally with sample data before deploying it to production.
← amundsen-io on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.