Data Catalog
Definition
A centralized repository that stores metadata and helps users discover, understand, and manage data assets across the organization.
Use Cases
- Airbnb: Helping analysts and data scientists discover trustworthy datasets and understand ownership, definitions, and usage across a large internal data ecosystem. — Built and operated an internal data discovery portal (Data Portal) that indexes metadata about datasets, including descriptions, owners, and links to documentation, enabling search and self-service discovery. (Improved dataset discoverability and clarity around ownership and definitions, reducing time spent finding the right data and supporting more consistent analytics across teams.)
- Netflix: Enabling self-service discovery of data assets and metadata to support analytics and experimentation at scale. — Developed an internal metadata and discovery platform (Metacat) to manage table and schema metadata and make it searchable for users and tools across the data platform. (Streamlined access to metadata for many teams, supporting faster analysis and more reliable use of shared datasets.)
- Uber: Providing a central place for employees to find datasets, understand lineage, and identify owners to support analytics and machine learning workflows. — Created an internal data discovery and governance platform (Databook) that aggregates metadata, ownership, and usage signals to help users find and evaluate data assets. (Reduced friction in data discovery and improved confidence in selecting the right datasets for analytics and ML use cases.)
Provider Equivalents
- AWS: AWS Glue Data Catalog
- Azure: Microsoft Purview Data Catalog
- GCP: Dataplex Data Catalog
- OCI: OCI Data Catalog
Frequently Asked Questions
- What's the difference between a Data Catalog and a Data Dictionary?
- A data dictionary usually documents fields and definitions within a specific database or system (for example, what each column means). A data catalog is broader: it indexes many data assets across the organization (tables, files, dashboards, streams), adds searchable metadata (owners, tags, classifications), and often includes governance features like lineage, access policies, and data quality signals.
- When should I use a Data Catalog?
- Use a data catalog when you have many datasets across teams or platforms and people struggle to find the right data, understand what it means, or know who owns it. It’s especially useful for data lakes and analytics platforms where data is spread across object storage, warehouses, and BI tools, and you need consistent metadata, governance, and self-service discovery.
- How much does a Data Catalog cost?
- Costs depend on the provider and how much metadata you store and scan. Common pricing factors include: number of cataloged assets or metadata objects, frequency and scope of metadata scanning (crawlers/connectors), API requests, and any governance add-ons (classification, lineage, policy enforcement). Some platforms bundle catalog features into broader governance products, so total cost may also depend on users and enabled capabilities.
Category: big-data
Difficulty: intermediate
Related Terms