Data Lakehouse
Definition
A modern data architecture that combines the flexibility of data lakes with the structured querying and reliability of data warehouses.
Use Cases
- Shell: Enterprise analytics and data science across large volumes of operational and business data — Adopted Databricks Lakehouse to unify data engineering and analytics on a common platform, using open lakehouse storage and scalable compute for ETL and analytics workloads. (Enabled broader self-service analytics and faster time-to-insight by reducing data silos and consolidating pipelines and analytics on a shared architecture.)
- H&M Group: Retail analytics and personalization using large-scale data (e.g., transactions and digital signals) — Used Databricks Lakehouse to process and analyze data at scale, supporting data engineering and machine learning workflows on a unified data platform. (Improved the ability to operationalize analytics and ML by centralizing data and simplifying the path from raw data to analytics-ready datasets.)
- Comcast: Large-scale analytics across diverse data sources for business and operational reporting — Implemented a lakehouse-style platform with Databricks to consolidate data processing and analytics, leveraging scalable compute and a shared data layer. (Reduced fragmentation between data engineering and analytics teams and accelerated delivery of analytics use cases by standardizing on a common platform.)
Provider Equivalents
- Azure: Microsoft Fabric (OneLake + Lakehouse) / Azure Databricks (Delta Lake on ADLS)
Frequently Asked Questions
- What’s the difference between a data lakehouse and a data warehouse?
- A data warehouse stores curated, structured data optimized for SQL analytics and strong governance. A data lakehouse keeps data in low-cost object storage like a data lake (including raw and semi-structured data) but adds warehouse-like features—ACID transactions, schema enforcement, and performance optimizations—so you can run reliable SQL analytics and BI directly on the lake data.
- When should I use a data lakehouse?
- Use a lakehouse when you need one platform for both BI/SQL analytics and data science/ML, especially if you have large volumes of semi-structured data (logs, events, IoT) and want to avoid copying data between a lake and a warehouse. It’s also a good fit when you want open formats (e.g., Delta Lake or Apache Iceberg) and centralized governance over many data types.
- How much does a data lakehouse cost?
- Cost depends on (1) storage (object storage for raw and curated data), (2) compute for ingestion, transformation, and queries (often billed per vCPU/hour, DBU, or capacity units), (3) concurrency and workload patterns (interactive BI vs batch ETL), and (4) data governance and networking. Lakehouse costs are typically optimized by separating storage and compute, using autoscaling, choosing efficient file/table layouts, and minimizing unnecessary data copies.
Category: data
Difficulty: advanced
Related Terms
See Also