Data Lake
Definition
Centralized repository that stores all types of raw data at any scale, enabling advanced analytics and machine learning applications.
Use Cases
- Netflix: Centralizing large volumes of operational, application, and viewing-related data for analytics and machine learning — Netflix has publicly described using Amazon S3 as a core part of its data platform, storing large-scale data and processing it with cloud analytics tools and data pipelines. (This approach supports large-scale analytics, experimentation, and data-driven product decisions across the company.)
- Shell: Analyzing industrial and operational data from energy assets — Shell has publicly shared Azure-based data platform work that brings together large volumes of operational data for analytics and AI use cases. (The company improved access to data for engineering and business teams and enabled faster analysis of operational information.)
- The Home Depot: Combining customer, store, and digital commerce data for analytics — The Home Depot has publicly discussed using Google Cloud data services to centralize and analyze enterprise data at scale. (This helped teams gain faster insights and improve decision-making across retail operations and customer experience.)
Provider Equivalents
- AWS: Amazon Simple Storage Service (Amazon S3) with AWS Lake Formation and AWS Glue
- Azure: Azure Data Lake Storage Gen2
- GCP: Google Cloud Storage with BigLake and Dataplex
- OCI: Oracle Cloud Infrastructure Object Storage with OCI Data Catalog and OCI Data Flow
Frequently Asked Questions
- What's the difference between Data Lake and Data Warehouse?
- A data lake stores raw data in its original format, including structured, semi-structured, and unstructured data such as logs, images, videos, and JSON files. A data warehouse stores cleaned, structured data that has been prepared for reporting and business intelligence. Use a data lake when you need flexibility and want to keep all data for future analysis. Use a data warehouse when you need fast, consistent reporting on well-defined business metrics.
- When should I use Data Lake?
- Use a data lake when you need to store large amounts of different types of data cheaply and analyze them later. It is a good fit for machine learning, log analytics, IoT data, clickstream data, and long-term archival of raw business data. It is especially useful when you do not yet know all the questions you will ask of the data in the future.
- How much does Data Lake cost?
- The main cost factors are storage volume, data transfer, data ingestion, metadata management, security features, and the compute used to process or query the data. Raw object storage is usually relatively low cost compared with databases, but costs can grow if you scan large datasets often, move data between regions, or run frequent transformation jobs. Good lifecycle policies, partitioning, and file format choices such as Parquet can reduce cost.
Category: data
Difficulty: advanced
Related Terms
See Also