Data Lake

Definition

Centralized repository that stores all types of raw data at any scale, enabling advanced analytics and machine learning applications.

Use Cases

Provider Equivalents

Frequently Asked Questions

What's the difference between Data Lake and Data Warehouse?
A data lake stores raw data in its original format, including structured, semi-structured, and unstructured data such as logs, images, videos, and JSON files. A data warehouse stores cleaned, structured data that has been prepared for reporting and business intelligence. Use a data lake when you need flexibility and want to keep all data for future analysis. Use a data warehouse when you need fast, consistent reporting on well-defined business metrics.
When should I use Data Lake?
Use a data lake when you need to store large amounts of different types of data cheaply and analyze them later. It is a good fit for machine learning, log analytics, IoT data, clickstream data, and long-term archival of raw business data. It is especially useful when you do not yet know all the questions you will ask of the data in the future.
How much does Data Lake cost?
The main cost factors are storage volume, data transfer, data ingestion, metadata management, security features, and the compute used to process or query the data. Raw object storage is usually relatively low cost compared with databases, but costs can grow if you scan large datasets often, move data between regions, or run frequent transformation jobs. Good lifecycle policies, partitioning, and file format choices such as Parquet can reduce cost.

Category: data

Difficulty: advanced

Related Terms

See Also