Big Data
Definition
Extremely large datasets that require special tools to store, process, and analyze. Like trying to organize all the books in every library in the world.
Use Cases
- Netflix: Personalized recommendations and content decisions using large-scale viewing and interaction data — Collects and processes large volumes of user playback events and service telemetry; uses distributed data processing and analytics to build recommendation and experimentation pipelines (Improves personalization and user engagement, and supports data-driven decisions on content and product features)
- Uber: Real-time marketplace optimization (matching riders and drivers) and operational analytics — Ingests high-volume event streams (trip requests, location updates, pricing signals) and processes them with distributed streaming and batch analytics to power forecasting and marketplace models (Supports faster matching, better ETAs, and improved marketplace efficiency at high scale)
- CERN: Scientific analysis of particle physics experiment data from the Large Hadron Collider — Uses globally distributed computing and storage to process and analyze extremely large experimental datasets with parallel processing frameworks (Enables large-scale scientific discovery by making massive datasets searchable and analyzable by researchers worldwide)
Frequently Asked Questions
- What's the difference between Big Data and a data warehouse?
- Big Data describes datasets that are too large, fast, or varied for traditional tools. A data warehouse is a structured system optimized for analytics (usually curated, cleaned, and modeled data). Big Data systems often start with raw or semi-structured data (logs, events, images) and may feed a warehouse after processing.
- When should I use Big Data tools instead of a traditional database?
- Use Big Data tools when you have very large volumes (terabytes to petabytes), high-velocity data (streams of events), or diverse formats (JSON logs, clickstreams, sensor data) and you need scalable batch or streaming processing. If your workload is mostly transactional (orders, accounts) or moderate-size analytics, a relational database or standard analytics stack is often simpler and cheaper.
- How much does Big Data cost?
- Costs depend on storage volume, data retention, compute time for processing, data transfer/egress, and managed service pricing. Major drivers include: (1) how often you process data (daily vs real-time), (2) how much data you keep and for how long, (3) whether you use managed services vs self-managed clusters, and (4) query patterns (frequent ad-hoc queries can increase compute). Cost control typically involves lifecycle policies, partitioning, compression, right-sizing compute, and using spot/preemptible capacity where appropriate.
Category: data
Difficulty: advanced
Related Terms
See Also