Partitioning
Definition
Dividing a database table or dataset into smaller, more manageable pieces based on specific criteria like date ranges or geographic regions.
Use Cases
- Netflix: Analyze large volumes of streaming and application logs efficiently for troubleshooting and analytics. — Stores log data in Amazon S3 using partitioned folder layouts (commonly by time such as day/hour) and queries it with engines that can prune partitions (e.g., Presto/Trino-based analytics and AWS query services). (Faster queries by scanning only relevant time partitions, lower compute cost from reduced data scanned, and improved operational troubleshooting speed.)
- The New York Times: Run analytics on large datasets (e.g., event and content interaction data) while controlling query time and cost. — Uses Google BigQuery partitioned tables (typically by ingestion time or event date) so analysts can filter by date ranges and avoid scanning older partitions. (More predictable query performance and lower cost because partition pruning reduces the amount of data processed for common time-bounded queries.)
- Uber: Query large-scale trip and event datasets for reporting and experimentation with strong performance on time-based filters. — Organizes data in a data lake with partitions (commonly by date and sometimes by region/city) and uses SQL engines that support partition pruning to limit scans to relevant partitions. (Reduced query latency and infrastructure spend by minimizing unnecessary reads of historical data, enabling faster iteration for analytics and experimentation.)
Provider Equivalents
- AWS: Amazon Redshift (table partitioning via sort keys/distribution + spectrum partitioned data), Amazon Athena (partitioned tables), Amazon RDS for PostgreSQL (native table partitioning)
- Azure: Azure Synapse Analytics (table partitioning), Azure SQL Database (partitioned tables), Azure Data Lake + serverless SQL (partitioned data layouts)
- GCP: BigQuery (partitioned tables), Cloud SQL for PostgreSQL (native table partitioning)
- OCI: Oracle Autonomous Data Warehouse (table partitioning), Oracle Database on OCI (native table partitioning)
Frequently Asked Questions
- What's the difference between partitioning and sharding?
- Partitioning splits data into parts but typically keeps it within the same database system and is managed as one logical table (e.g., partitions by month). Sharding splits data across multiple database instances/servers, which adds operational complexity but can scale write throughput and storage beyond a single system.
- When should I use partitioning?
- Use partitioning when your queries commonly filter on a predictable key (most often time, like date ranges) and the table is large enough that scanning everything is slow or expensive. It’s especially useful for logs, events, orders, IoT telemetry, and any dataset where recent data is queried far more than old data.
- How much does partitioning cost?
- Partitioning itself usually has no direct line-item cost, but it affects cost through performance and storage. In data warehouses and query engines that charge by data scanned (e.g., BigQuery on-demand, Athena), good partitioning can lower cost by reducing bytes scanned. In databases, partitioning can reduce CPU/IO for queries but may add overhead for writes, maintenance, and index management depending on the engine and partition strategy.
Category: data
Difficulty: advanced
Related Terms
See Also