Apache Kafka
Definition
Distributed streaming platform for building real-time data pipelines and streaming applications, enabling high-throughput data processing and integration.
Use Cases
- LinkedIn: Activity stream and event pipeline for near real-time data movement across many internal systems — Created and open-sourced Apache Kafka to handle high-throughput event ingestion, durable storage, and fan-out to multiple consumers; producers publish events to topics and downstream services consume them independently (Enabled scalable, decoupled real-time data pipelines and reliable event distribution across many applications)
- Netflix: Real-time event streaming for monitoring, analytics, and operational telemetry across microservices — Uses Kafka as an event backbone where services publish operational events and metrics; multiple consumer applications process streams for alerting, dashboards, and analytics (Improved observability and faster detection of issues by processing high-volume events in near real time)
- Uber: Event-driven architecture for moving operational events (e.g., service events and logs) between systems in real time — Uses Kafka as a central pub/sub log where producers emit events and multiple downstream consumers process them for analytics and operational workflows (Supported scalable real-time data distribution and decoupled producers from many independent consumers)
Provider Equivalents
- AWS: Amazon Managed Streaming for Apache Kafka (Amazon MSK)
- Azure: Azure Event Hubs (Kafka endpoint) / Azure HDInsight Kafka
- GCP: Cloud Pub/Sub / Confluent Cloud on Google Cloud
- OCI: OCI Streaming
Frequently Asked Questions
- What's the difference between Apache Kafka and RabbitMQ?
- Kafka is a distributed event streaming platform built for very high throughput and durable event storage (events are kept in a log for a configurable time and can be replayed). RabbitMQ is a traditional message broker focused on flexible routing and per-message delivery semantics, typically with messages removed once consumed. Use Kafka when you need event streams, replay, and many consumers reading the same data; use RabbitMQ when you need complex routing patterns and classic queue-based messaging.
- When should I use Apache Kafka?
- Use Kafka when you need to ingest and distribute large volumes of events in real time, decouple producers from multiple consumers, and optionally replay historical events. Common scenarios include clickstream and telemetry ingestion, microservice event buses, CDC (change data capture) pipelines, log aggregation, real-time analytics, and feeding data lakes/warehouses.
- How much does Apache Kafka cost?
- Kafka is open source, so the software license cost is $0, but you pay for infrastructure and operations: compute for brokers, storage for logs, network egress, and the engineering time to run and monitor the cluster (plus ZooKeeper for older deployments, or KRaft in newer Kafka versions). Managed options (e.g., Amazon MSK or Confluent Cloud) charge based on broker capacity/throughput, storage, and data transfer, and can reduce operational overhead.
Category: data
Difficulty: advanced
Related Terms
See Also