gcp
System Design
advanced
Search engine web crawling

Web Crawler System

System Design Classic

A web crawler systematically downloads and indexes the internet, forming the backbone of search engines. This GCP-native design handles URL frontier management (prioritizing which URLs to crawl next via Pub/Sub topics), politeness policies (respecting robots.txt and rate limits per domain), content deduplication (detecting near-duplicate pages via SimHash in Memorystore), and distributed coordination across hundreds of Cloud Run crawler instances. Suited for search engine teams building scalable crawlers that respect robots.txt, manage politeness delays, and deduplicate content.

Data Flow

URL Frontier
URL Discovery Stream
Crawler Workers
URL Dedup Filter (Memorystore)
Crawled Content
Crawl Metadata
Content Parser

Share this architecture with your network

Service Breakdown (7 services)

Other7 services
URL Frontier
  • Maintains a priority queue of URLs to crawl
  • Enforces politeness delays per domain
  • Deduplicates URLs to avoid redundant fetches
Crawler Workers
  • Fetches and parses web pages in parallel
  • Extracts links, text, and metadata from HTML
  • Respects robots.txt and rate-limits per host
URL Dedup Filter (Memorystore)
  • Caches data in-memory with sub-millisecond latency
  • Supports Redis protocol for broad compatibility
  • Scales vertically without downtime
Crawled Content
  • Stores objects with configurable redundancy classes
  • Supports lifecycle rules for automatic archival
  • Integrates with analytics services for direct querying
Crawl Metadata
  • Stores page metadata, checksums, and crawl timestamps
  • Tracks URL freshness for recrawl scheduling
  • Supports deduplication via content fingerprints
Content Parser
  • Runs event-driven code without servers
  • Scales instantly from zero to peak load
  • Cost-effective for sporadic workloads
URL Discovery Stream
  • Runs Apache Beam pipelines for batch and stream
  • Auto-scales workers for pipeline throughput
  • Integrates with BigQuery and Pub/Sub

Scaling Strategy

Crawler workers run on Cloud Run with automatic scaling based on Pub/Sub queue depth. Pub/Sub manages the URL frontier with ordered topics for priority-based crawling. Memorystore stores the bloom filter for URL deduplication and per-domain rate limit counters. Crawled content goes to Cloud Storage, with Firestore tracking crawl metadata. Dataflow processes discovered URLs back to the frontier. Workers scale based on subscription backlog while respecting per-domain politeness constraints.

Related Architectures