backend system design
Lyft
Uber
Google

Lyft System Design: Distributed Wiki Archiving Bots

Topics:
Web Crawling
Distributed File Systems
Job Scheduling
Roles:
Software Engineer
Backend Engineer
Site Reliability Engineer
Experience:
Mid Level
Senior
Staff

Question Description

You must design a backend system to create and maintain a complete, up-to-date archive of Wikipedia using one primary server and 1,000 small bots. The system's goal is an initial full copy of all pages (hundreds of millions) and continuous updates to reflect edits.

Core requirements: each bot fetches pages over HTTP, parses content and assets (text, images), extracts new URLs, and reports parsed content and metadata back to the primary server. You need to prevent overlap during the initial crawl, support periodic re-checks for updates, and store both content and indexing metadata for retrieval.

High-level flow/stages:

  • URL partitioning and assignment: partition the URL namespace and assign shards to bots (consistent hashing or range partitioning) to avoid duplicate work.
  • Fetch & parse: bots perform HTTP requests (respecting robots.txt, rate limits, ETag/Last-Modified), parse pages, and extract links and assets.
  • Reporting & storage: bots push compressed content + metadata to the primary server or object storage; primary updates index and dedupes content.
  • Continuous updates: schedule re-crawl jobs based on edit frequency, change detection, and priority queues.

Skill signals you should demonstrate: distributed systems design (sharding, consistent hashing), job scheduling & orchestration (leader election, queues, backoff), storage design (object store vs DB, indexing), network reliability (retries, timeouts, rate limiting), and data consistency (idempotency, deduplication, versioning). Discuss failure modes, monitoring, and resource optimization (bandwidth, compression, delta updates).

Common Follow-up Questions

  • How would you shard or partition Wikipedia's URL namespace across 1,000 bots to minimize duplication and balance load? Discuss consistent hashing vs range-based partitioning.
  • Design a strategy to detect and store only changed content (delta updates). How would you use ETag/Last-Modified, content hashing, or diff storage to save bandwidth and space?
  • How do you handle bot failures, network partitions, and duplicate work? Describe checkpointing, idempotent tasks, retries with backoff, and leader election for task reassignment.
  • Explain how you'd respect Wikipedia rate limits and politeness (API quotas, robots.txt). How would you implement global and per-bot throttling while meeting freshness targets?
  • If you need to support fast search over the archive, how would you design indexing and retrieval (metadata DB, search index, CDN for static assets)?

Related Questions

1Design a scalable web crawler and URL frontier for large-scale indexing
2Build a distributed backup/replication system for frequently changing content
3Design a job scheduler for thousands of worker VMs with fault tolerance and dynamic scaling
4How to design change data capture (CDC) and incremental sync for large document stores
5Architect a cost-efficient object storage + metadata service for petabyte archives

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Lyft System Design: Distributed Wiki Archiving Bots | Voker