Lyft System Design: Distributed Wiki Archiving Bots

Question Description

You must design a backend system to create and maintain a complete, up-to-date archive of Wikipedia using one primary server and 1,000 small bots. The system's goal is an initial full copy of all pages (hundreds of millions) and continuous updates to reflect edits.

Core requirements: each bot fetches pages over HTTP, parses content and assets (text, images), extracts new URLs, and reports parsed content and metadata back to the primary server. You need to prevent overlap during the initial crawl, support periodic re-checks for updates, and store both content and indexing metadata for retrieval.

High-level flow/stages:

URL partitioning and assignment: partition the URL namespace and assign shards to bots (consistent hashing or range partitioning) to avoid duplicate work.
Fetch & parse: bots perform HTTP requests (respecting robots.txt, rate limits, ETag/Last-Modified), parse pages, and extract links and assets.
Reporting & storage: bots push compressed content + metadata to the primary server or object storage; primary updates index and dedupes content.
Continuous updates: schedule re-crawl jobs based on edit frequency, change detection, and priority queues.

Skill signals you should demonstrate: distributed systems design (sharding, consistent hashing), job scheduling & orchestration (leader election, queues, backoff), storage design (object store vs DB, indexing), network reliability (retries, timeouts, rate limiting), and data consistency (idempotency, deduplication, versioning). Discuss failure modes, monitoring, and resource optimization (bandwidth, compression, delta updates).

Lyft System Design: Distributed Wiki Archiving Bots

Question Description

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI