Anthropic Coding Interview: Domain-Scoped Web Crawler
Question Description
Domain-scoped web crawler (coding interview)
You are asked to implement a domain-scoped web crawler driven only by a provided HtmlParser.getUrls(url) interface (no other network fetches). The exercise requires three variants — single-threaded, multi-threaded, and asyncio/non-blocking — each accepting concurrency and politeness parameters. Each seed URL is treated as depth 0 and discovery is restricted to the seed's originating domain.
The typical flow you should implement: start from seeds, normalize and deduplicate candidate URLs, schedule visits while enforcing per-host politeness (concurrency limits and crawl delays), parse discovered links, and stop when max depth or max pages is reached. You must measure and report runtimes for the single-threaded and concurrent runs and compute speedup (T_single / T_concurrent).
Skills you need to show: thread-safe and async-safe deduplication, URL canonicalization (resolve relative URLs, remove fragments, canonicalize scheme/host/port), per-host rate limiting, graceful shutdown when limits are hit, and strategies for keeping an asyncio loop non-blocking (e.g., offloading blocking parsing to threadpools). Optionally describe robots.txt caching, near-duplicate filtering, and how you'd scale the crawler across multiple servers with distributed deduplication and coordinated politeness.
Focus on clear APIs, correct concurrency semantics, and measurable performance reporting rather than full production infra. Demonstrate trade-offs between strict global coordination and eventual consistency for large-scale crawls.
Common Follow-up Questions
- •How would you scale this crawler across multiple servers while preserving per-host politeness and avoiding duplicate visits (cross-process deduplication)?
- •Design a concurrency-safe, high-throughput deduplication layer: would you use a centralized datastore, distributed Bloom filters, or a hybrid? Discuss consistency and performance trade-offs.
- •How would you integrate robots.txt handling and caching across multiple hosts and processes, and what refresh/invalidation strategy would you use?
- •If HtmlParser.getUrls is synchronous and CPU-heavy, how would you keep the asyncio variant non-blocking? Describe concrete integration patterns and their costs.
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.