Question Description

Domain-scoped web crawler (coding interview)

You are asked to implement a domain-scoped web crawler driven only by a provided HtmlParser.getUrls(url) interface (no other network fetches). The exercise requires three variants — single-threaded, multi-threaded, and asyncio/non-blocking — each accepting concurrency and politeness parameters. Each seed URL is treated as depth 0 and discovery is restricted to the seed's originating domain.

The typical flow you should implement: start from seeds, normalize and deduplicate candidate URLs, schedule visits while enforcing per-host politeness (concurrency limits and crawl delays), parse discovered links, and stop when max depth or max pages is reached. You must measure and report runtimes for the single-threaded and concurrent runs and compute speedup (T_single / T_concurrent).

Skills you need to show: thread-safe and async-safe deduplication, URL canonicalization (resolve relative URLs, remove fragments, canonicalize scheme/host/port), per-host rate limiting, graceful shutdown when limits are hit, and strategies for keeping an asyncio loop non-blocking (e.g., offloading blocking parsing to threadpools). Optionally describe robots.txt caching, near-duplicate filtering, and how you'd scale the crawler across multiple servers with distributed deduplication and coordinated politeness.

Focus on clear APIs, correct concurrency semantics, and measurable performance reporting rather than full production infra. Demonstrate trade-offs between strict global coordination and eventual consistency for large-scale crawls.

Anthropic Coding Interview: Domain-Scoped Web Crawler

Question Description

Domain-scoped web crawler (coding interview)

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI