Atlassian System Design: Scalable Image Crawler Service
Question Description
Design a scalable backend service that accepts user-submitted crawling jobs and recursively discovers image URLs and metadata without downloading the image bytes. You’ll expose RESTful job-management APIs to create, monitor, cancel and query jobs; an internal scheduling layer must prioritize and distribute work across crawlers; and a parsing pipeline must extract <img> tags, CSS/image references, and JS-rendered images where needed.
Flow & stages
- Job intake: user submits a job with starting URLs and optional constraints (depth, domain allowlist, urgency).
- Frontier & scheduler: a distributed URL frontier enqueues targets, applies domain-aware politeness (robots.txt, rate limits), and prioritizes by job weight/latency requirements.
- Worker pool & parsers: horizontally-scaled crawlers fetch pages (with retries, backoff and JS rendering options), normalize URLs, detect cycles (visited set / Bloom filters), and extract image URLs + metadata (alt, dimensions, format, source, timestamp).
- Storage & search: write deduplicated image records and provenance into a persistent store (indexed for keyword and source lookups). Provide search APIs for filtering by metadata.
Skill signals
You should demonstrate knowledge of distributed queues and job schedulers, URL normalization and duplicate-detection techniques, rate-limiting and politeness, storage/indexing trade-offs (SQL vs. NoSQL, secondary indexes), fault tolerance and idempotency, and observability for progress tracking and debugging.
Practical trade-offs to discuss include JS rendering costs, dedupe strategy (URL vs. content-hash), consistency model for job progress, and throttling to avoid overloading target sites.
Common Follow-up Questions
- •How would you implement per-domain politeness and global rate limiting while maximizing parallelism?
- •Describe strategies to discover images loaded by JavaScript (SPA) and the trade-offs of using headless browsers vs. lightweight heuristics.
- •How do you deduplicate image records when the same image appears under multiple URLs (URL normalization, content hashing, perceptual hashing)?
- •Design the monitoring and retry strategy: how do you track job progress, report errors, and ensure exactly-once/at-least-once processing semantics?
- •How would you prioritize work across jobs (fairness vs. latency), and design backpressure when the system is overloaded?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.