Atlassian Scalable Image Crawler Design & Job APIs

Question Description

Design a scalable backend service that accepts user-submitted crawling jobs and recursively discovers image URLs and metadata without downloading the image bytes. You’ll expose RESTful job-management APIs to create, monitor, cancel and query jobs; an internal scheduling layer must prioritize and distribute work across crawlers; and a parsing pipeline must extract <img> tags, CSS/image references, and JS-rendered images where needed.

Flow & stages

Job intake: user submits a job with starting URLs and optional constraints (depth, domain allowlist, urgency).
Frontier & scheduler: a distributed URL frontier enqueues targets, applies domain-aware politeness (robots.txt, rate limits), and prioritizes by job weight/latency requirements.
Worker pool & parsers: horizontally-scaled crawlers fetch pages (with retries, backoff and JS rendering options), normalize URLs, detect cycles (visited set / Bloom filters), and extract image URLs + metadata (alt, dimensions, format, source, timestamp).
Storage & search: write deduplicated image records and provenance into a persistent store (indexed for keyword and source lookups). Provide search APIs for filtering by metadata.

Skill signals

You should demonstrate knowledge of distributed queues and job schedulers, URL normalization and duplicate-detection techniques, rate-limiting and politeness, storage/indexing trade-offs (SQL vs. NoSQL, secondary indexes), fault tolerance and idempotency, and observability for progress tracking and debugging.

Practical trade-offs to discuss include JS rendering costs, dedupe strategy (URL vs. content-hash), consistency model for job progress, and throttling to avoid overloading target sites.

Atlassian System Design: Scalable Image Crawler Service

Question Description

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI