backend system design
Atlassian
Google
LinkedIn

Atlassian System Design: Scalable Image Crawler Service

Question Description

Design a scalable backend service that accepts user-submitted crawling jobs and recursively discovers image URLs and metadata without downloading the image bytes. You’ll expose RESTful job-management APIs to create, monitor, cancel and query jobs; an internal scheduling layer must prioritize and distribute work across crawlers; and a parsing pipeline must extract <img> tags, CSS/image references, and JS-rendered images where needed.

Flow & stages

  • Job intake: user submits a job with starting URLs and optional constraints (depth, domain allowlist, urgency).
  • Frontier & scheduler: a distributed URL frontier enqueues targets, applies domain-aware politeness (robots.txt, rate limits), and prioritizes by job weight/latency requirements.
  • Worker pool & parsers: horizontally-scaled crawlers fetch pages (with retries, backoff and JS rendering options), normalize URLs, detect cycles (visited set / Bloom filters), and extract image URLs + metadata (alt, dimensions, format, source, timestamp).
  • Storage & search: write deduplicated image records and provenance into a persistent store (indexed for keyword and source lookups). Provide search APIs for filtering by metadata.

Skill signals

You should demonstrate knowledge of distributed queues and job schedulers, URL normalization and duplicate-detection techniques, rate-limiting and politeness, storage/indexing trade-offs (SQL vs. NoSQL, secondary indexes), fault tolerance and idempotency, and observability for progress tracking and debugging.

Practical trade-offs to discuss include JS rendering costs, dedupe strategy (URL vs. content-hash), consistency model for job progress, and throttling to avoid overloading target sites.

Common Follow-up Questions

  • How would you implement per-domain politeness and global rate limiting while maximizing parallelism?
  • Describe strategies to discover images loaded by JavaScript (SPA) and the trade-offs of using headless browsers vs. lightweight heuristics.
  • How do you deduplicate image records when the same image appears under multiple URLs (URL normalization, content hashing, perceptual hashing)?
  • Design the monitoring and retry strategy: how do you track job progress, report errors, and ensure exactly-once/at-least-once processing semantics?
  • How would you prioritize work across jobs (fairness vs. latency), and design backpressure when the system is overloaded?

Related Questions

1Lyft System Design: Distributed Wiki Archiving Bots
2Anthropic Coding Interview: Domain-Scoped Web Crawler
3Design a distributed URL frontier for a high-throughput web crawler
4How to build a scalable job scheduler for heterogeneous crawling workloads
5Design an image metadata storage and search system for billions of records
6How to implement duplicate detection and canonicalization for web resources
7Design a polite crawler: robots.txt, rate limiting and domain isolation strategies

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Atlassian Scalable Image Crawler Design & Job APIs | Voker