backend system design
Atlassian
Google
LinkedIn

Atlassian System Design: Scalable Image Crawler Service

Topics:
Web Crawling
Distributed File Systems
Job Scheduling
Roles:
Software Engineer
Backend Engineer
Site Reliability Engineer
Experience:
Mid Level
Senior
Staff

Question Description

Design a scalable backend service that accepts user-submitted crawling jobs and recursively discovers image URLs and metadata without downloading the image bytes. You’ll expose RESTful job-management APIs to create, monitor, cancel and query jobs; an internal scheduling layer must prioritize and distribute work across crawlers; and a parsing pipeline must extract <img> tags, CSS/image references, and JS-rendered images where needed.

Flow & stages

  • Job intake: user submits a job with starting URLs and optional constraints (depth, domain allowlist, urgency).
  • Frontier & scheduler: a distributed URL frontier enqueues targets, applies domain-aware politeness (robots.txt, rate limits), and prioritizes by job weight/latency requirements.
  • Worker pool & parsers: horizontally-scaled crawlers fetch pages (with retries, backoff and JS rendering options), normalize URLs, detect cycles (visited set / Bloom filters), and extract image URLs + metadata (alt, dimensions, format, source, timestamp).
  • Storage & search: write deduplicated image records and provenance into a persistent store (indexed for keyword and source lookups). Provide search APIs for filtering by metadata.

Skill signals

You should demonstrate knowledge of distributed queues and job schedulers, URL normalization and duplicate-detection techniques, rate-limiting and politeness, storage/indexing trade-offs (SQL vs. NoSQL, secondary indexes), fault tolerance and idempotency, and observability for progress tracking and debugging.

Practical trade-offs to discuss include JS rendering costs, dedupe strategy (URL vs. content-hash), consistency model for job progress, and throttling to avoid overloading target sites.

Common Follow-up Questions

  • How would you implement per-domain politeness and global rate limiting while maximizing parallelism?
  • Describe strategies to discover images loaded by JavaScript (SPA) and the trade-offs of using headless browsers vs. lightweight heuristics.
  • How do you deduplicate image records when the same image appears under multiple URLs (URL normalization, content hashing, perceptual hashing)?
  • Design the monitoring and retry strategy: how do you track job progress, report errors, and ensure exactly-once/at-least-once processing semantics?
  • How would you prioritize work across jobs (fairness vs. latency), and design backpressure when the system is overloaded?

Related Questions

1Design a distributed URL frontier for a high-throughput web crawler
2How to build a scalable job scheduler for heterogeneous crawling workloads
3Design an image metadata storage and search system for billions of records
4How to implement duplicate detection and canonicalization for web resources
5Design a polite crawler: robots.txt, rate limiting and domain isolation strategies

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Atlassian Scalable Image Crawler Design & Job APIs | Voker