Google System Design: Twitter Hashtag Aggregator Guide
Question Description
Designing a Twitter-like hashtag aggregator tests your ability to build high-throughput, low-latency analytics pipelines that support real-time and historical queries.
You are asked to ingest live tweet events (hashtags, timestamps, engagement metrics), compute aggregated metrics over configurable windows (minute/hour/day), and serve queries such as top trending hashtags, time-series counts, and engagement summaries. At interview time you should walk through an end-to-end flow: ingest → stream processing & windowing → aggregation/storage → query API and dashboard.
Start by explaining ingestion (Pub/Sub/Kafka), partitioning by hashtag or user, and how you’ll shard for scale. Then describe stateful stream processing: tumbling/sliding windows, watermarking to handle late events, and fault-tolerant state backends (e.g., Kafka Streams/Flink with durable checkpoints). Discuss aggregation storage: a real-time store (Druid/ClickHouse/Redis timeseries) for fast top-K queries and a cold OLAP store (BigQuery/S3 + batch jobs) for long-range analytics.
Cover read patterns and DB trade-offs: range queries and heavy aggregations favor columnar/TSDB or OLAP systems; write-heavy counters can use wide-column stores with materialized aggregates. Finally, highlight non-functional considerations: partitioning, replication, backpressure, monitoring, and SLOs for 100ms read latencies. Demonstrate these skills to show you can reason about throughput, correctness, durability, and operational complexity.
Common Follow-up Questions
- •How would you handle out-of-order and late-arriving tweets when computing time-windowed aggregates (explain watermarks and event-time windowing)?
- •Design a near real-time top-K hashtags service: how do you compute heavy hitters at scale (discuss Count-Min Sketch, Top-K, and exact vs approximate trade-offs)?
- •Compare storage choices for fast reads and large-scale aggregation: why choose Druid/ClickHouse vs Cassandra vs a TSDB for this use case?
- •How would you ensure durability and no data loss in the face of broker or processor failures (discuss replication, checkpointing, and replay strategies)?
- •What strategies would you use to meet a 100ms read-latency SLO for common queries while keeping costs reasonable (caching, pre-aggregation, and materialized views)?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.