Top-K in Time Window - Coding Question, Walmart Labs
Question Description
Problem summary
You must build a component that records events (item_id, timestamp) and answers frequency-based top‑k queries over a closed interval [t_start, t_end]. You will implement record and query APIs, then extend the design to support explicit eviction of old events, out‑of‑order (late) arrivals, and an optimized version when timestamps come from a bounded discrete domain.
What you'll be asked to do
You should show an implementation that maintains per‑timestamp or per‑bucket counts and can return up to k item IDs with the highest access counts in [t_start, t_end]. Expect to discuss data structures such as hash maps for counts, time buckets (sliding-window aggregates), and a min‑heap (priority queue) to compute top‑k efficiently. You should explain complexity tradeoffs for update and query paths and how tie‑breaking is handled.
Flow & extension stages
- Basic API: implement record(item_id, timestamp) and query(t_start, t_end, k) with correct closed‑interval semantics.
- Eviction: add evict_older_than(cutoff) to drop data < cutoff and document correctness guarantees for queries >= cutoff.
- Out‑of‑order events: introduce an allowed lateness parameter or accept full reaggregation; state whether late arrivals can retroactively change previous query answers.
- Bounded timestamps: if timestamps lie in a fixed range, switch to array/circular buffer buckets for O(1) eviction and lower memory overhead.
Skill signals interviewers look for
You should demonstrate: time/space complexity analysis, selection of appropriate aggregates vs raw logs, correctness under inclusive intervals, handling late data semantics (bounded lateness vs eventual consistency), and tradeoffs between exact and approximate top‑k (e.g., Space‑Saving / Count‑Min Sketch for high throughput).
Common Follow-up Questions
- •How would you change the design to support continuous streaming queries (always-on sliding window top-k) with low latency?
- •Describe an approximate variant using Space‑Saving or Count‑Min Sketch: how do you bound error and memory, and when is it acceptable?
- •How would you shard or partition the system to handle very high write throughput and still compute global top-k over a time window?
- •If events carry weights or durations (not just counts), how would you adapt recording, aggregation, and top-k computation?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.