Problem. Given k event streams, each sorted by timestamp, merge them into a single globally sorted timeline. Each event is (timestamp, data). Break timestamp ties by original stream order.
Examples:
[[(1,"a1"),(4,"a2"),(7,"a3")],[(2,"b1"),(5,"b2")],[(3,"c1"),(6,"c2"),(8,"c3")]]→[(1,"a1"),(2,"b1"),(3,"c1"),(4,"a2"),(5,"b2"),(6,"c2"),(7,"a3"),(8,"c3")][[(1,"x")],[]]→[(1,"x")][[(1,"a"),(2,"b")],[(1,"c"),(2,"d")]]→[(1,"a"),(1,"c"),(2,"b"),(2,"d")](stream 0 before stream 1 on ties)
Solution. Same min-heap approach as merge-k-sorted, but keyed on (timestamp, stream_index) to break ties by stream order.
import heapq
def merge_streams(streams):
heap = []
for i, s in enumerate(streams):
if s:
ts, data = s[0]
heapq.heappush(heap, (ts, i, 0, data))
result = []
while heap:
ts, si, ei, data = heapq.heappop(heap)
result.append((ts, data))
if ei + 1 < len(streams[si]):
nts, ndata = streams[si][ei + 1]
heapq.heappush(heap, (nts, si, ei + 1, ndata))
return result
O(N log k) time, O(k) space.
Gotchas.
- Include
stream_indexin the heap tuple to get stable tie-breaking — otherwise Python will try to comparedatastrings, which may not be meaningful. - Empty streams: skip during initialisation.
- For real-time (online) merge, the heap naturally handles new events pushed as they arrive.
- If data is not comparable, use
(timestamp, stream_index, element_index)and store data separately.