Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search

Saturday, February 7, 2009

Finding the the more frequent symbol in a stream of numbers

You are given an infinite stream of numbers, find the symbol that appears more frequently. At any instant, you are interested to symbols that appears more than 50% of times.

Assuming that the symbols in the stream are from an infinite vocabulary (else you can just collect the frequencies of each symbol and pick the max), the idea is the following:

1) keep a map with 2 keys 2) process the next element N in the stream 3) if N is in the map then map[N]++ 4) if N is not in the map then replace the element with the lowest value in the map with map[N]=1 5) at any time you can return the value with the highest value in the map

To make the implementation easier you could use a 2 nodes max-heap to keep track of the element with highest value in the map.

The idea is that if at any time there has been an element that appeared more the 50% of the times in the stream, then this element will stay in the map with the highest value. If no element appeared more than 50% of the times so far then the result is meaningless.

Data stream is a very interesting topic. I recommend reading [Data Streams: Algorithms and Applications] by Muthukrishna

ReplyDeleteAssuming that the symbols in the stream are from an infinite vocabulary (else you can just collect the frequencies of each symbol and pick the max), the idea is the following:

ReplyDelete1) keep a map with 2 keys

2) process the next element N in the stream

3) if N is in the map then map[N]++

4) if N is not in the map then replace the element with the lowest value in the map with map[N]=1

5) at any time you can return the value with the highest value in the map

To make the implementation easier you could use a 2 nodes max-heap to keep track of the element with highest value in the map.

The idea is that if at any time there has been an element that appeared more the 50% of the times in the stream, then this element will stay in the map with the highest value. If no element appeared more than 50% of the times so far then the result is meaningless.