Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search

Tuesday, November 9, 2010

Repeating numbers

All numbers in an array repeat twice except two numbers which repeat only once. All the numbers are placed randomly. Find out efficiently the two numbers that have not repeated with O(1) extra memory.

Opposite problem is harder: you have a list with N unique numbers except one duplicate pair. Find the duplicate using only O(1) storage. (Hash table would be easy in practice, but that uses O(N) storage!)

XORig everything won't give you the two numbers... it will cancel out all the dupes, but leave you with A^B, not the individual A and B.

A multipass algorithm would work but I don't think it's optimal. Allocate a fixed table size, lets say 256. Stream the list if integers and fill the table with a histogram of the low order byte of each entry. Examine the table at the end and you'll find every entry is even except two.. the bins that have the unduplicated values. This gives you the lowest significant byte of the two values. (In 1 in 256 chance, they happen to have the same LSB, all values in the table are even and you can try with a higher byte.) So now we've reduced the problem down to 2/256 in size and we can repeat it again with the next higher byte, each time filtering out about 99% of the numbers, until we're down to just a handful of possibilities that can be examined with brute force.

Got it.. a better version of the histogram counter I posted. Basically use the binning approach again, creating say 256 entries. Run through the list, figure which bin each value goes into based on any hash you like. XOR the number with the bin's entry. At the end, all of the bins will be 0 except two, which hold the two numbers A and B. This CAN rarely fail if A & B collide in the same bin, in which case you'll have only 1 nonzero accumulator.. just repeat with a different hash function. Finally, if A or B is itself 0, it will look like a collision failure. Easy solution is if you have to repeat the test, accumulate (x+1) and not x so if a 0 caused the first failure, you're no longer using 0. (You can repeat again with another hash and x+2 in the unlikely event of a double failure.)

XOR all the numbers

ReplyDeleteOpposite problem is harder: you have a list with N unique numbers except one duplicate pair. Find the duplicate using only O(1) storage. (Hash table would be easy in practice, but that uses O(N) storage!)

ReplyDeleteXORig everything won't give you the two numbers... it will cancel out all the dupes, but leave you with A^B, not the individual A and B.

ReplyDeleteA multipass algorithm would work but I don't think it's optimal. Allocate a fixed table size, lets say 256. Stream the list if integers and fill the table with a histogram of the low order byte of each entry. Examine the table at the end and you'll find every entry is even except two.. the bins that have the unduplicated values. This gives you the lowest significant byte of the two values. (In 1 in 256 chance, they happen to have the same LSB, all values in the table are even and you can try with a higher byte.) So now we've reduced the problem down to 2/256 in size and we can repeat it again with the next higher byte, each time filtering out about 99% of the numbers, until we're down to just a handful of possibilities that can be examined with brute force.

Got it.. a better version of the histogram counter I posted. Basically use the binning approach again, creating say 256 entries. Run through the list, figure which bin each value goes into based on any hash you like. XOR the number with the bin's entry. At the end, all of the bins will be 0 except two, which hold the two numbers A and B. This CAN rarely fail if A & B collide in the same bin, in which case you'll have only 1 nonzero accumulator.. just repeat with a different hash function. Finally, if A or B is itself 0, it will look like a collision failure. Easy solution is if you have to repeat the test, accumulate (x+1) and not x so if a 0 caused the first failure, you're no longer using 0. (You can repeat again with another hash and x+2 in the unlikely event of a double failure.)

ReplyDelete