Antonio Gulli's coding playground: February 2009

Saturday, February 28, 2009

Viterbi Algorithm in Boost and C++

The Viterbi algorithm was conceived by Andrew Viterbi in 1967 as an error-correction scheme for noisy digital communication links. The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states – called the Viterbi path – that results in a sequence of observed events.

In 2008, Viterbi won the Millennium Technology Prize "for the invention of the Viterbi algorithm, the key building element in modern wireless and digital communications systems, touching the lives of people everywhere.". It's interesting to note that in 1960s nobody could imagine a general application for the algorithm, so Viterbi followed his lawyer's advice and did not patent it. More information about Viterbi are given by this nice article by DrDobbs.

Here you have the Viterbi code in C++ adapted by the one of Božský Filip for using boost and a bit cleaner interface.

Friday, February 27, 2009

A beautiful solution to the ordered generation of hamming numbers

The hamming problem has an interesting variant. That is generating the numbers in ascending order (e.g. quoting wikipedia "regular numbers are often called Hamming numbers, after Richard Hamming, who proposed the problem of finding computer algorithms for generating these numbers in order.").

It turns out that a beautiful and compact solution is obtained by using 3 queues (Q2, Q3, Q5). The solution produces numbers in order and without duplicates. Time complexity is linear in the size output space. Hmm ... intuitively space should be constant in the output space, but still I need to think a bit about it... Here you have the C++ code for Hamming numbers in order and without duplicates. No need to explain the algorithm here since he code is self-explaining.

Thursday, February 26, 2009

Myriad virtues of (Generic) Patricia Trees

Patricia Trees is a interesting variant of tries with a large range of applications ranging from IP lookup, to Web query suggestion, to word completion and spelling suggestion on your mobile phone. Anytime you have a string prefix and need to complete it with all the known variants available in your pre-built dictionary, well .. in this cases, you may want to consider Patricia trees.

Here I found a nice C++ generic implementation of patricia trees. This should be added to Boost template library!

Old but gold: Wavelet transform (and image clustering)

This series of three articles from DDJ are old, but still very interesting. The Fast Wavelet Transform (FWT) introduces this O(N) transformation which improves "old" FFT O(N logN). A Wavelet Analyzer introduces the recursive computation of wavelet coefficients tree. Here one of the most impressive property of wavelets is expressed in all its power. Each level of the tree is a more detailed approximation of the original data. The Wavelet Packet Transform is an interesting evolution of FWT. If you want to get more information, Wikipedia has more information on wavelets.

In short, wavelet transform enables analysis of data at multiple levels of resolution. This is useful in many contexts. For instance, it has been succesufully apploed to Similarity detection and clustering of images for fast ranking and clustering of similar images.

Wednesday, February 25, 2009

Web search ranking and spamming as Nash equilibrium game

The paper Network Reputation Games by Hopcroft, J. and Sheldon, D. October 2008 gives a nice interpretation of link creation process and Pagerank score as a game with Nash equilibrium.

Generalizing the above, I wonder if Web spam and search ranking process can be jointly modeled as a game with Nash equilibrium. In other words, let's suppose that search engines make public their search ranking algorithms (ehmm) together with their demotion strategies for web spammers.

The benefit for search engines can be measured in terms of increase of market share. The benefit for content providers can be measure in terms of positions in the ten blue links or in terms of referral traffic generated by search engines. There is a non zero probability that a spammer is identified. In this case, its ranking is demoted and the referral traffic will have a drammatic decrease.

Under what conditions this is a game with a Nash equilibrium?

Tuesday, February 24, 2009

Betting with advisors

There are 3 men/advisors: one of them always says the truth, while the other two can lie. A man writes either 0 or 1 on a piece of paper. Then, he shows the number to the other men who will tell to you the number.

You are given a fixed amount of money M. If you bet x money and you win you will increase your richness of x units, otherwise you will decrease your richness of x units.

What is the maximum amount of money you can guarantee to win?

Web search ranking and GPU Nvidia CUDA many core computation

Many search ranking algorithms involve a multiplication between a [very large and sparse] matrix and a full vector (e.g. PageRank with power method) or can be re-casted into this type of problem (e.g. HITS, some of the new learning to rank methodologies, etc).

GPU computation performs parallel blocks based multiplication, where each block is assigned to a grid of threads running on separate core. I have 18 cores on my sony notebook (16 on GPU and 2 on CPU) and that is a lot of power!

GPGPU: CUDPP provides a library for parallel multiplication based on classical parallel prefix approach. I tested some code (adapted to run on windows vista by myself) and this approach looks quite promising;
IBM proposes a closed-source solution for sparse matrix - vector multiplication. Declared performance is interesting but I don't know the internals of their proposed approach;
This technical report from NVIDIA tackles the problem with an interesting mix of algorithmic and engineering solutions: N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, December 2008. [PDF] [Online].

In short, I believe that having low cost GPUs open a new exciting scenario for linear algebra computations. This model can make search engines save a lot of money.

Monday, February 23, 2009

CUDA and Sony vaio FZ21M configuration (part II)

Many people sent me mails asking information about how to run CUDA on a sony vaio FZ21M. Here you have the requested steps:

Install Visual C++ Express 2005. Cuda 2.0/2.1 needs express 2005 in order to compile. If you have Express 2008 you need to install 2005 as well;
Install CUDA drivers from http://www.laptopvideo2go.com/. I installed 185.20 Vista 32bit - 2008-12-26 version with Modded INF;
Install CUDA toolkit and CUDA SDK. I run v2.0 because CUDPP does not run on 2.1;

When you compile deviceQuery you get the result shown in the picture.

Sunday, February 22, 2009

Cuda: massive parallelism on my notebook sony vaio FZ21M

Cuda is a parallel computing architecture developed by NVIDIA. Nvidia GPUs have a parallel "many-core" architecture, each core capable of running thousands of threads simultaneously. I installed CUDA 2.1 on my Sony notebook Vaio FZ21M with Geoforce 8400M by using the modded drivers available at http://www.laptopvideo2go.com/

CUDA supports C, C++, Python, and Fortran. The paper Fast N-body Simulation with CUDA describes an a CUDA parallel program which runs 50 times as fast as a highly tuned serial implementation for the N-body simulation program. Wow! this is amazing. I still remeber my parallel implementation on Cray T3E, with 128 alpha processors. More than 10 years ago.

Well, now I have thousands of threads running on a multi-core GPU under my fingers (plus the dual-core CPU). Cuda documentation is pretty extensive. I am right now studying the CuBlas libraries for linear algebra and starting to code.

Here you have an introduction to cuda by the way of Dr. Dobbs

Saturday, February 21, 2009

Optimal prefix-suffix matches

You are given a finite set of string S = { s1, s2, .... sn } over a finite alphabet Σ. Two strings s1 and s2 have a prefix-suffix match if s2 has a prefix α which is a suffix of s1 (i.e. s1 = φα, s2 = αλ ). In this case the strings can be joined in a superstring φαλ and the value of this join is | α |. In this case s1 is tail-joined and s2 is head-joined. If a string is tail(head)-joined, it can not be tail(head)-joined again with other strings. Find the combination of prefix-suffix matches in S with maximum value.

Friday, February 20, 2009

Cheating wifes

Every man in a village of 100 married couples has cheated on his wife. Every wife in the village instantly knows when a man other than her husband has cheated, but does not know when her own husband has. The village has a law that does not allow for adultery. Any wife who can prove that her husband is unfaithful must kill him that very day. The women of the village would never disobey this law. One day, the queen of the village visits and announces that at least one husband has been unfaithful. What happens?

Thursday, February 19, 2009

Coin changes and dynamic programming

Coin Change is the problem of finding different ways of making changes for a particular amount of cents,

n

, having an infinite amount of coins available. For instance, how can you change 10 dollars having an infinite amount of nickels (1 cent), pennies (5 cents), quarter (10 cents), dimes (25 cents), one dollars?

Well this is the typical problem that you can solve by using dynamic programming.

Wednesday, February 18, 2009

Generating Web Graphs with RMAT and Eigen template c++ library

R-MAT: is a recursive model for graph mining. It has some useful properties:

On-line property. The number of nodes and edges can change with times;
Power law degree distribution property;
Small world property: diameter is much smaller than the order of the graph;
Many dense bipartite subgraphs. The number of distinct bipartite cliques is large when compared to a random graph.

RMAT recursively subdivides the adjacency matrix in four equal quadrants and selects one of the four quadrants with unequal probability (a, b, c or d). When the selected portion is a 1x1 cell, the corresponding direct edge is set.

Here you have the RMAT C++ code implemented with Eigen template library. As an example, I report the timings taken on my VMware virtual machine (guest: Linux Ubuntu with 1GB memory, host: Windows Vista Professional, core duo, 4Gb)

2^16 nodes, 8*2^16 edges; real 0m6.453s
2^18 nodes, 8*2^18 edges; real 0m53.253s
2^20 nodes, 8*2^20 edges; real 2m17.383s
2^21 nodes, 8*2^21 edges; real 7m26.611s

I believe that these performance can be further improved by adopting fixed-size matrices, with size defined at compile time. Well, running on a dedicated server and not a virtual machine may help as well. But this is not bad, the library is pretty good for rapid coding!

Eigen -- A generic matrix library for linear algebra computation

Eigen is an interesting library for Matrix computation in C++. It supports all the common operations for linear algebra with vectors, matrices, fixed-size (known at compile time) and dynamic (allocated at compile time), full or sparse. Coding with Eigen is pretty fast and intuitive. Implementing an algorithm on top of Eigen feels like just copying pseudocode. I think that it would be great to have it included in Boost. Benchmarks show that the resulting code is pretty fast and optimized at compile time.

I am currently coding a model for Web graph generation (see R-MAT: A recursive model for graph mining). I think that I will post the code pretty soon.

Tuesday, February 17, 2009

Overlapping rectangles

A set of rectangles has one border parallel to the x-axis of a Cartesian plan. Find the set of overlapping rectangles in optimal time.

Monday, February 16, 2009

Hamming numbers III

It is possible to generate the hamming numbers without duplicates. The key intuition it to generate the sets:

M2 = { h : h=2^l , l \in N, l < max }
M3 = { h : h=3^o , o \in N, o < max }
M5 = { h : h=5^p , p \in N, p < max }

Then we generate:

M23 = { i : i = h*k, h \in M2, k \in M3}
M25 = { i : i = h*k, h \in M2, k \in M5}
M35 = { i : i = h*k, h \in M3, k \in M5}
M235 = { i : i = h*k, h \in M23, k \in M5}

Here is the code. Let's have a look to the complexity in terms of analyzed numbers. Let P the maximum integer you can represent. Then, generating the sets M2, M3, M5 takes O(logP), since |M2| <= log2(P) , |M3| <= log3(P), |M5| <= log5(P). As a consequence, generating M23, M25, M35, M235 takes (log(P))^2. We are sub-linear in terms of input and cubic in terms of output.

Sunday, February 15, 2009

Hammin numbers II

One problem with the previous solution is the amount of duplicates we generate. For instance, starting from 2 we can generate 2x3=6. Starting from 3 we can generate 3x2=6. We need to remove these duplicates as soon as possible to avoid unneeded elements in H. A solution is to use set operations, which is "logarithmic in general, but amortized constant if t is inserted right after p". So the complexity of this solution is O(nlogn) in worst case, and O(n) in average since hamming numbers are sparse among large integers.


#include <set>
#include <iostream>
#include <limits.h>

int main(){

 typedef unsigned long long num;
 typedef std::set<num> Nums;
 typedef std::set<num>::const_iterator Nums_iterator;
 Nums nums;
 Nums_iterator it;

 nums.insert(1);

 for (it = nums.begin(); it != nums.end(); ++it){

   nums.insert(nums.end(), *it*2);
   nums.insert(nums.end(), *it*3);
   nums.insert(nums.end(), *it*5);
   std::cout << *it << std::endl;
 }
}

Saturday, February 14, 2009

Hamming numbers

The set of Hamming numbers H is defined recursively. 1 is in H. The numbers 2h, 3h, 5h are in H, if h is a number contained in H. The first few hamming numbers are 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 25, 27, 30, 32, 36, 40, 45, 48, 50, 54, 60, ... Dijkstra put the problem of computing efficiently the hamming numbers. What is your proposal?

Friday, February 13, 2009

Generic programming and out-of-the core computation

STXXL is a C++ library for extra large data sets. The library allows to write generic programs, with an idiom typical of STL, and to leverage the power of out-of-the-core computations for massive data-sets. I played with the library and I loved it.

You have the STXXL containers, which are similar to STL containers. Anyway, an STXXL container can be mapped to (multiple) external file(s) in a transparent way. Files may have different implementations that might be bases on various file systems or even remote storage interface. Examples of containers are:

vectors: example: stxxl::vector v(64 * int64(1024 * 1024));
map: example: stxxl::map
queue: example: stxxl:queue
stacks and priority queues

You can access the containers with STXXL Iterators and the library is taking care of all I/O operation, supporting large files up to dozens of terabytes even if they are mapped on multiple physical disks.

unsigned int big_size = 1024 * 1024 * 2 * 16 * 16;
typedef stxxl::vector vec_big;
vec_big my_vec(big_size);
vec_big::iterator big_it = my_vec.begin();

You can use a collection of STXXL Algorithms, similar to STL ones but for external memory:

stxxl::generate(v.begin(), v.end(), stxxl::random_number32(), 4);
stxxl::for_each_m(v.begin(), v.end(), square(), 4);
...

A particular mention goes to sorting algorithms for external memory:

typedef stxxl::vector vector_type;
const stxxl::int64 n_records =
stxxl::int64(384) * stxxl::int64(1024 * 1024) / sizeof(my_type);
vector_type v(n_records);
stxxl::random_number32 rnd;
for (vector_type::size_type i = 0; i < v.size(); i++)
v[i]._key = 1 + (rnd() % 0xfffffff);
stxxl::sort(v.begin(), v.end(), cmp(), memory_to_use);

This paper describes the library Stxxl : Standard Template Library for XXL Data Sets ; R. Dementiev, L. Kettner, P. Sanders "STXXL: standard template library for XXL data sets" Software: Practice and Experience Volume 38, Issue 6, Pages 589-637, May 2008 DOI: 10.1002/spe.844. See also the technical report.

Here you have a collection of example codes; performance are pretty good, and some stxxl users have reported that stxxl sort is much faster than UNIX sort. In addition, i found this paper which reports very good performance (Breadth first search on massive graphs). Do not forget to consult the STXXL forum if you need any help.

Economy does not look good.

Off-topic: one picture is enough to explain current situation with the economy.

Banks Merging and Acquisition (M&A)

Bank system is going crazy. One solution is to merge all of them together under a central authority. You have N banks at the beginning and they are combined into a single bank at the end. How many different configurations of merging are possible?

Thursday, February 12, 2009

Longest common substring, subsequences, increasing subsequences

String gurus are excited about long stuffs. It turns out that such theoretical researches have a lot of practical applications.

Longest Common Substring (LCS): it has applications in biology (gene analysis) and Web-IR (clustering, features' extraction) and many others. Given two strings S : | S | = n and T : | T | = m, finding the common substring has complexity O(nm) by using dynamic programming. In alternative, a generalized suffix tree (GST) can be built in linear time, if the size of the alphabet is constant. Once the GST is built, you can get the LCS by visiting GST bottom-up. You need to annotate each internal node with the strings that caused its creation. The visit is linear in the length of the strings.

Longest Common Subsequences (LCSseq): Substrings are made up of consecutive symbols, while subsequences are not. For instance with web clustering, you may want to "skip" some symbol. In this case, you have to use subsequences instead of substrings. If you have two documents, you may want to compare where they are different (have you ever used the diff program?). There are two interesting proprieties, which are useful for a dynamic programming solution with complexity O(nm)
- LCS(X_n, Y_m) = (LCS( X_n_-1, Y_m_-1), x_n), if the string X_n, Y_mboth ends with the symbol x_n
- LCS(X, Y) = the longest sequences of LCS(X_m _{– 1}, Y) and LCS(X, Y_n _{– 1}), this is useful when X_n, Y_mdo not end with the same symbol
- LCS({}, {}) = 0
Running time can be reduced by adopting memoization. Space can be reduced by using hashing.
Shortest Common Supersequence (SCS) is a common super-sequence of minimal length. Think about merging two documents. For two input sequences, an SCS can be formed from an LCS easily, by inserting the non-lcs symbols while preserving the symbol order

Longest Increasing Subsequence (LIS): Here symbols have an order (lexicographical, numerical, etc) and you want to find the longest common subsequence that respects this order. The longest increasing subsequence of a sequence S is the longest common subsequence of S and T, where T is the result of sorting of S. The algorithm is quadratic, but it is possible to give an O(nlogn) algorithm.

Wednesday, February 11, 2009

Magy! how to rank a web search result

Magy! engineers are trying to identify new methodologies for ranking web search results. After all, the academic literature has been evolved around three main ideas:

We saw my interesting developments, but no real break-trough in different directions. Do you have any additional idea to rank web search results?

Tuesday, February 10, 2009

Magy! search result clustering and the next search UI

Search UI is still the same since the beginning of search era. True: the 10 blue links model is a solid one, but is there any other innovation ?

Many people are thinking that blending is the next frontier, with champions such as Ask.com and Google. They mix and shake together results coming from different search domains, such as news, images, blogs, questions, products, and the like. The result is a sense of more variety, more freshness and, after all, a better user experience.

Personally, I do think that we need to try different UIs in response to different search queries. Back in 2000, I was a big fan of Northern Light folder's organization, which later on evolved into the Vivisimo's web hierarchies adopted by Clusty, and in a series or Academic reseach proposals about web snippet clustering.

I think that next UIs will be obtained (a) by leveraging cool graphical proposals and (b) by solving basic algorithm problems which apparently have no relationship with UI.

For instance, here it is one problem for our proud Magy! engineers.

Given an array of strings, find the longest common substring in optimal time

It turns out that this the solution of this problem can be very useful for web snippet clustering. Yet another example where theoretical researches may have interesting pratical applications.

PS: you may be interested to have a look to suffix array and suffix tree data structures for solving this particular problem.

Monday, February 9, 2009

Dropping Swarovski Crystals

A building of 100 floors, there is a threshold N s.t. if a Swarovski Crystal from Nth floor and above it will break. If it's dropped from any floor below, it will not break. Given 2 Crystals, find this N, and minimize the number of drops for the worse case.

Sunday, February 8, 2009

Maximum sub-array sum

Given an array of integers, find the sub-array of contiguous elements with maximal sum. In case of ties select the shortest sub-array.

Saturday, February 7, 2009

Finding the the more frequent symbol in a stream of numbers

You are given an infinite stream of numbers, find the symbol that appears more frequently. At any instant, you are interested to symbols that appears more than 50% of times.

Friday, February 6, 2009

Happy or not?

A community of K people can be happy or not happy. Can you enumerate and generate all the possible states?

Thursday, February 5, 2009

joining n segments

You are given n segments. In turn, you take each one of them and join it with one among those already selected, creating either a loop or a longer segment. How many loops there are in average, at every instant of time?

Wednesday, February 4, 2009

Covering a circle

You are given a circle and two players. When is his turn, each player will mark down a sign on cicle. Game stops when there is no more free space available on the circle. What is the winning strategy?

Tuesday, February 3, 2009

Find a number non in the sequence

You are given a sequence of m distinct numbers extracted by a set of 1...n numbers, with m < n. Can you give an efficient algorithm for identifying a number non in the sequence.

Monday, February 2, 2009

Boost::serialization for storing generic class

Boost serialization is a lovely collection for storing generic classes. Here you have example for serializing int, list of int, vectors of lists of floats (STL containers), and your own class.

Sunday, February 1, 2009

Maximum sub-array product

Given an array of integers, find the proper sub-array with maximum elements' product, in linear time.