Tuesday, March 31, 2009

Linear Algebra and Matrices in C++ and Boost

Quite frequently you have to deal with matrices, eigenvectors and linear solvers. Boost offers (at least) two different and complementary approaches:
  1. uBLAS, A template interface over the traditional set of Blas libraries. This is very useful for people who are used to Fortran Blas and who wants to use it with C++. uBLAS provides BLAS level 1, 2, 3 functionality for dense, packed and sparse matrice. Linear algebra is supported by interfacing directly with underlying blas libraries. Here an Overview of Matrix and Vector Operations. Other bindings to different libraries are also available.
  • Pros: vector and matrix types are defined by using templates; Blas is very efficient;
  • Cons: Personally, I find the Blas notation a little difficult to use.
  1. MTL4, another template interface with can use either Blas or its internal pure C++ numerical algorithms.
  • Pros: I found MTL quite intuitive: matrices have many useful type definitions, and the matrix operations are rather complete. Performance is quite good, altought the distribution site is not reporting an estensive testing;
  • Cons: MTL is not stable and is not part of standard boost distribution.
I would love to hear what is your favourite solution for Linear Algebra and Matrices in C++.

Monday, March 30, 2009

Decision Trees (part I)

Let's talk about our beloved Decision Trees (DTs). I have a very good friend who always suggests using a decision tree in every data mining task. After all, we use DTs every day in our real life.

What is a decision tree? nothing more that a long sequence of "if .. then else .. if then else .. if then .. else ... " ;-) This definition is not making justice to the power of DTs. The sequence of best choices is not pre-defined, but it is the result of a machine learning process on the training dataset. In DTs, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. All the binary structure of the DT is learnt automatically.

Training dataset is seen as a matrix, where each row is a record. For each row in the dataset, for each feature, the dataset is split in two parts A and B. An auxiliar score function (such as entropy or gini impurity) is used to compute the information gain for A, B. The feature obtaining the best information gain is used as split feature [ e.g. if (... feature_chosen ... ) then ... else ... ]. A node N is created containing the splitting feature. The left child and right child of N are built recursively by applying the same learning process on the subsets A and on B, respectively.
DTs can work with both numerical and categorical (e.g. strings) features.

Here you have the C++ code for building the DT (note that features are represented with boost::variant).

In a future posting, we will discuss about pruning DTs and machine learning classification using DTs.

Sunday, March 29, 2009

Linear Suffix Array

One article out of three thousands is beautiful. "Simple Linear Work Suffix Array Construction" (Juha Karkkainen and Peter Sanders, 2003) is a simple and elegant linear algorithm for building suffix array in linear time. Back in 2003, When I saw the article the very first time back, I thought: "This is a recursive diamond".

Suffix Arrays (SA) are a powerfull tool used in many different fields such as genetics, string matching, compression, and many others. A suffix array A of T is just a sorted array of suffixes of T. To avoid quadratic space we store only the indices of the suffixes, instead of full suffixes of T. Given SA(T), one can search for occurrences of P directly on the suffix array using binary search in O(|P| lg |T|) time. This can be improved to O(|P| + lg |T|).

SA were originally developed by Gene Myers (ex-Celera) and Udi Manber (now at Google). For more than a decade, the direct construction of suffix array was O(N^2). SA were built in linear time indirectly by building an intermediate suffix tree (which are expensive from the point of view of memory). SA is equivalent to an in-order traversal of the leaves of the suffix tree.

Here a succint description of the algorithm:
  1. Sort Σ, the alphabet of the T in linear time by using radix sort;
  2. Replace each letter in the text with its rank among the letters in the text;
  3. Divide the text T into 3 parts:
  • T0 = < (T[3i ], T[3i + 1], T[3i + 2]) , i>=0 . . . >
  • T1 = < (T[3i + 1], T[3i + 2], T[3i + 3]) , i=0>
  • T2 = < (T[3i + 2], T[3i + 3], T[3i + 4]) , i >= 0 >
  1. Recurse on the concatenation of T0 and T1. When this recursive call returns, we get T0 and T1 sorted in a suffix array. After that, we need is to sort the suffixes of T2, and merge them with the old suffixes to get suffixes of T. This is the point where the algorithm become a diamond: the construction of the suffixes of T2 and the linear merge.
  2. Note that by construction T2[i:] = T[3i+2:]= (T[3i+2], T[3i+3:]) = (T[3i+2], T0[i+1:]). Therefore we can have T2[i:] from the already built To[i+1]
  3. Merge the sorted suffixes of T0, T1, and T2. Remember that suffixes of T0 and T1 are already sorted, so comparing a suffix from T0 and a suffix from T1 takes constant time. To compare against a suffix from T2, we decompose it again to get a suffix from either T0 or T1;
Since this sorting and merging is done in linear time, we get a recursion formula T(n)=T(2/3n)+O(n), which gives linear time. You can prove it by substitution -- intuitively note that 2/3 C++ code for suffix array contained therein. It is neat, simple and instructive.

Saturday, March 28, 2009

N cooperative girls wearing an Hat

This is a classical and instructive puzzle. In a room there are N girls who wear either a Red or Black hat. The color is decided by tossing a coin.

Each girl cannot see her own hat, but she can only see hats worn by other girls. Girls cannot communicate each other. The team wins if at least one of them guesses correctly, and none of them guesses incorrectly. What is the team's best strategy? (In other words, who should talk and who should say in silence?)

Friday, March 27, 2009

A comment on this blog

manthrax@reddit "Man I like this dudes style. Short and sweet explanations/references of the algorithms and then an implementation in gods own C++. Respek! And thanks for posting!"

Genetic Programming

Computer Scientists like to borrow ideas from other disciplines! I already reviewed Simulated Annealing which has been inspired by Physic and atom movements. Now a quick discussion about Genetic Programming (and Evolutionary Computing), which has been inspired by Biological evolution and Genetics.

Genetic Programmig is another meta-heuristic for Optimization. The key idea is to find a (local) optimal solution by starting with a feasible population (a collection of non-optimal solutions). The initial population evolves in different iterations. Two key steps are used to evolve: a solution can mutated or crossovered. The best solution is selected among the evolutions of the initial population.

If you want to know more about Genetic programming and Evolutionary computing, there are thousands of books and papers. Personally, I like they unified approach presented in Evolutionary Computation by Kenneth A. De Jong, 2002. For those interested in Web search, let me point out that Genetic programming has been successfully used for Learning to rank (see this "optimizing evaluation measures in learning to rank" and the references within)

Here you have the code for Genetic Programming in C++

Thursday, March 26, 2009

Good guys and bad guys

You are given a set G of n good guys and a set B of m bad guys. Whenever the a good guy meet a good guy, happiness increase and nothing changes. With probability p the bad guy kills the good guy, if they get in touch (while with probability 1-p, the good guy will survive). If two bad guys meet each other, they will not survive.

Question is: who will survive, and with what probability?

Wednesday, March 25, 2009

Simulated Annealing

Simulated Annealing (SA) is a smart (meta)-heuristic for Optimization. Given a cost function in a large search space, SA replaces the current solution by a random "nearby" solution. The nearby solution is chosen with a probability that depends on the difference between the corresponding function values and on a global parameter T (a.k.a the temperature). T is gradually decreased during the process. The current solution changes almost randomly when T is large, but increasingly "downhill" as T goes to zero. The allowance for "uphill" moves saves the method from becoming stuck at local minima.

This approach has some similitude with Physic, where the heat causes the atoms to become unstuck from their initial positions and wander randomly through states of higher energy; the slow cooling gives them more chances of finding configurations with lower internal energy than the initial one.

Simulated Annealing is also used for many different data mining tasks such as Classification, Clustering, and Web search Ranking.

Here you have the code for Simulated Annealing in C++

Tuesday, March 24, 2009

Book review: Introduction to Video Search Engines

Introduction to Video Search Engines by David C. Gibbon and Zhu Liu (2008) it's a nice book about video search engines by AT&T people.

I recommend the book if you want an introduction to the Video search world. The book discusses important topics such as video standards, internet video, video classification by sources and by topics, the anatomy of a video search engine, and many others. Particularly interesting are the four chapters about Media Processing, Video Processing, Audio Processing and Text Processing.

In Video Processing, I appreciated the discussions about shot boundary determination and representative image selection with evaluation based on TrecVid dataset. Other interesting topics are Face detection, which can be used for ranking. and Face recognition.

In Audio Processing, I appreciated the discussions about Speaker segmentation and Speaker recognition.

Text Processing, is a more standard chapter with discussions about Named Entities recognition and Part of Speech tagging.

I would have appreciated a more detailed discussion about Video Search Ranking which would have been appreciated by Seo Community and a more detailed discussion of Video Search anatomy.

Anyway, I definitively recommend reading this book, if you you are interested in Video Search

Leaders' competition

Leaders have hard times these days: expectations are high and hope is great. This is a little puzzle, with a lesson inside: a cooperative strategy is important.

You have a list of N Leaders, ordered from the most important to the less important. Each Leader comes with a proposal about how to partition the money among different projects proposed by other leaders. Here are the rules of the game:
  1. Leaders make proposals in order of their importance;
  2. The proposal of the leader i are voted by leaders i, i + 1 , .... N;
  3. A proposal is accepted if it gets the majority of the voting leaders
  4. If a proposal is rejected than the Leader i who proposed it is no longer a Leader, and the proposal of Leader i+1 is therefore voted;
Suppose you are the Leader 0 (e.g. the most important one), what is your cooperative strategy for getting the other Leaders voting your proposal? What if you are Leader i-th can you have your own plan approved?

Monday, March 23, 2009

Sparse Sets with O(1) insert, delete operations and (sub)linear union, intersection

The paper An efficient representation for sparse sets (1993) by Preston Briggs, Linda Torczon, introduces an efficient data structure for representing a sparse static set (with u elements inserted out of a maximum of n elements). Insert and delete operations have a constant time complexity, while union and intersection are (sub)-linear in n. The space complexity is linear in n. The table on the right side compares the complexity of this data structure with a traditional bit vector.

The key intuition is to maintain two vectors such that dense[sparse[k]] = k, and a variable which maintains the number of elements in the sparse set. sparse[k] contains the number of elements in the set, when element k is inserted. In this way two elements inserted consecutively are mapped into contiguous positions of the dense vector. As a consequence, insert and delete operations are constant, while intersection and union operations are sub-linear.

This is the code for Sparse Sets representation in C++.

Sunday, March 22, 2009

Introsort:: mixing two words quicksort and heapsort

Introsort is a simple and effective algorithm which guarantes a sort as fast as Quicksort, but with Θ(N lg N ) worst case. It's well known that Quicksort has O(N lg N ) average complexity and O(N ^ 2) in worst case. This is true even for randomized variant of quicksort, or for the median of 3 pivoting strategy. Anyway, Quicksort is 2-5 time faster than Heapsort, which has Θ(N lg N) proven worst case.

In 1997, proposed to start with Quicksort and switch to Heapsort when the recursion depth is O(logN). This simple variant makes the Introsort algorithm Θ(N lg N ) worst case. Introsort is currently adopted as standard unstable sort by the Standard Template Library in stl_algo.h
"Introspective Sorting and Selection Algorithms". Software: Practice and Experience (Wiley) 27 (8): 983–993.

Saturday, March 21, 2009

An array of integers sorted and shuffled

You are given a sorted array A of N integers. For each integer i, a random number j is extracted and A[i], A[j] are swapped. When this shuffling is complete:
  • What is the probability that exist k such that A[k] = k?
  • What is the probability the A is still sorted?

Friday, March 20, 2009

The beauty of linear k-order selection

One out three thousands papers is a beautiful paper. I believe that "Time bounds for selection" (M. Blum, R.W. Floyd, V. Pratt, R. Rivest and R. Tarjan, J. Comput. System Sci. 7 (1973) 448-461) is an old and beautiful pearl.

It's well know that K-selection is a problem that can be solved in O(N logN) average time by using an approach similar to the one of QuickSort . Anyway, the worst-case is quadratic even when you adopt a randomization step (here you have the C++ code for Quickselect).

The above paper introduces a beautiful variant of classical selection. The key idea is to partition the input data array A in groups of five elements. Then:
  1. Within each group G[i], find the median(G[i]) in constant time by direct comparison of the 5 elements; This step is at most linear;
  2. Take the median of each group and find M, the median of medians (median(median(G[0], .... G[floor(n/5)+1]) by a recursive call; This steps takes T(n/5);
  3. Use M to partition the input by calling the quickselect algorithm recursively; This step takes at most T(7n/10) -- see below;
At each step of the recursive call, QuickSelect drops either the elements greater than M (call this GTM) or the ones less than M (call this LTM). Let's assume we maintain the elements less than M. How many are they?
  • Among the n/5 medians, n/10 are larger then M, being M the median;
  • For each median, at least two other elements in A are larger than the median by construction;
  • Therefore, GTM is at least 3n/10;
  • For symmetry, LTM is at least 3n/10;
As a consequence the recursive call to QuickSelect takes at most T(7n/10);

The total cost is therefore T(n) <= Θ(N) + T(n/5) + T(7n/10)

This recursion equation is linear (it can be proven by substitution method). The intuition is that 1/5 + 7/10 = 9/10 and therefore the geometric series is converging.

Thursday, March 19, 2009

An array of n bits, again

I love simple problems on simplest data structure: the infamous array of N bits. The linear scan of the array is O(N) and everyone knows this. Here I am asking: what is the probability of finding in position K-th the first bit set to 1 and all the preceding bits set to 0?

Now, Suppose that two sources are sending a stream of packets on the same wire and that they are able to detect collisions. After K failures, they suspend the transmission for random number of milliseconds in [0, 2^K-1]. What is the probability of having another failure?

Wednesday, March 18, 2009

Adaboost : improve your weak performance

Adaboost is one of my favorite Machine Learning algorithm. The idea is quite intriguing: You start from a set of weak classifiers and learn how to linearly combine them so that the error is reduced. The result is a strong classifier built by boosting the weak classifiers.

Mathematically, given a set of labels y_i = { +1, -1 } and a training dataset x_i, Adaboost minimizes the exponential loss function sum_i (exp - (y_i * f(x_i)) }. The function f = sum_t (alpha_t h_t) is a linear combination of the h_t classifier with weight alpha_t. For those loving Optimization Theory , Adaboost is a classical application of Gradient Descend.
The algorithm is quite simple and has been included in the top 10 data mining algorithms in 2007 and the Gödel prize in 2003. After Adaboost, Boosting become quite popular in the data mining community with application in Ranking and Clustering.

Here you have the code for AdaBoosting in C++ and Boost.

Beautiful Algorithmic Books

A beautiful course on-line from MIT

Tuesday, March 17, 2009

Fast String Hashing

Gio introduced me to the magic of FNV Hash a special kind of hash function well suited for hashing strings. Operations used for FVN hashing are bit-shift and xor. So they are pretty fast and guarantee to have few collision. No code this time, since the above web page has it already.

A comprehensive list of hash function is available here.

Update: Gio introduce me to the ultra-fast HSIEH string hash functions, which outperforms FNV hash. Again the code is in the page.

Serializing a Generic Tree

Boost Serialization offers a flexible tool for generic serialization. What if you can serialize a whole tree T, with something as simple as
oa << root 
and library serializes single node in T regardless tree shape and the type of each generic. This code explain how to serialize a generic tree. (Thanks gio for fixing that nasty bug!)

Monday, March 16, 2009

Segment covering

Given several segments of line with coordinates [Li,Ri], choose the minimal amount of them, such they completely cover the segment [0,M].