The Viterbi algorithm was conceived by Andrew Viterbi in 1967 as an error-correction scheme for noisy digital communication links. The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states – called the Viterbi path – that results in a sequence of observed events.
In 2008, Viterbi won the Millennium Technology Prize "for the invention of the Viterbi algorithm, the key building element in modern wireless and digital communications systems, touching the lives of people everywhere.". It's interesting to note that in 1960s nobody could imagine a general application for the algorithm, so Viterbi followed his lawyer's advice and did not patent it. More information about Viterbi are given by this nice article by DrDobbs.
Here you have the Viterbi code in C++ adapted by the one of Božský Filip for using boost and a bit cleaner interface.
Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search
Saturday, February 28, 2009
Friday, February 27, 2009
A beautiful solution to the ordered generation of hamming numbers
The hamming problem has an interesting variant. That is generating the numbers in ascending order (e.g. quoting wikipedia "regular numbers are often called Hamming numbers, after Richard Hamming, who proposed the problem of finding computer algorithms for generating these numbers in order.").
It turns out that a beautiful and compact solution is obtained by using 3 queues (Q2, Q3, Q5). The solution produces numbers in order and without duplicates. Time complexity is linear in the size output space. Hmm ... intuitively space should be constant in the output space, but still I need to think a bit about it... Here you have the C++ code for Hamming numbers in order and without duplicates. No need to explain the algorithm here since he code is self-explaining.
It turns out that a beautiful and compact solution is obtained by using 3 queues (Q2, Q3, Q5). The solution produces numbers in order and without duplicates. Time complexity is linear in the size output space. Hmm ... intuitively space should be constant in the output space, but still I need to think a bit about it... Here you have the C++ code for Hamming numbers in order and without duplicates. No need to explain the algorithm here since he code is self-explaining.
Thursday, February 26, 2009
Myriad virtues of (Generic) Patricia Trees
Patricia Trees is a interesting variant of tries with a large range of applications ranging from IP lookup, to Web query suggestion, to word completion and spelling suggestion on your mobile phone. Anytime you have a string prefix and need to complete it with all the known variants available in your pre-built dictionary, well .. in this cases, you may want to consider Patricia trees.
Here I found a nice C++ generic implementation of patricia trees. This should be added to Boost template library!
Here I found a nice C++ generic implementation of patricia trees. This should be added to Boost template library!
Old but gold: Wavelet transform (and image clustering)
This series of three articles from DDJ are old, but still very interesting. The Fast Wavelet Transform (FWT) introduces this O(N) transformation which improves "old" FFT O(N logN). A Wavelet Analyzer introduces the recursive computation of wavelet coefficients tree. Here one of the most impressive property of wavelets is expressed in all its power. Each level of the tree is a more detailed approximation of the original data. The Wavelet Packet Transform is an interesting evolution of FWT. If you want to get more information, Wikipedia has more information on wavelets.
In short, wavelet transform enables analysis of data at multiple levels of resolution. This is useful in many contexts. For instance, it has been succesufully apploed to Similarity detection and clustering of images for fast ranking and clustering of similar images.
In short, wavelet transform enables analysis of data at multiple levels of resolution. This is useful in many contexts. For instance, it has been succesufully apploed to Similarity detection and clustering of images for fast ranking and clustering of similar images.
Wednesday, February 25, 2009
Web search ranking and spamming as Nash equilibrium game
The paper Network Reputation Games by Hopcroft, J. and Sheldon, D. October 2008 gives a nice interpretation of link creation process and Pagerank score as a game with Nash equilibrium.
Generalizing the above, I wonder if Web spam and search ranking process can be jointly modeled as a game with Nash equilibrium. In other words, let's suppose that search engines make public their search ranking algorithms (ehmm) together with their demotion strategies for web spammers.
The benefit for search engines can be measured in terms of increase of market share. The benefit for content providers can be measure in terms of positions in the ten blue links or in terms of referral traffic generated by search engines. There is a non zero probability that a spammer is identified. In this case, its ranking is demoted and the referral traffic will have a drammatic decrease.
Under what conditions this is a game with a Nash equilibrium?
Generalizing the above, I wonder if Web spam and search ranking process can be jointly modeled as a game with Nash equilibrium. In other words, let's suppose that search engines make public their search ranking algorithms (ehmm) together with their demotion strategies for web spammers.
The benefit for search engines can be measured in terms of increase of market share. The benefit for content providers can be measure in terms of positions in the ten blue links or in terms of referral traffic generated by search engines. There is a non zero probability that a spammer is identified. In this case, its ranking is demoted and the referral traffic will have a drammatic decrease.
Under what conditions this is a game with a Nash equilibrium?
Tuesday, February 24, 2009
Betting with advisors
There are 3 men/advisors: one of them always says the truth, while the other two can lie. A man writes either 0 or 1 on a piece of paper. Then, he shows the number to the other men who will tell to you the number.
You are given a fixed amount of money M. If you bet x money and you win you will increase your richness of x units, otherwise you will decrease your richness of x units.
What is the maximum amount of money you can guarantee to win?
You are given a fixed amount of money M. If you bet x money and you win you will increase your richness of x units, otherwise you will decrease your richness of x units.
What is the maximum amount of money you can guarantee to win?
Web search ranking and GPU Nvidia CUDA many core computation
Many search ranking algorithms involve a multiplication between a [very large and sparse] matrix and a full vector (e.g. PageRank with power method) or can be re-casted into this type of problem (e.g. HITS, some of the new learning to rank methodologies, etc).
GPU computation performs parallel blocks based multiplication, where each block is assigned to a grid of threads running on separate core. I have 18 cores on my sony notebook (16 on GPU and 2 on CPU) and that is a lot of power!
GPU computation performs parallel blocks based multiplication, where each block is assigned to a grid of threads running on separate core. I have 18 cores on my sony notebook (16 on GPU and 2 on CPU) and that is a lot of power!
- GPGPU: CUDPP provides a library for parallel multiplication based on classical parallel prefix approach. I tested some code (adapted to run on windows vista by myself) and this approach looks quite promising;
- IBM proposes a closed-source solution for sparse matrix - vector multiplication. Declared performance is interesting but I don't know the internals of their proposed approach;
- This technical report from NVIDIA tackles the problem with an interesting mix of algorithmic and engineering solutions: N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, December 2008. [PDF] [Online].
Monday, February 23, 2009
CUDA and Sony vaio FZ21M configuration (part II)
Many people sent me mails asking information about how to run CUDA on a sony vaio FZ21M. Here you have the requested steps:
- Install Visual C++ Express 2005. Cuda 2.0/2.1 needs express 2005 in order to compile. If you have Express 2008 you need to install 2005 as well;
- Install CUDA drivers from http://www.laptopvideo2go.com/. I installed 185.20 Vista 32bit - 2008-12-26 version with Modded INF;
- Install CUDA toolkit and CUDA SDK. I run v2.0 because CUDPP does not run on 2.1;
Sunday, February 22, 2009
Cuda: massive parallelism on my notebook sony vaio FZ21M
Cuda is a parallel computing architecture developed by NVIDIA. Nvidia GPUs have a parallel "many-core" architecture, each core capable of running thousands of threads simultaneously. I installed CUDA 2.1 on my Sony notebook Vaio FZ21M with Geoforce 8400M by using the modded drivers available at http://www.laptopvideo2go.com/
CUDA supports C, C++, Python, and Fortran. The paper Fast N-body Simulation with CUDA describes an a CUDA parallel program which runs 50 times as fast as a highly tuned serial implementation for the N-body simulation program. Wow! this is amazing. I still remeber my parallel implementation on Cray T3E, with 128 alpha processors. More than 10 years ago.
Well, now I have thousands of threads running on a multi-core GPU under my fingers (plus the dual-core CPU). Cuda documentation is pretty extensive. I am right now studying the CuBlas libraries for linear algebra and starting to code.
Here you have an introduction to cuda by the way of Dr. Dobbs
CUDA supports C, C++, Python, and Fortran. The paper Fast N-body Simulation with CUDA describes an a CUDA parallel program which runs 50 times as fast as a highly tuned serial implementation for the N-body simulation program. Wow! this is amazing. I still remeber my parallel implementation on Cray T3E, with 128 alpha processors. More than 10 years ago.
Well, now I have thousands of threads running on a multi-core GPU under my fingers (plus the dual-core CPU). Cuda documentation is pretty extensive. I am right now studying the CuBlas libraries for linear algebra and starting to code.
Here you have an introduction to cuda by the way of Dr. Dobbs
Saturday, February 21, 2009
Optimal prefix-suffix matches
You are given a finite set of string S = { s1, s2, .... sn } over a finite alphabet Σ. Two strings s1 and s2 have a prefix-suffix match if s2 has a prefix α which is a suffix of s1 (i.e. s1 = φα, s2 = αλ ). In this case the strings can be joined in a superstring φαλ and the value of this join is | α |. In this case s1 is tail-joined and s2 is head-joined. If a string is tail(head)-joined, it can not be tail(head)-joined again with other strings. Find the combination of prefix-suffix matches in S with maximum value.
Friday, February 20, 2009
Cheating wifes
Every man in a village of 100 married couples has cheated on his wife. Every wife in the village instantly knows when a man other than her husband has cheated, but does not know when her own husband has. The village has a law that does not allow for adultery. Any wife who can prove that her husband is unfaithful must kill him that very day. The women of the village would never disobey this law. One day, the queen of the village visits and announces that at least one husband has been unfaithful. What happens?
Thursday, February 19, 2009
Coin changes and dynamic programming
Coin Change is the problem of finding different ways of making changes for a particular amount of cents, n, having an infinite amount of coins available. For instance, how can you change 10 dollars having an infinite amount of nickels (1 cent), pennies (5 cents), quarter (10 cents), dimes (25 cents), one dollars?
Well this is the typical problem that you can solve by using dynamic programming.
Well this is the typical problem that you can solve by using dynamic programming.
Wednesday, February 18, 2009
Generating Web Graphs with RMAT and Eigen template c++ library
R-MAT: is a recursive model for graph mining. It has some useful properties:
Here you have the RMAT C++ code implemented with Eigen template library. As an example, I report the timings taken on my VMware virtual machine (guest: Linux Ubuntu with 1GB memory, host: Windows Vista Professional, core duo, 4Gb)
- On-line property. The number of nodes and edges can change with times;
- Power law degree distribution property;
- Small world property: diameter is much smaller than the order of the graph;
- Many dense bipartite subgraphs. The number of distinct bipartite cliques is large when compared to a random graph.
Here you have the RMAT C++ code implemented with Eigen template library. As an example, I report the timings taken on my VMware virtual machine (guest: Linux Ubuntu with 1GB memory, host: Windows Vista Professional, core duo, 4Gb)
- 2^16 nodes, 8*2^16 edges; real 0m6.453s
- 2^18 nodes, 8*2^18 edges; real 0m53.253s
- 2^20 nodes, 8*2^20 edges; real 2m17.383s
- 2^21 nodes, 8*2^21 edges; real 7m26.611s
Eigen -- A generic matrix library for linear algebra computation
Eigen is an interesting library for Matrix computation in C++. It supports all the common operations for linear algebra with vectors, matrices, fixed-size (known at compile time) and dynamic (allocated at compile time), full or sparse. Coding with Eigen is pretty fast and intuitive. Implementing an algorithm on top of Eigen feels like just copying pseudocode. I think that it would be great to have it included in Boost. Benchmarks show that the resulting code is pretty fast and optimized at compile time.
I am currently coding a model for Web graph generation (see R-MAT: A recursive model for graph mining). I think that I will post the code pretty soon.
I am currently coding a model for Web graph generation (see R-MAT: A recursive model for graph mining). I think that I will post the code pretty soon.
Tuesday, February 17, 2009
Overlapping rectangles
A set of rectangles has one border parallel to the x-axis of a Cartesian plan. Find the set of overlapping rectangles in optimal time.
Monday, February 16, 2009
Hamming numbers III
It is possible to generate the hamming numbers without duplicates. The key intuition it to generate the sets:
- M2 = { h : h=2^l , l \in N, l < max }
- M3 = { h : h=3^o , o \in N, o < max }
- M5 = { h : h=5^p , p \in N, p < max }
- M23 = { i : i = h*k, h \in M2, k \in M3}
- M25 = { i : i = h*k, h \in M2, k \in M5}
- M35 = { i : i = h*k, h \in M3, k \in M5}
- M235 = { i : i = h*k, h \in M23, k \in M5}
Sunday, February 15, 2009
Hammin numbers II
One problem with the previous solution is the amount of duplicates we generate. For instance, starting from 2 we can generate 2x3=6. Starting from 3 we can generate 3x2=6. We need to remove these duplicates as soon as possible to avoid unneeded elements in H. A solution is to use set operations, which is "logarithmic in general, but amortized constant if t is inserted right after p". So the complexity of this solution is O(nlogn) in worst case, and O(n) in average since hamming numbers are sparse among large integers.
#include <set>
#include <iostream>
#include <limits.h>
int main(){
typedef unsigned long long num;
typedef std::set<num> Nums;
typedef std::set<num>::const_iterator Nums_iterator;
Nums nums;
Nums_iterator it;
nums.insert(1);
for (it = nums.begin(); it != nums.end(); ++it){
nums.insert(nums.end(), *it*2);
nums.insert(nums.end(), *it*3);
nums.insert(nums.end(), *it*5);
std::cout << *it << std::endl;
}
}
Saturday, February 14, 2009
Hamming numbers
The set of Hamming numbers H is defined recursively. 1 is in H. The numbers 2h, 3h, 5h are in H, if h is a number contained in H. The first few hamming numbers are 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 25, 27, 30, 32, 36, 40, 45, 48, 50, 54, 60, ... Dijkstra put the problem of computing efficiently the hamming numbers. What is your proposal?
Friday, February 13, 2009
Generic programming and out-of-the core computation
STXXL is a C++ library for extra large data sets. The library allows to write generic programs, with an idiom typical of STL, and to leverage the power of out-of-the-core computations for massive data-sets. I played with the library and I loved it.
- You have the STXXL containers, which are similar to STL containers. Anyway, an STXXL container can be mapped to (multiple) external file(s) in a transparent way. Files may have different implementations that might be bases on various file systems or even remote storage interface. Examples of containers are:
- vectors: example: stxxl::vector
v(64 * int64(1024 * 1024)); - map: example: stxxl::map
- queue: example: stxxl:queue
- stacks and priority queues
- You can access the containers with STXXL Iterators and the library is taking care of all I/O operation, supporting large files up to dozens of terabytes even if they are mapped on multiple physical disks.
- unsigned int big_size = 1024 * 1024 * 2 * 16 * 16;
typedef stxxl::vectorvec_big;
vec_big my_vec(big_size);
vec_big::iterator big_it = my_vec.begin(); - You can use a collection of STXXL Algorithms, similar to STL ones but for external memory:
- stxxl::generate(v.begin(), v.end(), stxxl::random_number32(), 4);
- stxxl::for_each_m(v.begin(), v.end(), square
(), 4); - ...
- A particular mention goes to sorting algorithms for external memory:
- typedef stxxl::vector
vector_type;
const stxxl::int64 n_records =
stxxl::int64(384) * stxxl::int64(1024 * 1024) / sizeof(my_type);
vector_type v(n_records);
stxxl::random_number32 rnd;
for (vector_type::size_type i = 0; i < v.size(); i++)
v[i]._key = 1 + (rnd() % 0xfffffff); stxxl::sort(v.begin(), v.end(), cmp(), memory_to_use); - This paper describes the library Stxxl : Standard Template Library for XXL Data Sets ; R. Dementiev, L. Kettner, P. Sanders "STXXL: standard template library for XXL data sets" Software: Practice and Experience Volume 38, Issue 6, Pages 589-637, May 2008 DOI: 10.1002/spe.844. See also the technical report.
- Here you have a collection of example codes; performance are pretty good, and some stxxl users have reported that stxxl sort is much faster than UNIX sort. In addition, i found this paper which reports very good performance (Breadth first search on massive graphs). Do not forget to consult the STXXL forum if you need any help.
Banks Merging and Acquisition (M&A)
Bank system is going crazy. One solution is to merge all of them together under a central authority. You have N banks at the beginning and they are combined into a single bank at the end. How many different configurations of merging are possible?
Thursday, February 12, 2009
Longest common substring, subsequences, increasing subsequences
String gurus are excited about long stuffs. It turns out that such theoretical researches have a lot of practical applications.
- Longest Common Substring (LCS): it has applications in biology (gene analysis) and Web-IR (clustering, features' extraction) and many others. Given two strings S : | S | = n and T : | T | = m, finding the common substring has complexity O(nm) by using dynamic programming. In alternative, a generalized suffix tree (GST) can be built in linear time, if the size of the alphabet is constant. Once the GST is built, you can get the LCS by visiting GST bottom-up. You need to annotate each internal node with the strings that caused its creation. The visit is linear in the length of the strings.
- Longest Common Subsequences (LCSseq): Substrings are made up of consecutive symbols, while subsequences are not. For instance with web clustering, you may want to "skip" some symbol. In this case, you have to use subsequences instead of substrings. If you have two documents, you may want to compare where they are different (have you ever used the diff program?). There are two interesting proprieties, which are useful for a dynamic programming solution with complexity O(nm)
- LCS(Xn, Ym) = (LCS( Xn-1, Ym-1), xn), if the string Xn, Ym both ends with the symbol xn
- LCS(X, Y) = the longest sequences of LCS(Xm – 1, Y) and LCS(X, Yn – 1), this is useful when Xn, Ym do not end with the same symbol
- LCS({}, {}) = 0
Running time can be reduced by adopting memoization. Space can be reduced by using hashing. - Shortest Common Supersequence (SCS) is a common super-sequence of minimal length. Think about merging two documents. For two input sequences, an SCS can be formed from an LCS easily, by inserting the non-lcs symbols while preserving the symbol order
- Longest Increasing Subsequence (LIS): Here symbols have an order (lexicographical, numerical, etc) and you want to find the longest common subsequence that respects this order. The longest increasing subsequence of a sequence S is the longest common subsequence of S and T, where T is the result of sorting of S. The algorithm is quadratic, but it is possible to give an O(nlogn) algorithm.
Wednesday, February 11, 2009
Magy! how to rank a web search result
Magy! engineers are trying to identify new methodologies for ranking web search results. After all, the academic literature has been evolved around three main ideas:
We saw my interesting developments, but no real break-trough in different directions. Do you have any additional idea to rank web search results?
We saw my interesting developments, but no real break-trough in different directions. Do you have any additional idea to rank web search results?
Tuesday, February 10, 2009
Magy! search result clustering and the next search UI
Search UI is still the same since the beginning of search era. True: the 10 blue links model is a solid one, but is there any other innovation ?
Many people are thinking that blending is the next frontier, with champions such as Ask.com and Google. They mix and shake together results coming from different search domains, such as news, images, blogs, questions, products, and the like. The result is a sense of more variety, more freshness and, after all, a better user experience.
Personally, I do think that we need to try different UIs in response to different search queries. Back in 2000, I was a big fan of Northern Light folder's organization, which later on evolved into the Vivisimo's web hierarchies adopted by Clusty, and in a series or Academic reseach proposals about web snippet clustering.
I think that next UIs will be obtained (a) by leveraging cool graphical proposals and (b) by solving basic algorithm problems which apparently have no relationship with UI.
For instance, here it is one problem for our proud Magy! engineers.
PS: you may be interested to have a look to suffix array and suffix tree data structures for solving this particular problem.
Many people are thinking that blending is the next frontier, with champions such as Ask.com and Google. They mix and shake together results coming from different search domains, such as news, images, blogs, questions, products, and the like. The result is a sense of more variety, more freshness and, after all, a better user experience.
Personally, I do think that we need to try different UIs in response to different search queries. Back in 2000, I was a big fan of Northern Light folder's organization, which later on evolved into the Vivisimo's web hierarchies adopted by Clusty, and in a series or Academic reseach proposals about web snippet clustering.
I think that next UIs will be obtained (a) by leveraging cool graphical proposals and (b) by solving basic algorithm problems which apparently have no relationship with UI.
For instance, here it is one problem for our proud Magy! engineers.
- Given an array of strings, find the longest common substring in optimal time
PS: you may be interested to have a look to suffix array and suffix tree data structures for solving this particular problem.
Monday, February 9, 2009
Dropping Swarovski Crystals
A building of 100 floors, there is a threshold N s.t. if a Swarovski Crystal from Nth floor and above it will break. If it's dropped from any floor below, it will not break. Given 2 Crystals, find this N, and minimize the number of drops for the worse case.
Sunday, February 8, 2009
Maximum sub-array sum
Given an array of integers, find the sub-array of contiguous elements with maximal sum. In case of ties select the shortest sub-array.
Saturday, February 7, 2009
Finding the the more frequent symbol in a stream of numbers
You are given an infinite stream of numbers, find the symbol that appears more frequently. At any instant, you are interested to symbols that appears more than 50% of times.
Friday, February 6, 2009
Happy or not?
A community of K people can be happy or not happy. Can you enumerate and generate all the possible states?
Thursday, February 5, 2009
joining n segments
You are given n segments. In turn, you take each one of them and join it with one among those already selected, creating either a loop or a longer segment. How many loops there are in average, at every instant of time?
Wednesday, February 4, 2009
Covering a circle
You are given a circle and two players. When is his turn, each player will mark down a sign on cicle. Game stops when there is no more free space available on the circle. What is the winning strategy?
Tuesday, February 3, 2009
Find a number non in the sequence
You are given a sequence of m distinct numbers extracted by a set of 1...n numbers, with m < n. Can you give an efficient algorithm for identifying a number non in the sequence.
Monday, February 2, 2009
Boost::serialization for storing generic class
Boost serialization is a lovely collection for storing generic classes. Here you have example for serializing int, list of int, vectors of lists of floats (STL containers), and your own class.
Sunday, February 1, 2009
Maximum sub-array product
Given an array of integers, find the proper sub-array with maximum elements' product, in linear time.
Subscribe to:
Posts (Atom)