Monday, August 30, 2010

Apache Mahout: a formidable collection of Data mining algos on the top of Hadoop (Map Reduce)

I am playing with Mahout these days and I am loving it. Mahout is a formidable collection of data mining and machine learning algorithms implemented on the top of Hadoop, the open source Map Reduce programming paradigm.

Here you find some hints to run Mahout on the top of Amazon EC2. Here a collection of algorithms implemented . They include:


Logistic Regression (SGD implementation in progress: MAHOUT-228)


Support Vector Machines (SVM) (open: MAHOUT-14, MAHOUT-232 and MAHOUT-334)

Perceptron and Winnow (open: MAHOUT-85)

Neural Network (open, but MAHOUT-228 might help)

Random Forests (integrated - MAHOUT-122, MAHOUT-140, MAHOUT-145)

Restricted Boltzmann Machines (open, MAHOUT-375, GSOC2010)


Canopy Clustering (MAHOUT-3 - integrated)

K-Means Clustering (MAHOUT-5 - integrated)

Fuzzy K-Means (MAHOUT-74 - integrated)

Expectation Maximization (EM) (MAHOUT-28)

Mean Shift Clustering (MAHOUT-15 - integrated)

Hierarchical Clustering (MAHOUT-19)

Dirichlet Process Clustering (MAHOUT-30 - integrated)

Latent Dirichlet Allocation (MAHOUT-123 - integrated)

Spectral Clustering (open, MAHOUT-363, GSoC 2010)

Pattern Mining

Parallel FP Growth Algorithm (Also known as Frequent Itemset mining)


Locally Weighted Linear Regression (open)

Dimension reduction

Singular Value Decomposition and other Dimension Reduction Techniques (available since 0.3)

Principal Components Analysis (PCA) (open)

Independent Component Analysis (open)

Gaussian Discriminative Analysis (GDA) (open)

Evolutionary Algorithms

see also: MAHOUT-56 (integrated)

1 comment:

  1. That is pretty cool. Too bad Amazon doesn't support Mahout as part of Elastic MapReduce:

    On a related note, have you looked at all at Google's Prediction API or Bigquery? Looks like they are invite only at the moment, but could be quite interesting as well.