Antonio Gulli's coding playground: Random Decision Trees

Saturday, November 21, 2009

Random Decision Trees are an interesting variant of Decision Trees. Here the key elements:

Different training sets are generated from the N objects in the original training set, by using a bootstrap procedure which randomly samples the same example multiple times. Each sample generate a different tree and all the trees are seen as a forest;
The random trees classifier takes the input feature vector, classifies it with every tree in the forest, and outputs the class label that recieved the majority of “votes”.
Each node of each tree is trained on a random subset of the variables. The size of this set is a training parameter (in general sqrt(#features)). The best split criterium is chosen just considering the random sampled variables;
Due to the above random selection, some training elements are left out for evaluation. In particular, for each left-out vector, find the the class that has got the majority of votes in the trees and compare it to the ground-truth response.
Classification error estimate is computed as ratio of number of misclassified left-out vectors to all the vectors in the original data.

As you know, I love the idea of Boosting (such as AdaBoost) and the ideas behind Random Decision Trees are quite intriguing too.