Antonio Gulli's coding playground: Similarities: what is better?

Sunday, October 25, 2009

Similarities: what is better?

Similarity functions are the core of many machine learning and data mining algorithms (hmm not just clustering and recomandation systems). There are many sim measures out of there.

What is the best one?

It depends. Anyway cosine similarity has a very good behaviour in a large scale experiment run by Google in the paper "Evaluating Similarity Measures: A Large-Scale Study in the Orkut Social Network". Other measures were evaluated such as L1-norm, Pointwise Mutual Information, Pointwise Mutual Infomation with negative feedback, TF*IDF, LogOdds. Dataset for the experiment is Orkut and 4,106,050 community pages with recommendations were considered. Cosine measure was the best one in terms of finding correct correlations between recommendations. I am pretty sure that different measures can have different performances for other datasets. Anyway, this is another example of why I love to KISS.

Keep it simple baby, KISS.

Antonio Gulli's coding playground

Sunday, October 25, 2009

Similarities: what is better?

No comments:

Post a Comment