Friday, March 19, 2010

Spam detection

Suppose you need to classify a set of web pages basing just on the textual content. What approach would you adopt?

1 comment:

  1. The approach that everyone always does? Think up a bunch of features and train a classifier?

    That's just about every paper on this topic these days. All the papers look like this: Here's a couple hundred features we tried, here are the dozen that mattered, here are the types of classifiers we tried but it didn't make much difference which kind we used.

    I think the more interesting work on this topic is when we include other data such as real-time user behavior data in the classification. That's when things start to get exciting.