Antonio Gulli's coding playground: Spam detection

Friday, March 19, 2010

Spam detection

Suppose you need to classify a set of web pages basing just on the textual content. What approach would you adopt?

1 comment:

Greg LindenMarch 22, 2010 at 8:28 AM
The approach that everyone always does? Think up a bunch of features and train a classifier?

That's just about every paper on this topic these days. All the papers look like this: Here's a couple hundred features we tried, here are the dozen that mattered, here are the types of classifiers we tried but it didn't make much difference which kind we used.

I think the more interesting work on this topic is when we include other data such as real-time user behavior data in the classification. That's when things start to get exciting.
ReplyDelete
Replies

Subscribe to: Post Comments (Atom)