(excerpt from my new book, question #65)
News clustering is a hard problem to be solved.
News articles are typically arriving to our clustering engine in a continuous
streaming fashion. Therefore, a plain vanilla batch approach is not feasible.
For instance, the simple idea of using k-means cannot work for two reasons.
First, it is not possible to know the number of clusters a-priori because the
topics are dynamically evolving. Second, the articles themselves are not
available a-priori. Therefore, more sophisticate strategies are required.
One initial idea is to split the data in
mini-batches (perhaps processed with Spark Streaming) and to clusters the
content of each mini-batch independently. Then, clusters of different epochs (e.g.
mini-batches) can be chained together.
An additional intuition is to start with
k-seeds and then extend those initial k-clusters whenever a new article that is
not similar enough to the initial groups arrives. In this way, new clusters are
dynamically created when needed. In one additional variant, we could think
about re-clustering all the articles after a number of epochs under the
assumption that this will improve our target metric.
In addition to that, we can have a look to the
data and perhaps notice that many articles are near-duplicates. Hence, we could
aim at reducing the computational complexity by applying pseudo-linear techniques
such as minHash shingling.
More sophisticate methods will aim at ranking
the articles by importance. This is an even harder problem again because of the
dynamic nature of the content and the absence of links, which could have allowed
PageRank-type computations. If that is not possible, then a two-layer model
could be considered where the importance of a news article depends on the importance
of the originating news sources, which in turns depends on the importance of
the emitted articles. It can be proven that this recurrent definition has a fixed-point
solution[1].
Even more sophisticate methods can aim at extracting
entities from the clusters and this is typically achieved by running a topic
model detection on the evolving mini-batches.
[1] Ranking a stream of news, WWW '05 Proceedings of the 14th
international conference on World Wide Web, G. M. Del Corso, A. Gulli, F.
Romani.
No comments:
Post a Comment