How to detect what is the original news article (or document) in a stream of articles (document) with near-duplicates? This is a generalization of the classical shingling problem, where the order of documents is important. For instance, in a news engine context you may want to identify the source which broke the story.
Detecting the Origin of Text Segments Efficiently is a Google paper which adopts different methodologies of generating Rabin's shingles. Then, the shingles are hashed into a fixed size cache and different cache eviction strategies are used for dealing with online processing. At query time, an estimation step is used to guess the origin of each shingle. The best combination of generation, eviction and estimation is able to detect the origin document with an accurary of 80/90% just using 1.4% of the shingled tokens! Pretty impressive!!
I wonder how many times the origin document is just the first one produced in temporal order.
No comments:
Post a Comment