Caching is an important aspect of modern search engine. There are some assumptions that I see in academic papers which might be different in industrial search. In particular:
1) Inverted lists are nowadays stored in memory; they are no longer stored on disks;
2) There is a periodic effect during the 24hours. There are morning queries and late night queries;
3) Geo-localization is very important; What are the benefits and the costs?
4) Verticals may invalidate cache. For instance you need to mix fresh news, with some cached results. how do you deal with that? What are the benefits and the costs?
The paper Improved Techniques for Result Caching in Web Search Engines models web search caching as a weighted problem. Queries results have the same size, but they may have different benefits. In this sense, the caching problem is not just the problem of maximizing the hit ratio. It becames the problem of maxizing the benefits.
I would like to see more papers where the above 4 considerations are taken into account.
I found that cost-aware caching policy is not new (P. Cao and S. Irani. Cost-aware WWW proxy caching
ReplyDeletealgorithms. In Proc. USENIX Symposium on Internet
Technologies and Systems, 1997.)
Thanks for pointing out this paper, Antonio.
ReplyDeleteOn your point about maximizing the benefits, it might be interesting to try to include some measure of the overall impact on user satisfaction in the metric, not just server time used.
For example, I could imagine that it might be important to partially cache data to keep the display time of appearing instantaneous, but that the amount of data necessary to cache to keep things fast might vary from query to query. I also suspect that searchers might be more sensitive to delays on certain classes of queries. A broader measure of the benefits of caching a query might be able to capture those effects.