Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search
Wednesday, September 30, 2009
Ebay: online auctions is about search and viceversa
I believe that search and on-line auctions will be the same in the near future. Just curious about what Hugh is doing there. I am just curios, Hugh and I have great expectations ;-)
Tuesday, September 29, 2009
Google as borg?
"Welcome to the cusp of a new decade and what I am officially calling The Age of the Goog. I am starting to see Google as the real life version of the Borg in that it is slowly moving along and without most people realizing it, simply assimilating us into their collective offerings of services."
searchnewz
searchnewz
Great feedback on some contribute gave to the open source coding community
"The best thing about code is, Its simple, easy to implement any where else and all patterns in one place. I have never seen the simplicity of having all patterns in one place. I had been through Modern C++ programming and design patterns from Alexanderscu. You made more easier than that."
Cheers!
DJ
http://codingplayground.blogspot.com/2009/01/design-patterns-c-full-collection-of.html
Cheers!
DJ
http://codingplayground.blogspot.com/2009/01/design-patterns-c-full-collection-of.html
Monday, September 28, 2009
Findory: an old post, I like to propagate it.
Findory was my model, when I started to work on news search back in 2003. I had the pleasure to meet with Greg and I enjoyed discussing with him. I like this post. It's worth reading.
Tuesday, September 22, 2009
Tricky puzzle
This puzzle is tricky. Mary writes two real numbers on the opposite side (A and B) of a paper. The numbers must be different. Then she gives the paper to John who can select one side of the paper. Say he selects A-side. He sees the number on A-side and, without having a look to B-side, he must bet if the number on A-side is greater of the number of B-side. They continue this game forever. Questions:
1. What is the probability to win for John?
2. Is there any strategy better than break-even?
1. What is the probability to win for John?
2. Is there any strategy better than break-even?
Monday, September 21, 2009
c++ giving an order in the allocating static objects
When a static object uses another static object there is no guarantee about the order of initialization.
- How can you guarantee that an object is allocated before another one?
- What if you need to give an order in the de-allocation process of static objects?
Sunday, September 20, 2009
c++ Insulation (IV)
So let's talk about good reasons to adopt insulation. Suppose you define a class; then you may want to modify it. For instance, you need to add a bool. Any client of that class is forced to recompile. Not a good thing when you need to manage a large project, distributed in many different locations.
Let's see other situations where you can have dependencies.
Let's see other situations where you can have dependencies.
- default arguments. If you use default arguments in your method, and then you decide to change them your clients will be forced to recompile.
- enumerations. If you put your enumerations in your .cpp file they are hidden from the client. If you put your enumerations in your .h file, you will introduce a dependency
Saturday, September 19, 2009
A probabilistic model for retrospective news event detection
A probabilistic model for retrospective news event detection is a 2005 paper that I still consider very valid. The problem faced is to detect when an event happened in the past. This can be very useful for building a chronicle service. The model prosed is a probabilistic model which incorporates both content and time information in a unified framework. Model parameters are estimated using Maximum Likelihood, where each article is charaterized by four indipendent probabilities. Namely, time, person, location and keywords. ML is extimated by using the classifical EM algorithm. Test are run on the TDT dataset.
Friday, September 18, 2009
c++ Insulation (III)
These are two of the most un-expected parts of C++.
Addendum: can you identify other reasons why insulation is good and dependencies are bad?
- Inline methods. Inlining is an easy yet powerful way of improving C++ performance. The problem is that if you inline a method than you client will replace its call to your method with the body of method itself. This means you introduce a very strong dependency on you; and your client will see its size increasing a lot. There is cost to pay for it. Just be sure of who is your bed and don't accept people you just met.
- Private members. Ok this is very nasty at first glance. I tell you in one shot: If you declare some private member, and someone is inheriting from you, that guy has a dependency on you. This is very counter-intuitive, I know. You may think. That is private, how it comes that he knows something about my private life. My good friend Giovanni (a real C++ guru) once told me. Private is about access, not about knowledge. You see it, but you are not allowed to access it. A stellar story. Augh Giovanni, you make my day. So you have a dependecy on it. My question: how do you avoid it? What is the cost of your choice?
Addendum: can you identify other reasons why insulation is good and dependencies are bad?
Thursday, September 17, 2009
c++ Insulation (II)
So did you get the answer?
- HasA/HoldsA: if you embedd a pointer or a reference to an object you don't need to know the physical layout of that object. There is no dependency apart from the name of the object itself. So you can make a forward declaration. No more dependency, no more cats to follow. I like cats; I don't like having them come to me just for milk ;-). In the below code fragment, you don't need to include any .h file for including the declaration of B class. no dependecy at all.
class B;
class A {
B * pBobj;
B & rBobj;
};
Wednesday, September 16, 2009
c++ Insulation (I)
Back to the old coding series. I wanted to discuss a bit about insulation, which is something essential for large C++ projects. Many people enjoyes encapsulation, as fundamental good practice for OO progamming. Insulation is a bit less known.
Basically, insulation is a way to reduce dependencies in code at compile time (my own pratical definition). What's bad with dependencies? Well, when you have a dependency you basically introduce some additional time to compile the unit on which you depend to. So, you can say: hey wait a minute if I depend on that piece of code this is meaning that I need that code. In C++ this is not always the case. There are some (implict) dependencies that you may want to avoid and you may want to be aware of. If you apply these suggestions, then your coding style will improve a lot. Let's see some of them:
Basically, insulation is a way to reduce dependencies in code at compile time (my own pratical definition). What's bad with dependencies? Well, when you have a dependency you basically introduce some additional time to compile the unit on which you depend to. So, you can say: hey wait a minute if I depend on that piece of code this is meaning that I need that code. In C++ this is not always the case. There are some (implict) dependencies that you may want to avoid and you may want to be aware of. If you apply these suggestions, then your coding style will improve a lot. Let's see some of them:
- Include files. Whenever you include a .h, you depend on that code. Ok, Ok you say: if I include it, I need it. Sure; but what about including other .h files in your .c file? Why what you included should, in turn, be included by whatever is including your .h file? stop uncessary dependencies and be fair stella;
- Inheritance. Books say any good OO sytem should leverage inheritance. I say don't believe the hype. Use inheritance with care. In a lot of situations Generics and templates are more efficient. BTW, when you inherit from a class, then you introduce a dependency on it. Sometimes you need, sometime you don't. Just be aware. Anything has a cost. Don't go in that direction without following your brain, and just listening to your heart;
- Layering. When your class (HasA) another user-defined type, you have a dependency on it. Again: hey, if I embedd that object this means that I need it. Correct. So what can you do to avoid it?
Tuesday, September 15, 2009
Monday, September 14, 2009
Microsoft Bing going Visual
Give it a try.
"Until now, I really hadn’t had much reason to switch to its Bing decision engine, which launched back in May, for my Web searching needs. Google was doing just fine. For a while, I was making an effort to use Yahoo but Google somehow always became the default.", ZDNET
Friday, September 11, 2009
Predicting query trends
Google just published a very interesting article "On the predictability of Search Trends", where past queries are used for predicting the trends of future queries. I like this paper.
"Specifically, we have used a simple forecasting model that learns basic seasonality and general trend. For each trends sequence of interest, we take a point in time, t, which is about a year back, compute a one year forecasting for t based on historical data available at time t, and compare it to the actual trends sequence that occurs since time t. The error between the forecasting trends and the actual trends characterizes the predictability level of a sequence, and when the error is smaller than a pre-defined threshold, we denote the trends query as predictable."
This is a more approach that you can use in many contexts. For instance, I have seen it used for understand the coverage and the precision of firing a vertical result time based (such as news, blogs, twitter) into the SERP.
Another observation about the paper. You can better predict wether a query has a predictable trend, by enriching your Querylog with other temporal based data such as Twitter, News and blogs.
"Specifically, we have used a simple forecasting model that learns basic seasonality and general trend. For each trends sequence of interest, we take a point in time, t, which is about a year back, compute a one year forecasting for t based on historical data available at time t, and compare it to the actual trends sequence that occurs since time t. The error between the forecasting trends and the actual trends characterizes the predictability level of a sequence, and when the error is smaller than a pre-defined threshold, we denote the trends query as predictable."
This is a more approach that you can use in many contexts. For instance, I have seen it used for understand the coverage and the precision of firing a vertical result time based (such as news, blogs, twitter) into the SERP.
Another observation about the paper. You can better predict wether a query has a predictable trend, by enriching your Querylog with other temporal based data such as Twitter, News and blogs.
Tuesday, September 8, 2009
Data is the king, not algorithms
In many years working in search, there is only few constant things that I observe when we talk about quality. One of them is that: "Data is the king, not algorithms".
If you wan to improve the search quality, quite often you need more data to analyze and you do not necessarly need a better algorithm. The best situation is when you are able to "contaminate" or to "enrich" you data with other information coming from different domains.
So you are working on Web search quality, maybe you can get a huge help from other domains such as News, blogs, Dns, Images, videos, etc. You can use these additional data sources to extract signals used to improve the Web search itself.
In many situation, a more sophisticate algorithm will not provide the same impact of some additional data source.
If you wan to improve the search quality, quite often you need more data to analyze and you do not necessarly need a better algorithm. The best situation is when you are able to "contaminate" or to "enrich" you data with other information coming from different domains.
So you are working on Web search quality, maybe you can get a huge help from other domains such as News, blogs, Dns, Images, videos, etc. You can use these additional data sources to extract signals used to improve the Web search itself.
In many situation, a more sophisticate algorithm will not provide the same impact of some additional data source.
Tuesday, September 1, 2009
Search stats, how the search market is growing.
An interesting article from marketwatch about search growth trends.
"Google Inc. saw its share of the worldwide search market grow 58% in July compared to the same month last year, while Yahoo Inc. saw only slight growth and Microsoft Corp. saw its own share increase 41%, according to data published Monday by comScore Inc."
"Google Inc. saw its share of the worldwide search market grow 58% in July compared to the same month last year, while Yahoo Inc. saw only slight growth and Microsoft Corp. saw its own share increase 41%, according to data published Monday by comScore Inc."
Subscribe to:
Posts (Atom)