Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search
Friday, December 31, 2010
Divide in two equal parts
You are given an array of positive integers, and you have to find a split point such that the two parts have equal sum
Thursday, December 30, 2010
Longest increasing sub-sequence
This is a well know algorithm, but it is useful to know for interviews
Wednesday, December 29, 2010
Subsequences
Given a sequence of integers, find a continuous subsequence which maximizes the sum of its elements
Tuesday, December 28, 2010
DP bracket problem
Given an increasing sequence of integers s1, ...., sk how many balanced parenthesis strings can you generate?
Monday, December 27, 2010
Count the number of inversions in an array
Given an array A unsorted, count the number of inversions.
Sunday, December 26, 2010
Reverse a string with O(1) additional space
Old interview question
Saturday, December 25, 2010
Friday, December 24, 2010
Nesting envelopes
Given k different types of envelopes. The dimensions of envelope type i are xi × yi. You can place envelope A inside envelope B if and only if the dimensions A are
strictly smaller than the dimensions of B. how many possible combination you can have?
strictly smaller than the dimensions of B. how many possible combination you can have?
Thursday, December 23, 2010
Modify a stack to implement min/max in O(1)
Modify a stack to implement min/max operations in O(1)
Wednesday, December 22, 2010
Maximize the number of items you collect
This is a classical DP problem. A table composed of N*M cells with certain items non zero. Start from (0, 0) and you can go down or right one cell. Design an algorithm to find the maximum number of non zero items you can collect, if you are moving from upper-left corner to bottom-right corner.
What is the complexity and the space needed?
What is the complexity and the space needed?
Tuesday, December 21, 2010
Launched News in UK, and FR
Another year, another news product. Optimized local rankers, classifiers and clustering. Oh yes, clustering my dear.
Monday, December 20, 2010
Are you the average of your friend actions?
A lot of people are enthusiastic about the benefits of social signals for selecting and ranking information. My question for you is: do you believe that you are well represented by the average actions of your most active social friends? or do you also have differences and a personal identity?
Sunday, December 19, 2010
Related Searches shipped in 8 countries
What an amazing ride. We shipped 8 countries in less than one year with localized filtering, aggregation, and ranking. Italy is part of the list (yeah!!!)
More pictures here
Saturday, December 18, 2010
Maximize and expression with no parenthesis
Given an expression E of n numbers with *, + operations and no precedence among the operators, identify the set of parenthesis which maximize the results for E.
Friday, December 17, 2010
Autosuggest shipped in 5 countries
What an amazing ride. We shipped 5 countries in less than one year with localized filtering, aggregation, and ranking.
More pictures here
Thursday, December 16, 2010
Encode and Decode a document
A 'Document' is defined as a collection of 'Fields'. Some fields can be optional. A 'Field' can be a first-order type (int, char, float), or an array of first order types.
Write the code to serialize and de-serialize a collection of 'Documents'
Write the code to serialize and de-serialize a collection of 'Documents'
Wednesday, December 15, 2010
Hive, SQL like, Facebook and the map join optimization
Hive is the SQL-like processor on the top of Hadoop, the open source Map-Reduce implementation. Facebook recently proposed an interesting optimization for the sort and merge phase adopted by the Reducer: "The motivation of map join is to save the shuffle and reduce stages and do the join work only in the map stage. By doing so, when one of the join tables is small enough to fit into the memory, all the mappers can hold the data in memory and do the join work there. So all the join operations can be finished in the map stage. However there are some scaling problems with this type of map join. When thousands of mappers read the small join table from the Hadoop Distributed File System (HDFS) into memory at the same time, the join table easily becomes the performance bottleneck, causing the mappers to time out during the read operation".
This is the optimization: "Hive-1641 solves this scaling problem. The basic idea of optimization is to create a new MapReduce local task just before the original join MapReduce task. This new task reads the small table data from HDFS to an in-memory hash table. After reading, it serializes the in-memory hash table into a hashtable file. In the next stage, when the MapReduce task is launching, it uploads this hashtable file to the Hadoop distributed cache, which populates these files to each mapper's local disk. So all the mappers can load this persistent hashtable file back into memory and do the join work as before."
So they are doing some pre-computation before the join operation itself for computing an hash table which is then pre-loaded in the memory of the local mapper. The performance gains are amazing since optimized map join is 12 to 26 times faster than the previous one.
I wonder how a fast compression scheme could help here. Paolo?
This is the optimization: "Hive-1641 solves this scaling problem. The basic idea of optimization is to create a new MapReduce local task just before the original join MapReduce task. This new task reads the small table data from HDFS to an in-memory hash table. After reading, it serializes the in-memory hash table into a hashtable file. In the next stage, when the MapReduce task is launching, it uploads this hashtable file to the Hadoop distributed cache, which populates these files to each mapper's local disk. So all the mappers can load this persistent hashtable file back into memory and do the join work as before."
So they are doing some pre-computation before the join operation itself for computing an hash table which is then pre-loaded in the memory of the local mapper. The performance gains are amazing since optimized map join is 12 to 26 times faster than the previous one.
I wonder how a fast compression scheme could help here. Paolo?
Tuesday, December 14, 2010
K-th sum for two sorted arrays
You are given 2 arrays A, B sorted in decreasing order of size m and n respectively and a number k, such that 1 <= k <= m*n. You need to fine the kth largest sum(a+b) possible where a \in A, and b \in B.
Hint: i like this problem, a possible solution is by induction and the complexity is O(n+m).
Hint: i like this problem, a possible solution is by induction and the complexity is O(n+m).
Monday, December 13, 2010
Interesting google talk
Available here, with more information about current G's index compression, spanner project, and many other features.
Sunday, December 12, 2010
Out of band: Red power led on dg834g
My router dg834g decided to die on Sunday. It died in a strage way. When turned on the power led remains red. Then, the wireless router starts the power on-cycle but it goes in an infinite reboot cycle. Anyway, sometime it simply starts correctly.
It turns out that the problem was with power supply unit. I just replaced it with an equivalent one.
It turns out that the problem was with power supply unit. I just replaced it with an equivalent one.
Saturday, December 11, 2010
Out of band: vnc connection to server behind a dg834g nat
I wanted to use my dg834g for accessing my servers behind a nat with dynamic ip.
The solution was to use dynamic dns (supported natively by the router), and ultravnc with security encryption. The dynamic ip allows to have a unique name, regardless of the dynamic ip assigned to the router. Then, I needed a way to reach the vnc severs. My solution was to a assign static ip to each server in my lan (instead of relaying on dhcp), and to have different range of vnc ports for each server (from server property page). Finally, i setup a static port forwarding set of rules for those ports (for each service configure a "service" in router's content filtering, and then open the "service" in the router's firewall and send to a specific lan server).
And voila, now my servers are accessible with 2048-bit RSA keys and 256-bit AES keys security encryption.
The solution was to use dynamic dns (supported natively by the router), and ultravnc with security encryption. The dynamic ip allows to have a unique name, regardless of the dynamic ip assigned to the router. Then, I needed a way to reach the vnc severs. My solution was to a assign static ip to each server in my lan (instead of relaying on dhcp), and to have different range of vnc ports for each server (from server property page). Finally, i setup a static port forwarding set of rules for those ports (for each service configure a "service" in router's content filtering, and then open the "service" in the router's firewall and send to a specific lan server).
And voila, now my servers are accessible with 2048-bit RSA keys and 256-bit AES keys security encryption.
Friday, December 10, 2010
Out of band: vpn with a dg834g router
I wanted to have a native vpn with my dg834g router. This is the configuration. Then you can adopt a client, such as the greenbow
Facebook data store
I used this in my interviews, now not anymore. Device an infrastructure for storing data in Facebook. There are 500M users and let's assume the average number of friends is 100. Each user can post a message, or can read all the fresh messages produced by her friends.
Thursday, December 9, 2010
Google's Dremel
Greg pointed out this interesting G paper "Dremel: Interactive Analysis of Web-Scale Datasets" (PDF)"
"By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing."
The key idea is very simple, but the implementation is complex: "our goal is to store all values of a given field consecutively to improve retrieval efficiency". This is the so called columnar vs. record-oriented storage.
Then, SQL relational operations are implemented by using the following intuition: "Think of a nested record as a labeled tree, where each label corresponds to a field name. The selection operator prunes away the branches of the tree that do not satisfy the specified conditions." .. "Dremel uses a multi-level serving tree to execute queries (see Figure 7). A root server receives incoming queries, reads metadata from the tables, and routes the queries to
the next level in the serving tree."
Dremel scans quadrillions of records per month.
Wednesday, December 8, 2010
Segments with points
You are given a blue segment S of length n. There are n points p_i (0 <= i < n) distributed uniformly at random. You mark in red all the segments connecting points p_i and p_j what would be the predominant colour in S after that?
Tuesday, December 7, 2010
RTTI
I was using this in my interviews. Now no longer: "What is RTTI?"
Monday, December 6, 2010
Subarray sorted
Given an unsorted array arr[0..n-1] of size n, find the minimum length subarray arr[s..e] such that sorting this subarray makes the whole array sorted.
Sunday, December 5, 2010
Saturday, December 4, 2010
Facebook friend suggestions
I used this in my past interviews: now not any longer. Devise an algorithm for suggesting friends in Facebook.
Friday, December 3, 2010
Queries and time
You are given a log of queries annotated with a time-stamp, where a single entry is
. Define a data-structure for efficiently identify the top k most frequent queries in the time interval [t0, t0 + \tau]
Thursday, December 2, 2010
Celebrity twitters
In a n twitter set, a celebrity is defined as someone who is followed by everyone, but who follows no one. You need to identify the celebrities by asking to twitters the question: "do you know person x?" The answer will be yes/no. Find the celebrity by asking only O(n) questions.
Hint: how many information can you infer with one question?
Hint: how many information can you infer with one question?
Wednesday, December 1, 2010
Ants in a row
You have a set of ants moving on an horizontal segment of length n. Each ant can randomically move either right or left. When an ant takes a direction, she stays in that direction until she either drops out of the segment or it bumps with another ant. In this case, the ant changes her direction. Can you formalize the system and describe what will happen after t iterations?
(thanks giuseppe, muthu)
(thanks giuseppe, muthu)
Subscribe to:
Posts (Atom)