Friday, September 19, 2014

Hands on big data - Crash Course on Spark - PageRank - lesson 8

Let's compute PageRank. Below you will find the definition.

Assuming to have a pairRDD of (url, neighbors) and (url, rank) where the rank is initialized as a vector of 1s, or uniformly random. Then for a fixed number of iterations (or until the rank is not changing significantly between two consecutive iterations) 


we join links and ranks forming (url, (links, rank)) for assigning to a flatMap based on the dest. Then we reduceByKey on the destination using + as reduction. The result is computed as 0.15 + 0.85 * computed rank 


The problem is that each join needs a full shuffle over the network. So a way to reduced the overhead in computation is partition with an HashPartitioner in this case on 8 partitions

 And avoid the shuffle

However, one can directly use the SparkPageRank code distributed with Spark


Here I take a toy dataset


And compute the PageRank


Other dataset are here

https://snap.stanford.edu/data/web-Google.html




Thursday, September 18, 2014

Wednesday, September 17, 2014

Hands on big data - Crash Course on Spark Cache & Master - lesson 6

Two very important aspects for Spark are the use of caching in memory for avoiding re-computations and the possibility to connect a master from the spark-shell

Caching
Caching is pretty simple. Just use .persist() at the end of the desired computation. There are also option to persist an computation on disk or to replicate among multiple nodes. There is also the option to persist objects in in-Memory shared file-systems such as Tachyon

Connect a master
There are two ways for connecting a master

val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
This will connect the local master with 8 cores

$ ./bin/spark-shell --master local[8]


Tuesday, September 16, 2014

Hands on big data - Crash Course on Spark - Word count - lesson 5

Let's do  simple exercise of word count. Spark will automatically parallelize the code


1. Load the bible
2. create a flatMap where every line is put in correspondence with words
3. Map the words to tuples (words, counter where counter is simple starting with 1
4. Reduce by keys, where the reduce operation is just +



Pretty simple, isn't it?




Monday, September 15, 2014

Hands on big data - Crash Course on Spark - Optimizing Transformations - lesson 4


1. Take a list of names
2. Create buckets by initials
3. Group by keys
4. Map each each initial into the set of names, get the size
5. Collect the results into the master as an array

This code is not super-optimized. For instance, we might want to force the number of partitions - the default is 64Mb chunks

Then we can avoid to count duplicates



 or better count the unique names with a reduceByKey where the reduce function is the sum



These optimizations reduce the information sent over the wire.

Found this example very instructive [Deep Dive : Spark internals], however you should add some string sanity check


and make sure that you don't run in some heap problems if you fix the partitions





You will find more transformatios here

Sunday, September 14, 2014

Hands on big data - Crash Course on Spark Map & Reduce and other Transformations - lesson 3

Second lesson. 

  1. Let's start to load some data into the Spark Context sc (this is a predefined object, storing RDD)
  2. Then do a map&reduce. Pretty simple using scala functional notation. We map the pages s into their length, and then reduce with a lamda function which sums two elements. That's it: we have the total length of the pages. Spark will allocate the optimal number of mappers and reducers on your behalf. However, if you prefer this can be tuned manually. 
  3. If we use map(s => 1) then we can compute the total number of pages
  4. If we want to sort the pages alphabetically then we can use sortByKey() and collect() the results into the master as an Array



As you see this is way easier than plain vanilla Hadoop. We will play with other tranformations during next lessons

Saturday, September 13, 2014

Hands on big data - Crash Course on Scala and Spark - lesson 2

Let's do some functional programming in Spark. First download a pre-compiled binary


Launch the scala shell



Ready to go


Let's warm-up, playing with types, functions, and a bit of lamda.

We also have the SparkContext . Dowloaded some dataset (here wiki pagescounts)

load the data in spark RDD 

 And let's do some computation


Bigdata is nothing new

How many of those catchy words do you know?

PVM, MPI, NOW, GRID, SKELETON, WAMM

Thursday, September 11, 2014

Hands-on big data - fire a spark cluster on Amazon AWS in 7 minutes - lesson 1

After login in aws.amazon.com, launch the management console




And select EC2


Then, launch an instance


For beginning, a good option is to get a free machine


Select the first one


Select all the default options and launch the machine. Remember to create the key par


Download the keys

Get putty and puttygen . Launch puttygen load the spark-test.pem and save the private key (it’s ok to have it with no password now)




Check the launched instance

Get the public ip



Fire putty, the login is typically ec2-user@IP




And add the ppk file like this

Login into the new machine




Untar
  


 Enter the directory










Get private and public keys


And download them


Modifiy your ~/.bashrc





You need to copy the spark-test.pem into the ~/.ssh directory  - winscp is your friend



 

./spark-ec2 -k spark-test -i ~/.ssh/spark-test.pem --hadoop-major-version=2 --spark-version=1.1.0 launch -s 1 ag-spark


Add caption





If it does not work, Make sure that the region ( -r ) is the same of the key you have on your running machine

GO AND TAKE A COFFEE, the cluster is booting – it needs time.


















Fire a browser and check the status

















Login into the cluster




Sunday, August 31, 2014

Search is dead, Awareness is the new king

Search is dead. Well, perhaps is not but I am provocative this morning. The first search engine was launched in 1993. Since then we made an incredible progress in terms of content discovery, indexing, scalability, ranking, machine learning, variety, and UX. However, the paradigm is still the same. You need to have an idea of what you are searching well before starting submitting queries based on keywords. Exactly like in 1993.

I think that Awareness is the new king. For awareness, I mean something that send to you the information you like to get by working on your behalf with no need of explicit searches. To be honest, the topic is not new. Patty Maes works on Intelligent Software Agents since 1987. However, I don't see a disruptive revolution. We still receive alerts via Google Alerts, which is based on explicit queries. Some initial steps - but not a revolution - are shown in Google Now where the location is the implicit query. Some other changes are in Google Glass, where the image captured might be the surrogate for query. Still is a world where 95% of our actions are described and learned via keywords.

What do you think?




Saturday, August 30, 2014

Internet of things: what is new?

We live in a world of catchy words. What was called Parallel Computation yesterday, became NoW (Network of Workstation), Grid and then Cloud. Same thing different words.
Another example is IoT, which is pretty much similar to Home Automation something discussed during the past 40 years.

So what is cool with IoT?

Frankly, I don't know. However, got an idea there and filed a patent. Let's see what will happen.

Friday, August 22, 2014

Assignment matching problem

Assume that we have  workers and  tasks to be completed. For each pair (worker, task) we know the costs that should be paid per worker to conclude the task. The goal is to conclude all the tasks and to minimize the total cost, under the condition that each worker can execute only one task and vice versa.

Friday, August 15, 2014

Antonio Gulli Adaboost

Adaboost is one of my favorite Machine Learning algorithm. The idea is quite intriguing: You start from a set of weak classifiers and learn how to linearly combine them so that the error is reduced. The result is a strong classifier built by boosting the weak classifiers.

Mathematically, given a set of labels y_i = { +1, -1 } and a training dataset x_i, Adaboost minimizes the exponential loss function sum_i (exp - (y_i * f(x_i)) }. The function f = sum_t (alpha_t h_t) is a linear combination of the h_t classifier with weight alpha_t. For those loving Optimization Theory , Adaboost is a classical application of Gradient Descend.
The algorithm is quite simple and has been included in the top 10 data mining algorithms in 2007 and the Gödel prize in 2003. After Adaboost, Boostingbecome quite popular in the data mining community with application in Ranking and Clustering.

Here you have the code for AdaBoosting in C++ and Boost.


Antonio Gulli KMeans

K-means is a classical clustering algorithm..
Here you have a C++ code for K-means clustering.

(Edit: 12/05/013)
See also my more recent posting A new lighter implementation of K-means


Sunday, August 10, 2014

Find an Hamiltonian cycle

The problem is NP-Complete, so provide a backtracking exponential solution.

Monday, August 4, 2014

Implement a DFS visit

Try to be independent from underlying graph implementation

Sunday, August 3, 2014

Implement a BFS visit in C++

Try to be independent from underlying graph implementation

Saturday, August 2, 2014

Implement a graph in C++

Using adjacent matrix. If possible, use templates for storing node attributes and link attributes. Is this the best implementation?


Friday, August 1, 2014

Saturday, June 14, 2014

Here is why Amazon should have personal shoppers

I frequently buy objects but I am not really an expert for that particular domain.

For instance, I recently bought some Bluetooth speakers but I was not sure about the brand and the quality of the deal I was making.

In those situations, I would certainly pay a few bucks to someone who is well know expert and is willing to give me advices on what to buy.

Personal  shoppers are common on expensive objects, but I am thinking about something done on a massive scale. I think that Amazon would be the perfect place for making this working at scale

BTW, ranking personal shoppers is a clear machine learning problem

Friday, June 13, 2014

Here is why sending money should be as easy as chat

Today chat is the king and the growth is incredible. No surprise IRC predates the Web and ICQ came before Google.

So why don't we use chat messages to pay our bill and send money? The protocol should be easy: you receive a request of payment with a chat message and you answer with another chat message including your public key.

Should this the natural evolution for Whatsup, Line, Viber, WeChat and all the other instant messengers?

After all, you have your phone in your pocket more frequently than your credit card in your wallet.

Thursday, June 12, 2014

Here is why linkedin should be linked to search

Search is about leveraging users' interactions. Every time you submit a query and select a few links You provide valuable information to the search engine that will be later used for ranking. Normally this information are intrinsically provided. However, this is not the only possible interaction model.

For instance, imagine that you are a top IT executive with the secret passion of snowboarding. Why don't allow you to answer to search queries related to this topic?

Another example, imagine that you are a banker working in finance with the secret passion of interior design. You'd love to share your expertise with other people for free but with current search model this is not easy.

Today, everyone has a linkedin profile. So why do we extend the search model so that I can register myself for a specific search query and offer my expertise on that topic for future users who are submitting the same query?

I think that the problem of spam is completely avoided if we adopt Linkedin profiles for registering users.

Also, I think that users would love the serendipity of being in touch with other people sharing same interests.

BTW, did I say that I love acquarium, ham radio and late 80s music?





Wednesday, June 11, 2014

Here is why Amazon should start to sell money

Amazon is well-know to sell everything (including baby sitters now ;) So why they don't start selling money?
The idea could be a little crazy but here are the motivations:

  1. They have all users' credit history so it is very easy for them to assess credit rate
  2. They have all user's credit cards so it is very easy for them to assess credit rate
  3. They have a huge number of transactions so they can easily move money and let user borrow 
This will be a win-win situation. Amazon can increase the number of transactions by allowing users to borrow money. Users can find very interesting rates. The whole system can benefit because the credit rate will be based on dynamically evolving parameters constantly updated

So when Amazon will become a bank?

PS: same suggestion can be applied to Ebay and Alibaba

Tuesday, June 10, 2014

Jump array

Given an array of non negative numbers, start at the first element and reach the end with a minimum number of jumps. A jump cannot exceed the length contained in the current position.

x

Monday, June 9, 2014

LIS

Given an array of integers find the longest increasing subsequence

Saturday, June 7, 2014

LCS

Given two strings find the longest common substring

Friday, June 6, 2014

Palidromes

Given a string find the minimum number of characters to be inserted for converting a string into a palidrome

Wednesday, June 4, 2014

Palindrome String

Compute the longest palindrome string

Tuesday, June 3, 2014

Dynamic Programming -

Given an array of integers find the longest bitonic sequence

Monday, June 2, 2014

Coin change

Given  n dollars, how many different ways there are to make the change into a set of coins

Saturday, May 31, 2014

Friday, May 30, 2014

Thursday, May 29, 2014

Integer Knapsack

A knapsack has max capacity C and there are n items each with weight wi, and value vi. Maximize the value in the knapsack without exceeding its max capacity

Wednesday, May 28, 2014

Jump stairs

There ia stair with n steps. You can climb either one step or two steps at a time. How many different ways there are to climb the stair?

Tuesday, May 27, 2014

Monday, May 26, 2014

Dice

Given a dice count how many ways to get sum s

Sunday, May 25, 2014

Palindromes

Given a string find the minimum number of characters to be inserted for converting the string into a palindrome.

Saturday, May 24, 2014

Bridge matching

N cities are on the northern bank of a river and n cities are on the southern bank. You can build a bridge only between cities with the same number

Friday, May 23, 2014

Sum subset

Given an array of integers of size n, partition it in such a way that the two subsets have equal sum s.

Thursday, May 22, 2014

Partition set

Partition a multiset S of positive integers into two subsets S1 and S2, such that the sum of the numbers in S1 equals the sum of the numbers in S2

Wednesday, May 21, 2014

Segment partition

Given a segment of integer length n, cut into different integer parts in such a way to maximize the product of the lengths of all parts

Tuesday, May 20, 2014

Cutting a rod

Given a rod of length n and a vector of prices for different lengths, cut the rod for maximizing the gain

Monday, May 19, 2014

Max matrix 0/1

Given a matrix consisting only of 0 and 1 find the maximum size square submatrix with all 1.

Sunday, May 18, 2014

Matrix parentization

Given a set of m matrices find the most efficient way of multiplying them

Saturday, May 17, 2014

Max submatrix

Given a matrix of size n x n fine the maximum sum rectangle

Friday, May 16, 2014

BST and dynamic programming

Given a binary search tree, find the size of the largest independent set of nodes

Thursday, May 15, 2014

Games of coins

A set of coins are aligned and two players can pick one coin from each side in turn. Maximize the value of the coins picked by the first player

Wednesday, May 14, 2014

Maximum value contiguous subsequence

Given an array of real numbers, find a contiguous subsequence with max sum S

Tuesday, May 13, 2014

Stock prices

Given an histogram array of unsigned integers encoding the price of a stock title during previous year, compute the area for the largest rectangle contained in this histogram

Monday, May 12, 2014

Ship battle

Given a matrix M and an array V match the array in the matrix

Sunday, May 11, 2014

Scheduling

Given n jobs, each one with a processing time t1 a profidt p1 and a deadline di maximize the profit

Saturday, May 10, 2014

Boolean parentesization

Given a boolean expression of AND, OR, XOR, TRUE, FALSE find the number of ways to parenthesize and evaluate to true

Friday, May 9, 2014

Balanced partition

Given an array of integers between 0 and M divide the integers into two sets such that the difference of their sum is minimized