Antonio Gulli's coding playground

Friday, September 19, 2014

Hands on big data - Crash Course on Spark - PageRank - lesson 8

Let's compute PageRank. Below you will find the definition.

Assuming to have a pairRDD of (url, neighbors) and (url, rank) where the rank is initialized as a vector of 1s, or uniformly random. Then for a fixed number of iterations (or until the rank is not changing significantly between two consecutive iterations)

we join links and ranks forming (url, (links, rank)) for assigning to a flatMap based on the dest. Then we reduceByKey on the destination using + as reduction. The result is computed as 0.15 + 0.85 * computed rank

The problem is that each join needs a full shuffle over the network. So a way to reduced the overhead in computation is partition with an HashPartitioner in this case on 8 partitions

And avoid the shuffle

However, one can directly use the SparkPageRank code distributed with Spark

Here I take a toy dataset

And compute the PageRank

Other dataset are here

https://snap.stanford.edu/data/web-Google.html

Thursday, September 18, 2014

Hands on big data - Crash Course on Spark RDDs - lesson 7

RDDs are the core of spark computation

Wednesday, September 17, 2014

Hands on big data - Crash Course on Spark Cache & Master - lesson 6

Two very important aspects for Spark are the use of caching in memory for avoiding re-computations and the possibility to connect a master from the spark-shell

Caching
Caching is pretty simple. Just use .persist() at the end of the desired computation. There are also option to persist an computation on disk or to replicate among multiple nodes. There is also the option to persist objects in in-Memory shared file-systems such as Tachyon.

Connect a master
There are two ways for connecting a master

val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)

This will connect the local master with 8 cores

$ ./bin/spark-shell --master local[8]

Tuesday, September 16, 2014

Hands on big data - Crash Course on Spark - Word count - lesson 5

Let's do simple exercise of word count. Spark will automatically parallelize the code

1. Load the bible
2. create a flatMap where every line is put in correspondence with words
3. Map the words to tuples (words, counter where counter is simple starting with 1
4. Reduce by keys, where the reduce operation is just +

Pretty simple, isn't it?

Monday, September 15, 2014

Hands on big data - Crash Course on Spark - Optimizing Transformations - lesson 4

1. Take a list of names

2. Create buckets by initials

3. Group by keys

4. Map each each initial into the set of names, get the size

5. Collect the results into the master as an array

This code is not super-optimized. For instance, we might want to force the number of partitions - the default is 64Mb chunks

Then we can avoid to count duplicates

or better count the unique names with a reduceByKey where the reduce function is the sum

These optimizations reduce the information sent over the wire.

Found this example very instructive [Deep Dive : Spark internals], however you should add some string sanity check

and make sure that you don't run in some heap problems if you fix the partitions

You will find more transformatios here

Sunday, September 14, 2014

Hands on big data - Crash Course on Spark Map & Reduce and other Transformations - lesson 3

Second lesson.

Let's start to load some data into the Spark Context sc (this is a predefined object, storing RDD)
Then do a map&reduce. Pretty simple using scala functional notation. We map the pages s into their length, and then reduce with a lamda function which sums two elements. That's it: we have the total length of the pages. Spark will allocate the optimal number of mappers and reducers on your behalf. However, if you prefer this can be tuned manually.
If we use map(s => 1) then we can compute the total number of pages
If we want to sort the pages alphabetically then we can use sortByKey() and collect() the results into the master as an Array

As you see this is way easier than plain vanilla Hadoop. We will play with other tranformations during next lessons

Saturday, September 13, 2014

Hands on big data - Crash Course on Scala and Spark - lesson 2

Let's do some functional programming in Spark. First download a pre-compiled binary

Launch the scala shell

Ready to go

Let's warm-up, playing with types, functions, and a bit of lamda.

We also have the SparkContext . Dowloaded some dataset (here wiki pagescounts)

load the data in spark RDD

And let's do some computation

Bigdata is nothing new

How many of those catchy words do you know?

PVM, MPI, NOW, GRID, SKELETON, WAMM

Thursday, September 11, 2014

Hands-on big data - fire a spark cluster on Amazon AWS in 7 minutes - lesson 1

After login in aws.amazon.com, launch the management console

And select EC2

Then, launch an instance

For beginning, a good option is to get a free machine

Select the first one

Select all the default options and launch the machine. Remember to create the key par

Download the keys

Get putty and puttygen . Launch puttygen load the spark-test.pem and save the private key (it’s ok to have it with no password now)

Check the launched instance

Get the public ip

Fire putty, the login is typically ec2-user@IP

And add the ppk file like this

Get spark (https://spark.apache.org/downloads.html)

Untar

Enter the directory

Get private and public keys

And download them

Modifiy your ~/.bashrc

You need to copy the spark-test.pem into the ~/.ssh directory - winscp is your friend

./spark-ec2 -k spark-test -i ~/.ssh/spark-test.pem --hadoop-major-version=2 --spark-version=1.1.0 launch -s 1 ag-spark

Add caption

If it does not work, Make sure that the region ( -r ) is the same of the key you have on your running machine

GO AND TAKE A COFFEE, the cluster is booting – it needs time.

Fire a browser and check the status

Sunday, August 31, 2014

Search is dead, Awareness is the new king

Search is dead. Well, perhaps is not but I am provocative this morning. The first search engine was launched in 1993. Since then we made an incredible progress in terms of content discovery, indexing, scalability, ranking, machine learning, variety, and UX. However, the paradigm is still the same. You need to have an idea of what you are searching well before starting submitting queries based on keywords. Exactly like in 1993.

I think that Awareness is the new king. For awareness, I mean something that send to you the information you like to get by working on your behalf with no need of explicit searches. To be honest, the topic is not new. Patty Maes works on Intelligent Software Agents since 1987. However, I don't see a disruptive revolution. We still receive alerts via Google Alerts, which is based on explicit queries. Some initial steps - but not a revolution - are shown in Google Now where the location is the implicit query. Some other changes are in Google Glass, where the image captured might be the surrogate for query. Still is a world where 95% of our actions are described and learned via keywords.

What do you think?

Saturday, August 30, 2014

Internet of things: what is new?

We live in a world of catchy words. What was called Parallel Computation yesterday, became NoW (Network of Workstation), Grid and then Cloud. Same thing different words.
Another example is IoT, which is pretty much similar to Home Automation something discussed during the past 40 years.

So what is cool with IoT?

Frankly, I don't know. However, got an idea there and filed a patent. Let's see what will happen.

Friday, August 29, 2014

Learning something new: Apache Spark

Any suggestion about it?

Friday, August 22, 2014

Assignment matching problem

Assume that we have workers and tasks to be completed. For each pair (worker, task) we know the costs that should be paid per worker to conclude the task. The goal is to conclude all the tasks and to minimize the total cost, under the condition that each worker can execute only one task and vice versa.

Thursday, August 21, 2014

Ford–Fulkerson’s Algorithm for Flow Network

Wednesday, August 20, 2014

Bellman Ford Algorithm

Monday, August 18, 2014

Find All_Pairs_Shortest_Path (Floyd-Warshall Algorithm)

Sunday, August 17, 2014

Find MST(Minimum Spanning Tree) using Prim’s Algorithm

Saturday, August 16, 2014

Find the longest path in a DAG

Friday, August 15, 2014

Antonio Gulli Adaboost

Adaboost is one of my favorite Machine Learning algorithm. The idea is quite intriguing: You start from a set of weak classifiers and learn how to linearly combine them so that the error is reduced. The result is a strong classifier built by boosting the weak classifiers.

Mathematically, given a set of labels y_i = { +1, -1 } and a training dataset x_i, Adaboost minimizes the exponential loss function sum_i (exp - (y_i * f(x_i)) }. The function f = sum_t (alpha_t h_t) is a linear combination of the h_t classifier with weight alpha_t. For those loving Optimization Theory , Adaboost is a classical application of Gradient Descend.
The algorithm is quite simple and has been included in the top 10 data mining algorithms in 2007 and the Gödel prize in 2003. After Adaboost, Boostingbecome quite popular in the data mining community with application in Ranking and Clustering.

Here you have the code for AdaBoosting in C++ and Boost.

Antonio Gulli KMeans

K-means is a classical clustering algorithm..
Here you have a C++ code for K-means clustering.

(Edit: 12/05/013)
See also my more recent posting A new lighter implementation of K-means

Thursday, August 14, 2014

Find the shortest path in a graph with non-negative edge weight

Wednesday, August 13, 2014

Find MST(Minimum Spanning Tree) using Kruskal’s Algorithm

Tuesday, August 12, 2014

Find the articulation points in a graph

Monday, August 11, 2014

Find the strongly connected components in a graph

Sunday, August 10, 2014

Find an Hamiltonian cycle

The problem is NP-Complete, so provide a backtracking exponential solution.

Saturday, August 9, 2014

Given a connected graph, compute the minimum spanning tree (MST)

use fibonacci heaps in boost library

Friday, August 8, 2014

Detecting a bipartite graph

Thursday, August 7, 2014

Given a direct acyclic graph, implement a topological sort

Wednesday, August 6, 2014

Given two nodes in an undirected graph, find the path connecting them

Tuesday, August 5, 2014

Detect a cycle in a direct graph

Monday, August 4, 2014

Implement a DFS visit

Try to be independent from underlying graph implementation

Sunday, August 3, 2014

Implement a BFS visit in C++

Try to be independent from underlying graph implementation

Saturday, August 2, 2014

Implement a graph in C++

Using adjacent matrix. If possible, use templates for storing node attributes and link attributes. Is this the best implementation?

Friday, August 1, 2014

Implement a Graph in C++

Direct graph with adjacent lists

Wednesday, July 16, 2014

DOMINANT SET OF INTERVALS

Saturday, June 14, 2014

Here is why Amazon should have personal shoppers

I frequently buy objects but I am not really an expert for that particular domain.

For instance, I recently bought some Bluetooth speakers but I was not sure about the brand and the quality of the deal I was making.

In those situations, I would certainly pay a few bucks to someone who is well know expert and is willing to give me advices on what to buy.

Personal shoppers are common on expensive objects, but I am thinking about something done on a massive scale. I think that Amazon would be the perfect place for making this working at scale

BTW, ranking personal shoppers is a clear machine learning problem

Friday, June 13, 2014

Here is why sending money should be as easy as chat

Today chat is the king and the growth is incredible. No surprise IRC predates the Web and ICQ came before Google.

So why don't we use chat messages to pay our bill and send money? The protocol should be easy: you receive a request of payment with a chat message and you answer with another chat message including your public key.

Should this the natural evolution for Whatsup, Line, Viber, WeChat and all the other instant messengers?

After all, you have your phone in your pocket more frequently than your credit card in your wallet.

Thursday, June 12, 2014

Here is why linkedin should be linked to search

Search is about leveraging users' interactions. Every time you submit a query and select a few links You provide valuable information to the search engine that will be later used for ranking. Normally this information are intrinsically provided. However, this is not the only possible interaction model.

For instance, imagine that you are a top IT executive with the secret passion of snowboarding. Why don't allow you to answer to search queries related to this topic?

Another example, imagine that you are a banker working in finance with the secret passion of interior design. You'd love to share your expertise with other people for free but with current search model this is not easy.

Today, everyone has a linkedin profile. So why do we extend the search model so that I can register myself for a specific search query and offer my expertise on that topic for future users who are submitting the same query?

I think that the problem of spam is completely avoided if we adopt Linkedin profiles for registering users.

Also, I think that users would love the serendipity of being in touch with other people sharing same interests.

BTW, did I say that I love acquarium, ham radio and late 80s music?

Wednesday, June 11, 2014

Here is why Amazon should start to sell money

Amazon is well-know to sell everything (including baby sitters now ;) So why they don't start selling money?
The idea could be a little crazy but here are the motivations:

They have all users' credit history so it is very easy for them to assess credit rate
They have all user's credit cards so it is very easy for them to assess credit rate
They have a huge number of transactions so they can easily move money and let user borrow

This will be a win-win situation. Amazon can increase the number of transactions by allowing users to borrow money. Users can find very interesting rates. The whole system can benefit because the credit rate will be based on dynamically evolving parameters constantly updated

So when Amazon will become a bank?

PS: same suggestion can be applied to Ebay and Alibaba

Tuesday, June 10, 2014

Jump array

Given an array of non negative numbers, start at the first element and reach the end with a minimum number of jumps. A jump cannot exceed the length contained in the current position.

x

Monday, June 9, 2014

LIS

Given an array of integers find the longest increasing subsequence

Sunday, June 8, 2014

A Collection of Design Pattern Interview Questions Solved in C++ (Volume 4)

Saturday, June 7, 2014

LCS

Given two strings find the longest common substring

Friday, June 6, 2014

Palidromes

Given a string find the minimum number of characters to be inserted for converting a string into a palidrome

Thursday, June 5, 2014

A Collection of Bit Programming Interview Questions solved in C++ (Volume 3)

Wednesday, June 4, 2014

Palindrome String

Compute the longest palindrome string

Tuesday, June 3, 2014

Dynamic Programming -

Given an array of integers find the longest bitonic sequence

Monday, June 2, 2014

Coin change

Given n dollars, how many different ways there are to make the change into a set of coins

Sunday, June 1, 2014

A Collection of Dynamic Programming Interview Questions Solved in C++ (Volume 1)

Saturday, May 31, 2014

Binomial coefficient

Compute the binomial coefficient

Friday, May 30, 2014

Catalan numbers

Compute the catalan numbers

Thursday, May 29, 2014

Integer Knapsack

A knapsack has max capacity C and there are n items each with weight wi, and value vi. Maximize the value in the knapsack without exceeding its max capacity

Wednesday, May 28, 2014

Jump stairs

There ia stair with n steps. You can climb either one step or two steps at a time. How many different ways there are to climb the stair?

Tuesday, May 27, 2014

Edit distance

Compute the edit distance

Monday, May 26, 2014

Dice

Given a dice count how many ways to get sum s

Sunday, May 25, 2014

Palindromes

Given a string find the minimum number of characters to be inserted for converting the string into a palindrome.

Saturday, May 24, 2014

Bridge matching

N cities are on the northern bank of a river and n cities are on the southern bank. You can build a bridge only between cities with the same number

Friday, May 23, 2014

Sum subset

Given an array of integers of size n, partition it in such a way that the two subsets have equal sum s.

Thursday, May 22, 2014

Partition set

Partition a multiset S of positive integers into two subsets S1 and S2, such that the sum of the numbers in S1 equals the sum of the numbers in S2

Wednesday, May 21, 2014

Segment partition

Given a segment of integer length n, cut into different integer parts in such a way to maximize the product of the lengths of all parts

Tuesday, May 20, 2014

Cutting a rod

Given a rod of length n and a vector of prices for different lengths, cut the rod for maximizing the gain

Monday, May 19, 2014

Max matrix 0/1

Given a matrix consisting only of 0 and 1 find the maximum size square submatrix with all 1.

Sunday, May 18, 2014

Matrix parentization

Given a set of m matrices find the most efficient way of multiplying them

Saturday, May 17, 2014

Max submatrix

Given a matrix of size n x n fine the maximum sum rectangle

Friday, May 16, 2014

BST and dynamic programming

Given a binary search tree, find the size of the largest independent set of nodes

Thursday, May 15, 2014

Games of coins

A set of coins are aligned and two players can pick one coin from each side in turn. Maximize the value of the coins picked by the first player

Wednesday, May 14, 2014

Maximum value contiguous subsequence

Given an array of real numbers, find a contiguous subsequence with max sum S

Tuesday, May 13, 2014

Stock prices

Given an histogram array of unsigned integers encoding the price of a stock title during previous year, compute the area for the largest rectangle contained in this histogram

Monday, May 12, 2014

Ship battle

Given a matrix M and an array V match the array in the matrix

Sunday, May 11, 2014

Scheduling

Given n jobs, each one with a processing time t1 a profidt p1 and a deadline di maximize the profit

Saturday, May 10, 2014

Boolean parentesization

Given a boolean expression of AND, OR, XOR, TRUE, FALSE find the number of ways to parenthesize and evaluate to true

Friday, May 9, 2014

Balanced partition

Given an array of integers between 0 and M divide the integers into two sets such that the difference of their sum is minimized