Monday, September 15, 2014

Hands on big data - Crash Course on Spark - Optimizing Transformations - lesson 4

1. Take a list of names
2. Create buckets by initials
3. Group by keys
4. Map each each initial into the set of names, get the size
5. Collect the results into the master as an array

This code is not super-optimized. For instance, we might want to force the number of partitions - the default is 64Mb chunks

Then we can avoid to count duplicates

 or better count the unique names with a reduceByKey where the reduce function is the sum

These optimizations reduce the information sent over the wire.

Found this example very instructive [Deep Dive : Spark internals], however you should add some string sanity check

and make sure that you don't run in some heap problems if you fix the partitions

You will find more transformatios here

Sunday, September 14, 2014

Hands on big data - Crash Course on Spark Map & Reduce and other Transformations - lesson 3

Second lesson. 

  1. Let's start to load some data into the Spark Context sc (this is a predefined object, storing RDD)
  2. Then do a map&reduce. Pretty simple using scala functional notation. We map the pages s into their length, and then reduce with a lamda function which sums two elements. That's it: we have the total length of the pages. Spark will allocate the optimal number of mappers and reducers on your behalf. However, if you prefer this can be tuned manually. 
  3. If we use map(s => 1) then we can compute the total number of pages
  4. If we want to sort the pages alphabetically then we can use sortByKey() and collect() the results into the master as an Array

As you see this is way easier than plain vanilla Hadoop. We will play with other tranformations during next lessons

Saturday, September 13, 2014

Hands on big data - Crash Course on Scala and Spark - lesson 2

Let's do some functional programming in Spark. First download a pre-compiled binary

Launch the scala shell

Ready to go

Let's warm-up, playing with types, functions, and a bit of lamda.

We also have the SparkContext . Dowloaded some dataset (here wiki pagescounts)

load the data in spark RDD 

 And let's do some computation

Bigdata is nothing new

How many of those catchy words do you know?


Thursday, September 11, 2014

Hands-on big data - fire a spark cluster on Amazon AWS in 7 minutes - lesson 1

After login in, launch the management console

And select EC2

Then, launch an instance

For beginning, a good option is to get a free machine

Select the first one

Select all the default options and launch the machine. Remember to create the key par

Download the keys

Get putty and puttygen . Launch puttygen load the spark-test.pem and save the private key (it’s ok to have it with no password now)

Check the launched instance

Get the public ip

Fire putty, the login is typically ec2-user@IP

And add the ppk file like this

Login into the new machine


 Enter the directory

Get private and public keys


And download them

Modifiy your ~/.bashrc

You need to copy the spark-test.pem into the ~/.ssh directory  - winscp is your friend


./spark-ec2 -k spark-test -i ~/.ssh/spark-test.pem --hadoop-major-version=2 --spark-version=1.1.0 launch -s 1 ag-spark

Add caption

If it does not work, Make sure that the region ( -r ) is the same of the key you have on your running machine

GO AND TAKE A COFFEE, the cluster is booting – it needs time.

Fire a browser and check the status

Login into the cluster

Sunday, August 31, 2014

Search is dead, Awareness is the new king

Search is dead. Well, perhaps is not but I am provocative this morning. The first search engine was launched in 1993. Since then we made an incredible progress in terms of content discovery, indexing, scalability, ranking, machine learning, variety, and UX. However, the paradigm is still the same. You need to have an idea of what you are searching well before starting submitting queries based on keywords. Exactly like in 1993.

I think that Awareness is the new king. For awareness, I mean something that send to you the information you like to get by working on your behalf with no need of explicit searches. To be honest, the topic is not new. Patty Maes works on Intelligent Software Agents since 1987. However, I don't see a disruptive revolution. We still receive alerts via Google Alerts, which is based on explicit queries. Some initial steps - but not a revolution - are shown in Google Now where the location is the implicit query. Some other changes are in Google Glass, where the image captured might be the surrogate for query. Still is a world where 95% of our actions are described and learned via keywords.

What do you think?

Saturday, August 30, 2014

Internet of things: what is new?

We live in a world of catchy words. What was called Parallel Computation yesterday, became NoW (Network of Workstation), Grid and then Cloud. Same thing different words.
Another example is IoT, which is pretty much similar to Home Automation something discussed during the past 40 years.

So what is cool with IoT?

Frankly, I don't know. However, got an idea there and filed a patent. Let's see what will happen.