Friday, June 26, 2015

Microsoft - Elsevier to organize next WSDM cup

Glad to announce that Microsoft and Elsevier will organize the next wsdm cup. With

In the recent explosive growth of online activities, the data are often recorded as heterogeneous graphs, ranging from Facebook’s Open Graph that record our social and communication activities to the graphs gathered by major search engine companies that represent a snapshot of our collective knowledge. As demonstrated in many web search and data mining applications, a critical element to make the best use of the data is the ability to assess the relative importance of the nodes.
In the 2016 WSDM Cup, the challenge will be to assess the query-independent importance of scholarly articles, using data from the Microsoft Academic Graph--a large heterogeneous graph comprised of publications, authors, venues, organizations, and the fields of study.
More details coming soon. Stay tuned.
Alex Wade, Microsoft Research
Kuansan Wang, Microsoft Research
Antonio Gulli, Elsevier B.V.

Saturday, March 21, 2015

Teaching computer science to more than 100 kids during the last years

During the last year, I spent time teaching computer science to kids between 8-13 years old. It is volunteer activity in part inspired by the initiative. I spend few hours teaching how to program video games by using Microsoft Research's Kodu environment and the children are enthusiastic also because they can share the game with other kids worldwide thanks to the online community

Then i explain the meaning of Fibonacci numbers and the different places where they occur in nature

Then i discuss the problem of sorting by playing this game

Monday, March 16, 2015

A Collection of Graph Programming Interview Questions Solved in C++

The new book is almost ready

A Collection of Graph Programming Interview Questions Solved in C++

Saturday, November 8, 2014

Antonio Gulli and Machine Learning

These are some of the visual results of my previous experience in Machine Learning

Use of Machine Learning to build Bing Autosuggest

Use of Machine Learning for integration with facebook

Use of Machine Learning to build Bing News engine

Use of Machine Learning to build Universal Results

Use of Machine Learning to build TheDailyBeast

The use of Machine Learning to detect image similarities (AKA Andy Warhol's period)

Monday, September 22, 2014

Saturday, September 20, 2014

Hands on big data - Crash Course on Spark - Start 6 nodes cluster - lesson 9

One easy way to start a cluster is to leverage those created by Amplab

After creating the keywords for AWS, I created the cluster but had to add -w 600 for timeout

Deploy (about 30mins including all the data copy)


Run the interactive scala shell which will connect to the master

Run commands

Instances on AWS

Friday, September 19, 2014

Hands on big data - Crash Course on Spark - PageRank - lesson 8

Let's compute PageRank. Below you will find the definition.

Assuming to have a pairRDD of (url, neighbors) and (url, rank) where the rank is initialized as a vector of 1s, or uniformly random. Then for a fixed number of iterations (or until the rank is not changing significantly between two consecutive iterations) 

we join links and ranks forming (url, (links, rank)) for assigning to a flatMap based on the dest. Then we reduceByKey on the destination using + as reduction. The result is computed as 0.15 + 0.85 * computed rank 

The problem is that each join needs a full shuffle over the network. So a way to reduced the overhead in computation is partition with an HashPartitioner in this case on 8 partitions

 And avoid the shuffle

However, one can directly use the SparkPageRank code distributed with Spark

Here I take a toy dataset

And compute the PageRank

Other dataset are here