Saturday, October 31, 2015

What is Natural Language Processing?

An excerpt from my new book - instant questions and answers, with code. This is #61, second volume (more than  100 in total)

Natural Language Processing (NLP) is a complex topic and there are books devoted only to this subject. In this book, an introductive survey will be provided based on the NLTK a python Natural Language Toolkit. Let us start.

Text is made up of sentences and sentences are composed of words. So the first step in NLP is frequently to separate those basic units according to the rules of the chosen language. Often very frequent words carry little information and they should be filtered out as stopwords. The first code fragment split text into sentences and then sentences into words where stop words are then removed.

In addition to that, it could be interesting to find out the meaning of words and here wordnet[1] can help with its organization of terms into synsets, which are organized into inheritance tree where the most abstract terms are hypernyms and the more specific terms are hyponyms. Wordnet can also help in finding synonyms and antonyms (opposite words) of a given terms. The code fragment finds the synonyms of the word love in English.

Moreover, words can be stemmed and the rules for stemming are very different from language to language. NLTK supports the SnowballStemmer that supports multiple idioms. The code fragment finds the stem of the word volvi in Spanish.

In certain situations, it could be convenient to understand whether a word is a noun, an adjective, a verb and so on. This is the process of part-of-speech tagging and NLTK provides a convenient support for this type of analysis as illustrated in the below code fragment.



text = "Poetry is the record of the best and happiest moments \
of the happiest and best minds. Poetry is a sword of lightning, \
ever unsheathed, which consumes the scabbard that would contain it."

# download stopwords"stopwords")
from nltk.corpus import stopwords
stop = stopwords.words('english')

# download the punkt package'punkt')
# load the sentences' tokenizer
tokenizer ='tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(text)
print sentences

# tokenize in words
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
for sentence in sentences:
    words = tokenizer.tokenize(sentence)
    words = [w for w in words if w not in stop]
    print words
from nltk.corpus import wordnet
for i,j in enumerate(wordnet.synsets('love')):
    print "Synonyms:", ", ".join(j.lemma_names())

# SnowBallStemmer
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('spanish')
print "Spanish stemmer"
print stemmer.stem('volver')

from nltk.tag import UnigramTagger
from nltk.corpus import treebank
trainSenteces = treebank.tagged_sents()[:5000]
tagger = UnigramTagger(trainSenteces)
tagged = tagger.tag(words)
print tagged   


['Poetry is the record of the best and happiest moments of the happiest and best minds.', 'Poetry is a sword of lightning, ever unsh
eathed, which consumes the scabbard that would contain it.']
['Poetry', 'record', 'best', 'happiest', 'moments', 'happiest', 'best', 'minds', '.']
['Poetry', 'sword', 'lightning', ',', 'ever', 'unsheathed', ',', 'consumes', 'scabbard', 'would', 'contain', '.']
Synonyms: love
Synonyms: love, passion
Synonyms: beloved, dear, dearest, honey, love
Synonyms: love, sexual_love, erotic_love
Synonyms: love
Synonyms: sexual_love, lovemaking, making_love, love, love_life
Synonyms: love
Synonyms: love, enjoy
Synonyms: love
Synonyms: sleep_together, roll_in_the_hay, love, make_out, make_love, sleep_with, get_laid, have_sex, know, do_it, be_intimate, have
_intercourse, have_it_away, have_it_off, screw, fuck, jazz, eff, hump, lie_with, bed, have_a_go_at_it, bang, get_it_on, bonk
Spanish stemmer
[('Poetry', None), ('sword', None), ('lightning', None), (',', u','), ('ever', u'RB'), ('unsheathed', None), (',', u','), ('consumes
', None), ('scabbard', None), ('would', u'MD'), ('contain', u'VB'), ('.', u'.')]

Friday, October 30, 2015

What are autoencoders and stacked autoencoders?

An encoder is a function  which transforms the input vector  in the output  where  is a weight matrix and  is an offset vector. A decoder is an inverse function which tries to reconstruct the original vector from y. An auto-encoder tries to reconstruct the original input by minimizing the error during the reconstruction process. There are two major variants for auto-encoding: Sparse auto-encoders force sparsity by using L1 regularization, while de-noising autoencoders stochastically corrupt the input with some form of randomization.
Mathematically, a stochastic mapping transforms the input vector  into a noisy vector  which is then transformed into an hidden representation . The reconstruction phase is via a decoder  where an error minimization algorithm is used via either squared error loss or cross-entropy loss. Autoencoders typically use a hidden layer which acts as a bottleneck that compress the data as in figure.

In a deep learning context, multiple auto-encoder are stacked for producing the final denoised output. The “magic” outcome of this combination is that autoencoders learn how to extract meaningful features from noise data with no need offhand-craft features’ selection. There are also additional applications. For instance, deep autoencoders are able to map images into compressed vectors with small dimensionality and this can be useful for searching images by image similarity. Plus, Deep autoencoders can map words into small dimension vectors and this is a process useful in topic modelling  distributed across a collection of documents.

Thursday, October 29, 2015

Academic Search and Relevance on ScienceDirect

During the past few months team worked on Academic Search. So, it is time to post a number of side- by-side comparisons. Let's select a topic.. say Modern Finance and pick a few queries just to show where we are.

{quantitative easing}

All articles are from 2010/2011, while here instead we show fresh results which are more relevant

{ultrafast trading}

here there is a proximity match and an alteration problem since ultrafast trading actually means high-frequency trading and it is not that fast to give a result of 1999. ScienceDirect correctly nails it


Fintech has a very specific meaning in Finance and this meaning is not nailed. ScienceDirect got it and it is also fresh. Kinda of cool.

Please run your own queries and report SATs and DSATs. Search and Relevance requires continuous investments and the work is never really done. There is always a metric to move, and new learnings to apply - which is why the job is fun!!

Antonio Gulli

Wednesday, October 28, 2015

What is Deep Learning?

Deep Learning is a buzzword which entered the mainstream thanks to some recent results capturing the attention of a global audience. Google’s Brain project learns to finds cats in videos, Facebook recognizes faces in images, Baidu recognizes visual shapes and objects, and both Baidu and Microsoft use deep learning for speech recognition. Apart from buzzwords, the most prestigious minds are world-wide working in deep learning including Jeff Dean (Google), Yann LeCun(Facebook), Andrew Ng(Baidu).

One very interesting progress made with Deep Learning is that it is now possible to learn how to extract discriminative features in an automatic way. Instead, traditional machine learning requires a lot of human effort for hand-crafting features and the machine learning was essentially a way to learn weights for balancing those features. Automatically discovering of discriminative features is indeed a big step forward toward reasoning. Machines can now learn what is important and what is not, while before humans had to pick features which were potentially important and, then, let the machines weight them at the risk of missing discriminative and fundamental information simply because it was not considered. In short, we can say that now we have Trainable Feature Extractors and Trainable Learning while before we only had the former. Auto-encoders are one tool used by Deep Learning for finding features useful for representing an input distribution.

Another interesting characteristic of Deep Learning is the ability to learn from mostly unlabelled data in a typical semi-supervised learning setting where a very large number of training examples are not having complete and correct true labels.
Yet another interesting trait of Deep Learning is the ability to learn how to approximate highly varying functions which happens when a piecewise approximation (with constant or linear pieces) of a function requires a very large number of pieces.

Deep learning uses a cascade of many layers of non linear processing units which performs feature extraction and transformation. What is still required is to compose manually the layers according the specific problem to be solved. So, the big next step would be to learn how to self-organize layers. Typically, Deep Learning compose many (recurrent) layers of ANNs with even more sophisticated generative models such as Deep Belief Networks and Deep Bolzmann Machines.
One fundamental assumption is that each level will learn more abstract concepts of the previous level. This concept is well explained in this image where the first layer learns basic features, while the second layer learns components of human face, and the third layer learns different types of faces. Hence, the learning system is a high dimensional entity able to discriminate many observed features that are related by unknown statistical relations. The learning is distributed in the sense that the the knowledge itself is not associated with one single neuron but it is the result of sharing the information within the network and the consequent activation of multiple neurons.

As shown in this image, the features become more extended and complex deeper in the network. In addition to that, multiple networks can be specialized on different concepts and learn how faces, cars, elephants, and chairs are visualized.

Advances in hardware have also been an important enabling factor for Deep Learning. In particular, powerful graphics processing units (GPUs) are highly suited for matrix and vector operations involved in machine learning and GPUs can speed up training algorithms by orders of magnitude, bringing running times of weeks back to few hours. This allows to increase the number of layers in a deep network and therefore the level of sophistication in representing models. This image gives another idea of how different levels are progressively learning more and more complex visual features.

Deep Learning network are typically trained via backpropagation where weights are updated via Stochastic Gradient Descent using an equation such as

so that the weight between the units  is updated at time  based on the weight available at time t plus a fraction of the partial derivative of a chosen cost function.  is the learning rate. Google built an Asynchronous Distributed Stochastic Gradient Descent server where more than 16000 CPUs independently update the gradient weights for learning the rather sophisticate recognition of the concept of “cats” from a generic YouTube video.  Other types of training have been proposed including forward propagation and forward-backward propagation for Restricted Bolzmann Machines and for Recurrent Networks.      
Another rather sophisticate approach uses Convolutional networks (networks where the same weight is used in all the spatial locations in the layer) with 24 layers for annotating images with concepts showing an impressive 6.6% error rate at top 5 results, which is a result competitive with the human brain.[1]

Embedding is another key concept introduced by Deep Learning. Embedding is used to avoid the problems encountered when learning with sparse data. For instance, we can extract words from documents and then create words embedding where words are simply grouped together if they occur within a chosen text window. A word embedding  : is a parameterized function mapping words in some language to high-dimensional vectors (perhaps 200 to 500 dimensions). Embedding vectors trained for language modelling task have very interesting proprieties where it is possible to express concepts and equivalences such as the relations between capitals and countries, and the relation between the queen and the king, and the meaning of superlative[2]
This table describe a word embedding learned on a skip model trained on 783M words with 300 dimensionalities        

If you are interested in knowing more about Deep Learning, then it could be worth having a look to a very exciting keynote by the way of Andrew Ng[3]. The author of this book strongly believes that the next step for Deep Learning is to integrate progress in HPC computation (where Spark is) with GPU computation (where packages like Theano and Lasagne are). This will open the root on deep learning cloud computation also leveraging the power of GPU platforms like CUDA.[4]

Tuesday, October 27, 2015

Collection of DataScience - volume 2


1. Why is Cross Validation important? 12
Solution 12
Code 12
2. Why is Grid Search important? 13
Solution 13
Code 13
3. What are the new Spark DataFrame and the Spark Pipeline? And how we can use the new ML library for Grid Search 14
Solution 14
Code 15
4. How to deal with categorical features? And what is one-hot-encoding? 17
Solution 17
Code 18
5. What are generalized linear models and what is an R Formula? 18
Solution 18
Code 19
6. What is the Word2Vec distributed representation? 19
Solution 19
Code 20
7. What are the Decision Trees? 20
Solution 20
Code 22
8. What are the Ensembles? 23
Solution 23
9. What is a Gradient Boosted Tree? 23
Solution 23
10. What is a Gradient Boosted Trees Regressor? 24
Solution 24
Code 24
11. Gradient Boosted Trees Classification 25
Solution 25
Code 25
12. What is a Random Forest? 27
Solution 27
Code 27
13. What is an AdaBoost classification algorithm? 28
Solution 28
14. What is a recommender system? 29
Solution 29
15. What is a collaborative filtering ALS algorithm? 29
Solution 29
Code 30
16. What is the DBSCAN clustering algorithm? 31
Solution 31
Code 31
17. What is a Streaming K-Means? 32
Solution 32
Code 33
18. What is the PCA Dimensional reduction technique? 33
Solution 33
Code 35
19. What is the SVD Dimensional reduction technique? 35
Solution 35
Code 36
20. What is Parquet? 36
Solution 36
Code 36
21. What is the Isotonic Regression? 37
Solution 37
Code 37
22. What is SVM with soft margins? 38
Solution 38
23. What is the Expectation Maximization Clustering algorithm? 39
Solution 39
24. What is a Gaussian Mixture? 40
Solution 40
Code 41
25. What is the Latent Dirichlet Allocation topic model? 41
Solution 41
Code 42
26. What is the Associative Rule Learning? 43
Solution 43
27. What is FP-growth? 44
Solution 44
Code 44
28. How to use the GraphX Library? 45
Solution 45
29. What is PageRank? And how to compute it with GraphX 46
Solution 46
Code 47
Code 47
30. What is Power Iteration Clustering? 48
Solution 48
Code 49
31. What is a Perceptron? 49
Solution 49
32. What is an ANN (Artificial Neural Network)? 50
Solution 50
33. What are the activation functions? 51
Solution 51
34. How many types of Neural Networks are known? 52
35. How can you train a Neural Network 53
Solution 53
36. What application have the ANNs? 54
Solution 54
37. Can you code a simple ANNs in python? 55
Solution 55
Code 55
38. What support has Spark for Neural Networks? 57
Solution 57
Code 57
39. What is Deep Learning? 58
Solution 58
40. What are autoencoders and stacked autoencoders? 62
Solution 62
41. What are convolutional neural networks? 63
Solution 63
42. What are Restricted Boltzmann Machines, Deep Belief Networks and Recurrent networks? 64
Solution 64
43. Neural Network – Deep Learning - Theano 66
Solution 66
Code 66
Complexity 66
44. Neural Network – Deep Learning - Theano 66
Solution 66
Code 67
Complexity 67
45. Neural Network – Deep Learning - Lasagne 67
Solution 67
Code 67
Complexity 67
46. Splines 67
Solution 67
Code 67
Complexity 67
47. Search – Hill Climbing, Simulated Annealing, Greedy 67
Solution 67
Code 67
Complexity 67
48. MonteCarlo 67
Solution 67
Code 68
Complexity 68
49. Sampling (Gibbs) 68
Solution 68
Code 68
Complexity 68
50. Hypothesis Testing 68
Solution 68
Code 68
Complexity 68
51. Text Mining 68
Solution 68
Code 68
Complexity 68
52. NLP tagging 68
Solution 68
Code 69
Complexity 69
53. Bloom Filters 69
Solution 69
Code 69
Complexity 69
54. minHash 69
Solution 69
Code 69
Complexity 69
55. LSH 69
Solution 69
Code 69
Complexity 69
56. Count Min Sketches 69
Solution 69
Code 69