Antonio Gulli's coding playground: What is Natural Language Processing?

Saturday, October 31, 2015

What is Natural Language Processing?

An excerpt from my new book - instant questions and answers, with code. This is #61, second volume (more than 100 in total)

Natural Language Processing (NLP) is a complex topic and there are books devoted only to this subject. In this book, an introductive survey will be provided based on the NLTK a python Natural Language Toolkit. Let us start.

Text is made up of sentences and sentences are composed of words. So the first step in NLP is frequently to separate those basic units according to the rules of the chosen language. Often very frequent words carry little information and they should be filtered out as stopwords. The first code fragment split text into sentences and then sentences into words where stop words are then removed.

In addition to that, it could be interesting to find out the meaning of words and here wordnet[1] can help with its organization of terms into synsets, which are organized into inheritance tree where the most abstract terms are hypernyms and the more specific terms are hyponyms. Wordnet can also help in finding synonyms and antonyms (opposite words) of a given terms. The code fragment finds the synonyms of the word love in English.

Moreover, words can be stemmed and the rules for stemming are very different from language to language. NLTK supports the SnowballStemmer that supports multiple idioms. The code fragment finds the stem of the word volvi in Spanish.

In certain situations, it could be convenient to understand whether a word is a noun, an adjective, a verb and so on. This is the process of part-of-speech tagging and NLTK provides a convenient support for this type of analysis as illustrated in the below code fragment.

Code

import nltk.data

text = "Poetry is the record of the best and happiest moments \

of the happiest and best minds. Poetry is a sword of lightning, \

ever unsheathed, which consumes the scabbard that would contain it."

# download stopwords

#nltk.download("stopwords")

from nltk.corpus import stopwords

stop = stopwords.words('english')

# download the punkt package

#nltk.download('punkt')

# load the sentences' tokenizer

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

sentences = tokenizer.tokenize(text)

print sentences

# tokenize in words

from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()

for sentence in sentences:

words = tokenizer.tokenize(sentence)

words = [w for w in words if w not in stop]

print words

#wordnet

#nltk.download("wordnet")

from nltk.corpus import wordnet

for i,j in enumerate(wordnet.synsets('love')):

print "Synonyms:", ", ".join(j.lemma_names())

# SnowBallStemmer

from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('spanish')

print "Spanish stemmer"

print stemmer.stem('volver')

#tagger

#nltk.download('treebank')

from nltk.tag import UnigramTagger

from nltk.corpus import treebank

trainSenteces = treebank.tagged_sents()[:5000]

tagger = UnigramTagger(trainSenteces)

tagged = tagger.tag(words)

print tagged

Outcome

['Poetry is the record of the best and happiest moments of the happiest and best minds.', 'Poetry is a sword of lightning, ever unsh

eathed, which consumes the scabbard that would contain it.']

['Poetry', 'record', 'best', 'happiest', 'moments', 'happiest', 'best', 'minds', '.']

['Poetry', 'sword', 'lightning', ',', 'ever', 'unsheathed', ',', 'consumes', 'scabbard', 'would', 'contain', '.']

Synonyms: love

Synonyms: love, passion

Synonyms: beloved, dear, dearest, honey, love

Synonyms: love, sexual_love, erotic_love

Synonyms: love

Synonyms: sexual_love, lovemaking, making_love, love, love_life

Synonyms: love

Synonyms: love, enjoy

Synonyms: love

Synonyms: sleep_together, roll_in_the_hay, love, make_out, make_love, sleep_with, get_laid, have_sex, know, do_it, be_intimate, have

_intercourse, have_it_away, have_it_off, screw, fuck, jazz, eff, hump, lie_with, bed, have_a_go_at_it, bang, get_it_on, bonk

Spanish stemmer

volv

[('Poetry', None), ('sword', None), ('lightning', None), (',', u','), ('ever', u'RB'), ('unsheathed', None), (',', u','), ('consumes

', None), ('scabbard', None), ('would', u'MD'), ('contain', u'VB'), ('.', u'.')]

[1] https://wordnet.princeton.edu/

Antonio Gulli's coding playground

Saturday, October 31, 2015

What is Natural Language Processing?

Code

Outcome

No comments:

Post a Comment