Saturday, October 31, 2015

What is Natural Language Processing?

An excerpt from my new book - instant questions and answers, with code. This is #61, second volume (more than  100 in total)

Natural Language Processing (NLP) is a complex topic and there are books devoted only to this subject. In this book, an introductive survey will be provided based on the NLTK a python Natural Language Toolkit. Let us start.

Text is made up of sentences and sentences are composed of words. So the first step in NLP is frequently to separate those basic units according to the rules of the chosen language. Often very frequent words carry little information and they should be filtered out as stopwords. The first code fragment split text into sentences and then sentences into words where stop words are then removed.

In addition to that, it could be interesting to find out the meaning of words and here wordnet[1] can help with its organization of terms into synsets, which are organized into inheritance tree where the most abstract terms are hypernyms and the more specific terms are hyponyms. Wordnet can also help in finding synonyms and antonyms (opposite words) of a given terms. The code fragment finds the synonyms of the word love in English.

Moreover, words can be stemmed and the rules for stemming are very different from language to language. NLTK supports the SnowballStemmer that supports multiple idioms. The code fragment finds the stem of the word volvi in Spanish.

In certain situations, it could be convenient to understand whether a word is a noun, an adjective, a verb and so on. This is the process of part-of-speech tagging and NLTK provides a convenient support for this type of analysis as illustrated in the below code fragment.

Code

import nltk.data

text = "Poetry is the record of the best and happiest moments \
of the happiest and best minds. Poetry is a sword of lightning, \
ever unsheathed, which consumes the scabbard that would contain it."

# download stopwords
#nltk.download("stopwords")
from nltk.corpus import stopwords
stop = stopwords.words('english')

# download the punkt package
#nltk.download('punkt')
# load the sentences' tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(text)
print sentences

# tokenize in words
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
for sentence in sentences:
    words = tokenizer.tokenize(sentence)
    words = [w for w in words if w not in stop]
    print words
   
#wordnet
#nltk.download("wordnet")  
from nltk.corpus import wordnet
for i,j in enumerate(wordnet.synsets('love')):
    print "Synonyms:", ", ".join(j.lemma_names())

# SnowBallStemmer
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('spanish')
print "Spanish stemmer"
print stemmer.stem('volver')

#tagger
#nltk.download('treebank')
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
trainSenteces = treebank.tagged_sents()[:5000]
tagger = UnigramTagger(trainSenteces)
tagged = tagger.tag(words)
print tagged   

Outcome

['Poetry is the record of the best and happiest moments of the happiest and best minds.', 'Poetry is a sword of lightning, ever unsh
eathed, which consumes the scabbard that would contain it.']
['Poetry', 'record', 'best', 'happiest', 'moments', 'happiest', 'best', 'minds', '.']
['Poetry', 'sword', 'lightning', ',', 'ever', 'unsheathed', ',', 'consumes', 'scabbard', 'would', 'contain', '.']
Synonyms: love
Synonyms: love, passion
Synonyms: beloved, dear, dearest, honey, love
Synonyms: love, sexual_love, erotic_love
Synonyms: love
Synonyms: sexual_love, lovemaking, making_love, love, love_life
Synonyms: love
Synonyms: love, enjoy
Synonyms: love
Synonyms: sleep_together, roll_in_the_hay, love, make_out, make_love, sleep_with, get_laid, have_sex, know, do_it, be_intimate, have
_intercourse, have_it_away, have_it_off, screw, fuck, jazz, eff, hump, lie_with, bed, have_a_go_at_it, bang, get_it_on, bonk
Spanish stemmer
volv
[('Poetry', None), ('sword', None), ('lightning', None), (',', u','), ('ever', u'RB'), ('unsheathed', None), (',', u','), ('consumes
', None), ('scabbard', None), ('would', u'MD'), ('contain', u'VB'), ('.', u'.')]

No comments:

Post a Comment