Natural Language Processing (NLP) is a complex
topic and there are books devoted only to this subject. In this book, an introductive
survey will be provided based on the NLTK a python Natural Language Toolkit. Let
us start.
Text is made up of sentences and sentences are composed of words. So the first step in NLP is frequently to separate those
basic units according to the rules of the chosen language. Often very frequent
words carry little information and they should be filtered out as stopwords. The first code fragment split
text into sentences and then sentences into words where stop words are then
removed.
In addition to that, it could be interesting to
find out the meaning of words and here wordnet[1] can
help with its organization of terms into synsets,
which are organized into inheritance tree where the most abstract terms are
hypernyms and the more specific terms are hyponyms. Wordnet can also help in
finding synonyms and antonyms (opposite words) of a given
terms. The code fragment finds the synonyms of the word love in English.
Moreover, words can be stemmed and the rules
for stemming are very different from language to language. NLTK supports the SnowballStemmer
that supports multiple idioms. The code fragment finds the stem of the word volvi in Spanish.
In certain situations, it could be convenient
to understand whether a word is a noun, an adjective, a verb and so on. This is
the process of part-of-speech tagging and NLTK provides a convenient support for
this type of analysis as illustrated in the below code fragment.
Code
import nltk.data
text = "Poetry is the record of the best and happiest
moments \
of the happiest and best
minds. Poetry is a sword of lightning, \
ever unsheathed, which
consumes the scabbard that would contain it."
# download stopwords
#nltk.download("stopwords")
from nltk.corpus import stopwords
stop = stopwords.words('english')
# download the punkt package
#nltk.download('punkt')
# load the sentences'
tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(text)
print sentences
# tokenize in words
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
for sentence in sentences:
words = tokenizer.tokenize(sentence)
words = [w for w in words if w not in stop]
print words
#wordnet
#nltk.download("wordnet")
from nltk.corpus import wordnet
for i,j in enumerate(wordnet.synsets('love')):
print "Synonyms:", ", ".join(j.lemma_names())
# SnowBallStemmer
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('spanish')
print "Spanish
stemmer"
print stemmer.stem('volver')
#tagger
#nltk.download('treebank')
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
trainSenteces = treebank.tagged_sents()[:5000]
tagger = UnigramTagger(trainSenteces)
tagged = tagger.tag(words)
print tagged
Outcome
['Poetry is the record of the best and happiest moments
of the happiest and best minds.', 'Poetry is a sword of lightning, ever unsh
eathed, which consumes the
scabbard that would contain it.']
['Poetry', 'record', 'best',
'happiest', 'moments', 'happiest', 'best', 'minds', '.']
['Poetry', 'sword',
'lightning', ',', 'ever', 'unsheathed', ',', 'consumes', 'scabbard', 'would',
'contain', '.']
Synonyms: love
Synonyms: love, passion
Synonyms: beloved, dear,
dearest, honey, love
Synonyms: love, sexual_love,
erotic_love
Synonyms: love
Synonyms: sexual_love,
lovemaking, making_love, love, love_life
Synonyms: love
Synonyms: love, enjoy
Synonyms: love
Synonyms: sleep_together,
roll_in_the_hay, love, make_out, make_love, sleep_with, get_laid, have_sex,
know, do_it, be_intimate, have
_intercourse, have_it_away,
have_it_off, screw, fuck, jazz, eff, hump, lie_with, bed, have_a_go_at_it,
bang, get_it_on, bonk
Spanish stemmer
volv
[('Poetry', None), ('sword',
None), ('lightning', None), (',', u','), ('ever', u'RB'), ('unsheathed', None),
(',', u','), ('consumes
', None), ('scabbard',
None), ('would', u'MD'), ('contain', u'VB'), ('.', u'.')]
No comments:
Post a Comment