TextBlob: Simplified Text Processing
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation,and more.
TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.
Features
Noun phrase extraction Part-of-speech tagging Sentiment analysis
Classification (Naive Bayes, Decision Tree)
Language translation and detection powered by Google Translate Tokenization (splitting text into words and sentences)
Word and phrase frequencies Parsing
n-grams
Word inflection (pluralization and singularization) and lemmatization Spelling correction
Add new models or languages through extensions WordNet integration
Installation
Installing/Uploading from the PyPi
pip install textblob
python -m textblob.download_corpora
This will install TextBlob and download the necessary NLTK corpora. If you need to change the default download directory set the NLTK_DATA environment variable.
Downloading the minimum corpora:
If you only intend to use TextBlob’s default models (no model overrides), you can pass the lite argument. This downloads only those corpora needed for basic functionality.
python -m textblob.download_corpora lite
Installing with Conda
Note: Conda builds are currently available for Mac OSX only.
TextBlob is also available as a conda package. To install with conda, run
conda install -c https://conda.anaconda.org/sloria textblob python -m textblob.download_corpora
Python
TextBlob supports Python >=2.7 or >=3.4.
Dependencies
TextBlob depends on NLTK3. NLTK will be installed automatically when you run
pip install textblob
TextBlob aims to provide access to common text-processing operations through a familiar interface. You can treat TextBlob objects as if they were Python strings that learned how to do Natural Language Processing.
Create a TextBlob
First, the import
from textblob import TextBlob
In [0]:
Let's create our first TextBlob
wiki = TextBlob("I love Natural Language Processing, not you!")
In [0]:
Part-of-speech(POS) Tagging
Parts-of-speech tags can be accessed through the tags property.
wiki.tags
In [3]:
NameError Traceback (most recent call last)
<ipython-input-3-b8c1f4e9af80> in <module>()
----> 1 wiki.tags
NameError: name 'wiki' is not defined
Noun Phrase Extraction
Similarly, noun phrases are accessed through the noun_phrases property.
wiki.noun_phrases
In [2]:
NameError Traceback (most recent call last)
<ipython-input-2-39939bd48acf> in <module>()
----> 1 wiki.noun_phrases
NameError: name 'wiki' is not defined
Sentiment Analysis
The sentiment property returns a named tuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
In [0]:
Out[5]:
testimonial = TextBlob("Textblob is amazingly simple to use. What grea testimonial.sentiment
Sentiment(polarity=0.39166666666666666, subjectivity=0.43571428571428 57)
In [0]:
Out[6]:
testimonial.sentiment.subjectivity
0.4357142857142857
Tokenization
zen = TextBlob("Data is a new fuel. "
"Explicit is better than implicit. " "Simple is better than complex. ")
zen.words
In [0]:
Out[7]: WordList(['Data', 'is', 'a', 'new', 'fuel', 'Explicit', 'is', 'bette
r', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'])
In [0]:
Out[8]:
zen.sentences
[Sentence("Data is a new fuel."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]
Sentences objects have the same properties and methods as TextBlobs
for sentence in zen.sentences: print(sentence)
In [0]:
Data is a new fuel.
Explicit is better than implicit. Simple is better than complex.
Word Inflection and lemmatization
Each word in the TextBlob.words or Sentence.words is a Word object(a subclass of unicode) with useful methods, e.g. for word inflection.
sentence = TextBlob('Use 4 spaces per indentation level')
sentence.words
In [0]:
Out[10]: WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
In [0]:
Out[11]:
sentence.words[2].singularize()
'space'
In [0]:
Out[13]:
sentence.words[0].pluralize()
'Uses'
Words can be lemmatized just by calling the lemmatize method.
from textblob import Word
q = Word('lions') q.lemmatize()
In [0]:
Out[21]: 'lion'
q = Word("went")
q.lemmatize("v") #Pass in WordNet part of speech (verb)
In [0]:
Out[28]: 'go'
WordNet Integeration
You can access the synets for a Word via the synsets property or the get_synsets method optionally passing in a parts-of-speech.
WordNet
WordNet is a lexical database that is dictionary for the English language, it is specifically for the natural language processing.
Synset
It is a special kind of a simple interface that is present in the NLTK for look up words in WordNet. Synset instances are the groupings of synonymous that express the same type of concept. Some words have only one synset and some have several.
from textblob import Word
from textblob.wordnet import VERB word = Word( "goat")
word.synsets
In [0]:
Out[29]: [Synset('goat.n.01'),
Synset('butt.n.03'), Synset('capricorn.n.01'), Synset('capricorn.n.03')]
In [0]:
Out[31]:
Word("hack").get_synsets(pos=VERB)
[Synset('chop.v.05'),
Synset('hack.v.02'),
Synset('hack.v.03'),
Synset('hack.v.04'),
Synset('hack.v.05'),
Synset('hack.v.06'),
Synset('hack.v.07'),
Synset('hack.v.08')]
You can access the definitions for each synset via the definitions property or the define()
method, which can also take an optional part-of-speech(pos) argument.
In [0]:
Out[32]:
Word("length").definitions
['the linear extent in space from one end to the other; the longest d imension of something that is fixed in place',
'continuance in time',
'the property of being the extent of something from beginning to en d',
'size of the gap between two places',
'a section of something that is long and narrow']
You can also create synsets directly.
from textblob.wordnet import Synset octopus = Synset( 'octopus.n.02') shrimp = Synset('shrimp.n.03') octopus.path_similarity(shrimp)
In [0]:
Out[33]: 0.1111111111111111
WordLists
A wordlist is just the Python list with additional methods.
WordLists will find it out the words which are in the sentence and ignore the spaces in between them.
animals = TextBlob("cow sheep octopus") animals.words
In [0]:
Out[35]: WordList(['cow', 'sheep', 'octopus'])
In [0]:
Out[36]:
animals.words.pluralize() # It'll pluralize the words
WordList(['kine', 'sheep', 'octopodes'])
Spelling Correction
For correcting the words you can use correct() method to attempt spelling correction.
g = TextBlob('Can you pronounce czechuslovakia?') print(g.correct())
In [0]:
An you pronounce czechoslovakia?
Word objects have a spellcheck() Word.spellcheck() , this method that returns a list of (word, confidence) tuples with spelling suggestions.
from textblob import Word k = Word( 'longituode') k.spellcheck()
In [0]:
Out[42]: [('longitude', 1.0)]
This spelling correction is based on the Peter Norvig's "How to Write a Spelling Corrector", as implemented in the pattern library. It is about 70% accurate.
Get Word and Noun Phrase Frequencies
There are two ways to get the frequency of a word or noun phrase in the TextBlob
The first one is through the word_counts dictionary.
sent = TextBlob('She sales sea shells at the sea shore.')
sent.word_counts['sea']
In [0]:
Out[44]: 2
If you access the frequencies this way, the search will not be case sensitive, and words that are not found will have a frequency of 0.
The second way is to use the count() method.
sent.words.count('sea')
In [0]:
Out[46]: 2
You can specify whether or not the search should be case-sensitive (default is False).
sent.words.count('Sea', case_sensitive= True)
In [0]:
Out[49]: 0
In the above example we have given 'Sea' and ofcourse 'Sea' is not available in the sentence, 'Sea' is available in sentence but in lowercase beacause of that it given 0 as a result.
Each of these methods can also be used with noun phrases.
sent.noun_phrases.count('sea')
In [0]:
Out[51]: 0
Translation and Language Detection
TextBlobs can be translated between languages.
blob = TextBlob(u'Something is better than nothing.') blob.translate(to='hi')
In [0]:
Out[59]: TextBlob("कु छ नही ं से कु छ भला।")
If no source language is specified, TextBlob will attempt to detect the language. You can specify the source language explicitly, like so. Raises TranslatorError if the TextBlob cannot be translated into the requested language or NotTranslated if the translated result is the same as the input string.
chinese_blob = TextBlob(u" 有总 ⽐ 没有好") chinese_blob.translate(from_lang ="zh-CN", to='en')
In [0]:
Out[56]: TextBlob("Better than nothing")
You can also attempt to detect a TextBlob’s language using TextBlob.detect_language().
d = TextBlob(" कु छ नही ं से कु छ भला") d.detect_language()
In [1]:
NameError Traceback (most recent call last)
<ipython-input-1-05b06bd4adaf> in <module>()
----> 1 d = TextBlob("कु छ नही ं से कु छ भला")
2 d.detect_language()
NameError: name 'TextBlob' is not defined
Parsing
Use the parse() method to parse the text.
b = TextBlob("And now for something completely different.") print(b.parse())
In [0]:
And/CC/O/O now/RB/B-ADVP/O for/IN/B-PP/B-PNP something/NN/B-NP/I-PNP completely/RB/B-ADJP/O different/JJ/I-ADJP/O ././O/O
By default, TextBlob uses pattern’s parser
TextBlobs Are Like Python Strings!
You can use Python’s substring syntax.
In [0]:
Out[68]:
zen[0: 15]
TextBlob("Data is a new f")
We can use it as common string method.
In [0]:
Out[69]:
zen.upper()
TextBlob("DATA IS A NEW FUEL. EXPLICIT IS BETTER THAN IMPLICIT. SIMPL E IS BETTER THAN COMPLEX. ")
zen.find('than') #It shows that 'than' word starts from 39th place.
In [0]:
Out[72]: 39
You can make comparisons between TextBlobs and strings.
a_blob = TextBlob('apple') s_blob = TextBlob('samsumg')
a_blob < s_blob
In [0]:
Out[73]: True
In [0]:
Out[76]:
a_blob == 'apple'
True
You can concatenate and interpolate TextBlobs and strings.
In [0]:
Out[77]:
a_blob + ' and ' + s_blob
TextBlob("apple and samsumg")
In [0]:
Out[78]:
"{0} and {1}".format(a_blob,s_blob)
'apple and samsumg'
n-grams
The TextBlob.ngrams() method returns a list of tuples of n successive words.
blob = TextBlob("Now is better than never.") blob.ngrams(n= 3)
In [0]:
Out[80]: [WordList(['Now', 'is', 'better']),
WordList(['is', 'better', 'than']),
WordList(['better', 'than', 'never'])]
Get Start and End Indices of Sentences
Use sentence.start and sentence.end to get the indices where a sentence starts and ends within a TextBlob.
for k in zen.sentences: print(k)
print( "---- Starts at index {}, Ends at index {}".format(k.start,
In [0]:
Data is a new fuel.
---- Starts at index 0, Ends at index 19 Explicit is better than implicit.
---- Starts at index 20, Ends at index 53 Simple is better than complex.
---- Starts at index 54, Ends at index 84
In [0]:
Let's start building the Text Classification system
The textblob.classifiers module makes it simple to create custom classifiers. As an example, let’s create a custom sentiment analyzer.
Loading Data and Creating a Classifier
First we’ll create some training and test data.
train = [
('I love this sandwich.', 'pos'), ('this is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'), ('this is my best work.', 'pos'),
("what an awesome view", 'pos'),
('I do not like this restaurant', 'neg'), ('I am tired of this stuff.', 'neg'), ("I can't deal with this", 'neg'),
('he is my sworn enemy!', 'neg'), ('my boss is horrible.', 'neg')
]
test = [
('the beer was good.', 'pos'), ('I do not enjoy my job', 'neg'),
("I ain't feeling dandy today.", 'neg'), ("I feel amazing!", 'pos'),
('Gary is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg')
]
In [0]:
Now we’ll create a Naive Bayes classifier, passing the training data into the constructor.
from textblob.classifiers import NaiveBayesClassifier cl = NaiveBayesClassifier(train)
In [0]:
Loading Data from Files
You can also load data from common file formats including CSV, JSON, and TSV. CSV files should be formatted like so:
I love this sandwich.,pos This is an amazing place!,pos
I do not like this restaurant,neg
JSON files should be formatted like so:
with open('train.json', 'r') as fp:
cl = NaiveBayesClassifier(fp, format= "json")
In [0]:
Classifying Text
Call the classify(text) method to use the classifier.
In [0]:
Out[94]:
cl.classify("This is an amazing library!")
'pos'
You can get the label probability distribution with the prob_classify(text) method.
prob_dist = cl.prob_classify("I am suffering from cough and cold.") prob_dist.max()
In [0]:
Out[96]: 'neg'
In [0]:
Out[97]:
round(prob_dist.prob("neg"), 2)
0.71
In [0]:
Out[98]:
round(prob_dist.prob("pos"), 2)
0.29
Classifying TextBlobs
Another way to classify text is to pass a classifier into the constructor of TextBlob and call its
classify() method.
from textblob import TextBlob
blob = TextBlob("Alcohol is good. But the hangover is horrible.", clas blob.classify()
In [0]:
Out[104]: 'pos'
The advantage of this approach is that you can classify sentences within a TextBlob.
for b in blob.sentences: print(b) print(b.classify())
In [0]:
Alcohol is good. pos
But the hangover is horrible. pos
Evaluating Classifiers
To compute the accuracy on our test set, use the accuracy(test_data) method.
In [0]:
Out[107]:
cl.accuracy(test)
1.0
Use the show_informative_features() method to display a listing of the most informative features.
In [0]:
cl.show_informative_features(5)
= |
False |
pos : |
neg |
= |
1.9 |
= |
False |
pos : |
neg |
= |
1.9 |
= |
True |
neg : |
pos |
= |
1.7 |
= |
True |
neg : |
pos |
= |
1.7 |
= |
False |
neg : |
pos |
= |
1.5 |
Most Informative Features
contains(I)
: 1.0
: 1.0
: 1.0
: 1.0
: 1.0
contains(this) contains(I) contains(this) contains(an)
Updating Classifiers with New Data
Use the update(new_data) method to update a classifier with new training data.
new_data = [('She is my best friend.', 'pos'),
("I'm happy to have a new friend.", 'pos'), ("Stay thirsty, my friend.", 'pos'),
("He ain't from around here.", 'neg')]
cl.update(new_data)
In [0]:
Out[109]: True
In [0]:
Out[110]:
cl.accuracy(test)
1.0
Feature Extractors
By default, the NaiveBayesClassifier uses a simple feature extractor that indicates which words in the training set are contained in a document.
For example, the sentence “I love” might have the features contains(love): True or contains(hate): False.
You can override this feature extractor by writing your own. A feature extractor is simply a function with document (the text to extract features from) as the first argument. The function may include a second argument, train_set (the training dataset), if necessary.
The function should return a dictionary of features for document.
For example, let’s create a feature extractor that just uses the first and last words of a document as its features.
def end_word_extractor(document): tokens = document.split()
first_word, last_word = tokens[0], tokens[- 1] feats = {} feats[ "first({0})".format(first_word)] = True feats["last({0})".format(last_word)] = False return feats
In [0]:
features = end_word_extractor("I love")
In [0]:
assert features == { 'last(love)': False, 'first(I)': True}
In [0]:
We can then use the feature extractor in a classifier by passing it as the second argument of the constructor.
In [0]:
In [0]:
Out[118]:
cl2 = NaiveBayesClassifier(test, feature_extractor=end_word_extractor)
blob = TextBlob("I'm excited to try my new classifier.", classifier=cl blob.classify()
'pos'
In [0]:
In [0]:
Credits: link ('https://textblob.readthedocs.io/en/dev/classifiers.html')
In [0]: