Text Analysis Part Two: Sentiment Analysis With the Natural Language Toolkit

Tutorial by Sarah McTavish (PhD Candidate, University of Waterloo)


Introduction

In Part 1, we examined how we can use text analysis software AntConc to investigate concordances and collocates in a large text file. This type of analysis is incredibly useful for examining keyword use, while maintaining the context within which the keyword appeared. In Part 2, we look at the use of the Python Natural Language Toolkit and how to do more complex sentiment analysis on our large text source. With this type of analysis, we can calculate whether a word or phrase in our text is primarily positive, negative, or neutral. Sentiment analysis can shed light on the emotions expressed when discussing a given topic; when combined with other types of text analysis, such as that concordance and collation analysis, or combined with network analysis, sentiment analysis can be a powerful tool for bringing context to a large text source.

Back to Top

Getting Started

In this tutorial, I use the web-based Jupyter Notebooks in the Anaconda Navigator to write and run my code. Jupyter Notebooks help keep all different parts and steps of a process in line, allows code to be run directly in the editor, and makes it easy to share code with others. Using the Natural Language Toolkit (NLTK) requires some knowledge of coding in Python. The NLTK Textbook provides basic instructions in their preface and chapter one, and it is worth working through their introductory exercises in order to get a feel for the code if you are a beginner to Python.

This tutorial builds on this lesson on exploratory text analysis using sentiment analysis by Zoë Wilkinson Saldaña on The Programming Historian. This lesson explains how to load the NLTK libraries and perform sentiment analysis on sentence and paragraphs. You may find it useful to work through this lesson first, as it provides a good basis for working with smaller chunks of text, before moving on to larger text files.

Back to Top

Loading the Fort McMurray Wildfires Collection

In this tutorial, I will be working with the University of Alberta's Fort McMurray Wildfires Collection. These northern Alberta wildfires took place during early May 2016, and resulted in the evacuation of nearly 90 000 residents. The fires spreaded quickly over several days, leading to dramatic photographs and video footage of residents fleeing their homes as the fires engulfed parts of the city and the surrounding area.

The dramatic nature of the wildfires and evacuation sparked considerable media attention worldwide. This international interest compounded the expected online coverage from local and regional residents and officials, who used the internet to spread necessary information on evacuation efforts and the state of the city during the fires. The Fort McMurray Wildfires Collection contains sites crawled and collected during coverage of these wildfires; for the purposes of this tutorial, we will be using a raw text version of this collection, which is 2.6 gigabytes in size. By performing sentiment analysis on this text file, we can determine whether our chosen keywords appear more commonly in a positive, negative, or neutral context within this international media coverage.

These download files can be found in the Archives Unleashed Cloud's collection pages. See the button below.

The derivative download buttons with the Full Text one highlighted

In general, this type of analysis can handle files up to about 2 gigabytes in size. Because our file is 2.6 gigabytes, we'll have to split it into two. As with the previous tutorial, this requires the use of the command line. You can find a tutorial about how to use it here. Using the Terminal and navigating to the directory where I have downloaded the full text file, I enter the command:

split -l 400000 7368-fulltext.txt

This splits into two files of 400,000 lines each, named xaa and xab. We're going to rename these files as FMMWildfires-A.txt and FMMWildfires-B.txt, to keep things straight.

Back to Top

Running NLTK in Jupyter Notebooks

After installing Anaconda Navigator, we then open the program and choose Jupyter Notebooks from the launch screen. It may be necessary to install Jupyter Notebooks the very first time that you run it.

Anaconda Navigator launch screen

Jupyter notebooks will then display a list of folders on your computer. You might find it helpful to create a new folder with just the text file that you wish to work with. In this example, I have created a folder on my Desktop called "FMMWildfires" that contains two files — the split full-text file.

. Opening a new Notebook in Jupyter Notebooks

Once you have navigated to the folder with the text file that you wish to work with, click on the "New" button near the top right of the screen and choose "Python 3 Notebook."

Before we can run sentiment analysis on our file, we need to import tools for the NLTK: the VADER lexicon, which calculates negative, positive, and neutral values for our text, and a word tokenizer, which splits our large text file into sentences or words. In Jupyter, we will enter this code into our first cell and click "Run."

import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')

The following cell downloads the needed lexicons for NLTK sentiment analysis.

This cell downloads the needed lexicons for NLTK sentiment analysis

Note: If you have already worked through The Programming Historian's sentiment analysis lesson, you will get a message that these libraries are already up to date.

Our next section of code will import the relevant modules from the NLTK libraries.

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize

And then initialize VADER so we can use it within our Python script.

sid = SentimentIntensityAnalyzer()

We will also initialize our 'english.pickle' word tokenizer function and give it a short name, 'tokenizer.'

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

At this point, we then open up our first text file, and designate it as 'fmwildfires1.'

f = open('WMMWildfires-A.txt')
fmwildfires1 = f.read()

We then tell our tokenizer to break up our text file, 'fmwildfires,' into a list of sentences.

sentences = tokenizer.tokenize(fmwildfires1)

All together, our cell in the Jupyter Notebook looks like this:

This cell imports the needed modules, opens our file, and breaks it into individual sentences

This code will be run as one program which does a number of things to prepare our file for sentiment analysis. Click "Run" — if everything has worked, you should not see any output or error messages.

Our next step will find all sentences in our text file that include a specific keyword and designate these sentences as a list. In our example, I am choosing "evacuation" as my keyword and I have designated our list of all sentences that include this word "evacuationlist". The "*"s are wildcards which match everything before and after the word "evacuation" itself. The last part of the code prints the first ten sentences in the list, so we can see that it worked.

import re

r = re.compile(".* evacuation .*")
evacuationlist = list(filter(r.match, sentences))
print(evacuationlist[:10]
This cell imports the needed modules, opens our file, and breaks it into individual sentences

And lastly, we will run the sentiment analysis on those sentences that include the word "evacuation."

for sentence in evacuationlist:
 print(sentence)
 scores = sid.polarity_scores(sentence)
 for key in sorted(scores):
  print('{0}: {1}, '.format(key, scores[key]), end='')
 print()

Once we hit "Run," we can see the calculated positive, negative, and neutral scores for each sentence that contains the word "evacuation" in our text source. Many of these sentences have been calculated to be neutral, but we do see positive values for sentences where the evacuation was praised, and negative when discussing the destruction of homes and oil rigs. The scores are highlighted with red boxes below, which have been added to the screenshot.

. This last step calculates negative, positive, and neutral scores for each sentence with the keyword in our text file

Because we have split our very large text file into two parts, we have to repeat the above steps for the FMMWildfires-B.txt file.

Back to Top

What Can We Learn From Sentiment Analysis?

Though sentiment analysis can be a powerful tool for quickly determining the emotions expressed through text, there are limitations to what sentiment analysis can provide. Additionally, like all text analysis, we need to be cautious in interpreting the results. For example, sentences that contain profanity have a tendency to be interpreted by NLTK as negative; this can be a problem when using texts from social media, where profanity is often used for emphasis.

Similarly, many negative terms or insults that have been reclaimed by minority groups are flagged as negative when using sentiment analysis. In my own work on queer communities on the early Internet, I see examples of slurs that have been reclaimed by the queer community (indeed, even the word "queer" itself in some cases) being interpreted as negative, despite clearly positive ideas being communicated in the sentence. This can also be a problem when using texts from other parts of the world, where even words in English can mean different things depending on who is writing them.

However, these limitations are similar to other types of computer-mediated text analysis, or even traditional close reading done by human scholars. Overall, the strength of sentiment analysis using NLTK is in the ability to isolate a keyword and provide a quick reading on the positive and negative emotions expressed when using that word. When combined with other text analysis methodologies, sentiment analysis has the ability to allow scholars to really delve into very large text sources.

Back to Top