Text Analysis Part One: Beyond the Keyword Search: Using AntConc
Tutorial by Sarah McTavish (PhD Candidate, University of Waterloo)
In my own research using thousands of deleted webpages as a historical source, the Wayback Machine has been an invaluable tool for viewing these sites as they were originally presented. With thousands of personal webpages, photos, journals, and biographies found within web archives, the temptation to just keep reading forever is high. However, there are limits to what one human can read. The alternative to endless link-clicking is large scale text analysis. But how can we make any more sense of a massive text file than a large collection of websites on the Wayback Machine?
Web archives are very big. For example, my own research uses a GeoCities "neighbourhood" which was intended to house gay, lesbian, and transgender personal webpages, and contains approximately 25,000 individual user sites. Just the raw text from WestHollywood is 1.7 gigabytes, much more than I could possibly read in my lifetime, despite sometimes wishing that it was possible. How can I make sense of this enormous collection of archived web pages? What types of large scale text analysis can be most useful in tackling this kind of source base? This tutorial walks you through an easy-to-use tool, Antconc, which can help you explore the text of an archived webpage.
Beyond the Keyword Search: Using AntConc for Text Analysis
AntConc is free corpus analysis software that can be downloaded for Windows, Mac, and Linux. Using AntCont, it is possible to investigate concordances in your text, without having any coding knowledge.
After installing AntConc, choose "Open File" or "Open Dir" (to open a directory containing multiple text files) from the File menu to begin analyzing your text files. Files must be in a .txt format, with UTF-8 formatting. If your file is in a different format, the creator of AntConc also offers file converter and encoding software, which can be downloaded for free from this website.
Loading Our Data
In this tutorial, I will be working with the University of Alberta's Fort McMurray Wildfires Collection. These northern Alberta wildfires took place during early May 2016, and resulted in the evacuation of nearly 90 000 residents. The fires spread quickly over several days, leading to dramatic photographs and video footage of residents fleeing their homes as the fires engulfed parts of the city and the surrounding area.
The dramatic nature of the wildfires and evacuation sparked considerable media attention worldwide. This international interest compounded the expected online coverage from local and regional residents and officials, who used the internet to spread necessary information on evacuation efforts and the state of the city during the fires. The Fort McMurray Wildfires Collection contains sites crawled and collected during coverage of these wildfires; for the purposes of this tutorial, we will be using a raw text version of this collection, which is 2.6 gigabytes in size. Using AntConc, we can examine the most frequently-used words in the collection, the context that they appear in, and the words most frequently found in relation to a given keyword.
AntConc handles many smaller files much better than one large text file, so it may be helpful to split the file into multiple smaller files and try running those through instead. When running large files, AntConc has a tendency to freeze and crash; this problem is eliminated by breaking the data into many smaller files. To do so requires the command line, which you can find a tutorial about here.
For the Fort McMurray all-text file, I split the 2.6 gigabyte file into 314 files of 2000 lines each. Using the Terminal, I navigated to the directory with the plain text file downloaded into it and entered the command:
split -l 2000 7368-fulltext.txt
This splits the file into multiple files, each 2000 lines in size -- these files are named
xac, etc. I then moved all of these files (you can use your file explorer) to a new folder, named
Using AntConc to Examine Concordances and Collocates
Using "Open Dir" from the "File" menu, we're going to choose the
FMMWildfiresSplit directory. This loads all 314 text files into AntConc.
A search under the "Concordance" tab will find every instance of a word or phrase, and display it in context with several words to the right and left of it. In this example, I did a search for "oil sands," which should be highly prevalent when examining the new coverage surrounding the wildfires, which took place in an area dominated by the oil extraction industry. After sorting through my 314 text files, AntConc found 83 888 uses of the words "oil sands" in the collection, and displays them with their contexts.
However, when examining concordances, a researcher still has to read through each instance of the word. Therefore, this method is generally more useful when dealing with a small number of word uses.
Another way that AntConc can analyze a text source is through collocates. Collocation measures the association between words and entities in a body of text. AntConc will show you the words found most frequently within a set number of words before and after your search term. In this example, we see the words that most frequently appear within five words before or after the search term "parents". Using the drop-down menu near the bottom of the window, I have set the sorting to "Sort by Frequency," which lists the words from most frequently appearing near my search term, to least frequently. The most frequent words are stop words, the most common words in the English language, such as "and", "the", "to", and "of".. However, if you scroll down, we see more interesting collocates.
Perhaps even more usefully, AntConc allows you to click on any collocate, and will show all concordances of that word along with your search term. Using this method, it is possible to view all instances of two related words within your text source. If you already know the word that you are looking for in relation to your search term, you can also do this directly through the Concordance tab, using AntConc's advanced search tools. The other benefit is that you can create a context word list, in order to further tailor your search results according to multiple context words. However, this requires you to already know the specific context and search terms that you expect to see; for this, it can be helpful to first run Collocation to view the most frequently-used words.
Overall, AntConc provides rather surface analysis on text sources, by looking at the instance of a named search term, and the context within which that word is found. However, this type of analysis can be incredibly useful for dealing with situations where keyword searches yield too many results to be meaningfully read, or to determine the context around which a search term is found. By combining search terms, it is possible to whittle thousands of keyword results down to a few dozen relevant matches.