Filtering the Full-Text Derivative File
Tutorial by Ian Milligan (Archives Unleashed Team)
We have some filtering tools already built into the Archives Unleashed Cloud -- for example, the "Text by Domain" button will give you a zip file of the text of the top ten domains within a web archive. Yet you may want to filter the derivative files further, so that they for example:
- Only contain websites from a certain date (i.e. all sites collected on July 16th, 2009)
- Only contain websites from a certain domain (i.e. the plain text of all "liberal.ca" websites), especially if they're not in the top ten domains but you know they are there.
- Only contain websites that have a certain keyword (i.e. every page that contains the word "elephant")
Fortunately, you can easily do this with some simple tools.
Grep is a Unix utility that stands for "globally search a regular expression and print." In short, it lets you take a text file and find lines that match criteria that you specify. Grep is especially useful for working with the files that the Cloud generates because each line in the full-text file corresponds to one website in the collection.
If you are on Linux or MacOS, you have Grep already installed. If you're on Windows, you will need to install a program called "Git Bash." You can install this via their "full installer" here. Follow these instructions to install.
One catch is that Grep only runs on the command line. The command line is powerful, but not always straightforward. Fortunately, there is a good walkthrough at the Programming Historian: Introduction to Bash. For the purposes of this tutorial, you will want to be able to "find" files in the command line.
A Command Line Example: Finding the Derivatives
In addition to the resources above, an example might give you a sense of how relatively straightforward it can be to locate files. Imagine that in the Cloud interface, you download the "full text" file. On MacOS, this defaults to your "downloads" directory.
To open the command line on MacOS, we go to our "Applications" folder, "Utilities," and then select "Terminal." You will see this window (your colours might be different):
By default, you start in your home user directory. You can now change the directory to downloads by typing:
If you type:
You will then see a long list of all of the files in this directory. Amongst them should be the file you downloaded from the Cloud! You could begin working on it here, or you could move it somewhere else on your file system. For more information on this, see the Bash tutorial above.
In the following example, I will use the files from the "Fort McMurray" collection. You can follow along with your own files, simply changing the collection number to your own (i.e. instead of 7368-fulltext.txt it might be 1234-fulltext.txt).
Let's imagine that we have a collection, but we just want part of it: in this case, the sites that match a certain date. This won't necessarily be the date that sites were created but will instead be the date that sites were crawled.
To do so, let's take a look at the basic structure of the full text file. We can do so by either opening up the file in a text editor, or in the command line, we can type a command that shows us the first x lines of a file.
In the directory where the file was downloaded (i.e. your "downloads" directory above or somewhere else), type the following command. It will just show the first two lines of a file, or in our case, the first two websites.
head -n 2 7368-fulltext.txt
Here we see the results:
The first two results aren't helpful to a researcher – the first document is a 301 error which was crawled but was redirecting browsers elsewhere, and the second is an error message from another 302. But what they do show is the basic structure of the file.
It is arranged like so:
So for date, if we wanted only sites that were crawled on 22 May 2016, we would need to convert them into a standardized format. In our case, dates are arranged like so YYYYMMDD, or Year-Month-Day. So 22 May 2016 is
To find these sites, let's construct a grep command. I will provide it below and then explain what it does.
grep '^(20160522' 7368-fulltext.txt > 20160522-text.txt
The basic structure of this command is:
grep PATTERN FILE-TO-ANALYZE > FILEOUTPUT.TXT
For the example pattern above, the quotation marks enclose the pattern we are looking for.
^ represents the "start of the line," and then since there is an opening parenthesis we look for that too. We then provide an output path for the results. In this case, there will now be a new file – 20160522-text.txt – that just contains the crawl text from 22 May 2016.
Finding Specific Domains
If the domain is in the top ten, you can use our handy "text by domain" derivative to work with it out of the box. If you want to find a different domain, however, you may have to do a bit of work yourself.
The logic is similar to finding specific domains. We'll also use grep. In the example above, we saw two different domains. Let's use "www.rmwb.ca" as our example and get only websites from that domain.
We turn to grep again:
grep ',www.rmwb.ca,' 7368-fulltext.txt > RMWB-text.txt
We now have a text file that just contains websites from the "www.rmwb.ca" domain.
And of course, things are relatively similar with keywords. This can be the easiest one.
grep 'helicopter' 7368-fulltext.txt > helicopter-text.txt
We now just have the text of all websites that contain the word "helicopter."
These relatively straightforward commands need a bit of practice, but before you know it, you'll be filtering your web archive files like an expert. You can then move on to subsequent analysis steps, or if it's small enough, maybe even just start reading.