Making Sense of the Domains Count
Tutorial by Ian Milligan (Archives Unleashed Team)
We don't always know exactly what we've collected in our web archives. Alternatively, a researcher might want to know what is in a given body of WARC files. The web is a complex beast: a small web archive that aims to collect just one page might end up capturing embedded widgets and content from other websites just to make something play; alternatively, a crawl might have been corrupted and begun to find material that's far out of scope.
This tutorial explores what you can learn from the derivative file marked as "Domains" in your collections page, as well as the list of domains that are displayed right on the page itself.
What do we mean by domain frequency?
The "domains" derivative provides a frequency count of the domains that have been captured by a web archive. Imagine a web archive that has five pages captured:
What we do in the domains frequency table is count how often each domain appeared - so the final result would be:
- liberal.ca, 3
- greenparty.ca, 2
What can you learn from this sort of information?
Imagine a few things that you might learn just by seeing the frequency of domains that have been collected:
- The crawl contains certain kinds of sites. At a glance you can begin to explore questions such as whether a web archive is primarily composed of social media (say lots of occurrences of twitter.com and facebook.com)? Or is it a media archive (NewYorkTimes.com and GlobeandMail.com make up a large amount of it)? Or something else. Given the unevenness of some metadata, sometimes just seeing what has been collected is what you need to get a sense of a collection.
- That something has gone wrong with the crawl! Sometimes something goes wrong with a web crawl --- sites have contained links to non-relevant material, from pornography to spam domains. This is going to understandably skew any other analysis. Better to learn earlier than later.
- The web crawl has remained small and focused. Imagine a web archive that only has a few sites. They've been collected a few times. This means that the crawl has maintained focus on a small number of websites, and if you decide to subsequently work with the plain text or network graphs, you can interpret the results accordingly.
There are many other things you can learn from this information, of course, these are just a few.
The domain chart
On each of our collection landing pages, we provide a frequency table to identify the top ten domains represented within each collection. By default, these are presented in declining order: the most frequently-collected domain at the top and the tenth-most frequently-collected domain at the bottom.
For a lot of users, this is probably going to be enough! You get a sense of the major sites and nodes of collections. For example, consider the output for the "Nova Scotia Municipal Governments" collection.
Here we learn that the sites that are being collected are municipal government sites, as opposed to media outlets about municipal government sites, or research projects about governments that are being carried out at a university.
However, we only see the top ten. Sometimes we might need to be more granular and see beyond these.
The domain derivative download
By clicking on the "Domains" button, users are provided with a text file that will be named something like 7485-fullurls.txt. You can open this file up in your text editor of choice. By default on MacOS, it will open in TextEdit and look like this:
Here we see our suspicions confirmed! The sites are all indeed Nova Scotian municipal governments, which are identified by the .ns.ca in the URL. Here we are already beginning to learn the relative sizes of the collection.
This will help support future analysis – if a collection contains a given website thousands of times, you need to keep that in mind when doing text analysis. If it only appears a fraction of what some of the other sites are doing, even if it is important it might be drowned out by all of the other information within an archive.