Making Sense of the Domains Count

Tutorial by Ian Milligan (Archives Unleashed Team)


Introduction

We don't always know exactly what we've collected in our web archives. Alternatively, a researcher might want to know what is in a given body of WARC files. The web is a complex beast: a small web archive that aims to collect just one page might end up capturing embedded widgets and content from other websites just to make something play; alternatively, a crawl might have been corrupted and begun to find material that's far out of scope.

This tutorial explores what you can learn from the derivative file marked as "Domains" in your collections page, as well as the list of domains that are displayed right on the page itself.

The download icon for domain counts as well as the top ten on the collection page

Back to Top

What do we mean by domain frequency?

The "domains" derivative provides a frequency count of the domains that have been captured by a web archive. Imagine a web archive that has five pages captured:

  • liberal.ca
  • liberal.ca/about
  • liberal.ca/donate
  • greenparty.ca
  • greenparty.ca/about

What we do in the domains frequency table is count how often each domain appeared - so the final result would be:

  • liberal.ca, 3
  • greenparty.ca, 2

Back to Top

What can you learn from this sort of information?

Imagine a few things that you might learn just by seeing the frequency of domains that have been collected:

  • The crawl contains certain kinds of sites. At a glance you can begin to explore questions such as whether a web archive is primarily composed of social media (say lots of occurrences of twitter.com and facebook.com)? Or is it a media archive (NewYorkTimes.com and GlobeandMail.com make up a large amount of it)? Or something else. Given the unevenness of some metadata, sometimes just seeing what has been collected is what you need to get a sense of a collection.
  • That something has gone wrong with the crawl! Sometimes something goes wrong with a web crawl --- sites have contained links to non-relevant material, from pornography to spam domains. This is going to understandably skew any other analysis. Better to learn earlier than later.
  • The web crawl has remained small and focused. Imagine a web archive that only has a few sites. They've been collected a few times. This means that the crawl has maintained focus on a small number of websites, and if you decide to subsequently work with the plain text or network graphs, you can interpret the results accordingly.

There are many other things you can learn from this information, of course, these are just a few.

Back to Top

The domain chart

On each of our collection landing pages, we provide a frequency table to identify the top ten domains represented within each collection. By default, these are presented in declining order: the most frequently-collected domain at the top and the tenth-most frequently-collected domain at the bottom.

For a lot of users, this is probably going to be enough! You get a sense of the major sites and nodes of collections. For example, consider the output for the "Nova Scotia Municipal Governments" collection.

Top ten domains as presented on collection page

Here we learn that the sites that are being collected are municipal government sites, as opposed to media outlets about municipal government sites, or research projects about governments that are being carried out at a university.

However, we only see the top ten. Sometimes we might need to be more granular and see beyond these.

Back to Top

The domain derivative download

By clicking on the "Domains" button, users are provided with a text file that will be named something like 7485-fullurls.txt. You can open this file up in your text editor of choice. By default on MacOS, it will open in TextEdit and look like this:

Screenshot of the domain derivative file in MacOS TextEdit

Here we see our suspicions confirmed! The sites are all indeed Nova Scotian municipal governments, which are identified by the .ns.ca in the URL. Here we are already beginning to learn the relative sizes of the collection.

This will help support future analysis – if a collection contains a given website thousands of times, you need to keep that in mind when doing text analysis. If it only appears a fraction of what some of the other sites are doing, even if it is important it might be drowned out by all of the other information within an archive.

Back to Top