Documentation

Introduction

The Archives Unleashed Cloud, referred to as AUK, is an open-source cloud-based analysis tool that helps researchers and scholars conduct web archive analysis.

AUK is a component of the Archives Unleashed Project, which empowers researchers, scholars, librarians and archivists to access and explore their web archival collections. As accessibility is a main priority of the project, AUK supports this goal by providing a web-based front end for users to access the Archives Unleashed Toolkit.

For more information on the AUK platform, as well as the basics of web archiving and the WARC file format, check out our about page.

Audience and Application

AUK is designed to provide a web-based user interface to help users access and analyze their web archive collections. This cloud-based option uses the WASAPI Data Transfer API, which means that individuals already set up with an Archive-It account will be able to ingest and explore WARC files. In the future, we are exploring interoperability with other web archiving platforms.

With AUK you can:

  • Ingest Archive-It collections via WASAPI;
  • Download collection derivatives: network files, domain lists, and full text (all and by domain); and
  • Visualize your collections with an in-browser network diagram to see major nodes and connections within the collection.

Note: You can use our canonical version at https://cloud.archivesunleashed.org, or you are welcome to run your own local version by visiting our GitHub repository.

Getting Started

This guide will walk through how to set up an AUK account and sync your Archive-It collections, and the types of analysis outputs you can curate.

Login

Begin at https://cloud.archivesunleashed.org.

You will not need to sign up for an AUK account. Instead, AUK will use either a Twitter or GitHub account to authenticate your login. For those who have multiple Twitter or GitHub accounts, feel free to use whichever suits your research purposes best. AUK only uses these accounts to authenticate who you are, we do not have any additional access to your Twitter or GitHub account!

The following two screenshots show you how to log in. Click on the logo of the service you want to use to authenticate (the bird for Twitter or the Octocat for GitHub). After you authenticate, you should be brought to the main page.

AUK Sign In Screenshot AUK Log In Screenshot

You are now logged into AUK. Welcome!

Account Set Up

The left hand panel displays two sets of credentials: 1) AU Cloud account and 2) Archive-It account. You will also notice that the main collections space at right is empty. That's because before you have connected AUK to your Archive-It account, there are no collections to show. Let's walk through setting up your account.

Click Enter Credentials in the left hand panel (highlighted in the screenshot below).

AUK Account Credentials

On the next screen, fill out the form to update both sets of credentials.

  • AU Account Information: AUK requires an email so we can let you know what stage of analysis your collection is in. You can use the email address of your choice, and have the ability to modify this at any time. If you have a Gravatar account, it will also use this email address to populate your user avatar.
  • Consent Checkbox: to comply with Canadian Anti-Spam legislation, we've included this checkbox to make sure you understand that the Archives Unleashed team will connect with you in regards to your collections and feedback.
  • Archive-It Information: In order to work with web archive collections, you must enter your Archive-It account username and password. Your Archive-It credentials are encrypted and salted.
AUK Account Credentials Log In Screenshot AUK Account Credentials Log In Screenshot

Click Update

You will be redirected back into the main space. A note at the top of the page will indicate that AUK is syncing with your Archive-It account. The length of time to ingest your collections will depend on the size of your collections and the number of other users syncing their collections. This process could take a few minutes to a few hours. You will receive an email, to the address you provided for your AU Cloud account, when the collections are synced.

AUK Account Credentials Log In Screenshot

Note: You may update/change your AU Cloud or Archive-It credentials at any time by clicking Update. If you add a new collection in Archive-It or change your Archive-It credentials, you can click Update and redo the above steps to re-sync.

Collections

Once your collections have been synced, you can return to the main collections screen and start analysis. You can sort collections by clicking on the arrows beside each header. Let's take a look at the collection information available to us in the main space:

  • Title - The name of each collection is pulled directly from your Archive-It account; each title provides the access point for analysis.
  • Analyzed - This header identifies whether a collection has been analyzed in AUK. If yes, you will see at least one of the following; a network diagram, downloadable derivative files, and/or a domains frequency table once you enter the collection's page.
  • Public - This field notes whether or not the collection is public, and if so, will be listed on an institution's Archive-It page.
  • Files - Indicates the number of ARC or WARC files in each collection. Note, this size will differ from the collection size in your Archive-It dashboard. The Archive-It WASAPI endpoint provides the size of each compressed WARC. Whereas the Archive-It dashboard will provide the size of a collection uncompressed.
  • Size - Indicates the collection's total file size.
Collection Main Space

Analysis

Click on the name of the collection you would like to analyze. This will bring you into the collection page.

At the top of the screen, click Analyze Collection. You do not need to keep AUK or your browser open for it to analyze the collection. AUK will notify you when the process is complete and ready for you to start using the output files.

Note: We operate on a queuing system so the time it takes to analyze the collection will depend on how large the collection is and how many other users are processing collections. This could take anywhere between half an hour to a few weeks. You need only click the button a single time.

Analyze the Collection

Output Files

You will receive an email once your collections are ready for analysis. When you navigate to the collection space, by clicking on the collection title (in the screenshot below we've selected the Nova Scotia Municipal Governments collection), you will find:

  • An interactive hyperlink diagram: you can use this to explore the basic hyperlink structure of your collection.
  • Five output files that can be downloaded for further research and analysis. These include:
    • Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. You can find more information on using this file in the "Network Graphing Archived Websites With Gephi" tutorial.
    • Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.
    • Domains file, which will be a CSV file containing the frequency of domains found within your web archive.
    • Full Text file, which will be a large text file containing the extracted plain text of all the HTML files found within a collection. You might want to try loading it into a visualization platform like Voyant Tools to see patterns and trends within your web archive.
    • Full Text by Domain, which will be a large ZIP file containing ten text files corresponding to the plain text of the top ten domains. For example, if the domain "liberal.ca" is the most popular domain in your web archive, there will be a text file called "liberal-ca.txt". As above, you might want to use Voyant Tools to see patterns and trends.
  • Finally, below the download options, you can see a frequency table of the top 10 domains found within the collection.

Are you using Safari and having trouble with the network downloads? If so, note that Safari by default will add the .xml extension to your .graphml and .gexf downloads. To use the file, remove the .xml extension using Finder (rename the file to remove the trailing .xml).

Output Files

Interactive Hyperlink Diagram

We use GraphPass to help to create an interactive hyperlink diagram powered by Sigma js on the collection page. Sigma is a JavaScript library that assists in drawing and displaying graphs. GraphPass produces visualization-related data in the network files such as color, position and size based on common social network algorithms. In the hyperlink diagram each node (dot) represents a domain (i.e. all of the URLs within a domain such as "yorku.ca" or "newyorktimes.com") and each edge (line) represents a link from one node to another.

Users can explore and interact with the hyperlink diagram using the helper buttons in the top left corner of the network window.

  • Full Screen - opens up the network to full screen for easier interaction
  • Zoom in/out - zooms the camera into the network and away from it
  • Refresh - brings the network diagram back to its original state
  • Scale up/down - allows users to see more or less node labels in the network
AUK Interactive Hyperlink Diagram

A few cautionary notes on "scale up" and "scale down" are in order. With website networks, some sites have so many links compared to the others, that they obscure everything else. AUK's scale-up feature uses logarithmic transformation to make the graph a bit easier to read. For example, six nodes with size values 1, 10, 100, 1000, 10000 and 1,000,000,000 can be transformed using a base 10 logarithm to produce new sizes 0, 1, 2, 3, 4, & 9, making the node sizes much closer together in size.

You can further interact with the hyperlink diagram by hovering over any node to highlight its immediate connections to other nodes. The arrow on each line indicates the direction of connection, for instance, in the image below, we see an arrow connecting two nodes and reveals the townyarmouth.ca has a link to the atlantic.ctvnews.ca domain.

AUK Interactive Hyperlink Diagram Explained AUK Interactive Hyperlink Diagram hover feature