The Archives Unleashed Cloud is an open-source cloud-based analysis tool that helps researchers and scholars conduct web archive analysis.
The Cloud is a component of the Archives Unleashed Project, which empowers researchers, scholars, librarians and archivists to access and explore their web archival collections. As accessibility is a main priority of the project, the Archives Unleashed Cloud supports this goal by providing a web-based front end for users to access the most-recent version of the Archives Unleashed Toolkit.
For more information on the Cloud platform, as well as the basics of web archiving and the WARC file format, check out our about page.
If you have questions about the documentation or are running into issues while using the tool, please check out our frequently asked questions page. You can also join our Slack channel and join the #auk-support channel. We would love to connect with you there.
Audience and Application
The Archives Unleashed Cloud is designed to provide a web-based user interface to help users access and analyze their web archive collections. This cloud-based option uses the WASAPI Data Transfer API, which means that individuals already set up with an Archive-It account will be able to ingest and explore WARC files. In the future, we are exploring interoperability with other web archiving platforms.
With the Cloud you can:
- Ingest Archive-It collections via WASAPI;
- Download collection derivatives: network files, domain lists, and full text (all and by domain); and
- Visualize your collections with an in-browser network diagram to see major nodes and connections within the collection.
This guide will walk through how to set up a Cloud account and sync your Archive-It collections, and the types of analysis outputs you can curate.
Begin at https://cloud.archivesunleashed.org.
You will not need to sign up for a Cloud account. Instead, the Cloud will use either a Twitter or GitHub account to authenticate your login. For those who have multiple Twitter or GitHub accounts, feel free to use whichever suits your research purposes best. The Cloud only uses these accounts to authenticate who you are, we do not have any additional access to your Twitter or GitHub account!
The following two screenshots show you how to log in. Click on the logo of the service you want to use to authenticate (the bird for Twitter or the Octocat for GitHub). After you authenticate, you should be brought to the main page.
You are now logged into the Cloud. Welcome!
Account Set Up
The left hand panel displays two sets of credentials: 1) AU Cloud account and 2) Archive-It account. You will also notice that the main collections space at right is empty. That's because before you have connected the Cloud to your Archive-It account, there are no collections to show. Let's walk through setting up your account.
Click Enter Credentials in the left hand panel (highlighted in the screenshot below).
On the next screen, fill out the form to update both sets of credentials.
- AU Account Information: The Cloud requires an email so we can let you know what stage of analysis your collection is in. You can use the email address of your choice, and have the ability to modify this at any time. If you have a Gravatar account, it will also use this email address to populate your user avatar.
- Consent Checkbox: to comply with Canadian Anti-Spam legislation, we've included this checkbox to make sure you understand that the Archives Unleashed team will connect with you in regards to your collections and feedback.
- Archive-It Information: In order to work with web archive collections, you must enter your Archive-It account username and password. Your Archive-It credentials are encrypted and salted.
You will be redirected back into the main space. A note at the top of the page will indicate that the Cloud is syncing with your Archive-It account. The length of time to ingest your collections will depend on the size of your collections and the number of other users syncing their collections. This process could take a few minutes to a few hours. You will receive an email, to the address you provided for your AU Cloud account, when the collections are synced.
Note: You may update/change your AU Cloud or Archive-It credentials at any time by clicking Update. If you add a new collection in Archive-It or change your Archive-It credentials, you can click Update and redo the above steps to re-sync.
Once your collections have been synced, you can return to the main collections screen and start analysis. You can sort collections by clicking on the arrows beside each header. Let's take a look at the collection information available to us in the main space:
- Title - The name of each collection is pulled directly from your Archive-It account; each title provides the access point for analysis.
- Analyzed - This header identifies whether a collection has been analyzed in the Cloud. If yes, you will see at least one of the following; a network diagram, downloadable derivative files, and/or a domains frequency table once you enter the collection's page.
- Public - This field notes whether or not the collection is public, and if so, will be listed on an institution's Archive-It page.
- Files - Indicates the number of ARC or WARC files in each collection. Note, this size will differ from the collection size in your Archive-It dashboard. The Archive-It WASAPI endpoint provides the size of each compressed WARC. Whereas the Archive-It dashboard will provide the size of a collection uncompressed.
- Size - Indicates the collection's total file size.
Click on the name of the collection you would like to analyze. This will bring you into the collection page.
At the top of the screen, click Analyze Collection. You do not need to keep the Cloud or your browser open for it to analyze the collection. We will notify you when the process is complete and ready for you to start using the output files.
Note: We operate on a queuing system so the time it takes to analyze the collection will depend on how large the collection is and how many other users are processing collections. This could take anywhere between half an hour to a few weeks. You need only click the button a single time.
You will receive an email once your collections are ready for analysis. When you navigate to the collection space, by clicking on the collection title (in the screenshot below we've selected the Nova Scotia Municipal Governments collection), you will find:
- An interactive hyperlink diagram: you can use this to explore the basic hyperlink structure of your collection.
- Five output files that can be downloaded for further research and analysis. These include:
- Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. You can find more information on using this file in the "Network Graphing Archived Websites With Gephi" tutorial. Note that this graph file contains both the archived websites as well as the domains that they are linking to.
- Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Similarly to above, note that this graph file contains both the archived websites as well as the domains that they are linking to.
- Domains file, which will be a CSV file containing the frequency of domains found within your web archive.
- Full Text file, which will be a large text file containing the extracted plain text of all the HTML files found within a collection. You might want to try loading it into a visualization platform like Voyant Tools to see patterns and trends within your web archive.
- Full Text by Domain, which will be a large ZIP file containing ten text files corresponding to the plain text of the top ten domains. For example, if the domain "liberal.ca" is the most popular domain in your web archive, there will be a text file called "liberal-ca.txt". As above, you might want to use Voyant Tools to see patterns and trends.
- Finally, below the download options, you can see a frequency table of the top 10 domains found within the collection.
Are you using Safari and having trouble with the network downloads? If so, note that Safari by default will add the
.xml extension to your
.gexf downloads. To use the file, remove the
.xml extension using Finder (rename the file to remove the trailing
Interactive Hyperlink Diagram
This diagram visualizes the domains that are captured as well as any domains that they may link to. These linked-to domains may or may not be contained within the web archived collection.
Users can explore and interact with the hyperlink diagram using the helper buttons in the top left corner of the network window.
- Full Screen - opens up the network to full screen for easier interaction
- Zoom in/out - zooms the camera into the network and away from it
- Refresh - brings the network diagram back to its original state
- Scale up/down - allows users to see more or less node labels in the network
A few cautionary notes on "scale up" and "scale down" are in order. With website networks, some sites have so many links compared to the others, that they obscure everything else. The Cloud's scale-up feature uses logarithmic transformation to make the graph a bit easier to read. For example, six nodes with size values 1, 10, 100, 1000, 10000 and 1,000,000,000 can be transformed using a base 10 logarithm to produce new sizes 0, 1, 2, 3, 4, & 9, making the node sizes much closer together in size.
You can further interact with the hyperlink diagram by hovering over any node to highlight its immediate connections to other nodes. The arrow on each line indicates the direction of connection, for instance, in the image below, we see an arrow connecting two nodes and reveals the townyarmouth.ca has a link to the atlantic.ctvnews.ca domain.
Exploring Derivatives with the Archives Unleashed Cloud Jupyter Notebooks
We have prototype Jupyter Notebooks available for you to work with the derivatives generated by the Archives Unleashed Cloud. You can read more about this in our blog post "Exploring Web Archival Data through Archives Unleashed Cloud Jupyter Notebooks". They allow you to interactively explore and filter the domain count information, extracted full text, and network visualization data generated by the Archives Unleashed Cloud.
We are currently exploring greater integration between the notebooks and the Archives Unleashed Cloud. To use them now, please visit the GitHub repository here and follow the instructions.
This is still in the prototype stage. Any and all feedback and suggestions are greatly appreciated and can be sent to our project manager Samantha Fritz at firstname.lastname@example.org.
Need More Help?
Is your question or issue not discussed above or in our frequently asked question page? We'd love to hear more. Please join our Slack channel (the #auk-support channel is dedicated to these queries!) or check out the get involved section of our website.