Using the Archives Unleashed Cloud Derivative Files

One of the main features of the Archives Unleashed Cloud is the creation of derivative files. These are datasets that provide information about your web archives and can be used by researchers in place of the original WARC files.

Once analytics have been run on a collection, as discussed in the AUK documentation, you will be able to view and access a series of four derivative files.

Gephi File Output
Gephi
Raw Network File Output
Network
Domains File Output
Domains
Text File Output
Full Text

Learning Guides

The following guides walk through how to use AUK derivative files for further analysis.

When using the web as a historical source, the ability to see the way websites link to each other can be invaluable. However, using network analysis in historical research can also be a daunting prospect.

Tutorial Network Graphing Archived Websites With Gephi
File Output .gexf and .graphml
Files Generated
  1. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.
  2. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

The full text of a web collection can be invaluable for a researcher! AUK users will have access to the full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Tutorial
File Output .txt
Files Generated A text file is generated which can be used in various other programs and software.
Additional Resources

Working with lots of text can be challenging. Fortunately, there are lots of resources on the web to help teach you the basics of text analysis.

Sometimes you might just want to know what is inside a web archive! The web is a complex beast: a small web archive that aims to collect just one page might end up capturing embedded widgets and content from other websites just to make something play. Alternatively, a crawl might have been corrupted and begun to find material that's far out of scope.

Tutorial Making Sense of the Domains Count
File Output .txt
Files Generated A text file containing the frequency count of domains captured within your web archive.

How to Cite Archives Unleashed

If you've found the Archives Unleashed Toolkit or Cloud helpful in your research, please consider using the citation below in your publications. Your citations help to further the recognition of using open-source tools for scientific inquiry, assists in growing the web archiving community, and acknowledges the efforts of contributors to this project.

Archives Unleashed Project. (2018). Archives Unleashed Toolkit (Version 0.16.0). Apache License, Version 2.0.