Using the Archives Unleashed Cloud Derivative Files
One of the main features of the Archives Unleashed Cloud is the creation of derivative files. These are datasets that provide information about your web archives and can be used by researchers in place of the original WARC files.
Once analytics have been run on a collection, as discussed in the documentation, you will be able to view and access a series of four derivative files.
The following guides walk through how to use Cloud derivative files and the Archives Unleashed Cloud Notebooks for further analysis.
We are excited to introduce our Archives Unleashed Jupyter Notebooks, a prototype method for working with the derivatives generated by the Archives Unleashed Cloud. They allow you to interactively explore and filter the domain count information, extracted full text, and network visualization data generated by the Cloud.
When using the web as a historical source, the ability to see the way websites link to each other can be invaluable. However, using network analysis in historical research can also be a daunting prospect.
|File Output||.gexf and .graphml|
The full text of a web collection can be invaluable for a researcher! Archives Unleashed Cloud users will have access to the full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.
|Files Generated||A text file is generated which can be used in various other programs and software.|
Working with lots of text can be challenging. Fortunately, there are lots of resources on the web to help teach you the basics of text analysis.
Sometimes you might just want to know what is inside a web archive! The web is a complex beast: a small web archive that aims to collect just one page might end up capturing embedded widgets and content from other websites just to make something play. Alternatively, a crawl might have been corrupted and begun to find material that's far out of scope.
|Tutorial||Making Sense of the Domains Count|
|Files Generated||A text file containing the frequency count of domains captured within your web archive.|
How to Cite Archives Unleashed
If you've found the Archives Unleashed Toolkit or Cloud helpful in your research, please consider using the citation below in your publications. Your citations help to further the recognition of using open-source tools for scientific inquiry, assists in growing the web archiving community, and acknowledges the efforts of contributors to this project.
Archives Unleashed Project. (2019). Archives Unleashed Toolkit (Version 0.17.0). Apache License, Version 2.0.