Network Graphing Archived Websites With Gephi
Tutorial by Sarah McTavish (PhD Candidate, University of Waterloo)
When using the web as a historical source, the ability to see the way websites link to each other can be invaluable. However, using network analysis in historical research can also be a daunting prospect. The creation and interpretation of network graphs requires new tools and methodologies which may be unfamiliar to historians and other humanities scholars. Increasingly, tools such as the Archives Unleashed Cloud are making large datasets easily available for historical research using web sources. But, what do you do with this data once you have it?
This tutorial explores what you can learn from the derivative file marked as "Gephi" or "Raw Network" in your collections page.
There are several tools and methods for interpreting and visualizing large datasets. However, open-source network graphing software, such as Gephi, allow users to sort, filter, and graph data using a simple visual interface. Gephi can be downloaded for free for Windows, MacOS, and Linux from this website.
Once installed, you can either generate graphs from a spreadsheet, or other graph file formats, in order to manipulate your data. Using a GraphML file generated from the Archives Unleashed Cloud, we will demonstrate how to analyze a network graph using Gephi.
Example in this Tutorial
In this example, we will be graphing the University of Alberta's Fort McMurray Wildfires collection to explore where people were getting their information from during this significant Canadian news event. These northern Alberta wildfires took place during early May 2016, and resulted in the evacuation of nearly 90,000 residents. The fires spread quickly over several days, leading to dramatic photographs and video footage of residents fleeing their homes as the fires engulfed parts of the city and the surrounding area. The nature of the wildfires and evacuation sparked considerable media attention worldwide.
This international interest compounded the expected online coverage from local and regional residents and officials, who used the internet to spread necessary information on evacuation efforts and the state of the city during the fires. But which online news sources did people use to learn about the wildfires? Were official government websites linked to more frequently than the mainstream media? What role did blogs and social media play? By visualizing the linkages between sites crawled and archived during the media coverage of the wildfires, we can see which news sources were most relevant to people looking for information and which sites were linked to most frequently.
Loading Your Data into Gephi
The downloaded .graphml file can be opened directly in Gephi by choosing "Open Graph File" from the Gephi startup screen and then selecting the file to be opened.
Note that if you are using Safari, the file may come with an extra
.xml extension added to it (i.e. it will read
9745-gephi.gexf.xml). You will need rename the file to remove
.xml. The file should end with
Once opened, we see from the statistics on the upper right hand corner that the collection has 4106 nodes, or individual websites, and 6817 connections between those websites, or edges.
Gephi also provides a Data Laboratory view, which provides a list of the nodes, with all of their relevant statistics. It is in this view that the dataset can be manipulated; individual nodes can be deleted, the data can be sorted according to variables such as how frequently the node is linked to, and the labels for the data points can be changed. You can find that by clicking the "Data Laboratory" button at the upper left of the program.
Now let's begin exploring our data.
Sorting and Graphing Your Data
Going back to the "Overview" tab, we can then begin sorting and filtering our data. On the right side of the screen, Gephi's Filters and Statistics toolbar provides some simple but powerful tools to determine which variables will be used to graph the data, and which data points will be included.
Under statistics, there is a list of algorithms that can be used in order to sort the data.
- Average Clustering Coefficient - The measure of how "complete" the neighbourhood (or a node) is. In a network where every node is connected to every other node, the clustering coefficient will be 1. If no nodes are connected to any other nodes, the clustering coefficient will be 0.
- Average Path Length - The average distance between all pairs of nodes.
- Betweenness Centrality - Measures how often a node appears on the shortest path between any given two nodes.
- Closeness Centrality - The average distance between any two given nodes in the network.
- Connected Components - Measures the number of connected components -- or discrete "chunks" -- in a network.
- Degree - Measures the number of edges that connect to a node.
- Degree Power Law - Measures degree distribution according to power-law scale.
- Diameter - Calculates the maximum distance between all pairs of nodes.
- Eigenvector Centrality - Calculates node importance based on connections to other nodes.
- HITS - Hyperlink-Induced Topic Search algorithm, which evaluates nodes for page authority and the value of its links to other pages.
- Modularity - Measures the division of the network into clusters.
- PageRank > - Determines the probability of clicking through to each node, given a certain number of random clicks through links.
Because we want to determine which sites were most highly-read and linked to most frequently, so we can ascertain which sources were used the most to learn about the wildfires, we have to prioritize PageRank, or the probability of a user reaching each page, given a certain number of clicks on random links (Page, et al, 1999).
By clicking on the "run" button next to PageRank, Gephi will calculate the PageRank for each of our nodes. In order to help visualize distinct clusters within the network, we will also run the Modularity statistic; this measures the division of the network into clusters of closely-related nodes (Blondel, et al, 2008).
It is also possible to adjust the appearance of the nodes, edges, and labels to help provide a useful visualization of the network. In this example, I have adjusted the node and edge colours according to modularity, which means that distinct clusters will appear in different colours from other clusters. I have also adjusted the node size and label text size according to PageRank, so that websites with a higher PageRank appear larger. Watch the video below:
- Click on the colour pallette icon under the appearance tab, on the top left of the Gephi window.
- Click on "Partition" and choose "Modularity Class" from the menu that appears. Click "Apply."
- Next, choose the node size icon, next to the colour palette.
- Click on "Ranking" and then choose "PageRank" from the menu. Node minimum and maximum size can be adjusting according to preferences. Click "Apply."
- Make labels visible using the text icon (the large 'T") on the bottom toolbar. Label text will appear, but all in one size.
- Choose the text size icon to the far right of the appearance toolbar.
- Click on "Ranking" and then choose "PageRank" from the drop down menu. Text minimum and maximum size can be adjusted before clicking "Apply."
Once the network's appearance has been adjusted, we can begin to change the layout to provide a useful visualization of the network's connections. Under the Layout tab, Gephi provides several organizational algorithms, which can be changed and adjusted in order to change how the graph looks. Gephi provides a tutorial which describes how each algorithm works; however it is often necessary to try several layouts, experimenting with variables to find a layout that works for each particular graph.
In general, the Yifan Hu layout is a good choice for large graphs that show distinct clustering; if something doesn't work, the Random Layout algorithm can be run to "reset" the graph before trying again.
In the Fort McMurray Wildfires graph, the Yifan Hu algorithm works well to separate the network out into distinct clusters, emphasizing the interconnections between the nodes. The Label Adjust layout is also useful for moving around overlapping node labels.
By looking at the network graph, it becomes apparent that social media, such as Twitter, Facebook, and Instagram, played a central role in directing how users found information about the wildfires. This is perhaps not a surprising result; however, we can also see the reciprocal linking between official government web pages, such as Alberta.ca, and social media, but not more traditional news sources. This provides a fascinating perspective on the ways that information was disseminated from official sources in during the 2016 wildfires. Furthermore, we see distinct clusters of online and traditional news media, which show patterns of interaction and connection.
To help see these clusters even more clearly, you can hover your mouse cursor over one of the nodes. For example, here we have selected the "Macleans" news magazine's website --- we can see connections to other traditional Canadian news media outlets such as the National Post and Globe and Mail newspapers.
Filtering Your Data
Gephi also provides a useful set of tools for filtering data according to a number of variables. When working with many large datasets, it may be necessary to filter the data down to a smaller number of nodes, to provide useful visualizations. In short, filtering removes a lot of the "clutter" in the network graphs, which allows places of significance to show through clearly and quickly. Depending on need, the graph could be filtered using the edge tools, showing only sites with reciprocal linking, or one way linking, for example. These filters are found on the right-hand toolbar, under the Filters tab.
For this graph, I've filtered according to Degree Range, removing all sites that don't link to at least two other sites. This removes a lot of the "cloud" of sites that appears at the edges of the network, but are not highly interconnected within the network. However, we also lose distinct clusters, such as the one that appears in purple in the upper right corner of the graph; these sites only link to one site within our network, but the existence of these clusters could be a place for further exploration. Filtering can be incredibly important and necessary for comprehending the network as a whole, but can lead to a loss of nuance within the network.
How to filter by degree range:
- Choose the filters tab from the toolbar on the right of the screen.
- Click on the "Topology" folder from the menu and then double-click on the "Degree Range" filter.
- The filter parameters will appear at the bottom of the change. The slider can be dragged to adjust the filter. In this example, I set the minimum degree to "2" in order to filter out all sites that don't link to at least two other sites.
- Click on the "Start" button to begin filtering. Clicking "Stop" will restore all of the filtered sites.
Taking Your Graph Outside of Gephi
Lastly, Gephi provides preview and export tools which make it easy to include a network graph in a paper or presentation. Users can change the appearance of the graph to suit their needs, and then export it as a PNG or PDF file. This tool is found under the "Preview" button, to the top left of the screen.
Label sizes and node appearance can all be changed on the preview toolbar to the left of the screen. It is necessary to click the "Refresh" button to view changes. The image of the graph can then be saved by clicking "Export SVG/PDF/PNG" near the bottom left of the screen.
Using Gephi, it is a simple task to generate network graphs using archived web collections. Gephi's powerful and diverse range of tools and algorithms allow data to be manipulated and visualized in order to highlight the interconnection between archived websites. These graphs allow for new perspectives on influence and interaction when doing historical research with archived web materials.