6 min read
There’s No Place Like Home
Network graphs give a high-level view of the relationships among entities in your data; sometimes a different approach can give fresh insights into how they interact with each other and identify a connection you didn’t know was there. Some pictures can be worth much more than 1,000 words.
To illustrate, consider the screenplay for The Wizard of Oz. The title may refer to the Wizard, but based on the graph below, it is clearly Dorothy’s story. As the main character, she interacts with her family in Kansas, the Munchkins, residents of Oz (including the Witch) on her way to the Emerald City, and then (spoiler!) Kansas again.
In the graph above, the size of each node and label is based on “degree.” Degree is calculated as the number of scenes in which a character shares dialogue with others. Color here is based on naturally occurring clusters (i.e., communities) that tend to interact mostly among themselves. Notice how Dorothy’s Kansas connections are clustered together, away from the center; a similar pattern segregates the residents of Munchkinland and others who she meets along the yellow brick road.
This approach works well over the rainbow. What does it look like elsewhere?
A note about creating the pictures in this blog: The open source package Gephi (version 0.8.1) was used to read the connections (i.e., edges) and produce the files shown here - there are other tools for creating similar results. To create the edges file, the screenplay was downloaded in HTML format. Then, a script identified scenes and pairs of characters who shared dialogue with some basic Natural Language Processing (NLP). The code to reproduce all of this is located here: https://github.com/joebluems/NetworkGraphs.git
The Godfather is the epic story of a crime syndicate and the passing of the torch from a mob boss father to the son he was determined to keep out of the “family” business. The network graph of the screenplay is below—the same algorithm was used to create all of the graphs (i.e., size, labels, color).
The differences between the network graphs of The Godfather and The Wizard of Oz can illustrate variations in how entities behave in a network. The former revolved around a primary character interacting with groups she met along her journey. The network of The Godfather emphasizes Michael Corleone’s role, but there is a strong secondary layer of characters (e.g. family, foes, etc.) who interact within their own clusters.
Could a network graph detect a hidden identity? In The Dark Knight, billionaire playboy Bruce Wayne prowls Gotham’s streets as “The Batman.” As the network graph shows, those dual identities take a back seat to the Joker and his antics.
The graph for The Dark Knight appears to be much different from the other two. Perhaps the disjointed graph is a result of the Joker’s desire to create anarchy. There are communities (the police, the bank-robbing clowns, and the mob bosses), but there are also conversations happening unconnected to the main characters. Bruce Wayne and Batman interact with many people but not much with each other? Could link analysis be a clue for finding suspicious activity in your network?
Suppose that instead of movie scripts, you had real data containing actual interactions among your users or customers. What if the network were so large that simply visualizing it wouldn’t work? You could analyze the structure of the network instead.
The table below shows metrics for the network structures of the screenplays used in this blog:
Table 1 - Network Graph Metrics
Degree quantifies the number of scenes a character shares with others - for screenplays all links are reciprocated but that needn’t be the case. Networks dominated by main characters will have higher results. Graph density measures how many connections exist, compared to a fully connected network. Modularity identifies the strength of a network’s communities. Path Length is another metric of completeness of a graph. It is the average number of steps needed for any character (i.e., a complete graph would have path length = 1) to reach any other, and it should mirror the density.
Rather than compare metrics individually, you could assess the similarity of networks based on all the metrics - it may provide an alternative to collaborative filtering recommendations. This is illustrated in the table below:
Table 2: Pairwise Similarities
The similarities are based on the basic Euclidian distance of the normalized metrics - there are more sophisticated approaches available. This suggests that The Dark Knight and The Godfather are the two most similar films (i.e., shortest distance); The Wizard of Oz and The Dark Knight are the most disparate (even though they apparently share the character “Scarecrow”). If you’ve seen all three, this will probably make sense. This approach may also help you assess the structure of multiple communities within a network.
As the size of your network grows, storage of “edges” and parallel calculation of metrics across the entire network becomes challenging. Apache Spark running in a Hadoop environment addresses both of these limitations.
Applications of network analysis include finding fraud (i.e., suspicious connections) or identifying isolated communities to which connectivity can be improved. Regardless of the type of network, it’s likely there are benefits to analyzing the structure and finding the hidden patterns. Consider it an offer you can’t refuse.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.