A series of graphs
This will be a series of posts about network graphs and how we can use them to get a glimpse into a large dataset. We'll focus specifically on the career and skills graphs in the IT industry.
We've all heard of the social graph - a collection of links between people and the things they like. That's just a type of network graph. We can construct networks from any group of connected things. Authors, friends, trade between countries or organisations, Wikipedia articles, even beer reviews. SNAP from Stanford University is one place where you can get data like this to play with.
In this series, we'll of course be using career, skill, and job advertisement data. We can look at relationships between careers, skills, and various combinations of them. Today, we'll be using skills. We have 254 skills (nodes) with over 2,500 connections (edges).
A pretty picture
At a basic level, where a skill is mentioned alongside another skill in a job advert, we connect them. The more ads that show this connection, the stronger the link between the two skills. Of course, it's a little more complex than that when we get into the distance between the words as they appear in an advert, and a touch of secret sauce we add, but the fundamental concept is simple.
Drawing this graph makes quite a nice picture. Colours represent various groupings of skills based on their closeness to each other. Download a high res version.
We can see a lot of connections flowing everywhere. In fact, the IT field is a very tightly knit community in terms of skills. Getting from any skill to any other skill takes an average of only 1.9 stops (that's called the _average path length_). This simply means that finding a job that combines your current skillset with a skillset you want to adopt shouldn't be too difficult.
Finding central station
So how do you determine the most important skill? Which skill is connected to most of the others? Which technology will give you the most options for the future?
Think of it like a train network with each skill being a station. The more often you need to pass through a given station to get anywhere else, the more central it is to the network. In this case, the more often a skill is mentioned alongside others, the more important it is. We call this importance centrality and there are numerous ways to measure it.
This is the number of connections that a node has. Put another way, it's the number of stations directly linked to the station you're looking at. The higher the degree, the more likely it is that skill will appear alongside others in an advert. In our network, SQL has a degree of 199 whilst Clojure has 11.
Closeness represents how easily a node can make connections with others. For example Java is close to Clojure and Scala because they have direct links (they appear together in many adverts), but it is far from SharePoint and IT Management. The lower the overall closeness score, the less steps you need to take to get to any other node.
Between tells us how important a skill is to getting between disparate parts of the network. It's like a bridge. The station may not have a lot of connections itself (low degree), but may be the only way to get from one group of stations to another. It's like the person who introduced you to your girlfriend or boyfriend.
Betweenness centrality is a useful way to find the bridging technologies. Having these skills will help you find jobs in disparate, not strongly related areas that you might be interested in. SQL, Databases, and Analysis skills are the bridges that will get you to other technologies.
It may sound scary, but really it's not. This is related closely to Degree Centrality and is really a variation on that. It's based on the idea of downward connections. Take John and Mary, two school children. They each have 5 close friends. John's friends each have no other friends, they're an isolated group. Mary's friends however are quite social and each have many friends themselves.
Intuitively we feel that Mary is a more central and powerful force in the schoolyard. Her networks extends further than John's. It's a type of "friends of friends" scenario.
Given that the skills network we're looking at is very well connected, we don't have any real isolated communities so an eigenvector centrality analysis would not be fruitful here. For more dispursed network however, it becomes very useful. For example, Google uses a type of eigenvector centrality analysis to rank search results (PageRank).
There are as many other ways to measure centrality as there are network graphs, but they are all built on the above three basic concepts. The number of connections (degree), the distance between connections (closeness), the bridges between groups of connections (betweenness), and the friends of friends connections (eigenvector).
What's all this good for?
Pretty pictures are one, but why can't we just count how many times a skill is mentioned and use that to determine its popularity and worth?
Counting the frequency of a skill is a valid way to measure popularity. But it doesn't answer the question about how important a skill is. How well it relates to other skills. Interestingly, network analysis can be used in subtly complex ways to locate the more generic skills and downplay their importance in order to highligh the more interesting useful data underneath.
Centrality measures are also a key means in determining the most important nodes of a network. "Important" can mean anything from the most influential person in a group, the most at risk person for catching a virulent disease in a community, the most relevant search result, or the most useful technical skills to acquire.
However, as you'll learn through this series, relationships between various items may not always turn out as you expect. Is F# closer to other functional programming languages, or does it sit somewhere else?
We'll look at how these graphs are produced and in particular, find groupings or communities in a dataset.