The data for the majority of our metrics comes from analyzing millions of job advertisements. This data is not perfect, but we believe that it is a good proxy for the analysis we undertake.
Our data pipeline has many steps, each of which have the potential to accumulate errors, as such we outline each step below along with the types of errors which we are aware of and how we mitigate them. The pipeline consists of the following steps:
- Source data collection
- Skill extraction
- Experience extraction
- Career classification
- Aggregation & metric calculation
Our source data
Our source data is primarily expired job advertisments, which we use as a proxy for demand for people in particular careers and with particular skills.
Fitness of data for analysis
As described in another article here, what we want to know is rarely directly observable. Since we can't collect every technical person's pay stubs and watch to see what skills / technologies they use day to day (which would provide the best data I can think of!), we rely on analyzing Job ad data.
While Job Ad data is not perfect, it provides us with key information about the job market, what skills are in demand and what employers / recruiters are looking for at the moment. We beleive this data to be useful for employees looking to expand their skill base, as well as in determining how their pay compares to market rates.
Not all jobs are advertised
Many jobs are not advertised and thus not knowable to us. In particular, many jobs for senior positions are head hunted, or are filled through informal networks. Our analysis misses these data points.
Mitigation We are unable to mitigate this issue at present.
We don't capture all job ads
We collected ads from a number of sources, which while extensive does not represent all job ads. Our assumption is that the large number of jobs we collect are representative of the population of job advertisements, but it is difficult for us to assess this.
Mitigation We are constantly attempting to increase the number of job ads in the system both through partnerships and aggregating data ourselves.
Salary information is sparse
Job advertisments often have no, or limited salary information.
_Mitigation While most jobs don't contain salary information in there text, they often provide some salary information in order to have there ads appear in searches which quote a salary range. Where possible we have access to this information.
We also attemptt to extract / parse any salary information appearing in the text of a job ad.
Salary information is often quoted as a range
Where salary information is available, its often a range (e.g. $100k-120k).
Mitigation We take the mid point of the two numbers (the mean).
Contract & Permanent jobs have wildly different salaries
It is not uncommon for short term contracts to pay more than twice that of permanent positions to account for a lack of security and benefits.
Mitigation We are working on mechanisms to internally classify job ads as being either contract or permanent, but this has not yet been released.
For the time being, salary information is reflective of both permanent and contract positions. Thus careers which lend themselves more towards contract work (Sharepoint Consultants / Business Intelligence Consultants in particular), will tend to have higher salaries.
Extracting skills can be challenging, as a technology like "SQL Server" might be referred to as "SQL Server", "MSSQL", "SQL 2000" and any number of additional misspellings.
Mitigation We maintain long lists of aliases for our skills as well as technology to calculate simple misspellings using the corpus of job ads we have collected. One day we might publish an article on the poor spelling of most recruiters!
Extracting the experience required for a job, and what skills it applies to is a challenging task, which we do not always get right. There is not standard format that recruiters & HR use for this.
Mitigation We are unable to go into much detail about how this works (as its a core element of our intellectual property), but will say that it involves a rule based statistical engine, which we have validated manually. The engine is designed to be cautious (low false positive rate), and wont consider experience greater than 10 years.
The system also attempts to translate subjective levels of experience ("strong", "solid", etc) into numerical amounts expressed in years. This dictionary may be made available at a later date.
Career type classification
There is no industry standard career classification, so we have built our own that we can aggregate against. These classifications may miss particular types of careers, and jobs may be miss classified.
Mitigation Our list of careers was built using clustering algorithms using TF/IDF metrics for the skills we have extracted. We currently classify roles using a pipeline of TF/IDF metrics and dictionary based metrics (e.g. for an ad to qualify for a career, it must include particular words, as well as a combination of skills).
Aggregation & Metric calculation
Once extracted, salaries are calculated as the mean of jobs matching the criteria (e.g. career / location).
Skill salary contribution
This metric seeks to put a dollar value on a particular skill, in order for our users to be able to compare what skills are most valuable.
Calculation We take all the skills associated with a job ad, and associate each skill with an equal proportion of the salary. So if a job has 4 skills and pays $100,000 we ascribe $25,000 to each skill. We then calculate the mean skill value across each skill.
Problems Skills which tend to be associated with careers that are specialized (and therefore dont contain many skills). However, this appears to be inline with our discussions with recruiters.
Validation metric We test the quality of this metric by comparing the expected salary of a job (by taking the sum of its skill contributions) to its actual salary.
Please hit me up on twitter @siganakis and I would be happy to explain / discuss it with you (as long as my boss lets me!).