What Kind of Data Scientist are you?
So what variety of Data Scientist are you?
For anyone who has worked amidst the data science community, you have probably defined the role of data scientist and been able to categorise certain advanced analytics experts in to some some of class of 'data scientist’.
It may therefore surprise you to know that according to researchers at Microsoft and UCLA there are nine distinct types of different data scientist.
Microsoft recently commissioned an in-house survey of 793 data scientists, investigating how they spent their time, the tool and methodologies that they employ in the course of their work, along with the challenges and opportunities that they face in the course of their jobs.
The UCLA in conjunction with Microsoft then took this data from Microsoft for further testing and analysis. They ran the survey data though a cluster algorithm and committed the results to publication in the form of a paper title “Data Scientists in Software Teams: State of the Art and Challenges’, the full report can be downloaded here.
The first discovery uncovered by the UCLA in their findings was the fact that not all people practicing data science self-identified as “data scientists”. Around 40% of the advanced analytics professionals surveyed were happy to accept the label of data scientist, and contrasting with this around 24% were more inclined to call themselves software engineers. Interestingly 15% of respondents were identified at the level of ‘software engineers’ and not data scientist, while 20% held some alternative title.
In sum total, it was deduced that 532 could be accurately categorised as ‘data scientists’.
It was found that 33% held bachelor degree level education, with 22% having a PhD in Mathematics or Engineering, with 41% holding masters degrees accordingly, and 4% non-advanced in academic education.
The average post education experience live was 13.6 years, with a corresponding period of 10 years of time spent in their working lives analysing data.
The cluster algorithm illustrates patterns in how data science practitioners are employed working on tasks and problems. By analysing the dominant activity of a given group a name has been assigned by the UCLA to each group across nine categories:
The Data Preparer
This category of data scientist spends on average over 25% of their time querying the data, and 20% of the time on cleaning, refining and preparing the data for full analysis. The analysts that are concentrated on data preparation are most likely to employ SQL as the tool of choice, and less oriented towards working hands on to an advanced degree with machine learning algorithms.
The Data Shaper
The role of ‘data shaper’ assumes many of the skills of the Data Shaper, but combines additional expertise, including Machine Learning. They also utilise and command advanced query tools such as MATLAB and Python. They are also more likely to have a PhD, with less emphasises on working with SQL or structured data.
The Data Analyser
Data scientists who spend more than 50% of their time analysing large data sets could be attributed to this group. Additional traits of Data Analysers include more experience on working with classical statistics, maths along with further advanced methods of data manipulation, and a propensity for working with R.
The Platform Builder
Platform Builders typically spend 50% of their time building platforms and work on instrumenting code for the purpose of gathering data. These Platform Builders are most likely work within distributed systems such as Hadoop, and have Engineer in their job title, but will not necessary have a PhD.
This sort of Data Scientist spends a majority of their time working in concert with other business units involved on data management and insight gathering. Their daily work involves line-of-business decision makers and those in product development than the group as a whole, and less likely to work with SQL or structured data.
This data science class of analyst usually works for 60% of his or her time working on insight problems, and 20% of their time gleaning insight from data.
Its often the case that advanced data analysts wont know that they are in effect creating ‘data science’. Software engineers and program managers who spend 50% of their time using data science oriented skills and 50% working on something completely separate fall in to this category.
The ‘1/5th’ Moonlighter
The engineers and managers who occasionally dip in to data science (i.e. work on data science problems 20% of the time) fall into this category.
This variety of analyst is the Swiss army knife of analytics, the type of data scientist who will work all day long on a multitude of very high level data related problems, ranging from constructing platforms to aggregating data to analyse, along with having a sense of motivation to act upon what the data is taking them in the moment. The Polymath is more likely to hold a PhD in advanced maths or engineering, and be fully au-fait with Python programming. Furthermore, they are also likely to have have had at least some exposure to a Bayesian flavour of Monte Carlo statistics, than the other constituents of the group of 9.
What is interesting is that while we think of data scientist as a ‘catch all’ term for labelling advanced analytics professionals, when you consider the survey data produced by UCLA/Microsoft, it highlights very clearly the different groupings of data scientists, with their individual kinds of work activities.
The biggest challenges reported by data scientists may ring a bell to those who have worked in data science. The challenges were grouped into three main categories, including data, analysis, and people.
On the data front, poor data quality was one of the most commonly reported problems. “Some respondents mentioned that there is an expectation that it is a data scientist’s job to correct data quality issues, even though they are the main consumers of data,” the report states.
Data availability, including missing values and the inability to tap legacy systems for data collection, was also cited as a major challenge. Data integration, including the merging of different streams of data into a single data set for analysis, remains a bugaboo for data scientists around the world.
Scale was the biggest problem related to analysis (which is probably while some still refer to it as “big data”). Survey respondents reported that it can sometimes take too long to collect and analyze the data, whether it’s on Hadoop or Cosmos, Microsoft’s version of the big distributed storage and processing framework.
On the personnel side of the data science equation (a factor too often overlooked in many human endeavours), the UCLA researcher identified one major impediment to data science success: communicating what insights the data science team has discovered. Staying up-to-date on changing tools and technologies is another concern.