The Data Science 'Journeyman'

The profession of data science needs to define itself in order to evolve and survive. In common with companies of all sizes, it is very difficult to find data scientists, there are a great many people responding to this demand who are actively pursuing a career in data science. That said, there is no real consensus about the core skill set, many companies are hiring software engineers with no real statistical training as data scientists, yet both programming and statistical skills are important in equal measure. 

It is a fact that these days organisations are collecting so much data that workers need to possess high level programming skills just to manipulate it. Statisticians are clearly the right people to analyse the data, but they cant analyse and manipulate the data unless they can access it, new and standardised methodologies need to evolve to enable working at all levels to get hands-on with broad varieties of structured and unstructured data.  

Working with extremely large datasets is the ‘new norm’ and not the exception, with the many aspects of working with statistics changing when working at scale. As an example, when running a logistical regression, a statistician might run some diagnostic plots to gain a feel for the goodness of fit for a model, and to identify potential outliers. This approach would not be scaleable when you are running 20,000 logistical regressions on a single data set. 

When working at scale with data, visualisation, hypothesis testing, feature selection, and outlier detection all need different approaches. There is a need for an automated and large scale approach to augment the work of statisticians. The problems of working with data at scale is not just isolated to the commerce and industry space, academia also face the same monumental problems posed by the conundrum of how to store, process and interpret insights from vast data lakes. 

In conclusion, people don’t want data, they want ‘answers’ - people need access to highly engineered tools and methodologies to extract meaning and answers. For without statistics, machine learning and data mining skills we have nothing. 

Steve Thomas