Data Science Workflow

In this four part series on how to learn data science, I previously outlined the dimensions of data science and provided resources for learning data science. However, as previously mentioned, the best way to learn data science is to work through a whole problem from beginning to end.  

Because data science problems often get messy, I’ve outlined my typical workflow when working on a problem.

Data Science Workflow.jpg

Although the process looks very linear, in reality it can often feel more like a game of Sorry! where you may have to take a few steps back or start over now and again. I often find myself jumping back and forth from defining the problem, surveying the data available, asking questions, and forming my hypothesis for analysis. Once, I feel fairly confident about my problem statement and hypothesis, I am able to go through an iterative cycle of exploring the data, cleaning/reformatting it, and then on to the analysis or model building depending on the type of problem I am trying to solve.

Data Science Workflow Chaos.jpg

With all this iteration, it’s helpful for me to chunk the process into two main parts. First being the problem definition and second the analysis, highlighted in the boxes below. The reason I break the process down into these two chunks is because no two data science projects are the same, so at the end of the day if I’m defining a good problem, and doing analysis to gain insight from the data, then I’m doing my job.

Data Science Workflow with boxes.jpg

If you are trying to learn data science and would like to work on a real problem, my recommendation is to use this workflow to guide you through your first project. Start by finding an interesting problem you would like to solve, survey the data you have available to solve it and see what you come out with!

Now that you have a workflow for how to work through a problem, in the next post I will share some of my best practices for goal setting and tracking your progress in your learning journey.