One of the first things any good data scientist does before building a predictive model is to explore their data. The temptation can be to rush through this exploration stage in order to get to a working product faster. However, if we take just a little more time in this area we can save hours of work, build better models and troubleshoot issues much faster. To get you started, I’ve outlined a few principles for data exploration.
Be Genuinely Curious:
We all know curiosity is important and especially in the field of data science, but the next time you get a fresh dataset take a few moments to think of some questions you have regarding the data. By taking a moment to think of areas that arouse your curiosity, you become familiar with the problem and information you are working with.
Don’t Assume Anything:
Rather than assuming the range for variable X is between 5 and 20, verify it and approach each variable with a naïve attitude. Double checking your assumptions will help insure there are no errors in the data and will also aide you in getting familiar with each individual variable.
Whether you are trying to get to know the person you just met or you are trying to get to know your data, questions are a great way to get started. Rarely do we meet someone for the first time and they tell us everything about themselves, so in order to get to know them we ask questions. This same principle can be applied to data and works hand in hand with our curiosity.
Once you start asking questions and find something that interests or surprises you, then you can begin to dive deeper into understanding this area. This mastery of the data can be done by building graphs, running correlations, and asking more questions!
For all the areas mentioned above, visualization will be your best friend in aiding your curiosity, checking assumptions, asking questions and diving deeper into your discovery. To get started exploring data through visualization, multiple statistics and graphs should be used depending on the variable type you are looking at.
For continuous data, bar charts and frequency distributions work well and the following areas should be explored and visualized:
· Standard deviation
· Minimum and maximum values
· Missing observations
· Shape of the distribution
For nominal and ordinal data, histograms, summary statistics and box plots work well and should be looked at within the following areas:
· Number of categories
· Missing observations
· Number of observations in each category
Hopefully this provides you with some inspiration to take the time to explore your data and offers you some ideas to get started.