Make Your Data Prep Life Easier
BY KAT MORGAN – DATA SCIENTIST
I recently finished a project for a client building a direct mail predictive model. Thinking back on all the steps I took to get to the end result, data preparation was THE most important. I would like to say everything worked right the first time and I didn’t have to go back to the modeling “drawing board,” more than once. But then, what would I have to write about? Here are some tips—and my lessons learned—to help make the data prep process run a little more smoothly.
TIP 1 – DEFINE THE BUSINESS PROBLEM FIRST
Once I tried making a chicken and rice casserole without a recipe. It was one of those clean-out-the-freezer/pantry kind of nights and it seemed easy enough to cook. It was a disaster. The chicken was dry and the rice not all the way cooked. My dog wouldn’t even eat it. Building a model without defining the problem first can cause you to wander around hoping that you will stumble onto something really interesting. You hope that all the inputs you throw into your model algorithm will spit out an awesome, at the very least, edible, result. But if you don’t know what you are trying to answer, how will you know you’re collecting the right data for your model?
When defining the business problem, don’t settle for a one-liner like: “Last year’s holiday mailer didn’t work and this year needs to be more successful.” Have a discussion with your client to understand what they mean. “How do you define success? Is it response rate, average order value, increase in takeout orders…” The better you understand the business problem, the more helpful your model is likely to turn out.
TIP 2 – PICK THE BEST DATA
“No model is a miracle worker, it can’t take terrible data an magically know how to use that data.” If you haven’t read Data Smart by John Foreman, I recommend you take it for a spin. It’s a wonderful resource for exploring common data science concepts. In his book, Foreman suggest you get “more bang for your buck” when you spend your time selecting good data. Schedule some time with the database gurus in your organization.. Often times they have great suggestions on better tables to pull data and will share underlying data assumptions (like a table named emailable_customers_only actually contains non-emailable customers. Who knew?).
TIP 3 – VALIDATE YOUR CODE EARLY AND OFTEN
It’s a horrible feeling to finish a project and have someone find an error in your code that makes you have to start all over again. I know. I’ve been there and it’s not fun. Making errors happens. It’s a part of any data analysis process and many times it’s a chance to learn. Better to catch them early on and preferably BEFORE you’ve shared final results with your client. Have someone who’s not on your project jump in and take a look at your code. Often times, they are able to spot something you’ve overlooked.
TIP 4 – SHARE WITH YOUR CLIENT EARLY
Don’t wait until your model is already complete to present to your client. “Here’s this shiny new tool I built for you. Trust me it’s awesome!” Sharing your data build methodology helps ensure you understand the business problem and that you have the data needed. As a bonus, when people are involved in the process it helps them to feel ownership. They may be more likely to use what you’ve built.
TIP 5 – SCHEDULE PLENTY OF TIME FOR DATA PREP
Three out of five data scientists surveyed said the majority of their time (60%) is spent cleaning and organizing data. Dealing with messy data is one of the most time consuming parts of any project. Data prep seems like it should be straightforward (you just click a few buttons, right?), but there’s always a horror show lurking in one of the database tables just waiting to suck up more of your time. Don’t believe me? Take a look at this graph from the CrowdFlower 2016 Data Science Report.
Do yourself a favor and don’t underestimate how important it is to get your data right. When I’ve had issues building a model, it’s typically been a data problem requiring me to go back and build the dataset again. Ain’t nobody got time for that.
Make Your Data Prep Life Easier
BY KAT MORGAN – DATA SCIENTIST
I recently finished a project for a client building a direct mail predictive model. Thinking back on all the steps I took to get to the end result, data preparation was THE most important. I would like to say everything worked right the first time and I didn’t have to go back to the modeling “drawing board,” more than once. But then, what would I have to write about? Here are some tips—and my lessons learned—to help make the data prep process run a little more smoothly.
TIP 1 – DEFINE THE BUSINESS PROBLEM FIRST
Once I tried making a chicken and rice casserole without a recipe. It was one of those clean-out-the-freezer/pantry kind of nights and it seemed easy enough to cook. It was a disaster. The chicken was dry and the rice not all the way cooked. My dog wouldn’t even eat it. Building a model without defining the problem first can cause you to wander around hoping that you will stumble onto something really interesting. You hope that all the inputs you throw into your model algorithm will spit out an awesome, at the very least, edible, result. But if you don’t know what you are trying to answer, how will you know you’re collecting the right data for your model?
When defining the business problem, don’t settle for a one-liner like: “Last year’s holiday mailer didn’t work and this year needs to be more successful.” Have a discussion with your client to understand what they mean. “How do you define success? Is it response rate, average order value, increase in takeout orders…” The better you understand the business problem, the more helpful your model is likely to turn out.
TIP 2 – PICK THE BEST DATA
“No model is a miracle worker, it can’t take terrible data an magically know how to use that data.”[1]John Foreman, Data smart: Using data science to transform information into insight If you haven’t read Data Smart by John Foreman, I recommend you take it for a spin. It’s a wonderful resource for exploring common data science concepts. In his book, Foreman suggest you get “more bang for your buck” when you spend your time selecting good data. Schedule some time with the database gurus in your organization.. Often times they have great suggestions on better tables to pull data and will share underlying data assumptions (like a table named emailable_customers_only actually contains non-emailable customers. Who knew?).
TIP 3 – VALIDATE YOUR CODE EARLY AND OFTEN
It’s a horrible feeling to finish a project and have someone find an error in your code that makes you have to start all over again. I know. I’ve been there and it’s not fun. Making errors happens. It’s a part of any data analysis process and many times it’s a chance to learn. Better to catch them early on and preferably BEFORE you’ve shared final results with your client. Have someone who’s not on your project jump in and take a look at your code. Often times, they are able to spot something you’ve overlooked.
TIP 4 – SHARE WITH YOUR CLIENT EARLY
Don’t wait until your model is already complete to present to your client. “Here’s this shiny new tool I built for you. Trust me it’s awesome!” Sharing your data build methodology helps ensure you understand the business problem and that you have the data needed. As a bonus, when people are involved in the process it helps them to feel ownership. They may be more likely to use what you’ve built.
TIP 5 – SCHEDULE PLENTY OF TIME FOR DATA PREP
Three out of five data scientists surveyed said the majority of their time (60%) is spent cleaning and organizing data[2]CrowdFlower 2016 Data Science Report, http://visit.crowdflower.com/data-science-report. Dealing with messy data is one of the most time consuming parts of any project. Data prep seems like it should be straightforward (you just click a few buttons, right?), but there’s always a horror show lurking in one of the database tables just waiting to suck up more of your time. Don’t believe me? Take a look at this graph from the CrowdFlower 2016 Data Science Report.
Do yourself a favor and don’t underestimate how important it is to get your data right. When I’ve had issues building a model, it’s typically been a data problem requiring me to go back and build the dataset again. Ain’t nobody got time for that.
Footnotes [ + ]
Tags: