Reproducible Data Science
BY ROB CAUSEY – DATA SCIENTIST
What is “reproducible” data science and why is it important?
These are good questions. “Reproducibility” is about incorporating an intuitive sequence of steps that can regenerate the same results, across users. By providing the procedural context to an analysis, reproducibility enables better communication both within the technical team that is building the analysis and with the business users who are reviewing it. A reproducible workflow allows greater potential for validating an analysis, updating the data that underlies the work, and bringing others up to speed. As a result, data science projects will often have greater success when reproducible methods are used.
A perfect example of the benefits of reproducibility lies within music. Long after music was invented, musicians could still not reproduce each other’s sound with the original player’s precision. Although music has been around for at least 42,000 years, it was not until roughly 1,000 years ago that the musical score existed in a similar format to how it is understood today. The musical score is a reproducible sequence of steps, which makes a symphony interpretable to an orchestra to create the same sound that was originally intended. Lacking the score to map out this process in a reproducible manner, we would not be able to listen to the sounds of long-dead classical composers today.
We have an example of good reproducibility in music and we now have the opportunity to implement it within our data science practices. With reproducibility in music, a procedural context arranges notes (data) into a symphony (results of an analysis) and then translates these symphonies between the different orchestras (data scientists and business users) who play them (interpret and/or use the results). Rather than developing musical notation, we create an analytic pipeline through which we transform raw data into processed data, perform analysis, and then generate meaningful insights. If the process is reproducible, the reader of the final report should be able to trace back from its results to the data that they were built upon.
Here’s a typical pipeline with an estimate of 30 hours to complete:
- Communicate with client to understand business needs
- Design analysis process, such that its steps fall in alignment with the rough time estimates for project completion
- Define necessary assumptions about the data and the process
- Load up a sample set of web data for analysis
- Verify and validate the dataset and correct invalid formatting
- Join the data with other tables to pull in needed information
- Create a series of computer programs that analyze the data
- Build some tables and graphs based on interesting findings in the data
- Pull all of this together into two separate reports – one for business leads and one for the technical team
Now imagine one of the following things occurs:
- Change in web data provider
- Shift in business logic
- Need for validation of results
- Need for updating of results based on data refresh
If we lack reproducibility in our work, any one of these natural changes could require practically starting over. If you recall, this data study took us days to complete. Do you feel like wasting 30 hours every time your data changes? With a reproducible structure, the shift between environments is minimal so the process becomes seamless for later iterations and catching others up to speed.
Reproducibility brings about higher quality products that are built with enhanced validation, greater collaboration, faster updates, and easier expansion off of prior work. For a little extra time up front, you will save days in the long run – on each project. The benefits of such a feat are felt not only by the analyst, but also by their teams and clients. Reproducibility spawns innovation and evolution for everyone involved.
Reproducible Data Science
BY ROB CAUSEY – DATA SCIENTIST
What is “reproducible” data science and why is it important?
These are good questions. “Reproducibility” is about incorporating an intuitive sequence of steps that can regenerate the same results, across users. By providing the procedural context to an analysis, reproducibility enables better communication both within the technical team that is building the analysis and with the business users who are reviewing it. A reproducible workflow allows greater potential for validating an analysis, updating the data that underlies the work, and bringing others up to speed. As a result, data science projects will often have greater success when reproducible methods are used.
A perfect example of the benefits of reproducibility lies within music. Long after music was invented, musicians could still not reproduce each other’s sound with the original player’s precision. Although music has been around for at least 42,000 years, it was not until roughly 1,000 years ago that the musical score existed in a similar format to how it is understood today[1]Paterson, Jim. A Short History of Musical Notation. Music Files Ltd. 9 June 2015 (http://www.mfiles.co.uk/music-notation-history.htm).. The musical score is a reproducible sequence of steps, which makes a symphony interpretable to an orchestra to create the same sound that was originally intended. Lacking the score to map out this process in a reproducible manner, we would not be able to listen to the sounds of long-dead classical composers today.
We have an example of good reproducibility in music and we now have the opportunity to implement it within our data science practices. With reproducibility in music, a procedural context arranges notes (data) into a symphony (results of an analysis) and then translates these symphonies between the different orchestras (data scientists and business users) who play them (interpret and/or use the results). Rather than developing musical notation, we create an analytic pipeline through which we transform raw data into processed data, perform analysis, and then generate meaningful insights. If the process is reproducible, the reader of the final report should be able to trace back from its results to the data that they were built upon.
Here’s a typical pipeline with an estimate of 30 hours to complete:
Now imagine one of the following things occurs:
If we lack reproducibility in our work, any one of these natural changes could require practically starting over. If you recall, this data study took us days to complete. Do you feel like wasting 30 hours every time your data changes? With a reproducible structure, the shift between environments is minimal so the process becomes seamless for later iterations and catching others up to speed.
Reproducibility brings about higher quality products that are built with enhanced validation, greater collaboration, faster updates, and easier expansion off of prior work. For a little extra time up front, you will save days in the long run – on each project. The benefits of such a feat are felt not only by the analyst, but also by their teams and clients. Reproducibility spawns innovation and evolution for everyone involved.
Footnotes [ + ]
Tags: