Table of Contents
The work of applied statistics and data science cannot begin until the difficult work of data cleaning has finished.
About this Episode: In the corporate world, data doesn't arrive on your desk ready to "slice and dice." It arrives trapped in overdecorated spreadsheets, buried in embedded calculations, and riddled with human error.
In this episode of The Diary of a Mad Data Scientist, author and retired lawyer Richard Careaga joins host Anne Turner to discuss the "least glamorous" but most critical role in the field: The Data Janitor.
Using a real-world case study from "MegaCo"—a multinational facing a 7-billion-euro spend visibility problem—Richard illustrates how algorithmic "munging" and "scrubbing" can turn a mountain of digital junk into a high-level tool for cost-targeting and negotiation.
In this video, you’ll learn:
- Why spreadsheets are the "razzle-dazzle" traps of the finance world.
- The 8 types of human error that compromise your datasets.
- How to use Pareto plots to find the 20% of suppliers that account for 80% of your spend.
About the Author: Richard Careaga is a data scientist, retired lawyer, and author of Julia Mapping: A Practical Guide. He specializes in finding the signal in the noise using the Julia programming language and a healthy dose of skepticism toward "pretty" charts.