Skip to content

Without data janitors, algorithms are useless

The work of applied statistics and data science cannot begin until the difficult work of data cleaning has finished.

It may not be exciting but it's what makes exciting possible

Table of Contents

The work of applied statistics and data science cannot begin until the difficult work of data cleaning has finished.

About this Episode: In the corporate world, data doesn't arrive on your desk ready to "slice and dice." It arrives trapped in overdecorated spreadsheets, buried in embedded calculations, and riddled with human error.

In this episode of The Diary of a Mad Data Scientist, author and retired lawyer Richard Careaga joins host Anne Turner to discuss the "least glamorous" but most critical role in the field: The Data Janitor.

Using a real-world case study from "MegaCo"—a multinational facing a 7-billion-euro spend visibility problem—Richard illustrates how algorithmic "munging" and "scrubbing" can turn a mountain of digital junk into a high-level tool for cost-targeting and negotiation.

In this video, you’ll learn:

  • Why spreadsheets are the "razzle-dazzle" traps of the finance world.
  • The 8 types of human error that compromise your datasets.
  • How to use Pareto plots to find the 20% of suppliers that account for 80% of your spend.
0:00
/6:34


About the Author: Richard Careaga is a data scientist, retired lawyer, and author of Julia Mapping: A Practical Guide. He specializes in finding the signal in the noise using the Julia programming language and a healthy dose of skepticism toward "pretty" charts.





Latest

Do not go softly into that good R²

Do not go softly into that good R²

While ordinary least squares regression can be an unreasonably effective tool, it can pay dividends to dig deeper. The original chart shows a simple linear fit between AI subscription rates and private payroll changes, yielding a modest R² of 0.2111. However, running the diagnostic plots revealed a couple of

Members Public
Most bad maps are the same bad map

Most bad maps are the same bad map

Wow, would you look at that. All that economic output from just these five countries that don't even occupy the most area! This is a classic. Economic activity, like deaths and a host of other social and political aspects, is roughly proportional to population. Showing the raw numbers

Members Public

Excel to Julia: The Rosetta Stone

For when your spreadsheet starts to crawl, but you still need to get the job done. 1. The Basics: Data as a Thing In Excel, the data and the logic live in the same cell. In Julia, we keep them separate for speed and sanity. * Workbook/Sheet ≅ DataFrame * Column ≅:Symbol

Members Public

Using the forum at Julialang.org

The Julia language organization maintains an open, free forum where you can post questions to the helpful users under the New to Julia category. Especially if your question involves advanced scientific capabilities of the language, such as tensors, it will be invaluable. On the other hand … If you are an

Members Public