Table of Contents
While ordinary least squares regression can be an unreasonably effective tool, it can pay dividends to dig deeper. The original chart shows a simple linear fit between AI subscription rates and private payroll changes, yielding a modest R² of 0.2111. However, running the diagnostic plots revealed a couple of hidden issues.
using CairoMakie
using DataFrames
using GLM
using OLSPlots
using StatsModels
x = [6.0, 7.5, 9.5, 11.0, 11.5, 12.0, 12.5, 13.5, 14.0, 14.5, 15.0, 15.5, 16.0, 16.5, 17.5, 17.5, 18.5, 19.0, 19.5, 20.5, 21.0, 22.5, 23.0, 24.0, 26.5, 27.5, 34.0, 40.5, 42.0, 42.0, 45.0, 45.5, 46.5, 47.0,48.0, 48.5]
y = [0.24, 0.18, 0.02, 0.13, 0.09, 0.15, 0.11, 0.06, 0.08, 0.09, 0.07, 0.10, 0.08, 0.05, 0.06, 0.11, 0.04, 0.03, -0.01, -0.02, 0.11, -0.01, 0.08, 0.16, -0.06, 0.03, 0.05, 0.01, -0.04, 0.07, 0.01, 0.05, -0.02, 0.05, 0.13, 0.05]
model = GLM.lm(@formula(y ~ x), DataFrame(x=x, y=y))
r2(model)
diagnostic_plots(model)

First, a single outlier (a firm/sector with very high AI adoption and unusually high payroll changes) was acting as a drag on the model's strength. Second, even after removing it, the "Residuals vs Fitted" plot showed a distinct, sharp U-shaped pattern. This meant the straight-line model was systematically missing the curve of the data—under-predicting at the extremes and over-predicting in the middle.
To address this non-linearity, I first tried a log-transformation on the x-axis. While this successfully bumped the R² to 0.3785, the stubborn U-shape in the residuals persisted.The real breakthrough came from fitting a quadratic polynomial model (adding an X² term) without the outlier.
# Quadratic model, without point 35 (outlier: 48.0,0.13)
x = [6.0, 7.5, 9.5, 11.0, 11.5, 12.0, 12.5, 13.5, 14.0, 14.5, 15.0, 15.5, 16.0, 16.5, 17.5, 17.5, 18.5, 19.0, 19.5, 20.5, 21.0, 22.5, 23.0, 24.0, 26.5, 27.5, 34.0, 40.5, 42.0, 42.0, 45.0, 45.5, 46.5, 47.0,48.5]
y = [0.24, 0.18, 0.02, 0.13, 0.09, 0.15, 0.11, 0.06, 0.08, 0.09, 0.07, 0.10, 0.08, 0.05, 0.06, 0.11, 0.04, 0.03, -0.01, -0.02, 0.11, -0.01, 0.08, 0.16, -0.06, 0.03, 0.05, 0.01, -0.04, 0.07, 0.01, 0.05, -0.02, 0.05, 0.05]
df = DataFrame(x=x, x2=x.^2, y=y)
model = GLM.lm(@formula(y ~ x + x2), df)
r2(model)
diagnostic_plots(model)

This allowed the regression line to bend, capturing the actual "diminishing returns" dynamic between AI subscriptions and job market impacts. The results were striking:
- The problematic U-shaped residual pattern flattened out.
- Both the linear and quadratic terms proved highly statistically significant (p = 0.0009 and p = 0.0048, respectively).
- The predictive power of the model (R2) more than doubled from the original chart to 0.4306!
The takeaway: A straight line is a great starting point, but always check your residual plots! The real story is often hiding in the curves.