What you'll learn
- How to establish baseline distributions for an ML system's inputs and predictions.
- How to create a reference window and use it for drift detection.
- Which tests and distances to use (KS, PSI), and why.
1. Baseline first: define normal
Monitoring and observability start with a clear definition of "normal." In production ML, that means:
- a reference window of data (e.g., last 14 days before launch)
- feature profiles (summary stats + histograms)
- an initial prediction score profile (if available)
This is the known state we compare against later windows. Monitoring answers that something changed; observability helps us ask why.
1.1 Example schema (rides table)
We'll use a simple ride-sharing schema throughout the guide:
| column | type | notes |
|---|---|---|
| ride_id | string | unique id |
| timestamp | datetime | event time (UTC) |
| pickup_zone | string | city grid cell id |
| dropoff_zone | string | city grid cell id |
| trip_distance_km | float | continuous |
| surge_multiplier | float | continuous (>=1) |
| fare_amount | float | continuous |
| driver_eta_min | float | model output (optional in Ch1) |
2. Visualizing the baseline
Below are baseline histograms and descriptive stats. These serve as your reference profiles for P(X) features—trip distance, surge multiplier, and fare.
Loading histogram data...
Why histograms? Two-sample tests (e.g., KS for continuous features; Chi-squared for categorical) tell you if today's window likely came from the same distribution as the baseline window. But pictures (plus summary stats) help engineers reason quickly about where the change is (center, spread, tails).
3. Today vs. Baseline: measuring shift
When labels lag, compare inputs P(X) and model outputs P(Ĺ·) over time. That's standard in industry monitoring stacks.
We'll use:
- KS test (continuous): simple, non-parametric, compares empirical CDFs
- PSI (binned, symmetric): widely used for production drift dashboards; easy thresholds for alerting
Calculating PSI...
- < 0.10: stable
- 0.10–0.25: moderate shift (watch)
- ≥ 0.25: major shift (investigate, retrain or fix)
4. Run it yourself (data + code)
import numpy as np, pandas as pd
rng = np.random.default_rng(7)
N0, N1 = 20000, 8000 # baseline, today
# Baseline distributions
trip0 = np.clip(rng.normal(6.5, 2.0, N0), 0.5, None)
surge0 = np.clip(rng.lognormal(mean=0.05, sigma=0.15, size=N0), 1.0, None)
fare0 = np.clip(35 + trip0*3.2 + rng.normal(0, 5, N0), 5, None)
df0 = pd.DataFrame({
"ride_id": [f"b_{i}" for i in range(N0)],
"timestamp": pd.date_range("2025-09-01", periods=N0, freq="min"),
"pickup_zone": rng.choice([f"Z{i:03d}" for i in range(40)], size=N0),
"dropoff_zone": rng.choice([f"Z{i:03d}" for i in range(40)], size=N0),
"trip_distance_km": trip0,
"surge_multiplier": surge0,
"fare_amount": fare0,
})
# Today's window with subtle shift (slightly longer trips, heavier tail)
trip1 = np.clip(rng.normal(7.2, 2.3, N1), 0.5, None)
surge1 = np.clip(rng.lognormal(mean=0.08, sigma=0.18, size=N1), 1.0, None)
fare1 = np.clip(36 + trip1*3.4 + rng.normal(0, 6, N1), 5, None)
df1 = pd.DataFrame({
"ride_id": [f"t_{i}" for i in range(N1)],
"timestamp": pd.date_range("2025-10-01", periods=N1, freq="min"),
"pickup_zone": rng.choice([f"Z{i:03d}" for i in range(40)], size=N1),
"dropoff_zone": rng.choice([f"Z{i:03d}" for i in range(40)], size=N1),
"trip_distance_km": trip1,
"surge_multiplier": surge1,
"fare_amount": fare1,
})
df0.to_csv("rides_baseline.csv", index=False)
df1.to_csv("rides_today.csv", index=False)
print("Wrote rides_baseline.csv and rides_today.csv")5. What to alert on in Chapter 1
- PSI ≥ 0.25 on any high-importance feature → Alert
- 0.10 ≤ PSI < 0.25 → Warn, annotate and watch next window
- KS p-value < 0.01 for major features → annotate Drift suspected
Why this mix? Labels can be delayed; monitoring P(X) and P(Ĺ·) is crucial in those situations. Summary stats + tests help quickly narrow the where and how of change.
6. Where this connects (foreshadow)
This chapter ends with a subtle alert (PSI rising) that will carry into Chapter 2: Covariate Shift. We'll add spatial hexbins and a control-room view of evolving frequencies across zones, then step into concept drift later. Observability widens from "that it changed" to "why it changed".