Chapter 6: The City Restored

What you'll learn

How to connect monitoring, drift detection, and experimentation into a single MLOps loop
How to visualize model health in production dashboards
How to enforce automated guardrails and incident recovery
How real companies implement continuous observability

The MLOps Feedback Loop

Modern ML systems operate in closed feedback loops. Instead of deploying models and hoping for the best, production systems continuously monitor drift, performance, and fairness — automatically triggering retraining or rollbacks when issues arise.

Stage	Role	Example Metric
Ingestion	Data Quality Checks	Missing feature %
Monitoring	Drift Detection	PSI, KS statistic
Evaluation	Model Performance	RMSE, MAE, accuracy
Experimentation	Controlled Tests	A/B test outcomes
Governance	Guardrails	SLA breach, fairness gaps
Retraining	Continuous Learning	Model refresh pipeline

The loop flows: Detect drift → Diagnose → Retrain → Revalidate → Redeploy.

Live Monitoring Dashboard

Track model health with unified dashboards that correlate drift and performance metrics over time:

Loading monitoring dashboard...

Interpretation:

Blue line (PSI): Measures input distribution drift. Threshold at 0.25 indicates significant shift.
Orange line (RMSE): Tracks prediction error. Increases correlate with higher drift.
When PSI breaches threshold, performance typically degrades — triggering automated alerts.

The dashboard enables teams to spot degradation early and correlate drift with model errors.

Drift vs Performance Relationship

Does drift actually cause performance degradation? Let's examine the correlation:

Loading drift-performance correlation...

Observation: The strong positive correlation confirms that covariate shifts (PSI) often precede performance degradation (RMSE). This validates monitoring drift as an early warning signal — allowing teams to retrain models before users notice quality drops.

This correlation helps prioritize retraining: not all drift matters equally, but drift that correlates with performance issues requires immediate action.

Guardrails & Auto-Recovery

Guardrails ensure systems fail safely. Rather than just alerting humans, modern ML systems take automated protective actions:

Loading guardrail timeline...

Guardrail Actions:

🔵 OK: Model performing within acceptable bounds
🟡 Warning: Metric breach detected, team alerted
🔴 Rollback: Automatic revert to previous stable version
🟢 Recovered: Model retrained and redeployed successfully

Common Guardrail Thresholds:

Latency ≤ 300ms (99th percentile)
MAE ≤ 2.5 minutes (for ETA prediction)
Fairness gap ≤ 5% (across demographic groups)
PSI ≤ 0.25 (input drift threshold)

Implementation Code

Here's how to generate monitoring data and implement basic guardrail logic:

import numpy as np
import pandas as pd

rng = np.random.default_rng(21)
days = pd.date_range("2025-09-01", periods=30)

# PSI gradually increases (drift emerging)
psi = np.clip(np.linspace(0.05, 0.3, 30) + rng.normal(0, 0.01, 30), 0, 1)

# RMSE correlates with PSI
rmse = 1.8 + 4 * psi + rng.normal(0, 0.1, 30)

# Create monitoring dataset
df = pd.DataFrame({
  "date": days,
  "psi": psi,
  "rmse": rmse,
  "bias": rng.normal(0, 0.2, 30),
  "volume": rng.integers(8000, 12000, 30)
})

df.to_csv("monitoring_dashboard.csv", index=False)
print(f"PSI–RMSE correlation: {df[['psi','rmse']].corr().iloc[0,1]:.2f}")

Real-World Implementations

Company	Monitoring Stack	Guardrail Logic
Uber	Michelangelo + MonStitch	Auto-drain traffic on drift or SLA breach
Airbnb	Experiment Guardrails	Blocks metric regressions in concurrent tests
Netflix	Atlas + XPGuard	Real-time anomaly detection on KPIs
Google	TFX + Vertex Pipelines	Data & model drift checks before auto-promotion

These systems share common patterns:

Centralized monitoring across all production models
Automated guardrails with configurable thresholds
Incident response workflows (alert → rollback → retrain)
Feedback loops that improve model performance over time

Key Takeaways

Continuous Observability Checklist

Centralize metrics across drift, performance, and fairness
Automate guardrail checks with alert thresholds
Correlate drift with performance degradation for prioritization
Trigger retraining or rollback automatically when thresholds breach
Feed experiment results back into retraining → closed learning loop

Bringing It All Together

From Chapter 1 through Chapter 6, we've built a complete MLOps statistical foundation:

Chapter 1: Established baselines and learned drift detection with PSI Chapter 2: Extended to covariate drift monitoring over time Chapter 3: Detected concept drift and performance degradation Chapter 4: Implemented rigorous A/B testing with SRM checks and power analysis Chapter 5: Optimized experiments with CUPED and sequential testing Chapter 6: Closed the loop with continuous monitoring and automated guardrails

These aren't isolated techniques — they form an integrated system where:

Monitoring detects issues early
Experiments validate improvements rigorously
Guardrails protect production automatically
Feedback loops drive continuous improvement

Continue Building

The statistics you've learned here apply to any production ML system. Whether shipping models at a startup or managing thousands at a tech giant, these principles keep systems healthy and your decisions grounded in data.