Improve Your Data-Driven Decision-Making With Anomaly Detection: A Practical Method

1–2 minutes

read

Anomaly detection is a crucial part of data-driven decision-making. It enables you to proactively identify problems or unusual patterns and helps prioritize investigations when working with large, complex and multi dimensional datasets.

For example, consider the following table showing daily clicks by state. In which state did we observe the most significant drop-off on December 28?

I’ll reveal the answer at the end 🙂 .

There are many tools available for anomaly detection, but I want to share a quick and practical method that can be applied almost anywhere. It’s simple, highly effective, and an excellent starting point. I’ve used this approach to analyze time series data in Pandas, Spreadsheets, Data Quality Assurance (DQA) checks, SQL or PromQL.

To explain this approach, let me introduce the Z-score, a statistical tool that helps identify anomalies by quantifying how far a value deviates from the mean in terms of standard deviations.

Z-score = (X – μ) / σ

Explanation of terms:

  • X: The most recent data point in your series.
  • μ: The mean (average) of the series, excluding the latest data point.
  • σ: The standard deviation of the dataset, indicating how much the values deviate from the mean.

How Z-score work:

  • Z > 0: The value is above the mean (higher than average).
  • Z < 0: The value is below the mean (lower than average).

Statistical interpretation of Z-scores:

  • Z = ±1: 32% of values are outside (below the 16th percentile or above the 84th percentile).
  • Z = ±2: 5% of values are outside (below the 2.5th percentile or above the 97.5th percentile).
  • Z = ±3: 0.3% of values are outside (below the 0.15th percentile or above the 99.85th percentile).

By calculating Z-scores, you can pinpoint extreme deviations, prioritize data points with the highest absolute values, and set alerts for anomalies exceeding a specific threshold (e.g. ±2.5 standard deviations)

Limitations of Z-scores: Best suited for normal distributions, sensitive to outliers, and less reliable with small sample sizes.

And the winner is? Illinois experienced the most severe drop-off, with a Z-score of -2.726. On December 28, the clicks were 927, while the average was 994.78 and the standard deviation was 24.86. The Z-score is calculated as (927-994.78)/24.86

Discover more from THE CTO DILEMMA

Subscribe now to keep reading and get access to the full archive.

Continue reading