Why Anomaly Detection Matters for Time Series Data
Every data pipeline, monitoring dashboard, and ML model eventually runs into data points that just don't belong. A sudden CPU spike at 3 AM, a fraudulent transaction buried in millions of legitimate ones, a sensor reading that drifts way outside its expected range — these are all anomalies that, left unchecked, will corrupt your downstream analytics and tank model accuracy.
Here's the thing about time series data though: the order and timing of observations carry real meaning. A value of 95 might be perfectly normal during peak hours but deeply suspicious at midnight. Traditional tabular outlier detection ignores this temporal context entirely, and that's a problem. It's why purpose-built approaches — from statistical decomposition to tree-based machine learning — are essential for this kind of work.
In this guide, you'll build a complete anomaly detection toolkit in Python using scikit-learn 1.8, statsmodels, and pandas. We'll cover four distinct methods, talk about when each one shines, and you'll walk away with production-ready code you can adapt to your own datasets.
Types of Anomalies in Time Series
Before picking a detection method, it helps to know what you're actually looking for. Time series anomalies generally fall into three buckets:
- Point anomalies — A single observation that deviates sharply from its neighbors. Think a one-off CPU spike or a rogue sensor reading.
- Contextual anomalies — Values that are perfectly normal in one context but abnormal in another. High ice-cream sales in July? Expected. The same figure in January? That's suspicious.
- Collective anomalies — A sequence of observations that, individually, look fine but together form an abnormal pattern. A slow, steady decline in server response time over several hours might indicate a memory leak, even though no single measurement crosses a static threshold.
The methods below handle these categories differently. Statistical and STL-based approaches excel at point and contextual anomalies. Machine learning methods like Isolation Forest and Local Outlier Factor can also capture collective anomalies — provided you engineer the right features (more on that shortly).
Setting Up Your Environment
All examples here use Python 3.11+ with these packages:
pip install pandas numpy scikit-learn statsmodels matplotlib
If you're using scikit-learn 1.8 (released December 2025), you get corrected sample_weight handling in IsolationForest and expanded Array API support for GPU-backed arrays. Make sure you're on the latest version:
pip install --upgrade scikit-learn
python -c "import sklearn; print(sklearn.__version__)"
# 1.8.0
Preparing the Sample Dataset
To keep things reproducible, we'll generate a synthetic time series with a known trend, seasonality, noise, and some injected anomalies. I find this approach way more useful for learning than loading a mystery CSV where you can't verify whether your detector is actually working.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(42)
# Generate 365 days of hourly data
dates = pd.date_range("2025-01-01", periods=365 * 24, freq="h")
n = len(dates)
# Trend + daily seasonality + noise
trend = np.linspace(50, 70, n)
daily_season = 10 * np.sin(2 * np.pi * np.arange(n) / 24)
noise = np.random.normal(0, 2, n)
values = trend + daily_season + noise
# Inject 20 point anomalies
anomaly_indices = np.random.choice(n, size=20, replace=False)
values[anomaly_indices] += np.random.choice([-1, 1], size=20) * np.random.uniform(15, 30, size=20)
df = pd.DataFrame({"timestamp": dates, "value": values})
df.set_index("timestamp", inplace=True)
print(f"Dataset shape: {df.shape}")
print(f"Injected anomalies: {len(anomaly_indices)}")
print(df.head())
This gives you a year of hourly measurements with a gentle upward trend, a repeating 24-hour cycle, and 20 artificial spikes scattered throughout. Real-world equivalents include server metrics, IoT sensor feeds, and financial tick data.
Feature Engineering with Rolling Windows
Isolation Forest and LOF treat each row independently — they have zero awareness of temporal ordering. To give them that context, you need to engineer features that encode recent history into every observation.
Rolling window statistics are the most effective way to do this.
window = 24 # one-day window for hourly data
df["rolling_mean"] = df["value"].rolling(window=window, center=False).mean()
df["rolling_std"] = df["value"].rolling(window=window, center=False).std()
df["rolling_min"] = df["value"].rolling(window=window, center=False).min()
df["rolling_max"] = df["value"].rolling(window=window, center=False).max()
# How far the current value deviates from its local mean
df["z_score"] = (df["value"] - df["rolling_mean"]) / df["rolling_std"]
# Lag features capture short-term dependencies
df["lag_1h"] = df["value"].shift(1)
df["lag_24h"] = df["value"].shift(24)
df["diff_1h"] = df["value"].diff(1)
# Drop rows with NaN from rolling calculations
df.dropna(inplace=True)
print(f"Features created: {list(df.columns)}")
print(df.head())
These features paint a rich picture of each data point's local neighborhood: its recent average and volatility, how far it deviates from the norm, and what the value looked like one hour and one day ago. Honestly, this feature engineering step often matters more than which algorithm you pick.
Choosing the Right Window Size
The window size should match the dominant periodicity of your data. For hourly data with a daily cycle, a 24-hour window captures one full period. For daily data with weekly seasonality, go with a 7-day window. When in doubt, try multiple sizes and compare the resulting anomaly scores — you'll usually spot a clear winner pretty quickly.
Method 1: Statistical Z-Score Detection
The simplest approach uses the rolling Z-score you already computed. Any observation whose Z-score exceeds a threshold (commonly 3) gets flagged as anomalous:
threshold = 3.0
df["anomaly_zscore"] = (df["z_score"].abs() > threshold).astype(int)
n_detected = df["anomaly_zscore"].sum()
print(f"Z-score anomalies detected: {n_detected}")
# Visualize
fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(df.index, df["value"], color="steelblue", linewidth=0.5, label="Value")
anomalies = df[df["anomaly_zscore"] == 1]
ax.scatter(anomalies.index, anomalies["value"], color="red", s=20, zorder=5, label="Anomaly")
ax.set_title("Z-Score Anomaly Detection (threshold=3)")
ax.legend()
plt.tight_layout()
plt.show()
The Z-score method is fast, interpretable, and requires no model training. Hard to beat for a first pass.
Its weakness? It assumes a roughly normal distribution within each window and can miss contextual anomalies in highly seasonal data. But as a baseline, it's surprisingly effective.
Method 2: STL Decomposition with Statsmodels
Seasonal-Trend decomposition using LOESS (STL) separates a time series into three components: trend, seasonal, and residual. The residual captures everything the trend and seasonality can't explain — and large residuals point directly to anomalies.
from statsmodels.tsa.seasonal import STL
# Resample to daily for cleaner STL decomposition
daily = df["value"].resample("D").mean()
stl = STL(daily, seasonal=7, robust=True)
result = stl.fit()
# Plot the decomposition
fig = result.plot()
fig.set_size_inches(14, 10)
plt.tight_layout()
plt.show()
Setting robust=True is important here — it applies a data-dependent weighting function that down-weights extreme values during LOESS fitting. This makes the decomposition itself resistant to the very anomalies you're trying to find. Without it, outliers can distort the trend and seasonal estimates, which kind of defeats the purpose.
Flagging Anomalies from Residuals
residuals = result.resid
resid_mean = residuals.mean()
resid_std = residuals.std()
# Flag residuals beyond 2.5 standard deviations
stl_threshold = 2.5
daily_df = pd.DataFrame({"value": daily, "residual": residuals})
daily_df["anomaly_stl"] = (residuals.abs() > resid_mean + stl_threshold * resid_std).astype(int)
n_stl = daily_df["anomaly_stl"].sum()
print(f"STL anomalies detected: {n_stl}")
# Visualize
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)
axes[0].plot(daily_df.index, daily_df["value"], color="steelblue", label="Daily Mean")
stl_anomalies = daily_df[daily_df["anomaly_stl"] == 1]
axes[0].scatter(stl_anomalies.index, stl_anomalies["value"], color="red", s=40, zorder=5, label="Anomaly")
axes[0].set_title("STL-Based Anomaly Detection")
axes[0].legend()
axes[1].plot(daily_df.index, daily_df["residual"], color="gray", label="Residual")
axes[1].axhline(resid_mean + stl_threshold * resid_std, color="red", linestyle="--", label=f"+{stl_threshold}σ")
axes[1].axhline(resid_mean - stl_threshold * resid_std, color="red", linestyle="--", label=f"-{stl_threshold}σ")
axes[1].set_title("Residuals with Threshold")
axes[1].legend()
plt.tight_layout()
plt.show()
STL's key advantage is that it explicitly accounts for seasonality. An observation that deviates from the expected seasonal pattern will produce a large residual even if its raw value looks totally normal — exactly the kind of contextual anomaly that plain Z-scores miss.
Method 3: Isolation Forest with Scikit-Learn 1.8
So, let's get into the ML-based methods. Isolation Forest works by building an ensemble of random decision trees that attempt to isolate each observation. The core insight is elegant: anomalies, being rare and different, require fewer random splits to isolate and therefore have shorter average path lengths through the trees.
Why Isolation Forest Works Well for Anomaly Detection
- No distribution assumptions — Unlike Z-scores, it works equally well on skewed, multimodal, or non-Gaussian data.
- Scales linearly — Training complexity is O(n · t · log(ψ)) where t is the number of trees and ψ is the subsample size, making it practical even for large datasets.
- Multidimensional — It naturally handles multiple features, so all your rolling-window engineered columns contribute to the anomaly score.
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import RobustScaler
# Select features for the model
feature_cols = ["value", "rolling_mean", "rolling_std", "z_score",
"lag_1h", "lag_24h", "diff_1h"]
# RobustScaler is preferred when outliers are present
scaler = RobustScaler()
X_scaled = scaler.fit_transform(df[feature_cols])
# Train Isolation Forest
iso_forest = IsolationForest(
n_estimators=200,
contamination=0.005, # expect ~0.5% anomalies
max_samples="auto",
random_state=42,
n_jobs=-1
)
iso_forest.fit(X_scaled)
# Predict: -1 = anomaly, 1 = normal
df["anomaly_iforest"] = iso_forest.predict(X_scaled)
df["anomaly_score_iforest"] = iso_forest.decision_function(X_scaled)
n_iforest = (df["anomaly_iforest"] == -1).sum()
print(f"Isolation Forest anomalies detected: {n_iforest}")
# Visualize
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)
axes[0].plot(df.index, df["value"], color="steelblue", linewidth=0.5)
if_anomalies = df[df["anomaly_iforest"] == -1]
axes[0].scatter(if_anomalies.index, if_anomalies["value"],
color="red", s=20, zorder=5, label="Anomaly")
axes[0].set_title("Isolation Forest Anomaly Detection")
axes[0].legend()
axes[1].plot(df.index, df["anomaly_score_iforest"], color="darkorange", linewidth=0.5)
axes[1].axhline(0, color="red", linestyle="--", label="Decision Boundary")
axes[1].set_title("Anomaly Score (lower = more anomalous)")
axes[1].legend()
plt.tight_layout()
plt.show()
Tuning the Contamination Parameter
The contamination parameter tells the model what fraction of observations are expected to be anomalous. Set it too high and you'll drown in false positives; too low and real anomalies slip through the cracks.
When you don't know the true anomaly rate (which, let's be honest, is most of the time), set contamination="auto" and use the decision_function() scores to set a custom threshold based on your tolerance for false positives.
# Using decision_function for custom thresholding
scores = iso_forest.decision_function(X_scaled)
# Visualize the score distribution to pick a threshold
fig, ax = plt.subplots(figsize=(10, 4))
ax.hist(scores, bins=100, color="steelblue", edgecolor="white")
ax.axvline(np.percentile(scores, 0.5), color="red", linestyle="--",
label="0.5th percentile")
ax.set_title("Distribution of Anomaly Scores")
ax.set_xlabel("Anomaly Score")
ax.legend()
plt.tight_layout()
plt.show()
Method 4: Local Outlier Factor (LOF)
While Isolation Forest measures how easy it is to isolate a point, Local Outlier Factor takes a different angle — it measures how dense a point's neighborhood is compared to its neighbors' neighborhoods. A point surrounded by a tight cluster but sitting just outside of it will get a high LOF score, making LOF particularly good at detecting anomalies near cluster boundaries.
from sklearn.neighbors import LocalOutlierFactor
# LOF with default 20 neighbors
lof = LocalOutlierFactor(
n_neighbors=20,
contamination=0.005,
metric="euclidean",
n_jobs=-1
)
# LOF's fit_predict returns labels directly
df["anomaly_lof"] = lof.fit_predict(X_scaled)
df["anomaly_score_lof"] = lof.negative_outlier_factor_
n_lof = (df["anomaly_lof"] == -1).sum()
print(f"LOF anomalies detected: {n_lof}")
# Visualize
fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(df.index, df["value"], color="steelblue", linewidth=0.5)
lof_anomalies = df[df["anomaly_lof"] == -1]
ax.scatter(lof_anomalies.index, lof_anomalies["value"],
color="red", s=20, zorder=5, label="Anomaly")
ax.set_title("Local Outlier Factor Anomaly Detection")
ax.legend()
plt.tight_layout()
plt.show()
LOF vs. Isolation Forest: When to Choose Which
Here's the practical trade-off. LOF computes pairwise distances between all points, giving it quadratic complexity — O(n²). On datasets with more than about 100k rows, this becomes noticeably slower than Isolation Forest. However, LOF excels when anomalies cluster near the boundary of normal data, a scenario where Isolation Forest's random splits may not create a clean separation.
My rule of thumb: use Isolation Forest as your default for large datasets. Reach for LOF when you suspect anomalies sit in locally sparse regions near dense clusters, or when your dataset is small enough that the quadratic cost doesn't matter.
Combining Multiple Methods for Robust Detection
No single method catches every anomaly type perfectly. In my experience, the most reliable strategy is combining multiple detectors through a voting ensemble:
# Create a combined score using majority voting
df["vote_zscore"] = df["anomaly_zscore"]
df["vote_iforest"] = (df["anomaly_iforest"] == -1).astype(int)
df["vote_lof"] = (df["anomaly_lof"] == -1).astype(int)
df["total_votes"] = df["vote_zscore"] + df["vote_iforest"] + df["vote_lof"]
# Require at least 2 out of 3 methods to agree
df["anomaly_ensemble"] = (df["total_votes"] >= 2).astype(int)
n_ensemble = df["anomaly_ensemble"].sum()
print(f"Ensemble anomalies (2/3 vote): {n_ensemble}")
# Visualize
fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(df.index, df["value"], color="steelblue", linewidth=0.5)
ens_anomalies = df[df["anomaly_ensemble"] == 1]
ax.scatter(ens_anomalies.index, ens_anomalies["value"],
color="red", s=25, zorder=5, label="Ensemble Anomaly")
ax.set_title("Ensemble Anomaly Detection (2/3 Majority Vote)")
ax.legend()
plt.tight_layout()
plt.show()
# Summary comparison
summary = pd.DataFrame({
"Method": ["Z-Score", "STL Residuals", "Isolation Forest", "LOF", "Ensemble (2/3)"],
"Anomalies Detected": [
df["anomaly_zscore"].sum(),
n_stl,
n_iforest,
n_lof,
n_ensemble
]
})
print(summary.to_string(index=False))
The ensemble approach reduces false positives by requiring agreement across methods that rely on fundamentally different assumptions — statistical distribution, seasonal decomposition, and density-based isolation. It's a bit more work to set up, but the reduction in noise is worth it.
Classifying Anomaly Severity
Not all anomalies deserve the same level of attention. You can use the anomaly scores to classify detections into severity tiers:
def classify_severity(row):
z = abs(row["z_score"])
if z > 5:
return "CRITICAL"
elif z > 4:
return "WARNING"
elif z > 3:
return "MINOR"
return "NORMAL"
df["severity"] = df.apply(classify_severity, axis=1)
severity_counts = df[df["severity"] != "NORMAL"]["severity"].value_counts()
print("Anomaly Severity Distribution:")
print(severity_counts)
In production systems, this severity classification drives your alerting rules. Critical anomalies trigger immediate pages, warnings go to dashboards, and minor detections feed into daily reports for someone to look at when they have time.
Production Best Practices
Moving anomaly detection from a notebook to a production pipeline introduces several practical headaches. Here's what I've seen trip people up most often.
1. Align Your Time Grid
Before computing rolling features, make sure your timestamps follow a consistent frequency. Use df.asfreq("h") or df.resample("h").mean() to fill gaps and align observations to a regular grid. Skipping this step is one of the most common sources of subtle bugs.
2. Fit on Clean Data
Train your Isolation Forest or LOF model on a period you know to be anomaly-free. If you train on contaminated data, the model's notion of "normal" shifts and it'll miss similar anomalies in the future. Use the contamination parameter conservatively to account for any residual noise in the training set.
3. Use RobustScaler for Feature Scaling
StandardScaler uses the mean and standard deviation, both of which are sensitive to outliers. RobustScaler uses the median and interquartile range instead, making it far more suitable for datasets that contain the very anomalies you're trying to detect. It's a small change that makes a real difference.
4. Monitor and Retrain
Time series distributions shift over time — what was anomalous last quarter may be perfectly normal now. Schedule periodic retraining or implement a sliding-window approach where the model is always fitted on the most recent N days of data.
5. Log Anomaly Scores, Not Just Labels
Store the continuous anomaly score alongside the binary label. This lets you retroactively adjust thresholds without re-running the model and provides a much richer signal for downstream dashboards.
Complete Pipeline: From Raw Data to Anomaly Report
Here's a self-contained function that ties everything together into a reusable pipeline:
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import RobustScaler
import pandas as pd
import numpy as np
def detect_anomalies(
df: pd.DataFrame,
value_col: str = "value",
window: int = 24,
contamination: float = 0.005,
z_threshold: float = 3.0,
n_estimators: int = 200,
) -> pd.DataFrame:
"""Detect anomalies using rolling Z-score and Isolation Forest.
Parameters
----------
df : DataFrame with a DatetimeIndex and a value column.
value_col : Name of the column containing the time series values.
window : Rolling window size for feature engineering.
contamination : Expected fraction of anomalies.
z_threshold : Z-score threshold for statistical detection.
n_estimators : Number of trees in the Isolation Forest.
Returns
-------
DataFrame with anomaly labels and scores appended.
"""
result = df.copy()
# Feature engineering
result["rolling_mean"] = result[value_col].rolling(window).mean()
result["rolling_std"] = result[value_col].rolling(window).std()
result["z_score"] = (
(result[value_col] - result["rolling_mean"]) / result["rolling_std"]
)
result["lag_1"] = result[value_col].shift(1)
result["lag_period"] = result[value_col].shift(window)
result["diff_1"] = result[value_col].diff(1)
result.dropna(inplace=True)
# Z-score detection
result["anomaly_zscore"] = (result["z_score"].abs() > z_threshold).astype(int)
# Isolation Forest detection
feature_cols = [value_col, "rolling_mean", "rolling_std",
"z_score", "lag_1", "lag_period", "diff_1"]
scaler = RobustScaler()
X = scaler.fit_transform(result[feature_cols])
iso = IsolationForest(
n_estimators=n_estimators,
contamination=contamination,
random_state=42,
n_jobs=-1,
)
iso.fit(X)
result["anomaly_iforest"] = (iso.predict(X) == -1).astype(int)
result["score_iforest"] = iso.decision_function(X)
# Ensemble vote
result["anomaly_ensemble"] = (
(result["anomaly_zscore"] + result["anomaly_iforest"]) >= 2
).astype(int)
return result
# Usage
anomalies_df = detect_anomalies(df, value_col="value", window=24)
print(f"Ensemble anomalies: {anomalies_df['anomaly_ensemble'].sum()}")
Frequently Asked Questions
What's the best Python library for anomaly detection in time series?
Scikit-learn is the most widely used library for general-purpose anomaly detection, offering Isolation Forest, Local Outlier Factor, One-Class SVM, and Elliptic Envelope out of the box. For time series with strong seasonality, combine scikit-learn with statsmodels for STL decomposition. If you want a zero-configuration option, libraries like PyOD and PyCaret wrap multiple algorithms behind a unified API — though you'll give up some control over tuning.
How do I choose between Isolation Forest and Local Outlier Factor?
Start with Isolation Forest for most use cases — it trains faster (linear vs. quadratic complexity), handles high-dimensional feature sets well, and scales to millions of rows without breaking a sweat. Switch to LOF when your dataset is small (under 100k rows) and you suspect anomalies sit in locally sparse pockets near dense clusters, where LOF's density-based scoring picks up nuances that Isolation Forest's random splits might miss.
What contamination value should I use for Isolation Forest?
If you have domain knowledge about the expected anomaly rate, use that directly (for example, 0.01 for 1% anomalies). When you don't know — and again, that's most of the time — set contamination="auto" and use the raw decision_function() scores to set your own threshold. Plot the score distribution and choose a cutoff that balances your tolerance for false positives against the cost of missing real anomalies.
Can Isolation Forest handle seasonal time series data directly?
Not really. Isolation Forest has no built-in notion of seasonality or temporal ordering — it treats each row as an independent observation. To make it season-aware, you need to engineer features like rolling means, rolling standard deviations, lag values, and hour-of-day or day-of-week indicators before passing them to the model. Alternatively, apply STL decomposition first and run Isolation Forest on just the residual component.
How do I evaluate anomaly detection when I have no labeled data?
Without labels, you're stuck with qualitative evaluation: visualize detected anomalies on the original time series and have a domain expert review whether they align with known events like outages, deployments, or holidays. You can also inject synthetic anomalies into a clean dataset, run detection, and measure precision and recall against the known injection points. For ongoing monitoring, track the false-positive rate by logging how often operators dismiss alerts — that feedback loop is invaluable.