Research8 min read · Sep 21, 2024

Anomaly detection that the SOC actually trusts

High AUC doesn't matter if analysts dismiss the alerts. Here's what a detection model actually needs to survive contact with a real security operations workflow.

The metric that doesn't matter

Most anomaly detection papers lead with AUC-ROC. 0.97. 0.99. Sometimes 0.999 on a benchmark dataset. These numbers mean something in controlled evaluation. They don't tell you whether your model survives a week in a real SOC.

The gap between "high AUC" and "model analysts actually use" is mostly explained by three things: alert fatigue from false positives, p99 latency that breaks the detection workflow, and the absence of explanations that let an analyst triage without opening a ticket.

I've been thinking through what a robust detection pipeline for authentication log streams actually looks like, and the conclusions are less glamorous than the benchmark numbers suggest.

The feature pipeline is where you live or die

Auth logs are sparse, noisy, and inconsistent across sources. Windows Security Event 4624 has a different field structure than Linux PAM logs, which differ from cloud provider access logs, which differ from SSO provider webhook payloads. Before you train anything, you spend most of your time on normalization.

A reasonable feature set for per-user authentication anomaly detection:

Temporal features: hour of day, day of week, delta from median login time for this user, days since last login from this source IP
Geographic features: country of source IP, ASN, distance from previous login location, is-known-VPN-exit flag
Device features: user-agent hash, OS family, browser version, is-new-device flag
Behavioral features: number of distinct source IPs in rolling 24h window, failed-attempt ratio in rolling 1h window, number of distinct target accounts (for service accounts)
Velocity features: logins per hour, per-day, deviation from rolling 30-day baseline

None of these features are novel. The value is in computing them correctly and at low latency. The is-new-device flag sounds simple until you're reconciling user-agent strings from mobile apps that update weekly, or corporate proxies that strip identifying headers.

def compute_device_fingerprint(event: AuthEvent) -> str:
    # Normalize before hashing — raw UA strings are too volatile
    ua = normalize_user_agent(event.user_agent)  # strip version micro-patches
    return hashlib.sha256(
        f"{ua}|{event.os_family}|{event.device_type}".encode()
    ).hexdigest()[:16]

def is_new_device(user_id: str, fingerprint: str, lookback_days: int = 30) -> bool:
    seen = device_cache.get(user_id, lookback_days)
    return fingerprint not in seen

The normalization step matters. A model trained on raw user-agent strings will treat every Chrome minor version bump as a new device. You'll get spikes in is-new-device alerts every time Google ships an update to 2 billion Chrome users.

p99 latency dominates

If your model produces a score and that score arrives 4 seconds after the login event, it's useless for blocking decisions. Authentication anomaly detection has to fit inside the login flow if you want to act on it synchronously, or it has to route to an async alerting queue where the latency requirement is softer but the false positive cost is higher (because now you're paging someone).

A practical architecture:

Synchronous path: feature lookup from a low-latency store (Redis, DynamoDB), score inference via a lightweight gradient boosting model (XGBoost, LightGBM), decision in <50ms. Used for step-up auth challenges.
Async path: full feature computation, larger model, post-hoc analysis, human-review queue. Used for retrospective investigation and model retraining.

The synchronous model doesn't have to be your best model. It has to be your most consistent model. p99 latency at 40ms with 92% AUC beats p99 latency at 800ms with 98% AUC. The 6-point AUC gap costs you some detection fidelity; the latency gap breaks the product.

# Quick latency profile for inference path
wrk -t4 -c100 -d30s --latency http://localhost:8080/score

# Target: p99 < 50ms under 100 concurrent requests
# If you're hitting p99 > 100ms, profile the feature lookup first —
# that's almost always the bottleneck, not inference

Explanations are a trust primitive

An analyst who gets an alert with score 0.94 and no explanation will, after the third false positive, start dismissing all alerts from that model. This is rational behavior. It is also the death of your detection capability.

The minimum viable explanation for an auth anomaly alert:

Which features drove the score (SHAP values or rule-based equivalents)
What the baseline looks like for this user ("this user typically logs in from California; this event is from Romania")
What similar events looked like historically ("we've seen 3 events with this profile in the last 90 days; 2 were confirmed compromises")

SHAP is the standard here for tree-based models. The computational overhead is acceptable for the async path. For synchronous scoring, pre-compute feature importances at training time and use them as a proxy — it's not as precise, but it gives the analyst something to work with.

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_event)

top_features = sorted(
    zip(feature_names, shap_values[0]),
    key=lambda x: abs(x[1]),
    reverse=True
)[:5]

explanation = [
    {"feature": name, "contribution": float(val), "value": float(X_event[0][i])}
    for i, (name, val) in enumerate(top_features)
]

The false positive budget

Before you deploy, decide how many false positives the SOC can absorb per day without ignoring the model. This is a product decision, not a technical one. If the answer is 10 per day, your threshold needs to produce at most 10 false positives per day at production traffic volumes, even if that means your recall drops from 0.91 to 0.78.

Calibrate on your actual traffic. The threshold that works on a benchmark dataset almost never works on production traffic without adjustment. Set it on a holdout from production logs, not from the public dataset you used for initial development.

None of this is cutting-edge. The models aren't novel. The value is in building a pipeline that actually runs, produces timely output, explains itself clearly enough to earn analyst trust, and has honest precision-recall tradeoffs instead of benchmark-optimized ones. The SOC has a limited attention budget. Use it carefully.