Skip to content

Types of Outliers in Data Mining: A Comprehensive Guide

Outliers are anomalous data points that diverge from expected patterns in a dataset. This expansive guide offers a comprehensive walkthrough of techniques for detecting various types of outliers and properly analyzing them to derive impactful business insights…

The Pervasive Impact of Outliers

Outliers permeate real-world data and can have an outsized impact on analysis if left unmanaged. For example, researcher Paul Goodwin demonstrated that including just one or two outliers in analyses for determining new hospital staffing needs skewed results enough to recommend hiring 20-30% more nurses than necessary [1].

Table 1 shows outliers can significantly inflate averages:

Data Points Dataset 1 Dataset 2
Standard Values 2, 3, 6, 7, 9 2, 3, 6, 7, 100
Average 5.4 23.6

Table 1. Impact of an outlier on the average of sample datasets…

Common Causes of Outliers

Outliers commonly arise from:

  • Measurement and data entry errors
  • Data processing mistakes
  • Natural deviations in populations
  • Intentional anomalies like fraud

Their presence can lead to misguided conclusions – so identifying and properly handling outliers is crucial for sound analysis and decision making…

Techniques for Outlier Detection

Here we explore key statistical and machine learning techniques for detecting outliers based on their characteristics:

Standard Score

Also known as z-score, this technique labels data points as outliers if their standard scores fall above an absolute threshold like 3. It is calculated as:

z = (x - μ) / σ

Where x is the value, μ is the mean and σ is the standard deviation.

For example, Intel integrates a standard score based outlier detection method on their assembly lines to flag defective chip units…

Multivariate Outlier Detection

Multivariate methods detect outliers using multiple attributes. Common techniques include:

Mahalanobis Distance: Accounts for correlations between variables. Outliers have a larger Mahalanobis distance from the bulk of data points [2].

Angle-Based Outlier Detection (ABOD): Defines the variance between the angles an outlier forms with other data points as metric to score degree of outlier-ness [3]…

Time Series Decomposition

Time series components like trend and seasonal patterns must be removed to identify significant deviations. Example process:

  1. Decompose time series data into trend, seasonal and residual components
  2. Determine expected value range for residuals
  3. Label actual residual values outside threshold as outliers

Platforms like Tableau provide native time series outlier identification…

import yfinance as yf
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose 

data = yf.download(‘MSFT‘, ‘2015-01-01‘, ‘2023-01-01‘)
df = pd.DataFrame(data)[‘Close‘]

result = seasonal_decompose(df)  

outliers = detect_residual_outliers(result) 

Python snippet to demonstrate time series outlier detection…

Analyzing and Interpreting Outliers

Once detected, additional analysis helps determine origins and interpret impacts…

Assessing Underlying Causes

Compare flagged outliers against logs and metadata to uncover root causes like data collection gaps, demographic shift, instrumental errors etc. This enables fixing systemic issues.

For example, an IoT predictive maintenance solution flags a flurry of equipment failure alerts. Outlier analysis reveals the sensors were not properly calibrated…

Accounting for Masking and Swamping Effects

Some outliers hide other outliers (masking), while some regular points mimic outliers (swamping)…

Evaluating Algorithm Suitability

No one method works universally – analyzing detection rates on sample labeled data guides technique selection…

Applications of Outlier Analytics

Here we explore several commercial applications:

Fraud Detection – Visa screens 5,000 transaction/second using outlier analytics to catch fraudulent purchases [4].

Network Intrusion – Darktrace detects 95 unknown threats an hour (on average) by modeling normal network patterns to flag anomalies [5].

Predictive Maintenance – Thales employs IoT vibration sensors to detect early failure warnings by flagging deviations…

Manufacturing Optimization – Siemens Gamesa uses outlier analysis to identify defects in turbine blade fabrication process…

Customer Segmentation – Uber Eats clusters delivery locations to reveal outlier neighborhoods that order disproportionately, allowing targeted sales…

Algorithmic Trading – Quant hedge funds build statistical arbitrage strategies by modeling normal stock correlations to profit from outlier deviations …

Disease Outbreak Detection – Healthmap models disease incidence rates across locations to rapidly detect outlier regions with emerging spread…

Established Best Practices for Managing Outliers

Industry and academic authorities suggest guidelines like:

  • Retain genuinely unusual yet significant outliers [6]
  • Replace faulty outliers with missing values if deduced as errors [7]
  • Employ robust statistical methods resilient to outliers [8]
  • Use ensemble outlier detection for improved generalizability [9] …

FAQs

Answers to frequent questions on outlier analytics:

Q: How do I select outlier algorithms for my dataset?

A: Evaluate technique performance by manually labeling a subset rather than relying on standard assumptions…

Q: When should outliers be removed vs. retained?

A: Outliers should be investigated and understood rather than blindly removed before modelling…