-
Notifications
You must be signed in to change notification settings - Fork 0
ai_anomaly_detection
- More Developers Docs: The AI Anomaly Detection system is a Python-based utility that identifies outliers in datasets using statistical principles like standard deviation. This function is essential for finding anomalous data points that deviate significantly from the dataset's normal range.
{{youtube>0m1Ivadc2PM?large}}
The detect_anomalies() function analyzes numerical datasets to:
- Calculate statistical metrics such as mean, variance, and standard deviation.
- Identify anomalies that fall outside a defined threshold (e.g., 3 standard deviations from the mean).
- Log the detection process, providing insights into detected anomalies.
This system is highly valuable for:
- Monitoring data streams in real-time.
- Preprocessing data before model training.
- Identifying unusual behaviors or patterns in datasets.
The core anomaly detection mechanism is based on statistical outlier detection. It calculates the mean and standard deviation of the dataset to determine a range within which most data points fall. Any point outside this range is classified as an anomaly.
Threshold for Anomalies: Data points are considered anomalies if they fall outside the range:
[mean - (3 * standard deviation), mean + (3 * standard deviation)]
The function integrates Python's logging module to track its operations, including:
- When anomaly detection begins.
- Any errors or edge cases (e.g., empty datasets).
- List of anomalies detected.
Example Log Messages:
INFO: Detecting anomalies in the data... INFO: Anomalies detected: [120, -45]
This function takes a list of numeric data points as input and returns a list of values that qualify as anomalies.
Signature:
python
def detect_anomalies(data: List[float]) -> List[float]:
"""
Detect anomalies in the dataset by identifying outliers.
:param data: List of numeric data points
:return: List of anomalies detected
"""
Input Example:
python
data = [10, 12, 15, 10, 11, 14, 120, 12, 9, -45]
anomalies = detect_anomalies(data)
print(f"Anomalies: {anomalies}")
Output:
Anomalies: [120, -45]
Explanation:
- The dataset has a mean of 17.4 and a standard deviation of 32.4 (calculated internally).
- Data points 120 and -45 are beyond 3 standard deviations from the mean and are thus classified as anomalies.
The detect_anomalies() function includes safeguards to handle incomplete or invalid input data.
Example: Empty Dataset:
python
data = []
anomalies = detect_anomalies(data)
print(f"Anomalies: {anomalies}")
Output:
Anomalies: []
Explanation: The function immediately returns an empty list if the dataset is empty.
Example: All Data Within Range:
python
data = [100, 102, 98, 101, 99]
anomalies = detect_anomalies(data)
print(f"Anomalies: {anomalies}")
Output:
Anomalies: []
Explanation: No values fall beyond 3 standard deviations from the mean, so no anomalies are detected.
Input Data:
python
data = [100, 150, 200, 1000, 105, 210, 980, 115, 195]
anomalies = detect_anomalies(data)
Output:
Anomalies: [1000, 980]
Explanation: Outliers 1000 and 980 are classified as anomalies due to their significant deviation from the mean of the dataset.
With some modifications, the detect_anomalies() function can be adapted for real-time data stream monitoring.
Framework for Live Data Streams:
python
import random
import time
# Simulating real-time data collection
def stream_anomaly_detection():
data_stream = []
while True:
new_data = random.randint(50, 150) # Simulate normal range
if random.random() > 0.95: # Simulate anomaly
new_data = random.randint(-500, 500)
data_stream.append(new_data)
# Check anomalies every 10 data points
if len(data_stream) % 10 == 0:
anomalies = detect_anomalies(data_stream)
print(f"Latest Anomalies: {anomalies}")
time.sleep(1)
stream_anomaly_detection()
By default, the function uses 3 standard deviations as the threshold for anomaly detection. To customize this, modify the following part of the function:
python
anomalies = [x for x in data if abs(x - mean) > THRESHOLD * std_dev]
Example Custom Threshold:
python
THRESHOLD = 2 # Using 2 standard deviations instead of 3
data = [12, 15, 18, 10, 140]
anomalies = detect_anomalies(data)
print(f"Anomalies with Threshold={THRESHOLD}: {anomalies}")
Use the anomaly detection function to analyze multiple datasets in one script, automating the reporting process.
Example:
python
datasets = [
[10, 12, 14, 18, 200],
[90, 92, 91, 89, 700],
[101, 105, 110, 500]
]
for idx, data in enumerate(datasets):
anomalies = detect_anomalies(data)
print(f"Dataset {idx + 1}: {anomalies}")
Output:
Dataset 1: [200] Dataset 2: [700] Dataset 3: [500]
For deeper insights, combine the detection function with visualization tools to plot anomalies on a graph.
Example with Matplotlib:
python
import matplotlib.pyplot as plt
data = [10, 12, 15, 10, 11, 14, 120, 12, 9, -45]
anomalies = detect_anomalies(data)
# Plotting the dataset and anomalies
plt.plot(data, label="Data", marker="o")
plt.scatter(
[i for i, val in enumerate(data) if val in anomalies],
anomalies,
color="red",
label="Anomalies"
)
plt.title("Anomaly Detection in Dataset")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.show()
1. Sensor Data Monitoring: Detect unusual readings in sensor datasets, such as temperature fluctuations or pressure changes.
2. Finance and Fraud Detection: Identify fraudulent transactions or outliers in financial datasets. Detect anomalous patterns in trading or purchase history.
3. Preprocessing for AI Pipelines: Flag and handle anomalous data points before model training to improve model robustness and accuracy.
-
Normalize Data: Ensure datasets are normalized to minimize the impact of scaling on anomaly detection.
-
Adjust Thresholds: For datasets with high variance or noise, consider lowering the detection threshold to 2 standard deviations or less.
-
Visualization: Combine detection results with visualizations for better interpretability.
The AI Anomaly Detection framework provides a robust, flexible, and extensible mechanism for outlier detection in numerical datasets. With applications ranging from real-time monitoring to preprocessing for AI pipelines, the system is a valuable tool for automated anomaly analysis. By leveraging advanced usage patterns like visualization and threshold adjustments, the functionality can be tailored to a wide range of industry applications.