This project demonstrates the application of anomaly detection techniques to identify fraudulent credit card transactions from a dataset containing over 280,000 rows and 31 columns. Anomaly detection is a crucial task in the realm of fraud prevention, where the goal is to identify unusual or unexpected patterns that deviate from the norm.
- Introduction
- Data
- Anomaly Detection Algorithms
- Results
- Observations and Analysis
- Usage
- Dependencies
- Acknowledgments
Credit card fraud is a significant concern for financial institutions and customers alike. Traditional methods of fraud detection might not always be effective in identifying sophisticated fraudulent transactions. This project explores the application of anomaly detection algorithms to detect such fraudulent activities.
The dataset used in this project contains over 280,000 rows and 31 columns, representing various features extracted from credit card transactions. Each row represents a transaction, and the columns contain information such as transaction amount, time, and other relevant features.
Isolation Forest is an ensemble-based anomaly detection algorithm that works by isolating instances into individual trees. Anomalies are expected to have shorter paths in the tree structure, making them easier to isolate. This algorithm excels in identifying isolated anomalies and is relatively computationally efficient.
Local Outlier Factor is a density-based anomaly detection algorithm that assesses the local density deviation of a data point with respect to its neighbors. Points with significantly lower density than their neighbors are considered outliers. LOF can identify anomalies in more complex data distributions.
Support Vector Machine is a powerful machine learning algorithm that can also be used for anomaly detection. However, due to the size of the dataset, obtaining results using SVM proved to be challenging and computationally intensive. Therefore, SVM results are not included in this analysis.
After applying Isolation Forest and Local Outlier Factor algorithms to the dataset, the following classification report was obtained:
Isolation Forest : 675
Accuracy Score :
0.9976299739823811
Classfication Report:
precision recall f1-score support
0 1.00 1.00 1.00 284315
1 0.31 0.32 0.31 492
accuracy 1.00 284807
macro avg 0.66 0.66 0.66 284807
weighted avg 1.00 1.00 1.00 284807
Local Outlier Factor : 4600
Accuracy Score :
0.9838487115836339
Classfication Report:
precision recall f1-score support
0 1.00 0.99 0.99 284315
1 0.02 0.20 0.04 492
accuracy 0.98 284807
macro avg 0.51 0.59 0.52 284807
weighted avg 1.00 0.98 0.99 284807
The performance of Isolation Forest and Local Outlier Factor on the credit card fraud dataset can be summarized as follows:
Isolation Forest:
- Identified 675 instances as anomalies.
- Achieved an accuracy of approximately 99.76%.
- Showed better precision, recall, and F1-score for detecting fraudulent transactions compared to LOF.
Local Outlier Factor (LOF):
- Identified 4600 instances as anomalies.
- Achieved an accuracy of approximately 98.38%.
- Demonstrated low precision, recall, and F1-score for detecting fraudulent transactions.
Both algorithms struggled with the class imbalance in the dataset, but Isolation Forest outperformed LOF in most evaluation metrics.
- Clone this repository to your local machine.
- Install the required dependencies (see Dependencies).
- Run the provided Jupyter Notebook to see the implementation of Isolation Forest and Local Outlier Factor on the credit card fraud dataset.
- (Optional) Attempt the SVM implementation by modifying the code and using a more powerful computing environment.
- Python 3.x
- Jupyter Notebook
- Pandas
- Scikit-learn
Install the required dependencies using the following command:
pip install pandas scikit-learn
The dataset used in this project is sourced from Kaggle. Special thanks to the open-source community for providing valuable resources and algorithms for anomaly detection.