Log-based Anomaly Detection with Deep Learning: How Far Are We?
This February 2022 paper conducts an in-depth analysis of five state-of-the-art deep learning models for log-based anomaly detection.
The main goal is to assess how far these models have come in solving the problem of detecting anomalies in system logs. The authors focus on several key aspects of model evaluation to provide a comprehensive assessment.
Introduction and Motivation
Log-based anomaly detection is crucial for ensuring the reliability and availability of software-intensive systems.
Many deep learning models have been proposed recently, claiming high detection accuracy (e.g., F-measure > 0.9 on HDFS dataset).
The authors argue that several important aspects are overlooked in existing evaluations, such as training data selection, data grouping, class distribution, data noise, and early detection ability.
The goal is to re-evaluate the capabilities of deep learning models for log-based anomaly detection considering these aspects.
Study Design
Five representative deep learning models are evaluated: DeepLog, LogAnomaly, PLELog, LogRobust, and CNN.
Four public log datasets are used: HDFS, BGL, Thunderbird, and Spirit.
The authors design experiments to measure the impact of various factors on model performance.
Research questions focus on the effect of training data selection, data grouping, class distribution, data noise, and early detection ability.
Results and Findings
RQ1: Training data selection strategies (random vs. chronological) have a significant impact on semi-supervised models. Random selection can lead to data leakage and unreasonably high accuracy.
RQ2: Data grouping methods (fixed window vs. session window) substantially influence model performance. Models tend to lose accuracy when dealing with shorter log sequences.
RQ3: Highly imbalanced class distribution significantly affects model effectiveness. Commonly used metrics (Precision, Recall, F-measure) are not comprehensive enough for evaluating models with highly imbalanced data.
RQ4: Even a small amount of data noise (mislabelled logs, log parsing errors) can downgrade anomaly detection performance. Supervised models are more sensitive to mislabeled logs. Models capable of understanding log semantics can reduce the impact of parsing errors.
RQ5: Different models have varying abilities in early detection of anomalies. Forecasting-based models (DeepLog, LogAnomaly) can detect anomalies earlier than classification-based models (PLELog, LogRobust, CNN).
Discussion
The authors summarise the advantages and disadvantages of each model based on the findings.
They identify research challenges and suggest future work, such as using diverse datasets, handling limited labeled data, improving early detection, dealing with evolving systems, exploring relations among log events, and ensuring data quality.
Conclusion
The study shows that the problem of log-based anomaly detection has not been solved yet, despite the high accuracy claims of recent deep learning models.
The performance of models is often not as good as expected when considering various evaluation aspects.
The authors hope their findings can help practitioners and researchers in this area and provide a benchmark for future work.
In summary, this paper provides a comprehensive and critical evaluation of state-of-the-art deep learning models for log-based anomaly detection.
The authors challenge the existing claims of high accuracy by considering several important aspects often overlooked in evaluations.
The findings highlight the limitations of current models and the need for further research to effectively solve the problem of log-based anomaly detection in real-world scenarios.
Last updated