Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection

This January 2022 paper focuses on the application of deep learning techniques for log-based anomaly detection in large-scale software systems.

The authors recognise the importance of logs in ensuring system reliability and service quality, as they faithfully record runtime information that can be used for monitoring, troubleshooting, and understanding system behaviour.

The paper highlights the challenges faced by traditional manual inspection methods and machine learning-based approaches for log anomaly detection in modern software systems.

These challenges include:

Insufficient interpretability of results, making it difficult for admins and analysts to trust and act on automated analysis.
Weak adaptability to unseen log events that emerge due to feature additions and system upgrades.
The need for handcrafted features, which can be time-consuming and demand human domain knowledge.

To address these limitations, the authors explore the application of deep learning techniques, specifically neural networks, for log-based anomaly detection.

Deep learning has shown exceptional ability in modeling complex relationships and can automatically extract features from input data.

The paper provides a comprehensive review and evaluation of five popular neural network architectures used by six state-of-the-art log anomaly detection methods:

Four unsupervised methods:
- Two methods using Long Short-Term Memory (LSTM) networks
- One method using Transformer architecture
- One method using Autoencoder
Two supervised methods:
- One method using Convolutional Neural Networks (CNN)
- One method using Attentional Bidirectional LSTM (BiLSTM)

The authors note that unsupervised methods are more favored in the literature, as labels are often unavailable in real-world scenarios.

Gap between academic research and application

The authors also highlight the gap between academic research and industrial practices in adopting deep learning techniques for log-based anomaly detection.

They attribute this gap to the lack of awareness among site reliability engineers about state-of-the-art methods and the absence of open-source toolkits that apply deep learning techniques for this purpose.

To facilitate the adoption of deep learning-based log anomaly detection, the authors release an open-source toolkit containing the studied models.

This toolkit aims to help researchers and practitioners quickly understand the characteristics of popular deep learning-based anomaly detectors, save efforts on re-implementations, and focus on further customization or improvement.

What is the log anomaly detection process

Log Collection

Software systems generate logs that record runtime status, including timestamps and detailed messages (e.g., error symptoms, target components, IP addresses).
In large-scale systems, such as distributed systems, logs are often collected centrally.
The large volume of collected logs can be overwhelming for existing troubleshooting systems, and the lack of labeled data poses challenges for log analysis.

Log Parsing

Raw logs are semi-structured and need to be parsed into a structured format for analysis, a process called log parsing.
Log parsing identifies the constant/static part (log event, log template, or log key) and the variable/dynamic part (parameter values) of a raw log line.
Example: "Received block blk_789 of size 67108864 from /10.251.42.84" is parsed into the log event "Received block <> of size <> from <>", where parameters are replaced with "<>".

Log Partition and Feature Extraction

Logs are textual messages and need to be converted into numerical features for machine learning algorithms.
Each log message is represented by its log template identified by the log parser.
Log timestamps and identifiers (e.g., task/job/session ID) are used to partition logs into different groups, each representing a log sequence.
Timestamp-based log partition strategies:
- Fixed partitioning: Uses a pre-defined time interval (partition size) to split chronologically sorted logs without overlap between consecutive partitions.
- Sliding partitioning: Uses partition size and stride (forwarding distance) to generate overlapping log partitions, producing more log sequences than fixed partitioning.
Identifier-based partitioning: Sorts logs chronologically and divides them into sequences based on a unique and common identifier, indicating they originate from the same task execution.
Traditional ML-based methods often generate a vector of log event counts as input features, where each dimension represents a log event, and the value counts its occurrence in a log sequence.
DL-based methods directly consume the log event sequence, representing each element as an index or a more sophisticated feature like a log embedding vector to learn the semantics of logs.

Anomaly Detection

Based on the log features constructed in the previous phase, anomaly detection identifies anomalous log instances (e.g., logs printed by interruption exceptions).
Traditional ML-based anomaly detectors often produce a prediction (anomaly or not) for the entire log sequence based on its log event count vector.
DL-based methods first learn normal log patterns and then determine the normality for each log event, enabling them to locate the exact log event(s) that contaminate the log event sequence, improving interpretability.

Overall, log anomaly detection is a multi-step process that involves collecting logs, parsing them into a structured format, extracting features through log partitioning and representation, and finally applying anomaly detection algorithms to identify anomalous log instances.

The choice of methods, such as traditional ML-based or DL-based approaches, depends on the specific requirements and characteristics of the system being analyzed. DL-based methods have the advantage of learning log semantics and providing more interpretable results by locating specific anomalous log events within a sequence.

Machine Learning Methods

Based on the log anomaly detection process described, the current machine learning techniques used for log analysis can be categorised into two main groups:

Traditional Machine Learning (ML) methods

Examples: Log Clustering, Principal Component Analysis (PCA), Invariant Mining, Logistic Regression, Decision Trees, Support Vector Machines (SVM)
These methods often rely on handcrafted features, such as log event count vectors, where each dimension represents a log event and the value counts its occurrence in a log sequence.
They typically produce a prediction (anomaly or not) for the entire log sequence based on the extracted features.

Shortcomings of traditional ML methods

Handcrafted features: Traditional ML methods often require domain knowledge to manually design and extract relevant features from log data, which can be time-consuming and may not capture all the important information.

Limited ability to handle complex patterns: These methods may struggle to capture complex, non-linear relationships and long-term dependencies in log sequences, which can be crucial for accurate anomaly detection.

Lack of interpretability: Traditional ML methods typically produce predictions for entire log sequences, making it difficult to pinpoint the specific log events responsible for the anomalies.

Sensitivity to unseen log events: These methods often rely on a fixed set of log events and may not generalize well to unseen or evolving log patterns, requiring retraining when new log events emerge.

Existing log anomaly detection methods

Existing log anomaly detection methods can be categorised into unsupervised and supervised approaches.

The main idea behind unsupervised methods is that logs produced by a system's normal executions often exhibit stable patterns, and anomalies occur when these patterns are violated.

Supervised methods, on the other hand, require anomaly labels and learn features that distinguish abnormal samples from normal ones.

The paper introduces six state-of-the-art methods, four unsupervised and two supervised, which leverage neural networks for log anomaly detection. The choice of network architecture and loss function is crucial, as the loss guides how the model learns log patterns.

Unsupervised Methods

DeepLog

First work to employ LSTM for log anomaly detection
Learns log patterns from sequential relations of log events
Uses forecasting-based anomaly detection, predicting the next log event based on previous observations

LogAnomaly

Considers semantic information of logs using template2Vec
Generates distributed representations of words in log templates by considering synonyms and antonyms
Adopts forecasting-based anomaly detection with an LSTM model

Logsy (Transformer-based method)

First work to use the Transformer for log anomaly detection
Learns log representations to distinguish between normal and abnormal samples
Employs multi-head self-attention mechanism
Follows forecasting-based anomaly detection

Autoencoder

Uses autoencoder combined with isolation forest
Autoencoder learns representations for normal log event sequences
Anomalies detected based on reconstruction loss

Supervised Methods

LogRobust

Addresses log instability issue (unseen log events) by extracting semantic information using word vectors
Incorporates attention mechanism into a Bi-LSTM model to assign different weights to log events
Generates classification results (anomaly or not) using a softmax layer

CNN

First work to explore the feasibility of CNN for log-based anomaly detection
Constructs log event sequences using identifier-based partitioning
Proposes logkey2vec embedding method to create a trainable matrix for convolution calculation
Applies different convolutional layers and concatenates their outputs for prediction

The unsupervised methods, particularly those incorporating semantic information (e.g., LogAnomaly and Logsy), demonstrate the importance of understanding the meaning behind log events for accurate anomaly detection.

The supervised methods, LogRobust and CNN, introduce techniques such as attention mechanisms and convolutional layers to improve the model's ability to distinguish anomalies from normal events.

Experiment

The authors designed their experiment to evaluate the accuracy, robustness, and efficiency of six state-of-the-art deep learning-based log anomaly detection methods on two widely-used datasets, HDFS and BGL.

They aimed to provide a comprehensive comparison of these methods and address the lack of publicly available tools for industrial usage.

Experiment Design

Dataset selection: The authors chose HDFS and BGL datasets from Loghub, a large collection of system log datasets. These datasets were selected due to their popularity and different characteristics (e.g., HDFS logs contain identifiers, while BGL logs do not).
Evaluation metrics: The authors employed precision, recall, and F1 score to measure the accuracy of the anomaly detection methods, as log anomaly detection is a binary classification problem.
Experiment setup: The experiments were conducted on a machine with specific hardware configurations. The authors sorted logs chronologically, applied log partitioning to generate log sequences, and shuffled them. They used 80% of the data for training and 20% for testing. For unsupervised methods, anomalies were removed from the training data to learn normal log patterns.

Challenges

Unseen logs: The presence of unprecedented logs in the testing data posed a significant challenge to the anomaly detection methods, especially unsupervised ones.

Anomalies in training data: Even a small portion of anomalies in the training data could quickly deteriorate the performance of forecasting-based methods.

Efficiency: Deep learning-based methods generally required more time for training and testing compared to traditional machine learning-based methods.

Main Interests

The authors were most interested in:

Evaluating the accuracy of deep learning-based log anomaly detection methods and comparing them with traditional machine learning-based methods.
Investigating the impact of log semantics on the accuracy and robustness of the anomaly detection methods.
Assessing the robustness of the methods against unseen logs and anomalies in the training data.
Comparing the efficiency of deep learning-based methods with traditional machine learning-based methods in terms of training and testing time.

By designing the experiment in this manner, the authors aimed to provide valuable insights into the strengths and weaknesses of different deep learning-based log anomaly detection methods, as well as their performance in comparison to traditional machine learning-based methods.

The findings can guide researchers and practitioners in selecting appropriate methods for their specific use cases and help bridge the gap between academic research and industrial application.

Industry Application

The authors present a case study of deploying an automated log-based anomaly detection system in production at Huawei Cloud.

They selected an optimised version of DeepLog, a highly-cited deep learning-based method, for its simplicity and superior performance.

The deployment was motivated by the impracticality of manual anomaly detection in the face of terabytes of daily log data generated by services serving hundreds of millions of users.

Deployment Architecture

The log anomaly detection pipeline consists of two stages: offline training and online serving.

Online stage

Kafka is used as a streaming channel for online log analytics.
Data producers are different services that generate raw log data at runtime, with each service corresponding to one Kafka topic for data streaming.
The anomaly detection model acts as the data consumer and performs anomaly detection for each service.
Apache Flink is used for distributed log preprocessing and anomaly detection, processing streaming data with high performance and low latency.
Detection results are visualized on a monitoring panel through Prometheus.
Engineers confirm true anomalies or flag false positives with simple clicks.

Offline stage

Raw logs are archived and maintained in Apache HDFS.
Logs are retrieved from HDFS for model (re)training and evaluation.
A threshold is set manually for alerting anomalies.
Engineers can trigger model retraining if they observe performance degradation on the monitoring panel.

Real-World Challenges

Despite the model's success in shedding light on automated log-based anomaly detection and reporting high-risk anomalies, the authors identified several challenges:

High complexity of production logs compared to benchmark datasets.
The need for periodic threshold re-determination.
Concept drift due to feature upgrades and evolving log patterns.
Large-volume and low-quality log data due to lack of rigorous logging guidelines.
Unsatisfactory interpretability, requiring further improvement in learning logs' semantics.
Incorrect model strategy, as most anomalies stem from specific error logs rather than incorrect log event orders.
Labelling issues due to ambiguous cases and privacy concerns.

Future Improvements

Closer engineering collaboration

Establish a clear objective at the executive level and align infrastructure development, service architecture design, and engineers' mindsets.
Build a pipeline for log data generation, collection, labelling, and usage, with data/label sanity checks and continuous model-quality validation.

Better logging practices

Establish guidelines for writing logging statements, including timestamps, verbosity levels, context information, meaningful messages, template-based logging, and proper logging statement count.

Model improvement

Explore online learning, human-in-the-loop design, and multi-source learning (combining logs with metrics and incident tickets).
Address multiple aspects of logs (keywords, log event count/sequence) using ensemble learning.
Explore semantic relations between log events for accurate anomaly detection and automated fault localisation.

In summary, the authors' industrial case study highlights the potential and challenges of deploying deep learning-based log anomaly detection in production environments.

References

"DeepLog: Anomaly detection and diagnosis from system logs through deep learning", Anomaly Detection in System Logs, Min Du, Feifei Li, Guineng Zheng, Vivek Srikumar, 2017.
"LogBERT: Log Anomaly Detection via BERT", Application of BERT for Log Anomaly Detection, Haixuan Guo, Shuhan Yuan, Xintao Wu, 2021.
"Self-attentive classification-based anomaly detection in unstructured logs", Enhancing Anomaly Detection in Logs using Self-attention, Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, Jorge Cardoso, Odej Kao, 2020.
"Log-based Anomaly Detection with Deep Learning: How Far Are We?", Evaluation of Deep Learning Techniques for Log-based Anomaly Detection, Van Hoang Le, Hongyu Zhang, 2022.
"Robust and transferable anomaly detection in log data using pre-trained language models", Using Pre-trained Language Models for Anomaly Detection in Log Data, Harold Ott, Jasmin Bogatinovski, Alexander Acker, Sasho Nedelkoski, Odej Kao, 2021.
"Log-based anomaly detection without log parsing", Innovating Anomaly Detection without Traditional Log Parsing, Van-Hoang Le, Hongyu Zhang, 2021.
"A2Log: Attentive Augmented Log Anomaly Detection", Applying Attention Mechanisms for Log Anomaly Detection, Thorsten Wittkopp, Alexander Acker, Sasho Nedelkoski, et al., 2021.

PreviousLogBERT: Log Anomaly Detection via BERT NextLog-based Anomaly Detection with Deep Learning: How Far Are We?

Last updated 1 year ago

Was this helpful?