Deep Learning for Anomaly Detection in Log Data: A Survey
Last updated
Last updated
Copyright Continuum Labs - 2023
This May 2023 paper is a systematic literature review that investigates the use of deep learning techniques for anomaly detection in log data.
The authors aim to provide an overview of the state-of-the-art deep learning algorithms, data pre-processing mechanisms, anomaly detection techniques, and evaluation methods used in this field.
Log data is unstructured and involves intricate dependencies, making it challenging to prepare the data for ingestion by neural networks and extract relevant features for detection.
The variety of deep learning architectures makes it difficult to select an appropriate model for a specific use-case and understand their requirements on the input data format and properties.
The paper surveys various deep learning architectures used for log-based anomaly detection, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs).
The authors aim to provide insights into the features and challenges of different deep learning algorithms to help researchers and practitioners avoid pitfalls when developing anomaly detection techniques and selecting existing detection systems.
The paper investigates pre-processing strategies used to transform raw and unstructured log data into a format suitable for ingestion by neural networks.
A detailed understanding of these strategies is essential to use all available information in the logs and comprehend the influence of data representations on the detection capabilities.
Types of anomalies and their identification
The authors examine the types of anomalies that can be detected using deep learning techniques and how they are identified as such.
This information helps in understanding the capabilities and limitations of different deep learning approaches in detecting various types of anomalies.
The paper pays attention to relevant aspects of experiment design, including data sets, metrics, and reproducibility, to point out deficiencies in prevalent evaluation strategies and suggest remedies.
This analysis aims to improve the quality and comparability of evaluations in future research.
The authors investigate the extent to which the surveyed approaches rely on labeled data and support incremental learning.
This information is crucial for understanding the practical applicability of the methods in real-world scenarios, where labeled data may be scarce, and the ability to adapt to evolving system behavior is essential.
The paper assesses the reproducibility of the presented results in terms of the availability of source code and used data.
This analysis highlights the importance of open-source implementations and publicly available datasets for facilitating further research and enabling quantitative comparisons of different approaches.
Artificial neural networks (ANN) are inspired by biological information processing systems and consist of connected communication nodes arranged in layers (input, hidden, and output).
Deep learning algorithms are neural networks with multiple hidden layers.
Different architectures of deep neural networks exist, such as recurrent neural networks (RNN) for sequential input data.
Deep learning enables supervised, semi-supervised, and unsupervised learning.
Log data is a chronological sequence of events generated by applications to capture system states.
Log events are usually in textual form and can be structured, semi-structured, or unstructured.
Log events contain static parts (hard-coded strings) and variable parts (dynamic parameters).
Log parsing techniques extract log keys (templates) and parameter values for subsequent analysis.
Anomalies are rare or unexpected instances in a dataset that stand out from the rest of the data.
Three types of anomalies: point anomalies (independent instances), contextual anomalies (instances anomalous in a specific context), and collective anomalies (groups of instances anomalous due to their combined occurrence).
Anomaly detection can be unsupervised (no labeled data), semi-supervised (training data with only normal instances), or supervised (labeled data for both normal and anomalous instances).
Data representation: Feeding heterogeneous, unstructured log data into neural networks is non-trivial.
Data instability: As applications evolve and system behaviour patterns change, deep learning systems need to adapt and update their models incrementally.
Class imbalance: Anomaly detection assumes that normal events outnumber anomalous ones, which can lead to suboptimal performance of neural networks.
Anomalous artifact diversity: Anomalies can affect log events and parameters in various ways, making it difficult to design generally applicable detection techniques.
Label availability: The lack of labeled anomaly instances restricts applications to semi- and unsupervised deep learning systems, which typically achieve lower detection performance than supervised approaches.
Stream processing: To enable real-time monitoring, deep learning systems need to be designed for single-pass data processing.
Data volume: The high volume of log data requires efficient algorithms to ensure real-time processing, especially on resource-constrained devices.
Interleaving logs: Retrieving original event sequences from interleaving logs without session identifiers is challenging.
Data quality: Low data quality due to improper log collection or technical issues can negatively affect machine learning effectiveness.
Model explainability: Neural networks often suffer from lower explainability compared to conventional machine learning methods, making it difficult to justify decisions in response to critical system behavior or security incidents.
Based on the survey results, it is difficult to definitively state which deep learning technique stands out as the best for creating a commercial or enterprise-grade anomaly detection model, as the choice depends on various factors such as the specific use case, data characteristics, and performance requirements.
However, some insights can be drawn from the analysis of the reviewed publications:
Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs), are the most commonly used architectures for anomaly detection in log data. Their ability to learn sequential event execution patterns and disclose unusual patterns as anomalies makes them well-suited for this task.
Bi-LSTM RNNs, which process sequences in both forward and backward directions, have been found to outperform regular LSTM RNNs in some experiments. This suggests that capturing bidirectional context can improve anomaly detection performance.
GRUs offer computational efficiency compared to LSTM RNNs, which can be advantageous for edge device use cases or scenarios with limited computational resources.
Autoencoders (AEs) and their variants, such as Variational Autoencoders (VAEs), Conditional Variational Autoencoders (CVAEs), and Convolutional Autoencoders (CAEs), are specifically designed for unsupervised learning. They can learn the main features from input data while neglecting noise, making them suitable for scenarios where labeled data is scarce or unavailable.
Attention mechanisms, such as those used in Transformers or as additional components in other neural networks (e.g., RNNs), have shown promise in improving classification and detection performance by weighting relevant inputs higher. This can be particularly beneficial when dealing with long sequences.
Log Data Collection:
Collect log data from various sources, such as servers, applications, and network devices.
Ensure that log events contain relevant information, such as timestamps, event types, and event parameters.
Consider the volume and velocity of log data generation and establish appropriate mechanisms for centralized log collection and storage.
Pre-processing:
Apply parsing techniques to extract structured information from unstructured log data.
Use log keys (parsers) to identify unique event types and extract event parameter values.
Alternatively, employ token-based strategies that split log messages into lists of words, clean the data by removing special characters and stop words, and create token vectors.
Some approaches combine parsing and token-based pre-processing strategies to generate token vectors from parsed events.
Event Grouping:
Group log events into logical units for analysis, such as time windows or session windows.
Time-based grouping strategies include sliding time windows (overlapping) and fixed time windows (non-overlapping), which allocate log events based on their timestamps.
Session windows rely on event parameters that act as identifiers for specific tasks or processes, allowing the extraction of event sequences that depict underlying program workflows.
Feature Extraction:
Extract structured features from pre-processed log data to be used as input for deep learning models.
Common features include token sequences, token counts (e.g., TF-IDF), event sequences, event counts, event statistics (e.g., seasonality, message lengths, activity rates), event parameters, and event interval times.
Feature Representation:
Transform extracted features into suitable vector representations for input to neural networks.
Represent event sequences as event ID sequence vectors and event counts as count vectors.
Use semantic vectors to encode context-based semantics or language statistics of log tokens or event sequences.
Apply positional embedding to capture the relative positions of elements in a sequence.
Employ one-hot encoding for categorical data, such as event types or token values.
Use embedding layers or matrices to reduce the dimensionality of sparse input data.
Consider parameter vectors to directly use the values extracted from parsed log messages, such as numeric parameters for multi-variate time-series analysis.
Explore alternative representations, such as graphs or transfer matrices, to encode dependencies between log messages.
Ensure log data completeness and quality by capturing relevant information and implementing data validation mechanisms.
Establish a standardised log format across different sources to facilitate consistent parsing and feature extraction.
Consider the scalability and performance of log data collection and storage systems to handle large volumes of data.
Select appropriate pre-processing techniques based on the characteristics of the log data and the requirements of the anomaly detection task.
Choose event grouping strategies that align with the temporal or session-based nature of the log data and the desired granularity of analysis.
Extract meaningful features that capture the relevant patterns and dependencies in the log data for effective anomaly detection.
Experiment with different feature representation techniques to find the most suitable encoding for the specific deep learning architecture and anomaly detection approach.
Continuously monitor and update the log data collection and preparation pipeline to adapt to changes in the system and ensure the quality and relevance of the input data for the anomaly detection model.
By following these best practices and considering the key aspects of log data collection and preparation, organisations can establish a robust foundation for applying deep learning techniques to log-based anomaly detection.
It is important to tailor the process to the specific characteristics of the log data, the requirements of the anomaly detection task, and the chosen deep learning architecture to achieve optimal results.
Outliers (OUT): Single log events that do not fit the overall structure of the dataset. Outlier events are typically detected based on unusual parameter values, token sequences, or occurrence times.
Sequential (SEQ) anomalies: Detected when execution paths change, resulting in additional, missing, or differently ordered events within otherwise normal event sequences, or completely new sequences involving previously unseen event types.
Frequency (FREQ) anomalies: Consider the number of event occurrences, assuming that changes in system behavior affect the number of usual event occurrences, typically counted within time windows.
Statistical (STAT) anomalies: Based on quantitatively expressed properties of multiple log events, such as inter-arrival times or seasonal occurrence patterns, assuming that event occurrences follow specific stable distributions over time.
Anomaly score: A scalar or vector of numeric values extracted from the final layer of the neural network, expressing the degree to which the input log events represent an anomaly.
Binary classification (BIN): Estimates whether the input is normal or anomalous, with the numeric output interpreted as probabilities for each class in supervised approaches.
Input vector transformations (TRA): Transform the input into a new vector space and generate clusters for normal data, detecting outliers by their large distances to cluster centres.
Reconstruction error (RE): Leverage the reconstruction error of Autoencoders, considering input samples as anomalous if they are difficult to reconstruct due to not corresponding to the normal data the network was trained on.
Multi-class classification (MC): Assigns distinct labels to specific types of anomalies, requiring supervised learning to capture class-specific patterns during training.
Probability distribution (PRD): Train models to predict the next log key following a sequence of observed log events, using a softmax function to output a probability distribution for each log key.
Numeric vectors (VEC): Consider events as numeric vectors (e.g., semantic or parameter vectors) and formulate the problem of predicting the next log event as a regression task, with the network outputting the expected event vector.
Label (LAB): When the network output directly corresponds to a particular label (e.g., binary classification), anomalies are generated for all samples labeled as anomalous.
Threshold (THR): For approaches that output an anomaly score, a threshold is used to differentiate between normal and anomalous samples, allowing for tuning of detection performance by finding an acceptable trade-off between true positive rate (TPR) and false positive rate (FPR).
Statistical distributions: Model the anomaly scores obtained from the network as statistical distributions (e.g., Gaussian distribution) to detect parameter vectors with errors outside specific confidence intervals as anomalous.
Top log keys (TOP): When the network output is a multi-class probability distribution for known log keys, consider the top n log keys with the highest probabilities as candidates for classification, detecting an anomaly if the actual log event type is not within the set of candidates.
The relationship between network output and detection techniques is as follows:
BIN and MC rely on supervised learning and directly assign labels to new input samples.
RE, TRA, and VEC produce anomaly scores that are compared against thresholds.
PRD is typically compared against the top log keys with the highest probabilities.
It's important to note that there are some exceptions to these general patterns, such as approaches that support semi-supervised training through probabilistic labels or supervised approaches that rely on reconstruction errors.
By understanding the different anomaly types, network output formats, and detection methods, researchers and practitioners can better design and select appropriate deep learning-based anomaly detection techniques for their specific log data analysis tasks.
Datasets
The review reveals that the vast majority of evaluations rely on only four datasets: HDFS, BGL, Thunderbird, and OpenStack.
These datasets come from various use cases, such as high-performance computing, virtual machines, and operating systems. Some datasets include labeled anomalies (e.g., failures, intrusions), while others lack anomaly labels.
However, the limited number of widely-used datasets raises concerns about the generalisability and real-world applicability of the proposed anomaly detection approaches.
To address this issue, researchers could explore creating synthetic datasets using LLMs.
LLMs, such as GPT-3 or its variants, have shown capabilities in generating coherent and contextually relevant text.
By leveraging these models, researchers could potentially create synthetic log data that mimics the characteristics of real-world log events. Here's a possible approach:
Collect a diverse set of real-world log data from various systems and applications.
Preprocess the log data to extract relevant templates, parameters, and structures.
Fine-tune an LLM on the preprocessed log data, allowing it to learn the patterns, distributions, and relationships between log events.
Use the fine-tuned LLM to generate synthetic log events by providing it with appropriate prompts and constraints.
Inject anomalies into the generated log events based on predefined anomaly types and distributions.
Validate the generated log data by comparing its statistical properties and patterns with real-world log data.
By creating synthetic log data using LLMs, researchers can:
Overcome the limitation of relying on a few publicly available datasets.
Generate large-scale datasets with specific characteristics and anomaly distributions.
Evaluate the robustness and generalizability of anomaly detection approaches across diverse log data scenarios.
Protect sensitive or proprietary log data by generating synthetic datasets for public benchmarking.
However, it's essential to carefully validate the quality and realism of the LLM-generated log data to ensure that it effectively captures the complexities and nuances of real-world log events.
Collaboration between domain experts and machine learning researchers would be crucial in this process.