##
## Attaching package: 'ModelMetrics'
## The following object is masked from 'package:base':
##
## kappa
The sGMRFmix package provides an implementation of the algorithm presented by Ide et al. (2016). It is a novel anomaly detection method for multivariate noisy sensor data. It can automatically handle multiple operational modes. And it can also compute variable-wise anomaly scores.
First, we prepare a multivariate training data that contains no anomaly observations. Here we generate a data that consists of four parts and repeats two operational modes.
library(sGMRFmix)
set.seed(314)
train_data <- generate_train_data()
plot_multivariate_data(train_data)
Second, we prepare a multivariate test data that contains some
anomaly values. Here we generate a data that consists of 500 normal
observations and 500 anomaly observations. The normal part in the test
data consists of two operational modes that also have seen in the
training data. Note that the variable x5
has no difference
in the normal and anomaly part.
We input the training data to the sGMRFmix()
function to
learn the two operational modes. Then we compute variable-wise anomaly
scores for test data using the result.
fit <- sGMRFmix(train_data, K = 7, rho = 0.8, verbose = FALSE)
anomaly_score <- compute_anomaly_score(fit, test_data)
plot_multivariate_data(anomaly_score, fix_scale = TRUE) + ylim(NA, 50)
You can see that high anomaly scores appear in the latter part of the
test data. And you can also see the variable x5
has no high
anomaly scores.
You can install the sGMRFmix package from CRAN.
You can also install the package from GitHub.
install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("hoxo-m/sGMRFmix")
The source code for sGMRFmix package is available on GitHub at
The sGMRFmix package mainly provides two functions as follows:
sGMRFmix
to fit the model andcompute_anomaly_score
to compute anomaly scores.library(sGMRFmix)
fit <- sGMRFmix(train_data, K, rho)
anomaly_score <- compute_anomaly_score(fit, test_data)
There are two hyperparameters as below.
K
is a max number of mixture components.rho
is a constant that multiplies to the penalty term
in the model.You only need to set K
a large enough number because the
algorithm identifies major dependency patterns from the data via the
sparse mixture model.
On the other hand, you should determine rho
an optimal
value to maximize the performance of anomaly detection.
To fit the model, you must prepare two kinds of data as follows:
rho
.The package provides several functions to generate synthetic data.
set.seed(314)
train_data <- generate_train_data()
head(train_data)
#> x1 x2 x3 x4 x5
#> 1 -1.8269027 1.0745443 0.56970302 -1.2568921 0.08997801
#> 2 -1.6120444 0.2252822 0.74793059 -0.4788593 -1.25022520
#> 3 -0.6661666 1.2513592 -0.87397154 -0.3561145 -0.59305259
#> 4 -2.5561030 -1.1858031 0.08327536 0.6284850 0.12603220
#> 5 -1.3413068 0.7235066 1.28744741 -0.9630511 -1.10169444
#> 6 -0.2811875 1.4805043 0.33644955 -0.5148741 -0.20605220
test_data <- generate_test_data()
head(test_data)
#> x1 x2 x3 x4 x5
#> 1 -0.1534335 1.1129516 1.61478168 -1.15048817 -0.42104579
#> 2 -0.6737145 1.3716102 2.30689870 -1.41350416 -0.26243160
#> 3 -1.0491129 -0.2908006 3.04536973 -3.15425980 -0.46513038
#> 4 0.3412411 2.3378319 -0.56224407 0.51954178 0.05482343
#> 5 0.2378417 1.6324559 1.63852619 -1.11463852 1.87995270
#> 6 -0.3792480 0.3993461 -0.03158052 0.06132161 -1.27252108
test_labels <- generate_test_labels()
head(test_labels)
#> x1 x2 x3 x4 x5
#> 1 FALSE FALSE FALSE FALSE FALSE
#> 2 FALSE FALSE FALSE FALSE FALSE
#> 3 FALSE FALSE FALSE FALSE FALSE
#> 4 FALSE FALSE FALSE FALSE FALSE
#> 5 FALSE FALSE FALSE FALSE FALSE
#> 6 FALSE FALSE FALSE FALSE FALSE
Also, the package provides a function to visualize these data.
plot_multivariate_data(test_data, label = test_labels)
#> Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
#> ggplot2 3.3.4.
#> ℹ Please use "none" instead.
#> ℹ The deprecated feature was likely used in the sGMRFmix package.
#> Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
The package provides a function sGMRFmix()
to fit the
model named Sparse Gaussian Markov Random Field Mixtures (Ide
et al., 2016). It can automatically handle multiple operational modes
and allows to compute variable-wise anomaly scores.
#>
#> Call:
#> sGMRFmix(x = train_data, K = 7, rho = 0.8, verbose = FALSE)
#>
#> Data: 1000 x 5
#> Parameters:
#> K: 7
#> rho: 0.8
#> Estimated:
#> Kest: 2
#> pi: 0.502 0.498
#> m, A, theta, H, mode
As the result of we set K
to 7 and fit the model, the
number of mixtures Kest = 2
has been obtained by the sparse
estimation. The estimated weights of the mixtures are displayed as
pi
. They are near 0.5. The result contains other estimated
parameters such as m
, A
,
theta
.
rho
You can fit the model without labeled data, but you should prepare a
labeled test data to tell the model what the anomalies are. To do it,
you have no choice but to tune the hyperparameter rho
. To
avoid overfitting, you can use cross-validation. We measure the
performance of the anomaly detection by ROC-AUC.
n_split <- 5
split_block <- sample(n_split, size = nrow(test_data), replace = TRUE)
split_test_data <- split(test_data, split_block)
split_test_labels <- split(test_labels, split_block)
rho_candidates <- 10^seq(-1, 1, length.out = 10)
library(ModelMetrics)
df <- data.frame()
for (rho in rho_candidates) {
fit <- sGMRFmix(train_data, K = 7, rho = rho, verbose = FALSE)
auc <- double(n_split)
for (i in seq_len(n_split)) {
anomaly_score <- compute_anomaly_score(fit, split_test_data[[i]])
auc[i] <- auc(unlist(split_test_labels[[i]]), unlist(anomaly_score))
}
df <- rbind(df, data.frame(rho = rho, auc = auc))
}
library(ggplot2)
ggplot(df, aes(rho, auc)) + geom_point() +
stat_summary(fun.y = mean, geom = "line", color = "red") + scale_x_log10()
#> Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
#> ℹ Please use the `fun` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
library(dplyr)
df %>% group_by(rho) %>% summarise(mean_auc = mean(auc)) %>%
mutate(max = ifelse(max(mean_auc) == mean_auc, "***", "."))
#> # A tibble: 10 × 3
#> rho mean_auc max
#> <dbl> <dbl> <chr>
#> 1 0.1 0.576 .
#> 2 0.167 0.580 .
#> 3 0.278 0.559 .
#> 4 0.464 0.569 .
#> 5 0.774 0.659 ***
#> 6 1.29 0.656 .
#> 7 2.15 0.652 .
#> 8 3.59 0.645 .
#> 9 5.99 0.635 .
#> 10 10 0.617 .
Optimal rho
value that has the best performance to
detect the anomalies is 0.774.
We have obtained optimal hyperparameter rho.
Next, let
us investigate optimal threshold value for anomaly scores. We measure
the performance of the anomaly detection by F-measure.
optimal_rho <- 0.774
fit <- sGMRFmix(train_data, K = 7, rho = optimal_rho, verbose = FALSE)
threshold_candidates <- 10^seq(-1, 1, length.out = 100)
df <- data.frame()
for (i in seq_len(n_split)) {
anomaly_score <- compute_anomaly_score(fit, split_test_data[[i]])
f_measure <- double(length(threshold_candidates))
for (j in seq_along(threshold_candidates)) {
f1 <- f1Score(unlist(split_test_labels[[i]]), unlist(anomaly_score),
cutoff = threshold_candidates[j])
f_measure[j] <- f1
}
df <- rbind(df, data.frame(cutoff = threshold_candidates, f_measure = f_measure))
}
ggplot(df, aes(cutoff, f_measure)) + geom_point() +
stat_summary(fun.y = mean, geom = "line", color = "red") + scale_x_log10()
df %>% group_by(cutoff) %>%
summarise(mean_f_measure = mean(f_measure)) %>%
filter(mean_f_measure == max(mean_f_measure))
#> # A tibble: 1 × 2
#> cutoff mean_f_measure
#> <dbl> <dbl>
#> 1 1.71 0.578
We found optimal threshold is 1.71.
We can use the value for anomaly detection.
In the above example, you might think the F-measure is low. There is another way if you are interested in whether anomalies occur densely rather than individual one.
You can calculate moving average (or called rolling mean) for the anomaly scores by passing window size.
window_size <- 20
df <- data.frame()
for (i in seq_len(n_split)) {
anomaly_score <- compute_anomaly_score(fit, split_test_data[[i]], window_size)
f_measure <- double(length(threshold_candidates))
for (j in seq_along(threshold_candidates)) {
f1 <- f1Score(unlist(split_test_labels[[i]]), unlist(anomaly_score),
cutoff = threshold_candidates[j])
f_measure[j] <- f1
}
df <- rbind(df, data.frame(cutoff = threshold_candidates, f_measure = f_measure))
}
ggplot(df, aes(cutoff, f_measure)) + geom_point() +
stat_summary(fun.y = mean, geom = "line", color = "red") + scale_x_log10()
df %>% group_by(cutoff) %>%
summarise(mean_f_measure = mean(f_measure)) %>%
filter(mean_f_measure == max(mean_f_measure))
#> # A tibble: 1 × 2
#> cutoff mean_f_measure
#> <dbl> <dbl>
#> 1 2.72 0.927
anomaly_scores <- compute_anomaly_score(fit, test_data, window_size)
is_anomaly <- anomaly_scores > 2.60
plot_multivariate_data(test_data, label = is_anomaly)
You can see that we obtained an anomaly detector with high performance.
In the above, the model has identified that the synthetic data consists of two operational modes. We can see it as follows:
We can also see the weight indicating how often these modes appear.
Furthermore, we can also see the structure of each mode.
fit$mode
is the assigned mode to each observation in the
training data.
head(fit$mode, 10)
#> [1] 1 1 1 1 1 1 1 1 1 1
plot_multivariate_data(train_data, label = fit$mode, guide_title = "Mode")
Using it, you can see the correlation structure for each mode.
inds_mode1 <- c(1:250, 501:750)
true_mode1_values <- train_data[inds_mode1, ]
estimated_mode1_values <- train_data[fit$mode == 1, ]
pairs(true_mode1_values, main="True Mode 1 Structure")
pairs(estimated_mode1_values, main="Estimated Mode 1 Structure")
inds_mode2 <- c(251:500, 751:1000)
true_mode2_values <- train_data[inds_mode2, ]
estimated_mode2_values <- train_data[fit$mode == 2, ]
pairs(true_mode2_values, main="True Mode 2 Structure")
pairs(estimated_mode2_values, main="Estimated Mode 2 Structure")
In reality, true structure are unknown. You should check estimated structure and consider whether it is reasonable.
You can also see the structure of the anomaly state. To compare it with the normal operating modes will be helpful to investigate what caused the anomalies.