Multivariate Unsupervised Machine Learning for Anomaly Detection in Enterprise Applications

Machine Learning/논문 리뷰

Multivariate Unsupervised Machine Learning for Anomaly Detection in Enterprise Applications

고슴군 2019. 10. 7. 17:52

시계열 데이터의 oulier detect 방법으로 LOF와 DBSCAN을 제시

Outlier detect algorithm using DBSCAN for time-series data

1단계 : DBSCAN으로 학습시킨다.

2단계 : 가장 많은 포인트가 할당된 cluster를 선택하고, 그것을 normaly system state로 정의한다. 해당 cluster의 중심을 anomaly index 계산을 위한 reference point로 사용한다.

3단계 : 각 포인트와 reference point의 거리를 계산한다. 이것이 anomaly index이다.

4단계 : 각 차원이 anomaly index에 얼마나 contribution 하는지 (어떤 차원이 anomalous 하기에 anomaly index가 높은지) 계산한다. (이것은 부가적인 것이다)

Outlier detect algorithm using LOF for time-series data

LOF로 계산되는 score를 이용한다. 본 논문의 실험에서는 MinPts = 200일 때 가장 좋은 성능을 보였다.

My Opinion

DBSCAN과 LOF 모두 outlier를 어느 정도 detect하긴 했지만, guideline 정도로 활용 가능했다.

가장 큰 이유는, 데이터의 분포에 따라 어떤 것은 성능이 좋았고 어떤 것은 성능이 좋지 못했다. 사실, 활용한 데이터의 어떤 유형의 분포는 매우 이상했다.

이유 1 : DBSCAN의 경우 center로부터의 거리를 이용해 outlier index를 계산한다. 하지만 DBSCAN으로 만들어지는 cluster의 모양이 arbitrary 한데, 단순히 거리를 이용해 index를 계산하는 부분이 타당하지 않다. 가장 바깥쪽 point와의 거리를 계산한다든가 하는 방식으로 anomaly score를 계산해야 할 것이다.

이유 2 : LOF의 경우 k 값의 지정이 힘들고, k 값에 따라 민감하게 작동한다. LOF를 제시한 논문에 따라 maximum score를 적용하였지만, search할 k 값의 range를 지정하는 것이 힘들고 시간이 오래 걸린다.

그럼에도 불구하고, 사람이 봤을 때 누가봐도 outlier인 포인트는 잘 찾았다. practical한 측면에서는 활용할 수 있을 듯 하나 데이터의 분포가 깔끔해야 한다.(outlier가 잘 detect 되지 않은 분포의 경우 어떤 알고리즘을 사용해도 잘 detect되지는 않았다.)

[참조] Elsner, Daniel, et al. "Multivariate Unsupervised Machine Learning for Anomaly Detection in Enterprise Applications." Proceedings of the 52nd Hawaii International Conference on System Sciences. 2019.

저작자표시 (새창열림)

'Machine Learning > 논문 리뷰' 카테고리의 다른 글

Diagnosing Network-Wide Traffic Anomalies (0)	2019.10.08

현재글Multivariate Unsupervised Machine Learning for Anomaly Detection in Enterprise Applications

Dive into Data Science