Skip to content

Commit

Permalink
Merge pull request #32 from inexxt/master
Browse files Browse the repository at this point in the history
[ticket_24] Overview of the paper describing similar project
  • Loading branch information
bdfhjk committed Mar 18, 2016
2 parents 817cc0c + 427c70f commit ee3e737
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 0 deletions.
Binary file added docs/AAlog_paper_analysis_ticket24.pdf
Binary file not shown.
41 changes: 41 additions & 0 deletions docs/AAlog_paper_analysis_ticket24.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
##### Automatic Log Analysis using Machine Learning
Applies machine learning techniques to do automated log analysis. Compares several variants of clustering, artificial neural network algorithms and data preprocessing.
http://uu.diva-portal.org/smash/get/diva2:667650/FULLTEXT01.pdf

Overview of the paper:
Published Nov 2013 – some time ago, maybe it's worth to look for something newer

Working on unstructured text logs, mixed and single configuration (logs from different sources) - as in DPCS

Text preprocessing tips:
1) Replace timestamps. Use a special symbol to replace the whole timestamp before each message.
2) Replace digits. Use a special symbol to replace any digit in the log file.
3) Lower cases. Change all upper case letters into lower case.
4) Remove special characters. Remove all special characters, including punctuations, and only keep letters
and digits.

Features:
Manually created, shortage on expert knowledge
Char bigram, word bigram, word count, timestamp stats (different metrics on differences between subsequent timestamps in the log)
TF (term frequency - normalised wc) + IDF (importance of the word - how frequently is it used between logs)

Clustering:
Two classes (anomaly detection), not very useful in our case
DBSCOD - density based spatial clustering of outliers detection
core point - If the number of points in one point p’s neighbourhood is greater than the threshold MinPts,
p is a core point.
border point - The border point is not a core point, but it is located in one or multiple core points’
neighbourhoods.
outlier - (noise point) The other points except the core points and border points are all utliers.
They were interested in outliers, as they were probably anomalies.

Self-organising feature maps - simple explanation from wiki
https://en.wikipedia.org/wiki/Self-organizing_map#Learning_algorithm

Results:
K-means and other simple algorithms doing terrible, SOFM is good
High score on mixed configuration (different types of logs using one classifier)
Better to do dimensionality reduction for whole dataset (features * attributes) rather than separately for every feature

Important note about features:
"Secondly, among those feature candidates, the results show that simple features such as character bigram and timestamp statistics are effective enough to distinguish abnormal and normal logs. The advanced feature TF-IDF is more effective in certain test case, but gets fair results in some other test cases."

0 comments on commit ee3e737

Please sign in to comment.