Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/111897
|
Title: | 基於大數據資料的非監督分散式分群演算法 An Effective Distributed GHSOM Algorithm for Unsupervised Clustering on Big Data |
Authors: | 邱垂暉 Chiu, Chui Hui |
Contributors: | 郁方 Yu, Fang 邱垂暉 Chiu, Chui Hui |
Keywords: | 非監督式分群 GHSOM Actor Model 惡意程式偵測 平行運算 Unsupervised clustering GHSOM Actor model Malware detection Parallel computation |
Date: | 2017 |
Issue Date: | 2017-08-10 11:13:04 (UTC+8) |
Abstract: | 基於屬性相似度將樣本進行分群的技術已經被廣泛應用在許多領域,如模式識別,特徵提取和惡意行為偵測。由於此技術的重要性,很多人已經將各種分群技術利用分散式框架進行再製,例如K-means搭配Hadoop在Apache Mahout平台上。由於K-means需要預先定義分群數量,而自組織映射圖(SOM)需要預先定義圖的大小,所以能夠自動將樣本依照樣本間的變化容差進行分群的GHSOM(增長層次自組織映射圖)就提供了一個很棒的非監督學習方法用來針對某些資訊不完整的資料。然而,GHSOM目前並不是一個分散式的演算法,這就限制了其在大數據資料的應用上。在本篇論文中,我們提出了一種新的分散式GHSOM演算法。我們使用Scala的Actor Model來實現GHSOM的分散式系統,我們將GHSOM演算法中的水平擴增以及垂直擴增交由Actor來處理並顯示出顯著的性能提升。為了評估我們所提出的方法,我們收集並分析了數千個惡意程式在現實生活中的執行行為,並通過在數百萬個樣本上進行非監督分群後推導出惡意程式行為的檢測規則來顯示其性能的改進、規則有效性以及實踐中的潛在用法。 Clustering techniques that group samples based on their attribute similarity have been widely used in many fields such as pattern recognition, feature extraction and malicious behavior characterization. Due to its importance, various clustering techniques have been developed with distributed frameworks such as K-means with Hadoop in Apache Mahout for scalable computation. While K-means requires the number of clusters and self organizing maps (SOM) requires the map size to be given, the technique of GHSOM (growing hierarchical self organizing maps) that clusters samples dynamically to satisfy the requirement on tolerance of variation between samples, poses an attractive unsupervised learning solution for data that have limited information to decide the number of clusters in advance. However it is not scalable with sequential computation, which limits its applications on big data. In this paper, we present a novel distributed algorithm on GHSOM. We take advantage of parallel computation with scala actor model for GHSOM construction, distributing vertical and horizontal expansion tasks to actors and showing significant performance improvement. To evaluate the presented approach, we collect and analyze execution behaviors of thousands of malware in real life and derive detection rules with the presented unsupervised clustering on millions samples, showing its performance improvement, rule effectiveness and potential usage in practice. |
Reference: | [1] "Kvm," http://www.linux-kvm.org/page/Main Page/, (Visited on 7/15/2016). [2] S.-W. Lee and F. Yu, "Securing kvm-based cloud systems via virtualization intro- spection," in System Sciences (HICSS), 2014 47th Hawaii International Conference on. IEEE, 2014, pp. 5028-5037. [3] T. Kohonen, "The self-organizing map," Neurocomputing, vol. 21, no. 1, pp. 1-6, 1998. [4] J. Vesanto, "Som-based data visualization methods," Intelligent data analysis, vol. 3, no. 2, pp. 111-126, 1999. [5] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lan- der, and T. R. Golub, "Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation," Proceedings of the National Academy of Sciences, vol. 96, no. 6, pp. 2907-2912, 1999. [6] E. Alhoniemi, J. Hollmen, O. Simula, and J. Vesanto, "Process monitoring and mod- eling using the self-organizing map," Integrated Computer-Aided Engineering, vol. 6, no. 1, pp. 3-14, 1999. [7] A. M. Kalteh, P. Hjorth, and R. Berndtsson, "Review of the self-organizing map (som) approach in water resources: Analysis, modelling and application," Environ- mental Modelling & Software, vol. 23, no. 7, pp. 835-845, 2008. [8] E. J. Palomo, J. North, D. Elizondo, R. M. Luque, and T. Watson, "Application of growing hierarchical som for visualisation of network forensics traffic data," Neural Networks, vol. 32, pp. 275-284, 2012. [9] S.-Y. Huang and Y.-N. Huang, "Network traffic anomaly detection based on growing hierarchical som," in Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on. IEEE, 2013, pp. 1-2. [10] Y.-H. Li, Y.-R. Tzeng, and F. Yu, "Viso: Characterizing malicious behaviors of virtual machines with unsupervised clustering," in Cloud Computing Technology and Science (CloudCom), 2015 IEEE 7th International Conference on. IEEE, 2015, pp. 34-41. [11] R. M. Esteves, R. Pais, and C. Rong, "K-means clustering in the cloud-a mahout test," in Advanced Information Networking and Applications (WAINA), 2011 IEEE Workshops of International Conference on. IEEE, 2011, pp. 514-519. [12] "Apache mahout," http://mahout.apache.org/. [13] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, "An effcient k-means clustering algorithm: Analysis and implementa- tion," IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp. 881-892, 2002. [14] A. McAfee, E. Brynjolfsson, T. H. Davenport, D. Patil, and D. Barton, "Big data," The management revolution. Harvard Bus Rev, vol. 90, no. 10, pp. 61-67, 2012. [15] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras, "A survey of clustering algorithms for big data: Taxonomy and empirical analysis," IEEE transactions on emerging topics in computing, vol. 2, no. 3, pp. 267-279, 2014. [16] T. K. Moon, "The expectation-maximization algorithm," IEEE Signal processing magazine, vol. 13, no. 6, pp. 47-60, 1996. [17] "Bloom filter," http://en.wikipedia.org/wiki/Bloom_filter/, (Visited on 10/15/2016). [18] K. Leung and C. Leckie, "Unsupervised anomaly detection in network intrusion de- tection using clusters," in Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38. Australian Computer Society, Inc., 2005, pp. 333-342. [19] I. Burguera, U. Zurutuza, and S. Nadjm-Tehrani, "Crowdroid: behavior-based mal- ware detection system for android," in Proceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices. ACM, 2011, pp. 15-26. [20] C. Hewitt, "Actor model of computation: scalable robust information systems," arXiv preprint arXiv:1008.1459, 2010. [21] "Akka," http://akka.io/, (Visited on 10/15/2016). [22] "Cuckoo sandbox," http://cuckoosandbox.org/, (Visited on 7/15/2016). [23] "Malware knowledge base," http://owl.nchc.org.tw/, (Visited on 6/20/2016). [24] S.-W. Hsiao, Y.-N. Chen, Y. S. Sun, and M. C. Chen, "Combining dynamic pas- sive analysis and active fingerprinting for effective bot malware detection in virtu- alized environments," in International Conference on Network and System Security. Springer, 2013, pp. 699-706. [25] "Virustotal," https://www.virustotal.com, (Visited on 4/15/2017). [26] M. Dittenbach, D. Merkl, and A. Rauber, "The growing hierarchical self-organizing map," in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 6. IEEE, 2000, pp. 15-19. [27] J. A. Hartigan and M. A.Wong, "Algorithm as 136: A k-means clustering algorithm," Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100-108, 1979. [28] A. Broder and M. Mitzenmacher, "Network applications of bloom filters: A survey," Internet mathematics, vol. 1, no. 4, pp. 485-509, 2004. |
Description: | 碩士 國立政治大學 資訊管理學系 104356019 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0104356019 |
Data Type: | thesis |
Appears in Collections: | [資訊管理學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
601901.pdf | | 1552Kb | Adobe PDF2 | 72 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|