Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/61185
|
Title: | 適用於雲端分散儲存架構下的kNN平行演算法之研究 The study on paralleling the kNN algorithm suitable to cloud-based distributed architecture |
Authors: | 張伯辰 |
Contributors: | 楊建民 張伯辰 |
Keywords: | k-最鄰近演算法 平行運算 雲端運算 資料探勘 kNN parallel computing cloud computing data mining |
Date: | 2012 |
Issue Date: | 2013-10-01 11:59:45 (UTC+8) |
Abstract: | 近年來,隨著網際網路與相關設備的快速發展與普及,雲端運算逐漸興起,並產生了大量的雲端服務,讓網路上儲存了大量一般使用者的資料與文件。為了分析這些資料,可使用資料探勘中的資料分群來進行初步探討,而kNN正是廣泛被應用於資料分群的一種方式。不過隨著資料量的成長,使用kNN演算法的分群時間也會大幅度地成長。 為此,本研究試圖以平行化的方式運行kNN演算法,將大量資料分成少量多次處理,並於平行分群後進行群集合併。本研究將Google Spreadsheet作為雲端資料庫,透過PHP語言對資料庫下達指令,並進行多區域的平行kNN資料分群。待各區域完成分群後,再將分群結果傳送至整合端進行群集合併。 本研究以二種不同方式進行群集合併,並比較二種方式的結果。第一種合併方式為給定合併門檻值,直接將質心近似的群集合併;第二種方式為同時比較二群集之群內相似度與群間相似度之關係,來決定是否進行合併。綜合而言,第二種方式的運算時間較第一種方式為快,且合併結果也較第一種方式為佳。 本研究進行三種資料分群測試,分別為:一般性資料(隨機分配的固定維度數值化資料)、實際新聞文章的數值化資料、大量模擬實際文章的數值化資料(高維度且有部分集中群集)。根據結果顯示,三種測試皆顯示平行化後的速度有著極大幅度的提升,表示本研究設計的平行化演算法對kNN的加速有相當大程度的作用,分群品質的變化雖可接受、卻較難以掌控。 With the development and popularization of the Internet, many companies started offering cloud-services, allowed users to save their data on the Internet. To find information in the large amounts of data, kNN is a common method in data mining domain. But with the amount of data increased, the computing time also grew rapidly. In this study, I tried to parallel the kNN algorithm, expecting to promote the efficiency of kNN algorithm by separating the data into small amount and clustering them. In this study, Google Spreadsheet was used as a database, and accessing by PHP. Data stored in different spreadsheets separately, and clustered by different regional servers. After each regional server completed the data clustering, the regional clusters would be integrated into final result. I used two ways to merge clusters in the second step of the algorithm. The first way is given threshold; I merge the clusters if these centroids were similar. The second way is to compare the difference between the summary of the inter-cluster similarity of two clusters, and their intra-cluster similarity. Then decided whether to merge or not. The result showed that the second way is not only have better computing speed but also have better quality. In this study, I tested the parallel algorithm with three kinds of data. The first one is random generated data with fixed dimension. The second data was transformed by reality news. And third data was large amount of data generated by simulating reality data. All of the results showed the paralleled algorithm had a great effect on accelerated KNN method, but also got the unpredictable quality change. |
Reference: | 英文文獻 1. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. 2. Davis, N., Demetriou, G., Gaizauskas, R., Guo, Y., & Roberts, L. (2006). Web Service Architectures for Text Mining: An Exploration of the Issues via an E-Science Demonstrator. International Journal of Web Service Research 3(4): 95–112. 3. Dikaiakos, M.D., Katsaros, D., Mehra, P., Pallis, G., Vakali, A. (2009). Cloud Computing: Distributed Internet Computing for IT and Scientific Research. IEEE Internet Computing, 13(5):10–13. 4. Han, J., & Kamber, M. (2006). Data mining: Concepts and Techniques (2nd Edition), Morgan Kaufmann. 5. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264–323. 6. Lai, C. H., & Liu, D. R. (2009). Integrating knowledge flow mining and collaborative filtering to support document recommendation. Journal of Systems and Software, 82(12), 2023–2037. 7. Laudon, J. P., & Laudon, K. C. (2011). Management Information Systems (12th Edition), Pearson. 8. National Institute of Standards and Technology. (2011). The NIST Definition of Cloud Computing. NIST Special Publication 800-145. 9. Ovum. (2011). 2011 Trends to Watch: Cloud Computing Technology. 10. Salton, G., Wong, A., Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. 11. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1–47. 12. Song, Y. Huang, J. Zhou, D. Zha, H. Giles, C. L. (2007). IKNN: Informative K-Nearest Neighbor Classification. 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, 248–264, Warsaw, Poland. 13. Wu, X, Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top 10 algorithms in data mining. Knowl Inf Syst, 14:1–37. 14. Weinberger, K. Q., Saul, L. K. (2009). Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 10:207–244.
中文文獻 15. 吳季桓(2010)。自動分類的實作:KNN與SVM。國立中正大學資訊工程研究所碩士論文,未出版,嘉義縣。 16. 周宣光(譯)(2010)。管理資訊系統(原作者Laudon, J. P., & Laudon, K. C. (2009))。台北:東華書局。 17. 施雅月、賴錦慧(譯)(2008)。資料探勘(原作者Tan, P. N., Steinbach, M., Kumar, V. (2006))。台北:歐亞書局。 18. 連偉志(2013)。雲端環境下應用Google電子試算表於分散式儲存架構平台之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。 19. 陳仕斌(2012)。雲端運算之編譯排程系統設計與實作。國立成功大學電機工程學系碩士論文,未出版,台南市。 20. 黃孝文(2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。 21. 黃冠中(2007)。應用kNN演算法之文件分類平台實作。第六屆離島資訊技術與應用研討會論文,雲林:國立虎尾科技大學資工系主辦。 22. 曾國傑(2012)。運用KNN文字探勘分析智慧型終端App群集之研究. 國立政治大學資訊管理學系碩士論文,未出版,台北市。 23. 薛弘業(2013)。應用文字探勘與文件分類分群技術於股價走勢預測之研究─以台灣股票市場為例。國立政治大學資訊管理學系碩士論文,未出版,台北市。 網路資料 24. Boss, G., Malladi, P., Quan, D., Legregni, L., & Hall, H. (2007). Cloud Computing. Retrieved April, 2013 from IBM Corporation Web site: http://download.boulder.ibm.com/ibmdl/pub/software/dw/wes/hipods/Cloud_computing_wp_final_8Oct.pdf 25. Google Drive. (2006). Overview of Google Sheets. Retrieved February, 2013 from https://support.google.com/drive/bin/answer.py?hl=en&answer=140784&topic=20322&ctx=topic 26. Google Drive. (2013). Start Google Drive. Retrieved February, 2013 from https://www.google.com/intl/en/drive/start/index.html 27. Law of cosines - Wikipedia, the free encyclopedia Retrieved August, 2013 from http://en.wikipedia.org/wiki/Law_of_cosines 28. Parallel Programming for Multicore. Berkeley lecture. Retrieved March, 2013 from http://www.cs.berkeley.edu/~yelick/cs194f07/ |
Description: | 碩士 國立政治大學 資訊管理研究所 100356032 101 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G1003560321 |
Data Type: | thesis |
Appears in Collections: | [資訊管理學系] 學位論文
|
Files in This Item:
File |
Size | Format | |
032101.pdf | 778Kb | Adobe PDF2 | 248 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|