政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/61185
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  全文笔数/总笔数 : 113451/144438 (79%)
造访人次 : 51287714      在线人数 : 833
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜寻范围 查询小技巧:
  • 您可在西文检索词汇前后加上"双引号",以获取较精准的检索结果
  • 若欲以作者姓名搜寻,建议至进阶搜寻限定作者字段,可获得较完整数据
  • 进阶搜寻
    政大機構典藏 > 商學院 > 資訊管理學系 > 學位論文 >  Item 140.119/61185


    请使用永久网址来引用或连结此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/61185


    题名: 適用於雲端分散儲存架構下的kNN平行演算法之研究
    The study on paralleling the kNN algorithm suitable to cloud-based distributed architecture
    作者: 張伯辰
    贡献者: 楊建民
    張伯辰
    关键词: k-最鄰近演算法
    平行運算
    雲端運算
    資料探勘
    kNN
    parallel computing
    cloud computing
    data mining
    日期: 2012
    上传时间: 2013-10-01 11:59:45 (UTC+8)
    摘要:   近年來,隨著網際網路與相關設備的快速發展與普及,雲端運算逐漸興起,並產生了大量的雲端服務,讓網路上儲存了大量一般使用者的資料與文件。為了分析這些資料,可使用資料探勘中的資料分群來進行初步探討,而kNN正是廣泛被應用於資料分群的一種方式。不過隨著資料量的成長,使用kNN演算法的分群時間也會大幅度地成長。
      為此,本研究試圖以平行化的方式運行kNN演算法,將大量資料分成少量多次處理,並於平行分群後進行群集合併。本研究將Google Spreadsheet作為雲端資料庫,透過PHP語言對資料庫下達指令,並進行多區域的平行kNN資料分群。待各區域完成分群後,再將分群結果傳送至整合端進行群集合併。
      本研究以二種不同方式進行群集合併,並比較二種方式的結果。第一種合併方式為給定合併門檻值,直接將質心近似的群集合併;第二種方式為同時比較二群集之群內相似度與群間相似度之關係,來決定是否進行合併。綜合而言,第二種方式的運算時間較第一種方式為快,且合併結果也較第一種方式為佳。
      本研究進行三種資料分群測試,分別為:一般性資料(隨機分配的固定維度數值化資料)、實際新聞文章的數值化資料、大量模擬實際文章的數值化資料(高維度且有部分集中群集)。根據結果顯示,三種測試皆顯示平行化後的速度有著極大幅度的提升,表示本研究設計的平行化演算法對kNN的加速有相當大程度的作用,分群品質的變化雖可接受、卻較難以掌控。
    With the development and popularization of the Internet, many companies started offering cloud-services, allowed users to save their data on the Internet. To find information in the large amounts of data, kNN is a common method in data mining domain. But with the amount of data increased, the computing time also grew rapidly.
    In this study, I tried to parallel the kNN algorithm, expecting to promote the efficiency of kNN algorithm by separating the data into small amount and clustering them. In this study, Google Spreadsheet was used as a database, and accessing by PHP. Data stored in different spreadsheets separately, and clustered by different regional servers. After each regional server completed the data clustering, the regional clusters would be integrated into final result.
    I used two ways to merge clusters in the second step of the algorithm. The first way is given threshold; I merge the clusters if these centroids were similar. The second way is to compare the difference between the summary of the inter-cluster similarity of two clusters, and their intra-cluster similarity. Then decided whether to merge or not. The result showed that the second way is not only have better computing speed but also have better quality.
    In this study, I tested the parallel algorithm with three kinds of data. The first one is random generated data with fixed dimension. The second data was transformed by reality news. And third data was large amount of data generated by simulating reality data. All of the results showed the paralleled algorithm had a great effect on accelerated KNN method, but also got the unpredictable quality change.
    參考文獻: 英文文獻
    1. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
    2. Davis, N., Demetriou, G., Gaizauskas, R., Guo, Y., & Roberts, L. (2006). Web Service Architectures for Text Mining: An Exploration of the Issues via an E-Science Demonstrator. International Journal of Web Service Research 3(4): 95–112.
    3. Dikaiakos, M.D., Katsaros, D., Mehra, P., Pallis, G., Vakali, A. (2009). Cloud Computing: Distributed Internet Computing for IT and Scientific Research. IEEE Internet Computing, 13(5):10–13.
    4. Han, J., & Kamber, M. (2006). Data mining: Concepts and Techniques (2nd Edition), Morgan Kaufmann.
    5. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264–323.
    6. Lai, C. H., & Liu, D. R. (2009). Integrating knowledge flow mining and collaborative filtering to support document recommendation. Journal of Systems and Software, 82(12), 2023–2037.
    7. Laudon, J. P., & Laudon, K. C. (2011). Management Information Systems (12th Edition), Pearson.
    8. National Institute of Standards and Technology. (2011). The NIST Definition of Cloud Computing. NIST Special Publication 800-145.
    9. Ovum. (2011). 2011 Trends to Watch: Cloud Computing Technology.
    10. Salton, G., Wong, A., Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
    11. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1–47.
    12. Song, Y. Huang, J. Zhou, D. Zha, H. Giles, C. L. (2007). IKNN: Informative K-Nearest Neighbor Classification. 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, 248–264, Warsaw, Poland.
    13. Wu, X, Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top 10 algorithms in data mining. Knowl Inf Syst, 14:1–37.
    14. Weinberger, K. Q., Saul, L. K. (2009). Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 10:207–244.


    中文文獻
    15. 吳季桓(2010)。自動分類的實作:KNN與SVM。國立中正大學資訊工程研究所碩士論文,未出版,嘉義縣。
    16. 周宣光(譯)(2010)。管理資訊系統(原作者Laudon, J. P., & Laudon, K. C. (2009))。台北:東華書局。
    17. 施雅月、賴錦慧(譯)(2008)。資料探勘(原作者Tan, P. N., Steinbach, M.,
    Kumar, V. (2006))。台北:歐亞書局。
    18. 連偉志(2013)。雲端環境下應用Google電子試算表於分散式儲存架構平台之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。
    19. 陳仕斌(2012)。雲端運算之編譯排程系統設計與實作。國立成功大學電機工程學系碩士論文,未出版,台南市。
    20. 黃孝文(2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。
    21. 黃冠中(2007)。應用kNN演算法之文件分類平台實作。第六屆離島資訊技術與應用研討會論文,雲林:國立虎尾科技大學資工系主辦。
    22. 曾國傑(2012)。運用KNN文字探勘分析智慧型終端App群集之研究. 國立政治大學資訊管理學系碩士論文,未出版,台北市。
    23. 薛弘業(2013)。應用文字探勘與文件分類分群技術於股價走勢預測之研究─以台灣股票市場為例。國立政治大學資訊管理學系碩士論文,未出版,台北市。

    網路資料
    24. Boss, G., Malladi, P., Quan, D., Legregni, L., & Hall, H. (2007). Cloud Computing. Retrieved April, 2013 from IBM Corporation Web site:
    http://download.boulder.ibm.com/ibmdl/pub/software/dw/wes/hipods/Cloud_computing_wp_final_8Oct.pdf
    25. Google Drive. (2006). Overview of Google Sheets. Retrieved February, 2013 from
    https://support.google.com/drive/bin/answer.py?hl=en&answer=140784&topic=20322&ctx=topic
    26. Google Drive. (2013). Start Google Drive. Retrieved February, 2013 from
    https://www.google.com/intl/en/drive/start/index.html
    27. Law of cosines - Wikipedia, the free encyclopedia Retrieved August, 2013 from
    http://en.wikipedia.org/wiki/Law_of_cosines
    28. Parallel Programming for Multicore. Berkeley lecture. Retrieved March, 2013 from
    http://www.cs.berkeley.edu/~yelick/cs194f07/
    描述: 碩士
    國立政治大學
    資訊管理研究所
    100356032
    101
    資料來源: http://thesis.lib.nccu.edu.tw/record/#G1003560321
    数据类型: thesis
    显示于类别:[資訊管理學系] 學位論文

    文件中的档案:

    档案 大小格式浏览次数
    032101.pdf778KbAdobe PDF2247检视/开启


    在政大典藏中所有的数据项都受到原著作权保护.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 回馈