政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/100634
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  全文筆數/總筆數 : 113451/144438 (79%)
造訪人次 : 51291332      線上人數 : 817
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋
    政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 >  Item 140.119/100634
    請使用永久網址來引用或連結此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/100634


    題名: 透過Spark平台實現大數據分析與建模的比較:以微博為例
    Accomplish Big Data Analytic and Modeling Comparison on Spark: Weibo as an Example
    作者: 潘宗哲
    Pan, Zong Jhe
    貢獻者: 胡毓忠
    Hu, Yuh Jong
    潘宗哲
    Pan, Zong Jhe
    關鍵詞: 大數據分析
    機器學習
    微博
    分析流程
    亞馬遜雲端服務
    Big data analytics
    machine learning
    Weibo
    analytics pipeline
    Amazon EC2
    日期: 2016
    上傳時間: 2016-08-22 17:23:53 (UTC+8)
    摘要: 資料的快速增長與變化以及分析工具日新月異,增加資料分析的挑戰,本研究希望透過一個完整機器學習流程,提供學術或企業在導入大數據分析時的參考藍圖。我們以Spark作為大數據分析的計算框架,利用MLlib的Spark.ml與Spark.mllib兩個套件建構機器學習模型,解決傳統資料分析時可能會遇到的問題。在資料分析過程中會比較Spark不同分析模組的適用性情境,首先使用本地端叢集進行開發,最後提交至Amazon雲端叢集加快建模與分析的效能。大數據資料分析流程將以微博為實驗範例,並使用香港大學新聞與傳媒研究中心提供的2012年大陸微博資料集,我們採用RDD、Spark SQL與GraphX萃取微博使用者貼文資料的特增值,並以隨機森林建構預測模型,來預測使用者是否具有官方認證的二元分類。
    The rapid growth of data volume and advanced data analytics tools dramatically increase the challenge of big data analytics services adoption. This paper presents a big data analytics pipeline referenced blueprint for academic and company when they consider importing the associated services. We propose to use Apache Spark as a big data computing framework, which Spark MLlib contains two packages Spark.ml and Spark.mllib, on building a machine learning model. This resolves the traditional data analytics problem. In this big data analytics pipeline, we address a situation for adopting suitable Spark modules. We first use local cluster to develop our data analytics project following the jobs submitted to AWS EC2 clusters to accelerate analytic performance. We demonstrate the proposed big data analytics blueprint by using 2012 Weibo datasets. Finally, we use Spark SQL and GraphX to extract information features from large amount of the Weibo users’ posts. The official certification prediction model is constructed for Weibo users through Random Forest algorithm.
    參考文獻: [1] T. H. Davenport and J. Dyché, "Big data in big companies," International Institute for Analytics, 2013.
    [2] R. Kabacoff, R in action: data analysis and graphics with R: Manning Publications Co., 2015.
    [3] F. Pedregosa, et al., "Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.
    [4] L. Buitinck, et al., "API design for machine learning software: experiences from the scikit-learn project," arXiv preprint arXiv:1309.0238, 2013.
    [5] D. Agrawal, et al., "Big data and cloud computing: current state and future opportunities," in Proceedings of the 14th International Conference on Extending Database Technology, 2011, pp. 530-533.
    [6] K.-w. Fu, et al., "Assessing censorship on microblogs in China: Discriminatory keyword analysis and the real-name registration policy," Internet Computing, IEEE, vol. 17, pp. 42-50, 2013.
    [7] A. R. Jagdale, et al., "Data Mining and Data Pre-processing for Big Data."
    [8] D. Borthakur, "HDFS architecture guide," HADOOP APACHE PROJECT http://hadoop. apache. org/common/docs/current/hdfs design. pdf, 2008.
    [9] K. Shvachko, et al., "The hadoop distributed file system," in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010, pp. 1-10.
    [10] H. Karau, et al., Learning Spark: Lightning-Fast Big Data Analysis: " O`Reilly Media, Inc.", 2015.
    [11] M. Armbrust, et al., "Spark sql: Relational data processing in spark," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1383-1394.
    [12] R. S. Xin, et al., "Graphx: A resilient distributed graph system on spark," in First International Workshop on Graph Data Management Experiences and Systems, 2013, p. 2.
    [13] M. Zaharia, et al., "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2012, pp. 2-2.
    [14] N. Rana and S. Deshmukh, "Shuffle Performance in Apache Spark," in International Journal of Engineering Research and Technology, 2015.
    [15] S. Kotsiantis, et al., "Data preprocessing for supervised leaning," International Journal of Computer Science, vol. 1, pp. 111-117, 2006.
    [16] S. Landset, et al., "A survey of open source tools for machine learning with big data in the Hadoop ecosystem," Journal of Big Data, vol. 2, pp. 1-36, 2015.
    [17] S. Mathew, "Overview of amazon web services," Amazon Whitepapers, 2014.
    [18] P. Pääkkönen and D. Pakkala, "Reference architecture and classification of technologies, products and services for big data systems," Big Data Research, vol. 2, pp. 166-186, 2015.
    [19] P. Gupta, et al., "Wtf: The who to follow service at twitter," in Proceedings of the 22nd international conference on World Wide Web, 2013, pp. 505-514.
    [20] A. Thusoo, et al., "Data warehousing and analytics infrastructure at facebook," in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010, pp. 1013-1020.
    [21] G. Mishne, et al., "Fast data in the era of big data: Twitter`s real-time related query suggestion architecture," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 1147-1158.
    [22] M. Busch, et al., "Earlybird: Real-time search at twitter," in 2012 IEEE 28th International Conference on Data Engineering, 2012, pp. 1360-1369.
    [23] M. Zaharia, et al., "Spark: Cluster Computing with Working Sets," HotCloud, vol. 10, pp. 10-10, 2010.
    [24] C. Engle, et al., "Shark: fast data analysis using coarse-grained distributed memory," in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, pp. 689-692.
    [25] R. Sumbaly, et al., "The big data ecosystem at linkedin," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 1125-1134.
    [26] J. Lin and D. Ryaboy, "Scaling big data mining infrastructure: the twitter experience," ACM SIGKDD Explorations Newsletter, vol. 14, pp. 6-19, 2013.
    [27] X. Meng, et al., "Mllib: Machine learning in apache spark," arXiv preprint arXiv:1505.06807, 2015.
    [28] L. C. Freeman, "Centrality in social networks conceptual clarification," Social networks, vol. 1, pp. 215-239, 1978.
    [29] S. Ryza, "Advanced analytics with Spark. ed," by Ann Spencer. O’Reilly, 2014.
    [30] L. Breiman, "Bagging predictors," Machine learning, vol. 24, pp. 123-140, 1996.
    [31] L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5-32, 2001.
    [32] R. Genuer, et al., "Random Forests for Big Data," arXiv preprint arXiv:1511.08327, 2015.
    [33] Y. Liu, "Random forest algorithm in big data environment," CMNT, vol. 18, pp. 147-51, 2014.
    [34] K. Singh, et al., "Big data analytics framework for peer-to-peer botnet detection using random forests," Information Sciences, vol. 278, pp. 488-497, 2014.
    [35] T. Fawcett, "An introduction to ROC analysis," Pattern recognition letters, vol. 27, pp. 861-874, 2006.
    [36] S. Venkataraman, et al., "SparkR: Scaling R Programs with Spark."
    [37] M. Armbrust, et al., "Scaling spark in the real world: performance and usability," Proceedings of the VLDB Endowment, vol. 8, pp. 1840-1843, 2015.
    描述: 碩士
    國立政治大學
    資訊科學學系
    103753040
    資料來源: http://thesis.lib.nccu.edu.tw/record/#G0103753040
    資料類型: thesis
    顯示於類別:[資訊科學系] 學位論文

    文件中的檔案:

    檔案 大小格式瀏覽次數
    304001.pdf4738KbAdobe PDF2167檢視/開啟


    在政大典藏中所有的資料項目都受到原著作權保護.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 回饋