Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/100571
|
Title: | 基於主題模型之社群媒體內容分析探索 Exploring Topic Models for Analyzing the Contents of Social Media |
Authors: | 廖舒婷 Liao, Shu Ting |
Contributors: | 陳恭 Chen, Kung 廖舒婷 Liao, Shu Ting |
Keywords: | 主題分析 文字探勘 社群媒體 Topic Models Text Mining Social Media |
Date: | 2016 |
Issue Date: | 2016-08-22 13:40:38 (UTC+8) |
Abstract: | 隨著網路文章訊息量的快速增長,傳統內容分析已無法在短時間內有效地處理和解析龐雜文本潛在意義,為此,本研究嘗試建置一套以非監督式學習主題模型技術為核心的工具,結合自然語言處理可協助研究學者快速處理與探索大量中文資料,挖掘蘊藏的知識。並透過整合自動化的評估機制,提供模型效果好壞之參考。另由於主題模型所產出的結果仍需要人工判讀,因此本研究再利用視覺化技術呈現,以輔助研究學者詮釋結果。
本研究以太陽花學運期間六個來源收集資料為實驗對象,包括Facebook、Twitter以及四大即時新聞報,實驗結果顯示本研究建置之工具可以有效地應用於大量中文文本內容探索,有助於減少人工處理和手動作業,並縮短整個資料分析時程。藉由主題模型技術,我們得以探討社群媒體和新聞媒體關注議題之異同,而研究過程也發現不只台灣民眾以及新聞媒體關心太陽花學運,來自香港、大陸等世界各地的網友亦藉由社群媒體平台主動關注或發表意見。另依據主題的分布情況,亦可作為話題熱門度的指標。
最後,本研究進行模型效度評估,觀察衡量主題模型應用於不同性質中文文本資料之可行性與限制。此外,本研究透過文本歸類計算取得資料集主題的組成便可作為初步篩選資料集之重要特徵,從而提出未來可延伸發展的方向。 Recently, the data retrieved from the internet are too large for traditional content analysis methods to handle and extract high quality insights in reasonable amounts of time. To address this issue, we develop a data analysis system based on unsupervised topic modeling method. In particular, we focus on applying this tool to process Chinese texts. By a proper integration with the Chinese tokenization tool, jieba, our system is able to explore and analyze Chinese documents rapidly yet effectively. Besides, the system also automatically performs a quantitative evaluation of the quality of the generated model, which is useful for the user to get an idea quickly about how well the model works. Finally, as the outputs produced by topic modeling rely on human interpretation, we present a method for visualizing topic modeling results to help end-users understand and interpret what topics have been discovered.
To evaluate our system, six Chinese text data sets of different network media sources are used for experiment. The result in this study shows the proposed system can be applied to analyze large volumes of unlabeled Chinese text and help reduce manual work, and shorten the amount of time required. We then compare the topics found from social media with those from online news. It is observed that Taiwan’s Sunflower Movement not only received great attention from people in Taiwan, overseas users in Hong Kong or China also express their concerns and opinions through social media. Furthermore, according to topic distribution, we can also find hot topics easily.
Finally, we conduct some experiments to evaluate and understand the limiting factors of the propose system. An interesting finding is that our system can act as a data filter tool where the composition of data sets can be computed and used to define the filters for quick selection of relevant data sets from large data sets. |
Reference: | [1] Sullivan, Dan. (2001). Document Warehousing and Text Mining Techniques for Improving Business Operations, Marketing,and Sales. New York: John Wiley & Sons.
[2] Tan, A. H. (1999). Text mining: The state of the art and the challenges. In Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases (Vol. 8, pp. 65-70).
[3] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41,pp. 391-407.
[4] T. Hofmann. (1999). Probabilistic latent semantic indexing. presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, USA.
[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. (2003). Latent dirichlet allocation. J. Mach. Learn. Res.,vol. 3,pp. 993-1022.
[6] M. Steyvers and T. Griffths. Probabilistic topic models. (2006).
[7] Hall, David, Daniel Jurafsky and Christopher D. Manning. (2008). Studying the history of ideas using topic models. Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics.
[8] Phan, Xuan-Hieu, Le-Minh Nguyen, and Susumu Horiguchi. (2008). Learning to classify short and sparse text & web with hidden topics from large-scale data collections. Proceedings of the 17th international conference on World Wide Web. ACM.
[9] Xin Zhao, Jing Jiang, JianshuWeng et al. (2011). Comparing Twitter and traditional media using topic models. In Proceedings of the European Conference on Information Retrieval.
[10] Brody, Samuel, and Noemie Elhadad. (2010). An unsupervised aspect-sentiment model for online reviews. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics.
[11] 楚克明, and 李芳. "基于 LDA 模型的新聞話題的演化." 计算机应用与软件 28.4 (2011): 4-7.
[12] 冯时, 景珊, 杨卓, and 王大玲, "基于 LDA 模型的中文微博话题意见领袖挖掘," 东北大学学报: 自然科学版, vol. 34, pp. 490-494, 2013.
[13] 張日威,"應用LDA進行Plurk主題分類及使用者情緒分析",雲科大資訊管理學系碩士論文,2014.
[14] 李日斌, "探討臺灣網民對鄰國的情感",中山大學資訊管理學系研究所碩士論文,2014.
[15] Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).
[16] Newman, D., Lau, J. H. , Grieser, K. ,& Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 100-108). Association for Computational Linguistics.
[17] Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 262-272). Association for Computational Linguistics.
[18] Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). ACM.ISO 690.
[19] Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics,17-35.
[20] Maiya, A. S., & Rolfe, R. M. (2014). Topic similarity networks: visual analytics for large document sets. In Big Data (Big Data),2014 IEEE International Conference on (pp. 364-372). IEEE.
[21] Harris, Z. S. (1954). Distributional Structure. Word,10(2/3),146–162.
[22] Parnas, D. L. (1972). On the criteria to be used in decomposing systems into modules. Communications of the ACM,15(12),1053-1058.
[23] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37.
[24] Newman, D., Hagedorn, K., Chemudugunta, C., & Smyth, P. (2007). Subject metadata enrichment using statistical topic models. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (pp. 366-375). ACM.
[25] 謝宗震 (2014)。服貿事件 X 資料科學。檢自:http://readata.org/ecfa-and-data-science/ |
Description: | 碩士 國立政治大學 資訊科學系碩士在職專班 103971002 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0103971002 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系碩士在職專班] 學位論文
|
Files in This Item:
File |
Size | Format | |
100201.pdf | 4057Kb | Adobe PDF2 | 659 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|