政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/131634
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 113303/144284 (79%)
Visitors : 50839744      Online Users : 775
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/131634


    Title: 以LDA機率模型進行PTT論壇文章主題分類並分析文章留言與文章主題之關聯
    Using Latent Dirichlet Allocation Model for Topic Modeling with Articles of PTT Forum and Analyzing Relevance of Article Comments
    Authors: 郭泓志
    Kuo, Haung-Chi
    Contributors: 江玥慧
    劉昭麟

    郭泓志
    Kuo, Haung-Chi
    Keywords: 文件主題模型
    社群網路分析
    PTT
    Topic Modeling
    Latent Dirichlet Allocation
    Latent Dirichlet Allocation
    PTT
    Topic Modeling
    Social Network Analysis
    Date: 2020
    Issue Date: 2020-09-02 12:16:01 (UTC+8)
    Abstract: 隨著科技日新月異,人們在網路上的社群平台與論壇發言越來越普遍,各個國家不同領域的人集合在同一個區域討論分享意見越來越頻繁,但是如何能自動化的分類出每個發言族群討論的內容為一件難事,基於許多分類方法,本研究使用台灣知名的論壇PTT為資料來源,以LDA(Latent Dirichlet Allocation)模型將文章分類出主題群,使用Word2Vec模型分類出回應給同一篇文章的留言之討論主題,觀察其留言與文章主題的關聯性,可作為進一步了解論壇內交流狀況之基礎。
    With the rapid development of technology, people`s interaction on social networking platforms becomes more and more common. People from different fields in various countries gather in the same area to discuss and share opinions more and more frequently, but how can classify topics of discussion automatically is a difficult thing. This study uses Taiwan’s well-known online forum PTT as a data source, and adopts the LDA (Latent Dirichlet Allocation) model to classify articles into topic groups. Results of the model are used to further investigate if the comments of an article are related to the article in terms of topic groups. Analyzing the association between the comments and the articles can be used as a basis for further understanding of the communication in the PTT forum.
    Reference: [1] Hong, L., & Davison, B. D. (2010, July). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80-88). ACM.
    [2] Everett, B. (2013). An introduction to latent variable models. Springer Science & Business Media.
    [3] Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2013). Handbook of latent semantic analysis. Psychology Press.
    [4] Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100-103.
    [5] Hofmann, T. (2000). Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In Advances in neural information processing systems (pp. 914-920).
    [6] David M. Blei, Andrew Y. Ng, Michael I. Jordan. 2003. Latent Dirichlet Allocation. University of California, United States.
    [7] T. Mikolov, I. Sutskever, K. Chen, G. Corrado & J.Dean. 2013. Distributed
    Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems (pp.3111–3119)
    [8] PTT. (1995.9.14). Retrieved December 23, 2019, from https://www.ptt.cc/bbs/index.html
    [9] Jurafsky, D. (2000). Speech & language processing. Pearson Education India.
    [10] Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In Mining text data (pp. 43-76). Springer, Boston, MA.
    [11] Chaffar, S., & Inkpen, D. (2011, May). Using a heterogeneous dataset for emotion analysis in text. In Canadian conference on artificial intelligence (pp. 62-67). Springer, Berlin, Heidelberg.
    [12] 廖經庭. 2007. BBS 站的客家族群認同建構: 以 PTT 「Hakka Dream」版為例. 碩士論文. 國立中央大學, 桃園市, 台灣.
    [13] 蔣佳峰. 2017. PTT災害事件擷取系統. 碩士論文. 國立中央大學, 桃園市, 台灣.
    [14] 陳弘君. 2017. 社群媒體中鄉民對於政治議題之迴聲室效應:以PTT八卦版為例. 碩士論文. 私立元智大學, 桃園市, 台灣.
    [15] J. K. Pritchard, M. Stephens and P. Donnelly. 2000. Inference of Population Structure Using Multilocus Genotype Data. Genetics, 155(2), (pp.945-959). University of Oxford, Oxford OX1 3TG, United Kingdom.
    [16] Katherine A. Heller, Zoubin Ghahramani. 2001. Bayesian Hierarchical Clustering. University College London 17 Queen Square, London, WC1N 3AR, UK
    [17] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22.
    [18] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30, 175-193.
    [19] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
    [20] 沈裕傑. 2008. 以語句為主之LDA模型於文件摘要之應用Sentence-Based Latent Dirichlet Allocation for Text Summarization. 碩士論文. 國立成功大學, 台南市,台灣.
    [21] Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press.
    [22] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108).
    [23] Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019.
    [24] Wang, X., Wei, F., Liu, X., Zhou, M., & Zhang, M. (2011, October). Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1031-1040). ACM.
    [25] Quercia, D., Askham, H., & Crowcroft, J. (2012, June). TweetLDA: supervised topic classification and link prediction in Twitter. In Proceedings of the 4th Annual ACM Web Science Conference (pp. 247-250). ACM.
    [26] Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics.
    [27] Pavitt, C., & Johnson, K. K. (1999). An examination of the coherence of group discussions. Communication Research, 26(3), 303-321.
    [28] Li, W., Xu, J., He, Y., Yan, S., & Wu, Y. (2019). Coherent comment generation for chinese articles with a graph-to-sequence model. arXiv preprint arXiv:1906.01231.
    [29] Gensim. (n.d.). Retrieved December 23, 2019, from https://radimrehurek.com/gensim/models/word2vec.html
    [30] Crummy. (1996). Retrieved December 24, 2019, from https://www.crummy.com/software/BeautifulSoup/
    [31] MongoDB. (2009). Retrieved December 31, 2019, from https://www.mongodb.com/
    [32] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development, 2(2), 159-165.
    [33] Jieba. (n.d.). Retrieved December 31, 2019, from
    https://github.com/fxsjy/jieba
    [34] Wikipedia. (2001). Retrieved May 22, 2020, from
    https://dumps.wikimedia.org/zhwiki/20200501/
    [35] Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), 35-43.
    [36] Gibbs, N. E., Poole Jr, W. G., & Stockmeyer, P. K. (1975). A Comparison of Several Bandwidth and Profile Reduction Algorithms (No. TR-6). COLLEGE OF WILLIAM AND MARY WILLIAMSBURG VA.
    [37] Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).
    [38] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108).
    [39] Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).
    [40] Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).
    Description: 碩士
    國立政治大學
    資訊科學系
    107753013
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0107753013
    Data Type: thesis
    DOI: 10.6814/NCCU202001523
    Appears in Collections:[Department of Computer Science ] Theses

    Files in This Item:

    File Description SizeFormat
    301301.pdf4166KbAdobe PDF230View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback