Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/103991
|
Title: | 適用於中文史料文本之標記式主題模型分析方法研究 An Enhanced Topic Model Based on Labeled LDA for Chinese Historical Corpora |
Authors: | 陳奕安 |
Contributors: | 蔡銘峰 陳奕安 |
Keywords: | 主題模型 標記式主題模型 隱含狄利克雷分布 |
Date: | 2016 |
Issue Date: | 2016-11-14 16:15:00 (UTC+8) |
Abstract: | 本論文提出了一個適用於中文史料文本主題分析方法,主要是根據標記式隱含狄利克雷分布(Labeled Latent Dirichlet Allocation,LLDA) 演算法,使其可以透過人工標記的中文文本找出特定主題的相關詞彙。在我們提出的演算法中,我們加上主題種子字詞(Seed Words) 資訊,以增強 LDA 群聚過後的結果,使群聚過後的詞彙與主題的關聯度能夠獲得提昇。近年來,隨著網際網路的普及以及資訊檢索的蓬勃發展,同時由於數位典藏的資料成長,越來越多的實體書藉被編輯成數位版本並且加上後設資料(Metadata),在取得這些富有價值的 歷史文本資料後,如何利用文字探勘技術(Text Mining)在這些資料上變成一項重要的研究議題。其中,如何從大量文本史料中辨識出文章主題更是許多學者感興趣的方向,而 LDA 主題模型則是在文字探勘領域中非常經典的方法。在此研究中我們發現傳統 LDA 對於群聚後的主題描述存在些許問題,包括主題類別的高隨機性以及個別主題的低易讀性,使得後續的解讀工作變得十分困難,因此我們採用了由 LDA 衍生出的標記式主題模型 Labeled LDA 演算法,限定能夠產生的主題類別以降低期隨機性,此外我們還加入了考量中文字詞的長度以及自定義的相關種子字詞等改進,使群聚出的主題詞彙能夠與主題更加相關,更加容易描述。實驗部分,我們利用改良後的演算法提取出主題詞彙,並進行人工標記,接著將標記的結果作為正確解答來計算平均準度均值(Mean Average Precision,MAP)等資訊檢索之評估方法作為評估,結果證實以長字詞以及種子字詞為考量所群聚出的結果皆優於傳統主題模型所群聚出的結果;此外,我們也將最終的結果與 TF-IDF 權重計算後的字詞進行比較,並由實驗結果可見其兩者之間的差異性。 This paper proposes an enhanced topic model based on Labeled Latent Dirichlet Allocation (LLDA) for Chinese historical corpora to discover words related to specific topics. To enhance the traditional LDA performance and to increase the readability of its clustered words, we attempt to use the infor- mation of seed words and the Chinese word length into the traditional LDA algorithm. In this study, we find that the traditional LDA exists some prob- lems about topic descriptions after clustering. We therefore apply the Labeled LDA algorithm, which is derived from traditional LDA, with the proposed improvements of considering the lengths of the words and related seed words. In our experiments, Mean Average Precision (MAP) is used to evaluate our experiment results based on the topics words labeled manually by historical experts. The experimental results shows that the proposed method of consid- ering both Chinese word length information and seed words is better than the traditional LDA method. In addition, we compare the proposed results with the TF-IDF weighting scheme, and the proposed method also outperforms the TF-IDF method significantly. |
Reference: | [1] I. Bhattacharya. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 6th SIAM International Conference on Data Mining, volume 124, page 47. SIAM, 2006. [2] I. B ́ıro ́, J. Szabo ́, and A. A. Benczu ́r. Latent dirichlet allocation in web spam filter- ing. In Proceedings of the 4th international workshop on Adversarial Information Retrieval on the Web, pages 29–32. ACM, 2008. [3] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012. [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003. [5] K.-Y. Chen and B. Chen. 主題語言模型於大詞彙連續語音辨識之研究 (on the use of topic models for large-vocabulary continuous speech recognition)[in chinese]. In Proceedings of the 2009 ROCLING, pages 179–194, 2009. [6] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition (CVPR’05), volume 2, pages 524–531. IEEE, 2005. [7] T. L. Griffiths and M. Steyvers. Finding scientific topics. Journal of Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004. [8] G. E. Hinton and T. J. Sejnowski. Unsupervised Learning: Foundations of Neural Computation. 1999. [9] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and Development in In- formation Retrieval, pages 50–57. ACM, 1999. [10] R. A. Horn. The hadamard product. In Proceedings of Symposia in Applied Mathe- matics, volume 40, pages 87–169, 1990. [11] R. V. Lindsey, W. P. Headden III, and M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor processes. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computa- tional Natural Language Learning, pages 214–222. Association for Computational Linguistics, 2012. [12] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent seman- tic indexing: A probabilistic analysis. In Proceedings of the 17th ACM SIGACT- SIGMOD-SIGART symposium on Principles of database systems, pages 159–168. ACM, 1998. [13] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 248–256. Association for Computational Linguistics, 2009. [14] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their localization in images. In Proceedings of the 10th IEEE Inter- national Conference on Computer Vision (ICCV’05) Volume 1-Volume 01, pages 370–377. IEEE Computer Society, 2005. [15] Y. W. Teh. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 985–992. Association for Computational Linguistics, 2006. [16] X. Wang, A. McCallum, and X. Wei. Topical n-grams: Phrase and topic discov- ery, with an application to information retrieval. In Proceedings of the 7th IEEE International Conference on Data Mining, pages 697–702. IEEE Computer Society, 2007. [17] X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In Pro- ceedings of the 29th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 178–185. ACM, 2006. [18] D. Xing and M. Girolami. Employing latent dirichlet allocation for fraud detection in telecommunications. Journal of Pattern Recognition Letters, 28(13):1727–1734, 2007. [19] L. Yao, Y. Zhang, B. Wei, W. Wang, Y. Zhang, X. Ren, and Y. Bian. Discov- ering treatment pattern in traditional chinese medicine clinical cases by exploiting supervised topic model and domain knowledge. Journal of Biomedical Informatics, 58(C):260–267, 2015. [20] 孟海濤, 陳思, and 周睿. 基于 lda 模型的 web 文本分類. 鹽城工學院學報 (自然 科學版), 22(4):56–59, 2009. [21] 賈西平, 彭宏, 鄭啟倫, 石時需, and 江焯林. 基于主題的文檔檢索模型. 華南理 工大學學報 (自然科學版), 36(9):37–42, 2008. |
Description: | 碩士 國立政治大學 資訊科學學系 102753031 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0102753031 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
Files in This Item:
File |
Size | Format | |
303101.pdf | 11045Kb | Adobe PDF2 | 1610 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|