Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/63217
|
Title: | 一個對單篇中文文章擷取關鍵字之演算法 A Keyword Extraction Algorithm for Single Chinese Document |
Authors: | 吳泰勳 Wu, Tai Hsun |
Contributors: | 徐國偉 Hsu, Kuo Wei 吳泰勳 Wu, Tai Hsun |
Keywords: | 關鍵字擷取 單篇中文文章 Keyword Extraction single Chinese document |
Date: | 2013 |
Issue Date: | 2014-01-02 14:07:20 (UTC+8) |
Abstract: | 數位典藏與數位學習國家型科技計畫14年來透過數位化方式典藏國家文物,例如:生物、考古、地質等15項主題,為了能讓數位典藏資料與時事互動故使用關鍵字作為數位典藏資料與時事的橋樑,由於時事資料會出現新字詞,因此,本研究將提出一個演算法在不使用詞庫或字典的情況下對單一篇中文文章擷取主題關鍵字,此演算法是以Bigram的方式斷詞因此字詞最小單位為二個字,例如:「中文」,隨後挑選出頻率詞並採用分群的方式將頻率詞進行分群最後計算每個字詞的卡方值並產生主題關鍵字,在文章中字詞共現的分佈是很重要的,假設一字詞與所有頻率詞的機率分佈中,此字詞與幾個頻率詞的機率分佈偏差較大,則此字詞極有可能為一關鍵字。在字詞的呈現方面,中文句子裡不像英文句子裡有明顯的分隔符號隔開每一個字詞,造成中文在斷詞處理上產生了極大的問題,與英文比較起來中文斷詞明顯比英文來的複雜許多,在本研究將會比較以Bigram、CKIP和史丹佛中文斷詞器為斷詞的工具,分別進行過濾或不過濾字詞與對頻率詞分群或不分群之步驟,再搭配計算卡方值或詞頻後所得到的主題關鍵字之差異,實驗之資料將採用中央研究院數位典藏資源網的文章,文章的標準答案則來自於中央研究院資訊科學研究所電腦系統與通訊實驗室所開發的撈智網。從實驗結果得知使用Bigram斷詞所得到的主題關鍵字部分和使用CKIP或史丹佛中文斷詞器所得到的主題關鍵字相同,且部分關鍵字與文章主題的關聯性更強,而使用Bigram斷詞的主要優點在於不用詞庫。最後,本研究所提出之演算法是基於能將數位典藏資料推廣出去的前提下所發展,希望未來透過此演算法能從當下熱門話題的文章擷取出主題關鍵字,並透過主題關鍵字連結到相關的數位典藏資料,進而帶動新一波「數典潮」。 In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyword s in documents in digital archives, and the techniques developed along with the work can be used to build a connection between digital archives and news articles. Because there are always new words or new uses of words in news articles, in this thesis we propose an algorithm that can automatically extract keywords from a single Chinese document without using a corpus or dictionary. Given a document in Chinese, initially the algorithm uses a bigram-based approach to divide it into bigrams of Chinese characters. Next, the algorithm calculates term frequencies of bigrams and filters out those with low term frequencies. Finally, the algorithm calculates chi-square values to produce keywords that are most related to the topic of the given document. The co-occurrence of words can be used as an indicator for the degree of importance of words. If a term and some frequent terms have similar distributions of co-occurrence, it would probably be a keyword. Unlike English word segmentation which can be done by using word delimiters, Chinese word segmentation has been a challenging task because there are no spaces between characters in Chinese. The proposed algorithm performs Chinese word segmentation by using a bigram-based approach, and we compare the segmented words with those given by CKIP and Stanford Chinese Segmenter. In this thesis, we present comparisons for different settings: One considers whether or not infrequent terms are filtered out, and the other considers whether or not frequent terms are clustered by a clustering algorithm. The dataset used in experiments is downloaded from the Academia Sinica Digital Resources and the ground truth is provided by Gainwisdom, which is developed by Computer Systems and Communication Lab in Academia Sinica. According to the experimental results, some of the segmented words given by the bigram-based approach adopted in the proposed algorithm are the same as those given by CKIP or Stanford Chinese Segmenter, while some of the segmented words given by the bigram-based approach have stronger connections to topics of documents. The main advantage of the bigram-based approach is that it does not require a corpus or dictionary. |
Reference: | [1] 計畫起緣,http://wiki.teldap.tw/index.php/%E6%95%B8%E4%BD%8D%E5%85%B8%E8%97%8F%E8%88%87%E6%95%B8%E4%BD%8D%E5%AD%B8%E7%BF%92%E5%9C%8B%E5%AE%B6%E5%9E%8B%E7%A7%91%E6%8A%80%E8%A8%88%E7%95%AB (2013/9/1). [2] 聯合目錄,http://catalog.digitalarchives.tw(2013/9/1). [3] 了解數位典藏,http://digiarch.sinica.edu.tw/content/about/about.jsp(2013/9/5). [4] 數位典藏資源網 , http://digiarch.sinica.edu.tw/index.jsp(2013/9/10). [5] Liu, Z., Chen, X., and Sun, M. (2012). Mining the interests of Chinese microbloggers via keyword extraction. Frontiers of Computer Science, 6(1):76–87. [6] Liu, F., Liu, F., Liu, Y. (2011). A Supervised Framework for Keyword Extraction From Meeting Transcripts. IEEE Transactions on Audio Speech and Language Processing 19, 538–548. [7] Luo, X., et al. (2008). Experimental study on the extraction and distribution of textual domain keywords. Concurrency and Computation-Practice & Experience 20(16), 1917–1932. [8] Bracewell David, B., et al. (2008). Single document keyword extraction for Internet news articles. International Journal of Innovative Computing Information and Control 4(4), 905–913. [9] Sun Yue-heng. (2005). Research of NLP Technologies Based on Statistics and its Application in Chinese Information Retrieval, Tianjing University, Tianjing, pp.27-30. [10] Dai, Y. B., Khoo, S. G. T., Loh, T. E. (1999). A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval (pp 82–89). [11] Yu, H. K., Zhang, H. P., Liu, Q., Lv, X. Q. and Shi, S. C. (2006).Chinese named entity identification using cascaded hiddenMarkov model. Journal on Communications, 27(2), 87–94. [12] N-gram,http://en.wikipedia.org/wiki/N-gram(2013/8/13). [13] 蘇辰豫,在跨多語言資訊檢索中使用N-gram翻譯及維基百科翻譯解決未知詞問題,朝陽科技大學,2007。 [14] 洪大弘,基於語言模型及正反面語料知識庫之中文錯別字自動偵錯系統,朝陽科技大學,2009。 [15] 莊怡軒,英文技術文獻中動詞與其受詞之中文翻譯的語境效用,國立政治大學,2011。 [16] 王瑞平,應用平行語料建構中文斷詞組件,國立政治大學,2012。 [17] 蘇信州,TFT-LCD面板製造廠CIM客服中心之案例式推理模式建立,國立成功大學,2009。 [18] CKIP,http://ckipsvr.iis.sinica.edu.tw/intro.htm(2013/8/9) [19] 廖嘉新,實體論自動建構技術與其在資訊分類上之應用,國立成功大學,2002。 [20] 馮廣明,正面和負面資訊需求對資訊檢索效能之影響研究,國立台灣大學,2003。 [21] 蘇柏鳴,應用事件導向負面情緒預測網路使用者憂鬱傾向,國立成功大學,2012。 [22] 李怡欣,國小中年級社會教科書詞彙分析-以翰林版為例,國立台南大學,2012。 [23] Giarlo, M. J. (2005). A Comparative Analysis of Keyword Extraction Techniques. Rutgers,The State University of New Jersey. [24] Ercan, G., & Cicekli, I. (2007). Using Lexical Chains for Keyword Extraction. Information Processing & Management, Vol.43, Issue 6, pp. 1705-1714. [25] Dipl.-Ing. Wolfgang Nejdl. (2009). Automatic Keyword Extraction for Database Search. [26] J. D. Cohen. (1995). Language and domain-independent automatic indexing terms for abstracting. Journal of the American Society for Information Science. [27] I. Witten, G. Paynte, E. Frank, C. (1999). Gutwin, C. Nevill-Manning. KEA: practical automatic keyphrase extraction. In Proceedings of the 4th ACM Conference on Digital Library. [28] A. Hulth. (2003). Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing, Sapporo, Japan. [29] J. B. Keith Humphreys. (2002). Phraserate: An HTML keyphrase extractor. Technical Report. [30] Songhua Xu, Shaohui Yang, and Francis Chi-Moon Lau. (2010).Keyword extraction and headline generation using novel word features. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010. AAAI Press. [31] Zhenhui Li, Ging Zhou, Yun-Fang Juan, and Jiawei Han. (2010). Keyword extraction for social snippets. In Proceedings of the WWW, pages 1143-1144. [32] X. Wu and A. Bolivar. (2008). Keyword extraction for contextual advertisement. In Proc. of WWW, pages 1195–1196. [33] Y. Matsuo, M. Ishizuka. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools. [34] G. K. Palshikar. (2007). Keyword extraction from a single document using centrality measures. In Proceedings of the 2nd International Conference on Pattern Recognition and Machine Intelligence(LNCS-4815), pp. 503–510. [35] Yan Yang, Meng Qiu. (2011). Exploration and Improvement in Keyword Extraction for News Based on TFIDF. 2011 3rd International Conference on Machine Learning and Computing. [36] 詹權恩,以詞彙關聯性詞庫為基礎之文件關鍵字擷取模式,國立清華大學,2004。 [37] Hui Jiao, Qian Liu, Hui-bo Jia, (2007). Chinese Keyword Extraction Based on N-gram and Word Co-occurrence. 2007 International Conference on Computational Intelligence and Security Workshops. [38] Xinghua Li , Xindong Wu , Xuegang Hu , Fei Xie , Zhaozhong Jiang. (2008). Keyword Extraction Based on Lexical Chains and Word Co-occurrence for Chinese News Web Pages. Proceedings of the 2008 IEEE International Conference on Data Mining Workshops, p.744-751, December 15-19. [39] 撈智網, http://gainwisdom.iis.sinica.edu.tw/index.jsp(2013/9/10). [40] Precision and recall,http://en.wikipedia.org/wiki/Precision_and_recall(2013/11/15). [41] Zhang Le, Lu Xue-qiang, Shen Yan-na and Yao Tian-shun, Y. (2003). A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora. 20th International Conference on Computer Processing of Oriental Languages. [42] 黃佳新,關鍵字擷取與文件分類之因子分析,國立清華大學,2004。 |
Description: | 碩士 國立政治大學 資訊科學學系 100971017 102 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0100971017 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
Files in This Item:
File |
Size | Format | |
101701.pdf | 5746Kb | Adobe PDF2 | 2382 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|