政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/63217

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 114404/145434 (79%)
Visitors : 53211045 Online Users : 674

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/63217

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/63217

Title:	一個對單篇中文文章擷取關鍵字之演算法 A Keyword Extraction Algorithm for Single Chinese Document
Authors:	吳泰勳 Wu, Tai Hsun
Contributors:	徐國偉 Hsu, Kuo Wei 吳泰勳 Wu, Tai Hsun
Keywords:	關鍵字擷取單篇中文文章 Keyword Extraction single Chinese document
Date:	2013
Issue Date:	2014-01-02 14:07:20 (UTC+8)
Abstract:	數位典藏與數位學習國家型科技計畫14年來透過數位化方式典藏國家文物，例如：生物、考古、地質等15項主題，為了能讓數位典藏資料與時事互動故使用關鍵字作為數位典藏資料與時事的橋樑，由於時事資料會出現新字詞，因此，本研究將提出一個演算法在不使用詞庫或字典的情況下對單一篇中文文章擷取主題關鍵字，此演算法是以Bigram的方式斷詞因此字詞最小單位為二個字，例如：「中文」，隨後挑選出頻率詞並採用分群的方式將頻率詞進行分群最後計算每個字詞的卡方值並產生主題關鍵字，在文章中字詞共現的分佈是很重要的，假設一字詞與所有頻率詞的機率分佈中，此字詞與幾個頻率詞的機率分佈偏差較大，則此字詞極有可能為一關鍵字。在字詞的呈現方面，中文句子裡不像英文句子裡有明顯的分隔符號隔開每一個字詞，造成中文在斷詞處理上產生了極大的問題，與英文比較起來中文斷詞明顯比英文來的複雜許多，在本研究將會比較以Bigram、CKIP和史丹佛中文斷詞器為斷詞的工具，分別進行過濾或不過濾字詞與對頻率詞分群或不分群之步驟，再搭配計算卡方值或詞頻後所得到的主題關鍵字之差異，實驗之資料將採用中央研究院數位典藏資源網的文章，文章的標準答案則來自於中央研究院資訊科學研究所電腦系統與通訊實驗室所開發的撈智網。從實驗結果得知使用Bigram斷詞所得到的主題關鍵字部分和使用CKIP或史丹佛中文斷詞器所得到的主題關鍵字相同，且部分關鍵字與文章主題的關聯性更強，而使用Bigram斷詞的主要優點在於不用詞庫。最後，本研究所提出之演算法是基於能將數位典藏資料推廣出去的前提下所發展，希望未來透過此演算法能從當下熱門話題的文章擷取出主題關鍵字，並透過主題關鍵字連結到相關的數位典藏資料，進而帶動新一波「數典潮」。 In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyword s in documents in digital archives, and the techniques developed along with the work can be used to build a connection between digital archives and news articles. Because there are always new words or new uses of words in news articles, in this thesis we propose an algorithm that can automatically extract keywords from a single Chinese document without using a corpus or dictionary. Given a document in Chinese, initially the algorithm uses a bigram-based approach to divide it into bigrams of Chinese characters. Next, the algorithm calculates term frequencies of bigrams and filters out those with low term frequencies. Finally, the algorithm calculates chi-square values to produce keywords that are most related to the topic of the given document. The co-occurrence of words can be used as an indicator for the degree of importance of words. If a term and some frequent terms have similar distributions of co-occurrence, it would probably be a keyword. Unlike English word segmentation which can be done by using word delimiters, Chinese word segmentation has been a challenging task because there are no spaces between characters in Chinese. The proposed algorithm performs Chinese word segmentation by using a bigram-based approach, and we compare the segmented words with those given by CKIP and Stanford Chinese Segmenter. In this thesis, we present comparisons for different settings: One considers whether or not infrequent terms are filtered out, and the other considers whether or not frequent terms are clustered by a clustering algorithm. The dataset used in experiments is downloaded from the Academia Sinica Digital Resources and the ground truth is provided by Gainwisdom, which is developed by Computer Systems and Communication Lab in Academia Sinica. According to the experimental results, some of the segmented words given by the bigram-based approach adopted in the proposed algorithm are the same as those given by CKIP or Stanford Chinese Segmenter, while some of the segmented words given by the bigram-based approach have stronger connections to topics of documents. The main advantage of the bigram-based approach is that it does not require a corpus or dictionary.
Reference:	[1] 計畫起緣，http://wiki.teldap.tw/index.php/%E6%95%B8%E4%BD%8D%E5%85%B8%E8%97%8F%E8%88%87%E6%95%B8%E4%BD%8D%E5%AD%B8%E7%BF%92%E5%9C%8B%E5%AE%B6%E5%9E%8B%E7%A7%91%E6%8A%80%E8%A8%88%E7%95%AB (2013/9/1). [2] 聯合目錄，http://catalog.digitalarchives.tw(2013/9/1). [3] 了解數位典藏，http://digiarch.sinica.edu.tw/content/about/about.jsp(2013/9/5). [4] 數位典藏資源網， http://digiarch.sinica.edu.tw/index.jsp(2013/9/10). [5] Liu, Z., Chen, X., and Sun, M. (2012). Mining the interests of Chinese microbloggers via keyword extraction. Frontiers of Computer Science, 6(1):76–87. [6] Liu, F., Liu, F., Liu, Y. (2011). A Supervised Framework for Keyword Extraction From Meeting Transcripts. IEEE Transactions on Audio Speech and Language Processing 19, 538–548. [7] Luo, X., et al. (2008). Experimental study on the extraction and distribution of textual domain keywords. Concurrency and Computation-Practice & Experience 20(16), 1917–1932. [8] Bracewell David, B., et al. (2008). Single document keyword extraction for Internet news articles. International Journal of Innovative Computing Information and Control 4(4), 905–913. [9] Sun Yue-heng. (2005). Research of NLP Technologies Based on Statistics and its Application in Chinese Information Retrieval, Tianjing University, Tianjing, pp.27-30. [10] Dai, Y. B., Khoo, S. G. T., Loh, T. E. (1999). A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval (pp 82–89). [11] Yu, H. K., Zhang, H. P., Liu, Q., Lv, X. Q. and Shi, S. C. (2006).Chinese named entity identification using cascaded hiddenMarkov model. Journal on Communications, 27(2), 87–94. [12] N-gram，http://en.wikipedia.org/wiki/N-gram(2013/8/13). [13] 蘇辰豫，在跨多語言資訊檢索中使用N-gram翻譯及維基百科翻譯解決未知詞問題，朝陽科技大學，2007。 [14] 洪大弘，基於語言模型及正反面語料知識庫之中文錯別字自動偵錯系統，朝陽科技大學，2009。 [15] 莊怡軒，英文技術文獻中動詞與其受詞之中文翻譯的語境效用，國立政治大學，2011。 [16] 王瑞平，應用平行語料建構中文斷詞組件，國立政治大學，2012。 [17] 蘇信州，TFT-LCD面板製造廠CIM客服中心之案例式推理模式建立，國立成功大學，2009。 [18] CKIP，http://ckipsvr.iis.sinica.edu.tw/intro.htm(2013/8/9) [19] 廖嘉新，實體論自動建構技術與其在資訊分類上之應用，國立成功大學，2002。 [20] 馮廣明，正面和負面資訊需求對資訊檢索效能之影響研究，國立台灣大學，2003。 [21] 蘇柏鳴，應用事件導向負面情緒預測網路使用者憂鬱傾向，國立成功大學，2012。 [22] 李怡欣，國小中年級社會教科書詞彙分析-以翰林版為例，國立台南大學，2012。 [23] Giarlo, M. J. (2005). A Comparative Analysis of Keyword Extraction Techniques. Rutgers,The State University of New Jersey. [24] Ercan, G., & Cicekli, I. (2007). Using Lexical Chains for Keyword Extraction. Information Processing & Management, Vol.43, Issue 6, pp. 1705-1714. [25] Dipl.-Ing. Wolfgang Nejdl. (2009). Automatic Keyword Extraction for Database Search. [26] J. D. Cohen. (1995). Language and domain-independent automatic indexing terms for abstracting. Journal of the American Society for Information Science. [27] I. Witten, G. Paynte, E. Frank, C. (1999). Gutwin, C. Nevill-Manning. KEA: practical automatic keyphrase extraction. In Proceedings of the 4th ACM Conference on Digital Library. [28] A. Hulth. (2003). Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing, Sapporo, Japan. [29] J. B. Keith Humphreys. (2002). Phraserate: An HTML keyphrase extractor. Technical Report. [30] Songhua Xu, Shaohui Yang, and Francis Chi-Moon Lau. (2010).Keyword extraction and headline generation using novel word features. In Proceedings of the Twenty-Fourth AAAI Conference on Artiﬁcial Intelligence, AAAI 2010. AAAI Press. [31] Zhenhui Li, Ging Zhou, Yun-Fang Juan, and Jiawei Han. (2010). Keyword extraction for social snippets. In Proceedings of the WWW, pages 1143-1144. [32] X. Wu and A. Bolivar. (2008). Keyword extraction for contextual advertisement. In Proc. of WWW, pages 1195–1196. [33] Y. Matsuo, M. Ishizuka. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools. [34] G. K. Palshikar. (2007). Keyword extraction from a single document using centrality measures. In Proceedings of the 2nd International Conference on Pattern Recognition and Machine Intelligence(LNCS-4815), pp. 503–510. [35] Yan Yang, Meng Qiu. (2011). Exploration and Improvement in Keyword Extraction for News Based on TFIDF. 2011 3rd International Conference on Machine Learning and Computing. [36] 詹權恩，以詞彙關聯性詞庫為基礎之文件關鍵字擷取模式，國立清華大學，2004。 [37] Hui Jiao, Qian Liu, Hui-bo Jia, (2007). Chinese Keyword Extraction Based on N-gram and Word Co-occurrence. 2007 International Conference on Computational Intelligence and Security Workshops. [38] Xinghua Li , Xindong Wu , Xuegang Hu , Fei Xie , Zhaozhong Jiang. (2008). Keyword Extraction Based on Lexical Chains and Word Co-occurrence for Chinese News Web Pages. Proceedings of the 2008 IEEE International Conference on Data Mining Workshops, p.744-751, December 15-19. [39] 撈智網， http://gainwisdom.iis.sinica.edu.tw/index.jsp(2013/9/10). [40] Precision and recall，http://en.wikipedia.org/wiki/Precision_and_recall(2013/11/15). [41] Zhang Le, Lu Xue-qiang, Shen Yan-na and Yao Tian-shun, Y. (2003). A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora. 20th International Conference on Computer Processing of Oriental Languages. [42] 黃佳新，關鍵字擷取與文件分類之因子分析，國立清華大學，2004。
Description:	碩士國立政治大學資訊科學學系 100971017 102
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0100971017
Data Type:	thesis
Appears in Collections:	[資訊科學系] 學位論文

Files in This Item:

File	Size	Format
101701.pdf	5746Kb	Adobe PDF2	2382	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback