政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/32628

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | 全文筆數/總筆數 : 118260/149296 (79%)
造訪人次 : 77252745 線上人數 : 406

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

搜尋範圍

查詢小技巧：

您可在西文檢索詞彙前後加上"雙引號"，以獲取較精準的檢索結果

若欲以作者姓名搜尋，建議至進階搜尋限定作者欄位，可獲得較完整資料

進階搜尋

主頁 ‧ 登入 ‧ 上傳 ‧ 說明 ‧ 關於政大典藏 ‧ 管理

到手機版

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/32628

請使用永久網址來引用或連結此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/32628

題名:	以型態辨識為主的中文資訊擷取技術研究
作者:	翁嘉緯 Chia-Wei Weng
貢獻者:	劉吉軒 Jyi-Shane Liu 翁嘉緯 Chia-Wei Weng
關鍵詞:	資訊擷取型態辨識有限狀態自動機 Information Extraction Pattern based Finite State Automata
日期:	2003
上傳時間:	2009-09-17 13:53:20 (UTC+8)
摘要:	隨著網際網路的蓬勃發展，資訊擷取(Information Extraction)已經成為一個非常重要的技術。資訊擷取的目標為從非結構化的文字資料中，為特定的主題整理出相關之結構化資訊，其所牽涉的問題，包括分析文件的內容，篩選、擷取出相關的文字及其對應的意義。到目前為止，大部份的資訊擷取系統都著重在英文文件上，對於中文文件資訊擷取技術的研究才正在如火如荼的展開，加上全世界至少超過1/5的人說中文，積極投入中文資訊擷取的研究就顯得非常重要。中文的描述方式與英文有著很大的不同。在英文，詞跟詞之間有著明顯的『空白』，電腦可以很輕易的區隔輸入字串中每個詞。但是在中文，詞跟詞之間並沒有明顯的界限，一般的處理情形為利用詞典，將一個輸入字串中的文字，比對詞典內的詞來當做斷詞的依據，不過由於字組成詞的變化程度相當大，斷詞錯誤的情形仍很可能出現。因此，在本篇研究論文我們提出不做斷詞、不做詞性分析，而利用『型態辨識』的方法搭配『有限狀態自動機』的運作方式，來處理中文資訊擷取的問題。在實驗方面，我們以『總政府人事任免公報』當作測試資料，其精確度高達98%，而回收率也達到了97%。此外，我們也應用到其他不同的資料領域，對於建立跨領域之中文資訊擷取系統有了初步的研究進展，充分印證了本資訊擷取方法處理中文資訊擷取問題的可行性。 With the explosion of World Wide Web, information extraction has become a major technical area. The goal of information extraction is to transform non-structured text into structured data of specific topic. It involves analyzing, filtering and extracting relevant parts of text and the corresponding meaning. Most information extraction research mainly focuses on English text. On the other hand, research on Chinese information extraction has not received as much attention. Considering the fact that one-fifth population in the world are Chinese-speaking people, Chinese information extraction technology will become increasingly important. Chinese language is different with English in many aspects. In English, words are separated with space such that computers can easily distinguish each word in the input string. In Chinese, there are no spaces between characters to segment them into meaningful words. A general solution is to match characters of the input string to the words in the dictionary to find proper word boundary. Yet, much flexibility and ambiguity exist in the combination of characters into words. Many errors may occur in word segmentation. . In this thesis, we propose an approach to Chinese information extraction based on pattern matching and finite state automata, without relying on word segmentation and part-of-speech tagging. The approach was evaluated with “government personnel directives in official gazettes” as test data, and it achieved performance measure of 98% precision and 97% recall. Moreover, the approach was extended to other data domains. The results have showed initial progress on the research of multiple- domain Chinese information extraction system.
參考文獻:	[1] Wilks, Y. and Catizone, R. 1999. Can We Make Information Extraction More Adaptive? In M. Pazienza (ed.) Proceedings of the Summer School on Information Extraction (SCIE-99) Workshop, Springer-Verlag, Berlin. Rome. [2] Applet, D. E. and Israel, D. J. 1999. Introduction to Information Extraction Technology. In Proceedings of the 16th International Joint Conference on Artificial Intelligence. [3] Jim Cowie , Wendy Lehnert . 1996. Information Extraction, Communications of the ACM (CACM), 39 (1), pp.80-91 [4] Chowhurv, G. G. 1999. Introduction to Modern Information Retrieval. London : Library Association Publishing. [5] Rohini Srihari and Wei Li. A Question Answering System Supported by Information Extraction. Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-00), 166-172. [6] Grishman, Ralph and Beth M. Sundheim. 1996. Message Understanding Conference-6 : A Brief History. In Proceedings of the 16th International Conference on Computational Linguistics (COLING 96), Copenhagen, Denmark. [7] Peng, F. Models Development in IE Tasks – A survey. 1999. CS685 (Intelligent Computer Interface) course project, Computer Science Department, University of Waterloo. [8] Ellen Riloff. 1993. Automatically Constructing a Dictionary for Information Extraction Tasks. Proceeding for the Eleventh National Conference on Artificial Intelligence, pp.811-816. [9] Ellen Riloff. 1996. Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thriteenth National Conference on Artificial Intelligence, pp.1044-1049. [10] Califf, M. E. and Mooney R.J. 1999. Relational Learning of Pattern-match Rules for Information Extraction. In Proceedings of the 16th National Conference on AI, pp.328-334. [11] Kushmerick, N. Weld, D. and Doorenbos, R. 1997. Wrapper Induction for information extraction. In Proceedings of the 15th International Joint Conference on AI (IJCAI-97), pp. 729-737. [12] Kushmerick, N. 1998. Wrapper Induction: Efficiency and Expressiveness. Workshop on AI & Information Integration. In Proceedings of AAAI-98 Workshop on Artification Intelligence and Information Integration, pp. 15-68, AAAI Press, Menlo Park, California. [13] Chun-Nan Hsu and Ming-Tzung Dung. Aug 1998. Generating Finite-State Transducers for Semi-Structured Data Extraction from The Web, Journal of Infromation Systems, Special Issue on Semi-structured Data, Vol.23, No.8, pp.521-538. [14] Chun-Nan Hsu and Chien-Chi Chang. 1999. Finite-state Transducers for Semi-structured Text Mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden. [15] Muslea, I. Minton, S. and Knoblock, C. 1998. STALKER: Learning Extraction Rules for Semi-structured, Web-based Information Sources. In Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01, AAAI Press, Menlo Park, California. [16] Muslea, I. Minton, S. and Knoblock, C. 1999. A Hierarchical Approach to Wrapper Induction. In Proceedings of the 3rd International Conference on Autonomous Agents (Agents-99), pp. 190-197, Seattle, Washington. [17] Chia-Hui Chang and Chun-Nan Hsu. Dec 1999. Automatic Extraction of Information Blocks Using PAT Trees. In Proceedings of 1999 National Computer Symposium (NCS-1999), Tamking University, Tamsui, Taiwan. [18] Applet, D. Hobbs, J. Israel, D. Kameyama, M. Tyson, M. 1993. The SRI MUC-5 JV FASTUS Information Extraction System. Proceedings of the Fifth Message Understanding Conference. [19] Jyi-Shane Liu, Mu-Hsi. Tseng. November 2001. Extracting Government Personnel Information from Official Gazettes. In Proceedings of the Sixth Conference on Artificial Intelligence and Applications, pp. 593-598, Kaoshiung, Taiwan. [20] Chia-Hui Chang, Shao-Chen Lui, and Yen-Chin Wu. Apr2001. Applying Pattern Mining to Web Information Extraction. In Proceeding of the 5th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2000), pp. 4-16, Hong Kong. [21] Chia-Hui Chang and Shao-Chen Lui. May 2001. IEPAD : Information Extraction based on Pattern Discovery, In Proceedings of the 10th International Conference on World Wide Web (WWW10), pp.595-609, Hong Kong. [22] Horowitz, E. SAHNI, S. Rajasekaran, S. Computer Algorithms/C++, pp.284-286 [23] Forrester Research, URL : http://www.forrester.com [24] Message Understanding Conferences, URL : http://www.muc.saic.com [25] Text Retrieval Conferences, URL : http://trec.nist.gov [26] QA Track Specifications, URL :http://www.research.att.com/~singhal/qa -track-sepc.txt [27] 總統府人事任免公報, URL : www.president.gov.tw/2_report/layer2.html [28] 淡新檔案, URL :http://www.lib.ntu.edu.tw/specialcollect/Coll_Taiwan/ Coll_Tan-hsin.htm [29] CAN中央社新聞全文檢索, URL : http://search.cnanews.gov.tw [30] L.F. Chien. 1997. PAT Tree Based Keyword Extraction for Chinese Information Retrieval, Proceedings of the ACM SIGIR International Conference on Information Retrieval.
描述:	碩士國立政治大學資訊科學學系 90753018 92
資料來源:	http://thesis.lib.nccu.edu.tw/record/#G0090753018
資料類型:	thesis
顯示於類別:	[資訊科學系] 學位論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
75301801.pdf		17Kb	Adobe PDF2	740	檢視/開啟
75301802.pdf		20Kb	Adobe PDF2	736	檢視/開啟
75301803.pdf		53Kb	Adobe PDF2	916	檢視/開啟
75301804.pdf		44Kb	Adobe PDF2	770	檢視/開啟
75301805.pdf		367Kb	Adobe PDF2	948	檢視/開啟
75301806.pdf		425Kb	Adobe PDF2	807	檢視/開啟
75301807.pdf		951Kb	Adobe PDF2	879	檢視/開啟
75301808.pdf		36Kb	Adobe PDF2	737	檢視/開啟
75301809.pdf		38Kb	Adobe PDF2	673	檢視/開啟

在政大典藏中所有的資料項目都受到原著作權保護.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - 回饋