Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/32649
|
Title: | 中文資訊擷取結果之錯誤偵測 Error Detection on Chinese Information Extraction Results |
Authors: | 鄭雍瑋 Cheng, Yung-Wei |
Contributors: | 劉吉軒 Liu, Jyi-Shane 鄭雍瑋 Cheng, Yung-Wei |
Keywords: | 錯誤偵測 資訊擷取 文本資料描述 Error Detection Information Extraction Textual Data Profiling |
Date: | 2005 |
Issue Date: | 2009-09-17 13:56:10 (UTC+8) |
Abstract: | 資訊擷取是從自然語言文本中辨識出特定的主題或事件的描述,進而萃取出相關主題或事件元素中的對應資訊,再將其擷取之結果彙整至資料庫中,便能將自然語言文件轉換成結構化的核心資訊。然而資訊擷取技術的結果會有錯誤情況發生,若單只依靠人工檢查及更正錯誤的方式進行,將會是耗費大量人力及時間的工作。 在本研究論文中,我們提出字串圖形結構與字串特徵值兩種錯誤資料偵測方法。前者是透過圖形結構比對各資料內字元及字元間關聯,接著由公式計算出每筆資料的比對分數,藉由分數高低可判斷是否為錯誤資料;後者則是利用字串特徵值,來描述字串外表特徵,再透過SVM和C4.5機器學習分類方法歸納出決策樹,進而分類正確與錯誤二元資料。而此兩種偵測方法的差異在於前者隱含了圖學理論之節點位置與鄰點概念,直接比對原始字串內容;後者則是將原始字串轉換成特徵數值,進行分類等動作。 在實驗方面,我們以「總統府人事任免公報」之資訊擷取成果資料庫作為測試資料。實驗結果顯示,本研究所提出的錯誤偵測方法可以有效偵測出不合格的值組,不但能節省驗證資料所花費的成本,甚至可確保高資料品質的資訊擷取成果產出,促使資訊擷取技術更廣泛的實際應用。 Given a targeted subject and a text collection, information extraction techniques provide the capability to populate a database in which each record entry is a subject instance documented in the text collection. However, even with the state-of-the-art IE techniques, IE task results are expected to contain errors. Manual error detection and correction are labor intensive and time consuming. This validation cost remains a major obstacle to actual deployment of practical IE applications with high validity requirement. In this paper, we propose string graph structure and string feature-based methods. The former takes advantage of graph structure to compare characters and the relation between characters. Next step, we count the corresponding score via formula, and then the scores are takes to estimate the data correctness. The latter uses string features to describe a certain characteristics of each string, after that decision tree is generated by the C4.5 and SVM machine learning algorithms. And then classify the data is valid or not. These two detection methods have the ability to describe the feature of data and verify the correctness further. The difference between these two methods is that, we deal with string of row data directly in the previous method. Besides, it indicates the concept of node position and neighbor node in graphic theory. By contrast, the row string was transformed into feature value, and then be classified in the latter method. In our experiments, we use IE task results of government personnel directives as test data. We conducted experiments to verify that effective detection of IE invalid values can be achieved by using the string graph structure and string feature-based methods. The contribution of our work is to reduce validation cost and enhance the quality of IE results, even provide both analytical and empirical evidences for supporting the effective enhancement of IE results usability as well. |
Reference: | [1] Paulson, L. D., “Data Quality: a Rising e-Business Concern,” IT Professional, Vol. 2 No. 4, July-Aug. 2000, pp.10–14. [2] Rahm, E. and Do, H.-H., “Data Cleaning: Problems and Current Approaches,” IEEE Bulletin of the Technical Committee on Data Engineering, Vol. 23, No. 4, December 2000. [3] 翁家緯,“以型態辨識為主的中文資訊擷取技術研究”,國立政治大學資訊科學系碩士論文,2003。 [4] Message Understanding Conference, URL: http://www.muc.saic.com [5] Text Retrieval Conference, URL: http://trec.nist.gov [6] Jim Cowie, Wendy Lehnert. 1996. Information Extraction, Communications of the ACM(CACM), 39(1),pp.80-91 [7] Applet, D. E. and Israel, D.J. 1999. Introduction to Information extraction Technology. In Proceedings of the 16th International Joint Conference on Artificial Intelligence. [8] Peng, F. Models Development in IE tasks – A survey. 1999. CS685 (Intelligent Computer Interface) course project, Computer Science Department, University of Waterloo. [9] Ellen Riloff. 1993. Automatically Constructing a Dictionary for Information Extraction Tasks. Proceeding for the Eleventh National Conference on Artificial Intelligence, pp.811-816. [10] Ellen Riloff. 1996. Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thriteenth National Conference on Artificial Intelligence, pp.1044-1049. [11] Califf, M. E. and Mooney R.J. 1999. Relational Learning of Pattern- match Rules for Information Extraction. In Proceedings of the 16th National Conference on AI, pp.328-334. [12] Kushmerick, N. Weld, D. and Doorenbos, R. 1997. Wrapper Induction for information extraction. In Proceedings of the 15th International Joint Conference on AI (IJCAI-97), pp. 729-737. [13] Kushmerick, N. 1998. Wrapper Induction: Efficiency and Expressiveness. Workshop on AI & Information Integration. In Proceedings of AAAI-98 Workshop on Artification Intelligence and Information Integration, pp. 15-68, AAAI Press, Menlo Park, California. [14] Chun-Nan Hsu and Ming-Tzung Dung. Aug 1998. Generating Finite-State Transducers for Semi-Structured Data Extraction from The Web, Journal of Infromation Systems, Special Issue on Semi-structured Data, Vol.23, No.8, pp. 521-538. [15] Chun-Nan Hsu and Chien-Chi Chang. 1999. Finite-state Transducers for Semi-structured Text Mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden. [16] Jyi-Shane Liu, Mu-Hsi. Tseng. November 2001. Extracting Government Personnel Information from Official Gazettes. In Proceedings of the Sixth Conference on Artificial Intelligence and Applications, pp. 593-598, Kaoshiung, Taiwan. [17] Oman, R. C. and Ayers, T. B. “Improving Data Quality,” Journal of Systems management, May 1988, pp.31-35. [18] Tayi, G. K. and Ballou, D. P. “Examining Data Quality,” Communications of the ACM (41:2), Feb. 1998, pp.54-57. [19] Ballou, D. P. and Pazer, H. L. “Implication of Data Quality for Spreadsheet Analysis,” Data Base, Spr. 1987, pp.13-19. [20] Redman, T.C. Data Quality for the Information Age, Artech House, Inc., 1996. Redman, T.C. “The Impact of Poor Data Quality on the Typical Enterprise,” Communications of the ACM (41:2), Feb. 1998, pp.79-82. [21] Brauer, B., “Data Quality –Spinning Straw Into Gold,” Available [Online] at: http://www2.sas.com/proceedings/sugi26/p117-26.pdf, 2000. [22] Muller, H., and Freytag, J. C. Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Humboldt University Berlin, 2003. [23] V. Raman and J. M. Hellerstein, An Interactive Framework for Data Cleaning, UC Berkeley Computer Science Division Report No. UCB/CSD00/1110, September 2000. [24] H. Galhardas, D. Florescu and D. Shasha, An Extensible Framework for Data Cleaning, INRIA Technical Report, 1999. [25] Kaufman, L. and Rousseeus, P. J., Finding Groups in Data: An introduction to Cluster Analysis, New York: John Wiley & Sons, 1990. [26] 李念秋,“資料品質改善之研究:錯誤資料偵測技術之發展與評估”,國立中山大學資訊管理系碩士論文,2002。 [27] Quinlan, J. R., “Induction of Decision Tree,” Machine Learning, Vol. 1, 1986, pp.81-106. [28] Quinlan, J. R., C4.5: Programs for Machine Learning, Morgen Kaufmann Publishers, San Mateo, CA, 1993. [29] Chan, P. K., Fan, W., Prodromidis, A. L., and Stolfo, S. J.,“Distributed Data Mining in Credit Card Fraud Detection,” IEEE Intelligent Systems, Vol. 14, No. 6, 1999, pp.67-74. [30] N.Cristianini, J. Shawf-Taylor. An Introduction to Support Vector Machines and other kernel-based learning methods,Cambridge University Press,2000. [31] V. Vapnik. Statistical Learning Theory. Wiley, 1998. [32] Elmasri, R., and Navathe, S., Fundamentals Of Database Systems, 3rd edition , 2000. [33] LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html, URL:http://www.csie.ntu.edu.tw/~r91034/svm/svm_tutorial.html [34] Redman, T., Data Quality for the Information Age, Artech House, Boston, 1996. [35] 總統府人事任免公報,URL:http://www.president.gov.tw/2_report/layer2.html [36] Maletic, J.I. and Marcus, A., Data Cleansing: Beyond Integrity Analysis. Proceedings of the Conference on Information Quality (IQ2000), Boston, October 2000. [37] 立法院新聞知識管理系統,URL: http://nplnews.ly.gov.tw/index.jsp |
Description: | 碩士 國立政治大學 資訊科學學系 93753006 94 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0093753006 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|