政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/32649
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  全文筆數/總筆數 : 113822/144841 (79%)
造訪人次 : 51797622      線上人數 : 565
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋
    政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 >  Item 140.119/32649
    請使用永久網址來引用或連結此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/32649


    題名: 中文資訊擷取結果之錯誤偵測
    Error Detection on Chinese Information Extraction Results
    作者: 鄭雍瑋
    Cheng, Yung-Wei
    貢獻者: 劉吉軒
    Liu, Jyi-Shane
    鄭雍瑋
    Cheng, Yung-Wei
    關鍵詞: 錯誤偵測
    資訊擷取
    文本資料描述
    Error Detection
    Information Extraction
    Textual Data Profiling
    日期: 2005
    上傳時間: 2009-09-17 13:56:10 (UTC+8)
    摘要: 資訊擷取是從自然語言文本中辨識出特定的主題或事件的描述,進而萃取出相關主題或事件元素中的對應資訊,再將其擷取之結果彙整至資料庫中,便能將自然語言文件轉換成結構化的核心資訊。然而資訊擷取技術的結果會有錯誤情況發生,若單只依靠人工檢查及更正錯誤的方式進行,將會是耗費大量人力及時間的工作。
    在本研究論文中,我們提出字串圖形結構與字串特徵值兩種錯誤資料偵測方法。前者是透過圖形結構比對各資料內字元及字元間關聯,接著由公式計算出每筆資料的比對分數,藉由分數高低可判斷是否為錯誤資料;後者則是利用字串特徵值,來描述字串外表特徵,再透過SVM和C4.5機器學習分類方法歸納出決策樹,進而分類正確與錯誤二元資料。而此兩種偵測方法的差異在於前者隱含了圖學理論之節點位置與鄰點概念,直接比對原始字串內容;後者則是將原始字串轉換成特徵數值,進行分類等動作。
    在實驗方面,我們以「總統府人事任免公報」之資訊擷取成果資料庫作為測試資料。實驗結果顯示,本研究所提出的錯誤偵測方法可以有效偵測出不合格的值組,不但能節省驗證資料所花費的成本,甚至可確保高資料品質的資訊擷取成果產出,促使資訊擷取技術更廣泛的實際應用。
    Given a targeted subject and a text collection, information extraction techniques provide the capability to populate a database in which each record entry is a subject instance documented in the text collection. However, even with the state-of-the-art IE techniques, IE task results are expected to contain errors. Manual error detection and correction are labor intensive and time consuming. This validation cost remains a major obstacle to actual deployment of practical IE applications with high validity requirement.
    In this paper, we propose string graph structure and string feature-based methods. The former takes advantage of graph structure to compare characters and the relation between characters. Next step, we count the corresponding score via formula, and then the scores are takes to estimate the data correctness. The latter uses string features to describe a certain characteristics of each string, after that decision tree is generated by the C4.5 and SVM machine learning algorithms. And then classify the data is valid or not. These two detection methods have the ability to describe the feature of data and verify the correctness further. The difference between these two methods is that, we deal with string of row data directly in the previous method. Besides, it indicates the concept of node position and neighbor node in graphic theory. By contrast, the row string was transformed into feature value, and then be classified in the latter method.
    In our experiments, we use IE task results of government personnel directives as test data. We conducted experiments to verify that effective detection of IE invalid values can be achieved by using the string graph structure and string feature-based methods. The contribution of our work is to reduce validation cost and enhance the quality of IE results, even provide both analytical and empirical evidences for supporting the effective enhancement of IE results usability as well.
    參考文獻: [1] Paulson, L. D., “Data Quality: a Rising e-Business Concern,” IT Professional, Vol. 2 No. 4, July-Aug. 2000, pp.10–14.
    [2] Rahm, E. and Do, H.-H., “Data Cleaning: Problems and Current Approaches,” IEEE Bulletin of the Technical Committee on Data Engineering, Vol. 23, No. 4, December 2000.
    [3] 翁家緯,“以型態辨識為主的中文資訊擷取技術研究”,國立政治大學資訊科學系碩士論文,2003。
    [4] Message Understanding Conference, URL: http://www.muc.saic.com
    [5] Text Retrieval Conference, URL: http://trec.nist.gov
    [6] Jim Cowie, Wendy Lehnert. 1996. Information Extraction, Communications of the ACM(CACM), 39(1),pp.80-91
    [7] Applet, D. E. and Israel, D.J. 1999. Introduction to Information extraction Technology. In Proceedings of the 16th International Joint Conference on Artificial Intelligence.
    [8] Peng, F. Models Development in IE tasks – A survey. 1999. CS685 (Intelligent Computer Interface) course project, Computer Science Department, University of Waterloo.
    [9] Ellen Riloff. 1993. Automatically Constructing a Dictionary for Information Extraction Tasks. Proceeding for the Eleventh National Conference on Artificial Intelligence, pp.811-816.
    [10] Ellen Riloff. 1996. Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thriteenth National Conference on Artificial Intelligence, pp.1044-1049.
    [11] Califf, M. E. and Mooney R.J. 1999. Relational Learning of Pattern- match Rules for Information Extraction. In Proceedings of the 16th National Conference on AI, pp.328-334.
    [12] Kushmerick, N. Weld, D. and Doorenbos, R. 1997. Wrapper Induction for information extraction. In Proceedings of the 15th International Joint Conference on AI (IJCAI-97), pp. 729-737.
    [13] Kushmerick, N. 1998. Wrapper Induction: Efficiency and Expressiveness. Workshop on AI & Information Integration. In Proceedings of AAAI-98 Workshop on Artification Intelligence and Information Integration, pp. 15-68, AAAI Press, Menlo Park, California.
    [14] Chun-Nan Hsu and Ming-Tzung Dung. Aug 1998. Generating Finite-State Transducers for Semi-Structured Data Extraction from The Web, Journal of Infromation Systems, Special Issue on Semi-structured Data, Vol.23, No.8, pp. 521-538.
    [15] Chun-Nan Hsu and Chien-Chi Chang. 1999. Finite-state Transducers for Semi-structured Text Mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden.
    [16] Jyi-Shane Liu, Mu-Hsi. Tseng. November 2001. Extracting Government Personnel Information from Official Gazettes. In Proceedings of the Sixth Conference on Artificial Intelligence and Applications, pp. 593-598, Kaoshiung, Taiwan.
    [17] Oman, R. C. and Ayers, T. B. “Improving Data Quality,” Journal of Systems management, May 1988, pp.31-35.
    [18] Tayi, G. K. and Ballou, D. P. “Examining Data Quality,” Communications of the ACM (41:2), Feb. 1998, pp.54-57.
    [19] Ballou, D. P. and Pazer, H. L. “Implication of Data Quality for Spreadsheet Analysis,” Data Base, Spr. 1987, pp.13-19.
    [20] Redman, T.C. Data Quality for the Information Age, Artech House, Inc., 1996. Redman, T.C. “The Impact of Poor Data Quality on the Typical Enterprise,” Communications of the ACM (41:2), Feb. 1998, pp.79-82.
    [21] Brauer, B., “Data Quality –Spinning Straw Into Gold,” Available [Online] at: http://www2.sas.com/proceedings/sugi26/p117-26.pdf, 2000.
    [22] Muller, H., and Freytag, J. C. Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Humboldt University Berlin, 2003.
    [23] V. Raman and J. M. Hellerstein, An Interactive Framework for Data Cleaning, UC Berkeley Computer Science Division Report No. UCB/CSD00/1110, September 2000.
    [24] H. Galhardas, D. Florescu and D. Shasha, An Extensible Framework for Data Cleaning, INRIA Technical Report, 1999.
    [25] Kaufman, L. and Rousseeus, P. J., Finding Groups in Data: An
    introduction to Cluster Analysis, New York: John Wiley & Sons, 1990.
    [26] 李念秋,“資料品質改善之研究:錯誤資料偵測技術之發展與評估”,國立中山大學資訊管理系碩士論文,2002。
    [27] Quinlan, J. R., “Induction of Decision Tree,” Machine Learning, Vol. 1, 1986, pp.81-106.
    [28] Quinlan, J. R., C4.5: Programs for Machine Learning, Morgen Kaufmann Publishers, San Mateo, CA, 1993.
    [29] Chan, P. K., Fan, W., Prodromidis, A. L., and Stolfo, S. J.,“Distributed Data Mining in Credit Card Fraud Detection,” IEEE Intelligent Systems, Vol. 14, No. 6, 1999, pp.67-74.
    [30] N.Cristianini, J. Shawf-Taylor. An Introduction to Support Vector Machines and
    other kernel-based learning methods,Cambridge University Press,2000.
    [31] V. Vapnik. Statistical Learning Theory. Wiley, 1998.
    [32] Elmasri, R., and Navathe, S., Fundamentals Of Database Systems, 3rd edition , 2000.
    [33] LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html, URL:http://www.csie.ntu.edu.tw/~r91034/svm/svm_tutorial.html
    [34] Redman, T., Data Quality for the Information Age, Artech House, Boston, 1996.
    [35] 總統府人事任免公報,URL:http://www.president.gov.tw/2_report/layer2.html
    [36] Maletic, J.I. and Marcus, A., Data Cleansing: Beyond Integrity Analysis. Proceedings of the Conference on Information Quality (IQ2000), Boston, October 2000.
    [37] 立法院新聞知識管理系統,URL: http://nplnews.ly.gov.tw/index.jsp
    描述: 碩士
    國立政治大學
    資訊科學學系
    93753006
    94
    資料來源: http://thesis.lib.nccu.edu.tw/record/#G0093753006
    資料類型: thesis
    顯示於類別:[資訊科學系] 學位論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    75300601.pdf47KbAdobe PDF2806檢視/開啟
    75300602.pdf66KbAdobe PDF2733檢視/開啟
    75300603.pdf92KbAdobe PDF2807檢視/開啟
    75300604.pdf256KbAdobe PDF2735檢視/開啟
    75300605.pdf108KbAdobe PDF2788檢視/開啟
    75300606.pdf182KbAdobe PDF21041檢視/開啟
    75300607.pdf248KbAdobe PDF21055檢視/開啟
    75300608.pdf524KbAdobe PDF21070檢視/開啟
    75300609.pdf129KbAdobe PDF2919檢視/開啟
    75300610.pdf81KbAdobe PDF2738檢視/開啟
    75300611.pdf337KbAdobe PDF2772檢視/開啟


    在政大典藏中所有的資料項目都受到原著作權保護.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 回饋