Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/54795
|
Title: | 應用平行語料建構中文斷詞組件 Applications of Parallel Corpora for Chinese Segmentation |
Authors: | 王瑞平 Wang, Jui Ping |
Contributors: | 劉昭麟 Liu, Chao Lin 王瑞平 Wang, Jui Ping |
Keywords: | 中文斷詞 中英平行語料 未知詞 交集型歧異 |
Date: | 2011 |
Issue Date: | 2012-10-30 11:46:02 (UTC+8) |
Abstract: | 在本論文,我們建構一個基於中英平行語料的中文斷詞系統,並透過該系統對不同領域的語料斷詞。提供我們的系統不同領域的中英平行語料後,系統可以自動化地產生品質不錯的訓練語料,以節省透過人工斷詞方式取得訓練語料所耗費的時間、人力。 在產生訓練語料時,首先對中英平行語料中的所有中文句,透過查詢中文辭典的方式產生句子的各種斷詞組合,再利用英漢翻譯的資訊處理交集型歧異,將錯誤的斷詞組合去除。此外本研究從中英平行語料中擷取新的中英詞對與未知詞,並分別將其擴充至英漢辭典模組與中文辭典模組,以提升我們的系統之斷詞效能。 我們透過兩部分的實驗進行斷詞效能評估,而在實驗中會使用三種不同領域的實驗語料。在第一部分,我們以人工斷詞的測試語料進行斷詞效能評估。在第二部分,我們藉由漢英翻譯的翻譯品質間接地評估我們的系統之斷詞效能。由實驗結果顯示,我們的系統可以有一定的斷詞效能。 In this paper, we construct a Chinese word segmentation system which based on Chinese-English Parallel Corpus to save time and manpower, and the corpora in different domains can be segmented by our system. By providing Chinese-English Parallel Corpus to our system, training corpus can be automatically produced by our system. Then segmentation model can be trained with the produced training corpus. We use Chinese translation of words in English parallel sentences to solve overlapping ambiguity. We extract translation pairs and unknown words from Chinese-English Parallel Corpus. In evaluation, two different experiments are conducted, and experimental data in three domains are used to evaluate segmentation performance in two experiments. In the first experiment, manually annotated Chinese sentences are used as testing data. In the second experiment, segmentation performance is indirectly indicated by translation quality. Experimental results show that our system achieves acceptable segmentation performance. |
Reference: | [1] 牛津現代英漢雙解詞典,http://startdict.sourceforge.net/Dictionaries_zh_TW.php [連結已失效]。 [2] 中央研究院中文斷詞系統,http://ckipsvr.iis.sinica.edu.tw/ [2011/11/2]。 [3] 中央研究院現代漢語標記語料庫4.0版簡介,http://db1x.sinica.edu.tw/cgi-bin/kiwi/mkiwi/mkiwi.sh [2011/12/22]。 [4] 田侃文,英漢專利文書文句對列與應用,國立政治大學資訊科學所,碩士論文,2009。 [5] 史丹佛剖析器, http://nlp.stanford.edu/software/lex-parser.shtml [2012/2/26]。 [6] 朱怡霖,中文斷詞與專有名詞辨識之研究,國立臺灣大學資訊工程學研究所,碩士論文,2002。 [7] 成語詞典,http://yeelou.com/huzheng/stardict-dic/zh_TW/ [2011/3/30]。 [8] 林筱晴,語料庫統計值與網際網路統計值在自然語言處理上之應用:以中文斷詞為例,國立臺灣大學資訊工程學研究所,碩士論文,2004。 [9] 林千翔,基於特製隱藏式馬可夫模型之中文斷詞研究,國立中央大學資訊工程研究所,碩士論文,2006。 [10] 莊怡軒,英文技術文獻中動詞與其受詞之中文翻譯的語境效用,國立政治大學資訊科學所,碩士論文,2011。 [11] 現代漢語一詞泛讀,http://elearning.ling.sinica.edu.tw/introduction.html [2011/8/26]。 [12] 國家教育研究院學術名詞資訊網,http://terms.nict.gov.tw/download_main.php [2011/8/26]。 [13] 掌印辭典整理,http://www.palmstamp.com/forum/viewthread.php?tid=832&page=1#pid6847 [2011/8/26]。 [14] 詹嘉丞,中文斷詞系統中非繁體中文詞彙之處理,國立台灣海洋大學資訊工程所,碩士論文,2009。 [15] 構詞篇(下),http://chcs-opencourse.org/chcs/full_content/A21/pdf/03.pdf [2012/2/27]。 [16] 劉群、李素建,基於《知網》的辭彙語義相似度計算,中文計算語言學期刊,第七卷第二期,59-76,2002。 [17] 懶蟲簡明英漢詞典,http://yeelou.com/huzheng/stardict-dic/zh_TW/ [2011/3/30]。 [18] 羅永聖,結合多類型字典與條件隨機域之中文斷詞與詞性標記系統研究,國立臺灣大學資訊工程學研究所,碩士論文,2008。 [19] Keh-Jiann Chen and Shing-Huan Liu, Word Identification for Mandarin Chinese Sentences, Proceedings of the 15th International Conference on Computational Linguistics, 101-107, 1992. [20] Keh-Jiann Chen and Ming-Hong Bai, Unknown Word Detection for Chinese by a Corpus-based Learning Method, International Journal of Computational linguistics and Chinese Language Processing, Vol. 3, Num. 1, 27-44, 1998. [21] Keh-Jiann Chen and Wei-Yun Ma, Unknown Word Extraction for Chinese Documents, Proceedings of the 19th International Conference on Computational Linguistics, 169-175, 2002. [22] Pi-Chuan Chang, Michel Galley, and Christopher D. Manning, Optimizing Chinese Word Segmentation for Machine Translation Performance, Proceedings of the 3rd Workshop on Statistical Machine Translation, 224-232, 2008. [23] Dr.eye譯典通字典, http://www.dreye.com/ [2011/8/26]. [24] E-HowNet, http://ckip.iis.sinica.edu.tw/taxonomy/taxonomy-doc.htm [2011/8/26]. [25] E-HowNet Technical Report, http://rocling.iis.sinica.edu.tw/CKIP/paper/Technical_Reprt_E-HowNet.pdf [2012/6/21]. [26] Chung-Chi Huang, Wei-Teh Chen, and Jason S. Chang, Bilingual Segmentation for Alignment and Translation, Proceedings of the 9th international conference on Computational linguistics and intelligent text processing, 445-453, 2008. [27] ICTCLAS漢語分詞系統, http://ictclas.org/ [2012/7/1]. [28] Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan Lü, A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging, Proceedings of 46th Annual Meeting on Association for Computational Linguistics: HLT, 897-904, 2008. [29] Wenbin Jiang, Liang Huang, and Qun Liu, Automatic Adaptation of Annotation Standards:ChineseWord Segmentation and POS Tagging – A Case Study, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 522-530, 2009. [30] Mu Li, Jianfeng Gao, Changning Huang, and Jianfeng Li, Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation, Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, 1-7, 2003. [31] LingPipe, http://alias-i.com/lingpipe/ [2011/8/26] . [32] Yanjun Ma and Andy Way, Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation, Proceedings of the 12th Conference of the European Chapter of the ACL, 549-557, 2009. [33] Moses, http://www.statmt.org/moses/ [2011/12/22]. [34] C. D. Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, 1999, MIT Press. [35] Pat-Tree 中文抽詞程式, http://www.openfoundry.org/of/projects/367/ [2012/3/16]. [36] Patent Machine Translation Task at the NTCIR-9, http://ntcir.nii.ac.jp/PatentMT/ [2012/3/11]. [37] SIGHAN Bakeoff 2, www.sighan.org/bakeoff2005/ [2011/12/22]. [38] Stanford Chinese Segmenter, http://nlp.stanford.edu/software/segmenter.shtml [2011/8/26]. [39] Yuen-Hsien Tseng, Chao-Lin Liu, Chia-Chi Tsai, Jui-Ping Wang, Yi-Hsuan Chuang, and James Jeng, Statistical approaches to patent translation - Experiments with various settings of training data, Proceedings of the 9th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access - PatentMT, 661-665, 2011. [40] Kun Wang, Chengqing Zong, and Keh-Yih Su, A Character-Based Joint Model for Chinese Word Segmentation, Proceedings of the 23th International Conference on Computational Linguistics, 1173-1181, 2010. [41] Yahoo!斷章取義API, http://tw.developer.yahoo.com/cas/ [2011/11/2]. |
Description: | 碩士 國立政治大學 資訊科學學系 99753016 100 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0099753016 |
Data Type: | thesis |
Appears in Collections: | [Department of Computer Science ] Theses
|
Files in This Item:
File |
Size | Format | |
301601.pdf | 1029Kb | Adobe PDF2 | 1182 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|