Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/37106
|
Title: | 中文繁簡等義詞自動辨識之研究 A Study on Automatic Recognition on Exact Synonyms between Traditional and Simplified Chinese |
Authors: | 黃群弼 |
Contributors: | 劉吉軒 Liu,Jyi Shane 黃群弼 |
Keywords: | 中文繁簡對照 等義詞 自動辨識 |
Date: | 2008 |
Issue Date: | 2009-09-19 12:10:04 (UTC+8) |
Abstract: | 中文繁簡在字體或電腦編碼上明顯不同之外,在部份詞彙的用法也有不同,這些用法不同的詞彙卻有相同意義的詞彙稱為繁簡體中的等義詞,這些等義詞在雙方文化交流時可能會造成一些障礙,例如人們互相對話、文件書籍翻譯或軟體系統等轉換時容易造成詞義上的誤解,目前均以人工方式來解決不同詞彙的問題,均會費時耗力且易疏漏,若能利用科學的方法讓電腦能自動辨識中文繁簡的等義詞,便能利用辨識出的等義詞給予提示,解決繁簡詞義不同所造成的誤解。 依照實驗設計架構,首先建立電腦類與一般類的繁簡體語料庫,作為辨識的基礎,並建立研究的架構與方法,分為二個階段三種方法,第一階段使用第一種方法,我們先使用N-gram辨識等義詞,評估單一方法是否能有效辨識出等義詞,第二階段使用第二種方法PMI-IR & LC-IR方法與第三種方法Context Vector,評估第二階段的方法是否能將等義詞的辨識能力提高。 根據本研究目的,讓電腦能自動在語料庫中自動辨識中文繁簡等義詞,所以提出了新的辨識架構,用N-gram初步辨識出等義詞,並經由PMI-IR & LC-IR與Context Vector方法提高Precision約0~20%不等。本研究結論是採用不同語言的語料庫,使用N-gram能夠辦識出等義詞,並搭配PMI-IR & LC-IR與Context Vector方法後,可以強化與提昇其等義詞辨識的能力,解決單一方法等義詞辨識能力不足之問題。 Traditional Chinese and Simplied Chinese are not only different in the typeface and in the computer code, but also in the partial usage of vocabularies. These vocabularies which have different usage but have the same significance are called synonyms. These synonyms will cause some obstacles and misunderstanding in meaning when two parties have cultural exchange, such as during conversation, documents and books translation or softwares system transformation. What we do to solve the problem now is picked them out by manpower, but that will waste a lot of time and strength and easily make errors. If we can use scientific way to let the computer distinguish automatically the synonyms between Traditional Chinese and Simplied Chinese, we will be able to solve such misunderstanding by the hints of the distinguished synonyms. According to the structure of experiment, to let the computer distinguish automatically the synonyms between Traditional Chinese and Simplied Chinese, we have to establish a Traditional Chinese and Simplied Chinese computer category and a general category first as the basis of identification. We should build up the research structure and the method, which divided into two stages and three methods. The first stage uses the first method to use N-gram to distinguish the synonyms and then review if this single method can identify the synonyms effectively. The second stage uses the second method PMI-IR & LC-IR and the third method Context Vector and review if the second stage can raise the synonyms’ ability of identification. According to this research purpose, the computer to study on automatic exact recognition synonyms between traditional and simplified Chinese, so has proposed the new structure of distinguishing, N-gram automatic exact recognition synonym tentatively, and PMI-IR & LC-IR and Context Vector method can improve Precision about 0~20%. This conclusion is a corpus base of using different languages, using N-gram can be exact recognition synonyms, PMI-IR & LC-IR and Context Vector method, can improve single method ability. |
Reference: | 1. Amruta Purandare, & Ted Pedersen. (2004). Improving Word Sense Discrimination with Gloss Augmented Feature Vectors. Appears in the Proceedings of the Workshop on Lexical Resources for the Web and Word Sense Disambiguation. Puebla Mexico. 2. Attar, R., & Fraenkel, A. S. (1977). Local Feedback in Full-Text Retrieval Systems. Journal of the ACM, Volume 24, Issue 3, (頁 397-417). 3. Ben, Gabriel, & David. (2006). Dimensionality Reduction Aids Term Co-occurrence Based Multi-Document Summarization. 4. Brown, & Peter. (1991). Word sense disambiguation using statistical methods. In ACL 29, (pp. 264-270). 5. C. J. Van Rijsbergen. (1979). Information Retrieval. Butterworths, sec. edition., (pp 208). 6. Chen, Jen-Nan, & Chang, Jason-S. (1998). TopSense: A Topical Sense Clustering Method based on Information Retrieval Techniques on Machine Readable Resources. Special Issue on Word Sense Disambiguation, Computational Linguistics, (pp. 61-95). 7. Chen, Keh-Jiann, & You, Jia-Ming. (2002). A Study on Word Similarity using Context Vector Models. 8. Chen, Keh-Jiann, & You, Jia-Ming. (2006). Improving Context Vector Models by Feature Clustering for Automatic Thesaurus Construction.”. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. 9. David Hull. (1994). Improving Text Retrieval for the Routing Problem using Latent Semantic Indexing. ACM SIGIR Conference. 10. David Yarowsky. (1994). Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces, NM, (pp. 88-95). 11. Daniel Jurafsky, & James H. Martin. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hall. 12. Dan Klein, & Christopher D. Manning. (2003). Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics., (pp. 423-430). 13. Derrick Higgins. (2004). Which statistics reflect semantics? Rethinking synonymy and word similarity. 14. Dong, Zhen-dong, & Dong, Qiang. (2006). Hownet and the Computation of Meaning. World Scientific. 15. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statist., (pp. 1-26). 16. G. Salton & MJ McGill. (1983). Introduction to modern information retrieval. New York: McGraw-Hill. 17. GAISWWW Query. 擷取自 http://gais.cs.ccu.edu.tw/ 18. Gale, William, Church, Kenneth, Yarowsky. (1992). A method of disambiguating word senses in a large corpus. Computers and the Humanties 26, (pp. 415-439). 19. Google Offers Immediate Access to 3 Billion Web Documents. (2001). 擷取自 Google Inc: http://www.google.com/press/pressrel/3billion.html 20. H. Edmund Stiles. (1961). The association factor in information retrieval. Journal of the ACM, 8, (pp. 271-279). 21. Helen J. Peat, & Peter Willett . (1991). The Limitations of Term Co-occurrence Data for Query Expansion in Document Retrieval Systems. 22. Howard D. White, Xia Lin, Jan W. Buzydlowski, & Chaomei Chen . (2001). Term Co-occurrence Analysis as an Interface for Digital Libraries. 23. Jarmasz, M., & Szpakowicz. S. (2003). Roget’s thesaurus and semantic similarity. University of Ottawa ms. 24. Joe A. Guthrie, Louise Guthrie, Yorick Wilks, & Homa Aidinejad. (1991). Subject-Dependent Co-occurrence and Word Sense Disambiguation. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, (pp. 146-152). 25. Le, Cuong-Anh, & Shimizu, Akira. (2004). High WSD Accuracy Using Naive Bayesian Classifier with Rich Features. PACLIC 18. Tokyo. 26. Lesk, M. E. (1969). Word-word associations in document retrieval systems. American Documentation, 20, (pp. 27-38). 27. Li, Xiaobin, Stan Szpakowicz, & Matwin. (1995). A WordNet-Based Algorithm for Word Semantic Sense Disambiguation. In Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAL-95,. Montreal, Canada. 28. Lin, De-kang. (1997). Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity. In Proceedings of ACL-97. Madrid, Spain. 29. Lu, Wen-Hsiang, Lee, Hsi-Jian, & Chien, Lee-Feng. (2003). Term Translation Extraction Using Web Mining Techniques. 30. Magnus Sahlgren. (2006). Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. 31. Manning, Christopher, Schutze, & Hinrich. (1999). Foundations of Statistical Natural Language Processing. MIT Press. 32. Marco Baroni, & Sabrina Bisi. (2004). Using cooccurrence statistics and the web to discover synonyms in a technical language. 33. Mar´ıa Ruiz-Casado, Enrique Alfonseca, & Pablo Castells. (2005). Using context-window overlapping in synonym discovery and ontology extension. 34. M. E. Maron, & J. L. Kuhns. (1960). On relevance, probabilistic indexing and information retrieval. Journal of rhe ACM, 7, (pp. 216-244). 35. Michael.W. Berry, Susan.T. Dumais, & Amy.T. Shippy. (1995). A Case Study of Latent Semantic Indexing. Tech Rep., (pp. 95-271). 36. Michael Lesk . (1986). Automatic Sense Disambiguation: How to tell a pine cone from an ice cream cone. In Proceedings of the 1986 SIGDOC Conference, New York. Association for Computing Machinerypp. 24-26. 37. Siddharth Patwardhan, Satanjeev Banerjee, & Ted Pedersen. (2005). SenseRelate::TargetWord - A Generalized Framework for Word Sense Disambiguation. Appears in the Proceedings of the Twentieth National Conference on Artificial Intelligence. Pittsburgh, PA. 38. Peng, Fu-chun, Huang, Xiang-ji, Dale, Schuurmans,& Wang, Shao-jun. (2003). Text Classification in Asian Languages without Word Segmentation. Proceedings of the Sixth Internationa Workshop on Information Retrieval with Asian Languages (IRAL), Vol. 18, (pp. 41-48). 39. Philip Edmonds & Graeme Hirst. (2002). Near-synonymy and lexical choice. Computational Linguistics,28(2), (pp. 105-144). 40. Q.yuhen斷詞系統. 擷取自 http://www.rainsts.net 41. Senseval-2. (2001). 擷取自 http://193.133.140.102/senseval2/ 42. Sketch Engine. 擷取自 http://www.sketchengine.co.uk/ 43. Slator, B. (1991). Using Context for Sense Preference. In Zernik (ed.) Lexical Acquisition: Exploiting on-line Resources to Build a Lexicon, Lawrence Erlbaum, Hillsdale. 44. Soumen Chakrabarti, Martin van den Berg, & Byron Dom. (1999). Focused crawling: A new approach to Topic-Specific Web Resource Discovery. Proceedings of the WWW8 Conference. 45. Stanford Parser. 擷取自 http://www-nlp.stanford.edu/downloads/lex-parser.shtml 46. Stevens, M. E., Giuliano, V. E., & Heilprin, L. B. (1965). Statistical association methods for mechanized documentation. Washington:National Bureau of Standards (Occasional Publication no. 269). 47. Thomas K Landaauer, & Susan T. Dumais. (1997). A solution to Plato`s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104(2), (pp. 211–240). 48. Turney, . (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the Twelfth European Conference on Machine Learning (ECML2001), (pp. 491-502). Freiburg, Germany. 49. UngererF & Schmid. (1996). An Introduction to Cognitive Linguistics. London: Longman. 50. Walker. (1987). Thesaurus-Based Disambiguation. 51. Wang, Jenq-Haur, Teng, Jei-Wen, Cheng, Pu-Jen, Lu, Wen-Hsiang, & Chien, Lee-Feng (2004). Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach. 52. William C. Hannas. (1997). Asia`s Orthographic Dilemma. University of Hawaii Press. 53. William, R. Caid, & Joel, L. Carleton. (2003). Context Vector-Based Text Retrieval. A Fair Isaac White Paper. 54. Yang, Chang-hua, & Sue, Jin-Ker. (2002). Considerations of Linking WordNet with MRD. In Proceedings of the 19th International Conference on Computational Linguistics, (pp. 1121-1127). 55. 中央研究院斷詞系統. 擷取自 http://rocling.iis.sinica.edu.tw/CKIP/wordsegment.htm 56. 中国知网. 擷取自 http://www.cnki.net/index.htm 57. 北京大學语言信息处理研究所. 擷取自 http://202.112.195.8/Down.asp 58. 全昌勤、何婷婷、姬東鴻與劉輝. (2005). 從搭配知識獲取最優種子的詞義消歧方法. 中文信息學報,第十九卷,第一期, (頁 30-37). 59. 朱邦復工作室. 中台港澳通用中文內碼之介紹 . 擷取自 http://www.cbflabs.com/tec/cbflabs/jason2k0914.htm 60. 車方翔、劉挺、秦兵與李生. (2003). 面向依存文法分析的搭配抽取方法研究. 哈爾濱工業大學信息檢索研究室論文集. 61. 知网. 擷取自 http://www.keenage.com/ 62. 俞士汶、朱學峰、王惠與張芸芸. (1998). 現代漢語語法信息辭典. 清華大學出版社. 63. 倚天. 倚天中文系統技術手冊. 64. 梅家駒、竺一鳴、高蘊琦與殷鴻翔. (1993). 同義詞詞林. 上海辭書出版社. 65. 搜狗实验室(Sogou Labs). 擷取自 http://www.sogou.com/labs/ 66. 維基百科. 擷取自 http://zh.wikipedia.org 67. 汤志祥. (2002). 汉语词汇的"借用"和"移用"及其深层社会意义. 68. 陈水仙. (2006). 港台地区词汇对普通话的影响. 广东外语外贸大学英语教育学院. 69. 陈钟、彭波、关宏飞與王继民. (2005). 一种词汇共现算法及共现词对检索系统排序的影响. |
Description: | 碩士 國立政治大學 資訊科學學系 94971010 97 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0094971010 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|