政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/37106
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  全文笔数/总笔数 : 113822/144841 (79%)
造访人次 : 51789856      在线人数 : 526
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜寻范围 查询小技巧:
  • 您可在西文检索词汇前后加上"双引号",以获取较精准的检索结果
  • 若欲以作者姓名搜寻,建议至进阶搜寻限定作者字段,可获得较完整数据
  • 进阶搜寻
    政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 >  Item 140.119/37106


    请使用永久网址来引用或连结此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/37106


    题名: 中文繁簡等義詞自動辨識之研究
    A Study on Automatic Recognition on Exact Synonyms between Traditional and Simplified Chinese
    作者: 黃群弼
    贡献者: 劉吉軒
    Liu,Jyi Shane
    黃群弼
    关键词: 中文繁簡對照
    等義詞
    自動辨識
    日期: 2008
    上传时间: 2009-09-19 12:10:04 (UTC+8)
    摘要: 中文繁簡在字體或電腦編碼上明顯不同之外,在部份詞彙的用法也有不同,這些用法不同的詞彙卻有相同意義的詞彙稱為繁簡體中的等義詞,這些等義詞在雙方文化交流時可能會造成一些障礙,例如人們互相對話、文件書籍翻譯或軟體系統等轉換時容易造成詞義上的誤解,目前均以人工方式來解決不同詞彙的問題,均會費時耗力且易疏漏,若能利用科學的方法讓電腦能自動辨識中文繁簡的等義詞,便能利用辨識出的等義詞給予提示,解決繁簡詞義不同所造成的誤解。
    依照實驗設計架構,首先建立電腦類與一般類的繁簡體語料庫,作為辨識的基礎,並建立研究的架構與方法,分為二個階段三種方法,第一階段使用第一種方法,我們先使用N-gram辨識等義詞,評估單一方法是否能有效辨識出等義詞,第二階段使用第二種方法PMI-IR & LC-IR方法與第三種方法Context Vector,評估第二階段的方法是否能將等義詞的辨識能力提高。
    根據本研究目的,讓電腦能自動在語料庫中自動辨識中文繁簡等義詞,所以提出了新的辨識架構,用N-gram初步辨識出等義詞,並經由PMI-IR & LC-IR與Context Vector方法提高Precision約0~20%不等。本研究結論是採用不同語言的語料庫,使用N-gram能夠辦識出等義詞,並搭配PMI-IR & LC-IR與Context Vector方法後,可以強化與提昇其等義詞辨識的能力,解決單一方法等義詞辨識能力不足之問題。
    Traditional Chinese and Simplied Chinese are not only different in the typeface and in the computer code, but also in the partial usage of vocabularies. These vocabularies which have different usage but have the same significance are called synonyms. These synonyms will cause some obstacles and misunderstanding in meaning when two parties have cultural exchange, such as during conversation, documents and books translation or softwares system transformation. What we do to solve the problem now is picked them out by manpower, but that will waste a lot of time and strength and easily make errors. If we can use scientific way to let the computer distinguish automatically the synonyms between Traditional Chinese and Simplied Chinese, we will be able to solve such misunderstanding by the hints of the distinguished synonyms.
    According to the structure of experiment, to let the computer distinguish automatically the synonyms between Traditional Chinese and Simplied Chinese, we have to establish a Traditional Chinese and Simplied Chinese computer category and a general category first as the basis of identification. We should build up the research structure and the method, which divided into two stages and three methods. The first stage uses the first method to use N-gram to distinguish the synonyms and then review if this single method can identify the synonyms effectively. The second stage uses the second method PMI-IR & LC-IR and the third method Context Vector and review if the second stage can raise the synonyms’ ability of identification.
    According to this research purpose, the computer to study on automatic exact recognition synonyms between traditional and simplified Chinese, so has proposed the new structure of distinguishing, N-gram automatic exact recognition synonym tentatively, and PMI-IR & LC-IR and Context Vector method can improve Precision about 0~20%. This conclusion is a corpus base of using different languages, using N-gram can be exact recognition synonyms, PMI-IR & LC-IR and Context Vector method, can improve single method ability.
    參考文獻: 1. Amruta Purandare, & Ted Pedersen. (2004). Improving Word Sense Discrimination with Gloss Augmented Feature Vectors. Appears in the Proceedings of the Workshop on Lexical Resources for the Web and Word Sense Disambiguation. Puebla Mexico.
    2. Attar, R., & Fraenkel, A. S. (1977). Local Feedback in Full-Text Retrieval Systems. Journal of the ACM, Volume 24, Issue 3, (頁 397-417).
    3. Ben, Gabriel, & David. (2006). Dimensionality Reduction Aids Term Co-occurrence Based Multi-Document Summarization.
    4. Brown, & Peter. (1991). Word sense disambiguation using statistical methods. In ACL 29, (pp. 264-270).
    5. C. J. Van Rijsbergen. (1979). Information Retrieval. Butterworths, sec. edition., (pp 208).
    6. Chen, Jen-Nan, & Chang, Jason-S. (1998). TopSense: A Topical Sense Clustering Method based on Information Retrieval Techniques on Machine Readable Resources. Special Issue on Word Sense Disambiguation, Computational Linguistics, (pp. 61-95).
    7. Chen, Keh-Jiann, & You, Jia-Ming. (2002). A Study on Word Similarity using Context Vector Models.
    8. Chen, Keh-Jiann, & You, Jia-Ming. (2006). Improving Context Vector Models by Feature Clustering for Automatic Thesaurus Construction.”. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.
    9. David Hull. (1994). Improving Text Retrieval for the Routing Problem using Latent Semantic Indexing. ACM SIGIR Conference.
    10. David Yarowsky. (1994). Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces, NM, (pp. 88-95).
    11. Daniel Jurafsky, & James H. Martin. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hall.
    12. Dan Klein, & Christopher D. Manning. (2003). Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics., (pp. 423-430).
    13. Derrick Higgins. (2004). Which statistics reflect semantics? Rethinking synonymy and word similarity.
    14. Dong, Zhen-dong, & Dong, Qiang. (2006). Hownet and the Computation of Meaning. World Scientific.
    15. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statist., (pp. 1-26).
    16. G. Salton & MJ McGill. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.
    17. GAISWWW Query. 擷取自 http://gais.cs.ccu.edu.tw/
    18. Gale, William, Church, Kenneth, Yarowsky. (1992). A method of disambiguating word senses in a large corpus. Computers and the Humanties 26, (pp. 415-439).
    19. Google Offers Immediate Access to 3 Billion Web Documents. (2001). 擷取自 Google Inc: http://www.google.com/press/pressrel/3billion.html
    20. H. Edmund Stiles. (1961). The association factor in information retrieval. Journal of the ACM, 8, (pp. 271-279).
    21. Helen J. Peat, & Peter Willett . (1991). The Limitations of Term Co-occurrence Data for Query Expansion in Document Retrieval Systems.
    22. Howard D. White, Xia Lin, Jan W. Buzydlowski, & Chaomei Chen . (2001). Term Co-occurrence Analysis as an Interface for Digital Libraries.
    23. Jarmasz, M., & Szpakowicz. S. (2003). Roget’s thesaurus and semantic similarity. University of Ottawa ms.
    24. Joe A. Guthrie, Louise Guthrie, Yorick Wilks, & Homa Aidinejad. (1991). Subject-Dependent Co-occurrence and Word Sense Disambiguation. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, (pp. 146-152).
    25. Le, Cuong-Anh, & Shimizu, Akira. (2004). High WSD Accuracy Using Naive Bayesian Classifier with Rich Features. PACLIC 18. Tokyo.
    26. Lesk, M. E. (1969). Word-word associations in document retrieval systems. American Documentation, 20, (pp. 27-38).
    27. Li, Xiaobin, Stan Szpakowicz, & Matwin. (1995). A WordNet-Based Algorithm for Word Semantic Sense Disambiguation. In Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAL-95,. Montreal, Canada.
    28. Lin, De-kang. (1997). Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity. In Proceedings of ACL-97. Madrid, Spain.
    29. Lu, Wen-Hsiang, Lee, Hsi-Jian, & Chien, Lee-Feng. (2003). Term Translation Extraction Using Web Mining Techniques.
    30. Magnus Sahlgren. (2006). Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces.
    31. Manning, Christopher, Schutze, & Hinrich. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
    32. Marco Baroni, & Sabrina Bisi. (2004). Using cooccurrence statistics and the web to discover synonyms in a technical language.
    33. Mar´ıa Ruiz-Casado, Enrique Alfonseca, & Pablo Castells. (2005). Using context-window overlapping in synonym discovery and ontology extension.
    34. M. E. Maron, & J. L. Kuhns. (1960). On relevance, probabilistic indexing and information retrieval. Journal of rhe ACM, 7, (pp. 216-244).
    35. Michael.W. Berry, Susan.T. Dumais, & Amy.T. Shippy. (1995). A Case Study of Latent Semantic Indexing. Tech Rep., (pp. 95-271).
    36. Michael Lesk . (1986). Automatic Sense Disambiguation: How to tell a pine cone from an ice cream cone. In Proceedings of the 1986 SIGDOC Conference, New York. Association for Computing Machinerypp. 24-26.
    37. Siddharth Patwardhan, Satanjeev Banerjee, & Ted Pedersen. (2005). SenseRelate::TargetWord - A Generalized Framework for Word Sense Disambiguation. Appears in the Proceedings of the Twentieth National Conference on Artificial Intelligence. Pittsburgh, PA.
    38. Peng, Fu-chun, Huang, Xiang-ji, Dale, Schuurmans,& Wang, Shao-jun. (2003). Text Classification in Asian Languages without Word Segmentation. Proceedings of the Sixth Internationa Workshop on Information Retrieval with Asian Languages (IRAL), Vol. 18, (pp. 41-48).
    39. Philip Edmonds & Graeme Hirst. (2002). Near-synonymy and lexical choice. Computational Linguistics,28(2), (pp. 105-144).
    40. Q.yuhen斷詞系統. 擷取自 http://www.rainsts.net
    41. Senseval-2. (2001). 擷取自 http://193.133.140.102/senseval2/
    42. Sketch Engine. 擷取自 http://www.sketchengine.co.uk/
    43. Slator, B. (1991). Using Context for Sense Preference. In Zernik (ed.) Lexical Acquisition: Exploiting on-line Resources to Build a Lexicon, Lawrence Erlbaum, Hillsdale.
    44. Soumen Chakrabarti, Martin van den Berg, & Byron Dom. (1999). Focused crawling: A new approach to Topic-Specific Web Resource Discovery. Proceedings of the WWW8 Conference.
    45. Stanford Parser. 擷取自 http://www-nlp.stanford.edu/downloads/lex-parser.shtml
    46. Stevens, M. E., Giuliano, V. E., & Heilprin, L. B. (1965). Statistical association methods for mechanized documentation. Washington:National Bureau of Standards (Occasional Publication no. 269).
    47. Thomas K Landaauer, & Susan T. Dumais. (1997). A solution to Plato`s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104(2), (pp. 211–240).
    48. Turney, . (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the Twelfth European Conference on Machine Learning (ECML2001), (pp. 491-502). Freiburg, Germany.
    49. UngererF & Schmid. (1996). An Introduction to Cognitive Linguistics. London: Longman.
    50. Walker. (1987). Thesaurus-Based Disambiguation.
    51. Wang, Jenq-Haur, Teng, Jei-Wen, Cheng, Pu-Jen, Lu, Wen-Hsiang, & Chien, Lee-Feng (2004). Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach.
    52. William C. Hannas. (1997). Asia`s Orthographic Dilemma. University of Hawaii Press.
    53. William, R. Caid, & Joel, L. Carleton. (2003). Context Vector-Based Text Retrieval. A Fair Isaac White Paper.
    54. Yang, Chang-hua, & Sue, Jin-Ker. (2002). Considerations of Linking WordNet with MRD. In Proceedings of the 19th International Conference on Computational Linguistics, (pp. 1121-1127).
    55. 中央研究院斷詞系統. 擷取自 http://rocling.iis.sinica.edu.tw/CKIP/wordsegment.htm
    56. 中国知网. 擷取自 http://www.cnki.net/index.htm
    57. 北京大學语言信息处理研究所. 擷取自 http://202.112.195.8/Down.asp
    58. 全昌勤、何婷婷、姬東鴻與劉輝. (2005). 從搭配知識獲取最優種子的詞義消歧方法. 中文信息學報,第十九卷,第一期, (頁 30-37).
    59. 朱邦復工作室. 中台港澳通用中文內碼之介紹 . 擷取自 http://www.cbflabs.com/tec/cbflabs/jason2k0914.htm
    60. 車方翔、劉挺、秦兵與李生. (2003). 面向依存文法分析的搭配抽取方法研究. 哈爾濱工業大學信息檢索研究室論文集.
    61. 知网. 擷取自 http://www.keenage.com/
    62. 俞士汶、朱學峰、王惠與張芸芸. (1998). 現代漢語語法信息辭典. 清華大學出版社.
    63. 倚天. 倚天中文系統技術手冊.
    64. 梅家駒、竺一鳴、高蘊琦與殷鴻翔. (1993). 同義詞詞林. 上海辭書出版社.
    65. 搜狗实验室(Sogou Labs). 擷取自 http://www.sogou.com/labs/
    66. 維基百科. 擷取自 http://zh.wikipedia.org
    67. 汤志祥. (2002). 汉语词汇的"借用"和"移用"及其深层社会意义.
    68. 陈水仙. (2006). 港台地区词汇对普通话的影响. 广东外语外贸大学英语教育学院.
    69. 陈钟、彭波、关宏飞與王继民. (2005). 一种词汇共现算法及共现词对检索系统排序的影响.
    描述: 碩士
    國立政治大學
    資訊科學學系
    94971010
    97
    資料來源: http://thesis.lib.nccu.edu.tw/record/#G0094971010
    数据类型: thesis
    显示于类别:[資訊科學系] 學位論文

    文件中的档案:

    档案 描述 大小格式浏览次数
    101001.pdf106KbAdobe PDF2937检视/开启
    101002.pdf131KbAdobe PDF2989检视/开启
    101003.pdf131KbAdobe PDF21061检视/开启
    101004.pdf151KbAdobe PDF21173检视/开启
    101005.pdf215KbAdobe PDF21569检视/开启
    101006.pdf371KbAdobe PDF23353检视/开启
    101007.pdf462KbAdobe PDF22819检视/开启
    101008.pdf865KbAdobe PDF21377检视/开启
    101009.pdf170KbAdobe PDF21406检视/开启
    101010.pdf154KbAdobe PDF21415检视/开启
    101011.pdf1040KbAdobe PDF21215检视/开启


    在政大典藏中所有的数据项都受到原著作权保护.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 回馈