政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/112204

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 113656/144643 (79%)
Visitors : 51715838 Online Users : 624

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/112204

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/112204

Title:	適用於中文史料文本之作者語言模型分析方法研究 An enhanced writer language model for Chinese historical corpora
Authors:	梁韶中 Liang, Shao Zhong
Contributors:	蔡銘峰 Tsai, Ming Feng 梁韶中 Liang, Shao Zhong
Keywords:	語言模型中文史料文本長字詞遞歸神經網絡語言模型平滑法 Kneser-Ney
Date:	2017
Issue Date:	2017-08-28 11:41:07 (UTC+8)
Abstract:	因應近年來數位典藏的趨勢日漸發展，越來越多珍貴中文歷史文本選擇進行數保存，而保存的同時會面對文本的作者遺失或從缺，進而影響文本的完整性，而本論文提出了一個適用於中文史料文本作者分析的方法，主要是透過語言模型的建構，為每一位潛在的作者訓練出一個專屬的語言模型，而搭配不同的平滑方法能避免掉某一受測文本單詞出現的機率為零的機率進而造成計算上的錯誤，而本論文主要採用改良式 Kneser–Ney 平滑方法，該平滑方法因其會同時考慮到 N 詞彙語言模型的高低頻詞的影響，而使其成為建構語言模型普遍選擇的平滑方式。若僅將每一位潛在作者的所有文章進行合併訓練成單一的語言模型會忽略掉許多特性，所以本篇論文在取得附有價值的歷史文本之外，又加入後設資料 (Metadata) 進行綜合分析，包括人工標記的主題分類的統計資訊，使建構出來的語言模型更適配受測文本，增加預測結果的準確性。和加入額外的自定義的字詞以符合文本專有名詞的用詞習慣，還會在一般建構語言模型的基礎上，加入長字詞的權重，以確定字詞長度對預測準確度的關係。最後還會採用遞歸神經網路 (Recursive neural networks) 結合語言模型進行作者預測，與傳統的語言模型分析作進一步的比較。 In recent years, the trend of digital collections has been developing day by day, and more and more precious Chinese historical corpora have been selected for preservation. The preservation of the corpora at the same time will face the loss or lack of the authors, thus affecting the integrity of the corpora. A method for analyzing the author of the Chinese historical text is mainly through the construction of the language model, for each potential author to train a specific language model, and with a different smoothing method can be avoided zero probability of words and the error is caused by the calculation. This paper mainly adopts the Interpolated Modified Kneser-Ney smoothing method, which will take into account the influence of higher order and lower order n-grams string frequency. So, Interpolated Modified Kneser-Ney smoothing is become a very popular way to construct a general choice of language models. The combination of all the articles of each potential author into a single language model will ignore many of the features, so this paper in addition to the value of the historical corpora, but also to add the metadata to integrate analysis, including the statistical information of the subject matter classification of the artificial mark, so that the constructed language model is more suitable for the measured text, increase the accuracy of the forecast results, add additional custom words to match the language of the proper nouns, in addition. But also on the basis of the general construction language model, the weight of the long word to join, to determine the length of the word on the relationship between the accuracy of prediction. Finally, recursive neural networks language models are also used to predict the authors and to make further comparisons with the traditional language model analysis.
Reference:	[1] S.F.ChenandJ.Goodman.Anempiricalstudyofsmoothingtechniquesforlanguage modeling. In Proceedings of the 34th annual meeting on Association for Computa- tional Linguistics, pages 310–318. Association for Computational Linguistics, 1996. [2] K.W.ChurchandW.A.Gale.Acomparisonoftheenhancedgood-turinganddeleted estimation methods for estimating probabilities of english bigrams. Computer Speech & Language, 5(1):19–54, 1991. [3] I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4):237–264, 1953. [4] K. Heafield. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197. Association for Computational Linguistics, 2011. [5] K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn. Scalable modified kneser-ney language model estimation. In ACL (2), pages 690–696, 2013. [6] S. M. Katz. Estimation of probabilities from sparse data for the language model com- ponent of a speech recogniser. IEEE Int. Conf. Acoust, Speech and Signal Processing, 35(3):400–401, 1987. [7] R. Kneser and H. Ney. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, volume 1, pages 181–184. IEEE, 1995. [8] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
Description:	碩士國立政治大學資訊科學學系 103753014
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0103753014
Data Type:	thesis
Appears in Collections:	[資訊科學系] 學位論文

Files in This Item:

File	Size	Format
301401.pdf	1334Kb	Adobe PDF2	287	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback