政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/119910

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 118405/149442 (79%)
Visitors : 78290618 Online Users : 375

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/119910

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/119910

Title:	唐代墓誌銘與中國佛教寺廟志斷句研究 Sentence Segmentation for Tomb Biographies of Tang Dynasty and Chinese Buddhist Temple Gazetteers
Authors:	張逸 Chang, Yi
Contributors:	劉昭麟 Liu, Chao-Lin 張逸 Chang, Yi
Keywords:	深度學習機器學習自然語言處理 Deep learning Machine learning Natural language processing
Date:	2018
Issue Date:	2018-09-03 15:52:15 (UTC+8)
Abstract:	20世紀以前，中文書寫並沒有使用標點符號的習慣，閱讀時必須憑個人經驗和語感對文章進行斷句理解。由於個人的經驗和習慣的不同，往往會對文章造成對不一樣的解讀甚至是誤解，因此，斷句是理解文章最基礎且困難的第一步驟。因此過去學者通過正規表示式、機器學習、深度學習等不同的方法作為自動化文言文斷句的方式，減少文史專家處理斷句的時間。儘管目前已有許多自動斷句的研究，卻尚未出現一個系統將其整合並達到最佳的斷句效果。因此本研究設計一套實驗流程，將過去的研究成果進行組合測試，並觀察在不同組合測試下的Precision、Recall、F1等評估指標找出最佳的組合，進一步減少處理斷句的時間。關於實驗流程的設計，以「唐代墓誌銘」以及「中國佛教寺廟志」作為實驗語料，並且使用「條件隨機場(Conditional Random Fields, CRF)」以及「Long Short-Term Memory(LSTM)」兩種在過去自動斷句研究中表現良好的模型與配合前後文特徵作為baseline，進行進一步的特徵與模型相關的組合實驗。特徵相關的實驗是藉由在baseline中加入各種不同的特徵找出有用的項目，而模型相關的實驗觀察不同機器學習方法與模型訓練方法建找出能夠增進模型效果的項目。在本研究的實驗結果中，效果最好的特徵是前後文以及斷詞統計量，而效果最好的模型是整合了CRF與LSTM所產生的模型CRF+LSTM，其中CRF加入了弱點補強的演算法增強其效果，最後在唐代墓誌銘以及中國佛教寺廟志兩個語料中作為評估指標的F1值分別達到了0.873以及0.675。 Prior to the 20th century, using punctuation in articles hasn`t become a total phenomenon. Therefore readers have to comprehend passages through their personal experiences and the notion to the context, which caused challenges to decode articles accurately due to individual differences. Thus, the punctuation is a difficult first step towards the understanding of articles. Although plenty research has been done, a fully optimized performance automatic punctuation system is still yet to come. In search of the best optimized combination of auto-punctuation system, this research designed an experiment protocol which testing various combination of evaluation index, e.g., Precision, Recall, F1 and previous research data. The experiment protocol was using “Tomb Biographies of Tang Dynasty” and “Chinese Buddhist Temple Gazetteers” as text corpus, in which the Conditional Random Fields (CRF) and the Long Short-Term Memory (LSTM), favorited and well-performed models in the past research, was applied as a baseline for conducting further experiment of the combination of feature and model. For the feature related experiment was extracting valid entry via adding various item entry in baseline; the model related experiment was enhancing model performance by observing various machine learning and model training methods. The results of the study shows that the best performed feature was the context and statistic of word segmentation. As for the best model was the combination of CRF and LSTM, the CRF+LSTM, in which the shortcoming of algorithm in CRF was patched as enhancement. As the result, the F1 score of both text corpuses: “Tomb Biographies of Tang Dynasty” and “Chinese Buddhist Temple Gazetteers” were reached 0.873 and 0.675.
Reference:	[1]王博立、史曉東、蘇勁松，一種基於循環神經網路的文言文斷句方法，北京大學學報第53卷第2期，2017。 [2]周紹良，《唐代墓誌彙編》，上海古籍出版社。 [3]孫茂松、肖明等，基於無指導學習策略的無詞表條件下的漢語自動分詞，計算機學報第27卷第6期，2004。 [4]張開旭、夏云慶、宇航，基於條件隨機場的古漢語自動斷句與標點方法，清華大學學報，2009。 [5]彭維謙，自動擷取中文典籍中人名之嘗試 ── 以 PMI（Pointwise Mutual Information）斷詞於《資治通鑑》的應用為例，國立台灣大學，資訊工程所，碩士論文；指導教授：項潔，2012。 [6]黃建年、侯漢清，農業古籍斷句標點模式研究，中文信息學報，2008。 [7]黃致凱，應用序列標記技術於地方志的實體名詞辨識，國立政治大學，資訊科學學系，碩士論文；指導教授：劉昭麟，2016。 [8]黃瀚萱，以序列標記法解決古漢語斷句問題，國立交通大學，資訊工程學系，碩士論文；指導教授：孫春在，2008。 [9]蘭和群，文言文斷句與翻譯技巧，河南師範大學學報哲學社會科學版，2005。 [10]Ethem Alpaydin, Introduction to Machine Learning (2nd ed.). The MIT Press. 489-493, 2010. [11]Kenneth Church, William Gale, Patrick Hanks, Donald Hindle, Using Statistics in Lexical Analysis, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, 1991. [12]Junyoung Chung, Caglar Gulcehre and KyungHyun Cho, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, arXiv:1412.3555, 2014. [13]Hen-Hsen Huang, Chuen-Tsai Sun, and Hsin-Hsi Chen,Classical Chinese Sentence Segmentation, CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2010. [14]Ho, Tin Kam, Random Forest, Proceedings of the 3rd International Conference on Document Analysis and Recognition, 1995. [15]Mikhail Korobov, sklearn-crfsuite, https://sklearn-crfsuite.readthedocs.io/, 2015. [16]J. Lafferty, A. McCallum and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the 8th international conference on machine learning, 282-289, 2001. [17]R. Rojas, AdaBoost and the Super Bowl of Classifiers: A Tutorial Introduction to Adaptive Boosting, 3-5, 2009. [18]Ilya Sutskever, Oriol Vinyals and Quoc V. Le, Sequence to Sequence Learning with Neural Networks, Advances in Neural Information Processing Systems 27, NIPS 2014. [19]Yushi Yao and Zheng Huang, Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation, arXiv preprint arXiv:1602.04874, 2016.
Description:	碩士國立政治大學資訊科學系 104753032
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0104753032
Data Type:	thesis
DOI:	10.6814/THE.NCCU.CS.022.2018.B02
Appears in Collections:	[資訊科學系] 學位論文

Files in This Item:

File	Size	Format
張逸.pdf	1391Kb	Adobe PDF2	843	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback