Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/124825
|
Title: | 基於主動式學習之古漢語斷句系統發展與應用研究 Development and Application of An Ancient Chinese Sentence Segmentation System Based on Active Learning |
Authors: | 徐志帆 Hsu, Chih-Fan |
Contributors: | 陳志銘 Chen, Chih-Ming 徐志帆 Hsu, Chih-Fan |
Keywords: | 數位人文 主動學習 機器學習 自動化古漢語斷句 人機互動 digital humanities active learning machine learning automatic ancient Chinese sentence segmentation human-computer interaction |
Date: | 2019 |
Issue Date: | 2019-08-07 16:26:24 (UTC+8) |
Abstract: | 本研究旨在開發支援數位人文研究之「基於主動式學習的古漢語文本斷句系統」,結合主動學習與機器學習演算法,透過人機合作模式降低建立自動化古漢語斷句建立模型時所需的訓練語料,並協助人文學者面對未解讀過的文獻能更有效率的進行斷句判讀作業。為了找出最合適建立「基於主動式學習的古漢語文本斷句系統」的的演算法與特徵模板,本研究設計第一個實驗採用了不同的演算法與特徵模板配合依序文本和主動學習兩種選擇文本方法所建立的斷句模型進行比較。實驗結果發現,條件隨機場(conditional random fields)與三字詞特徵模板在主動學習方法中能有效地進行學習,適合發展「主動學習斷句模式」。 第二個實驗邀請人文專長領域的學者使用「基於主動式學習的古漢語文本斷句系統」進行古漢語文本的斷句判讀,以人文學者各自標註資料建立的斷句模型進行比較分析,並輔以半結構式訪談深度了解人文學者對於本研究發展之系統輔以斷句的使用感受與建議。 實驗結果發現「基於主動式學習的古漢語文本斷句系統」確實能有效學習人文學者的斷句標註資料,並且模型預測能力能基於人機合作而不斷提升。此外,分析過程中發現模型的斷句預測能力與人文學者的標註種類比和相鄰字種類比有顯著負相關。最後,透過訪談結果歸納得知人文學者對於系統操作流程與介面具有正面評價,多數受訪者認為本系統的斷句預測功能在古漢語斷句上能提供有效之輔助功能。未來可考量增加命名實體模型或其他古漢語規則的特徵模板設計,以進一步提升斷句預測能力,也希冀能將發展的系統運用在人文領域教育上,發展為訓練古漢語斷句之數位人文教育平台。 This study aims to develop an “Ancient Chinese Sentence Segmentation System Based on Active Learning” for supporting digital humanities research, combine active learning and machine learning algorithms, reduce training corpora required for establishing an automatic ancient Chinese sentence segmentation model through human-computer cooperation model, and assist humanists in efficient sentence segmentation interpretation when facing literatures which have not been interpreted. To find out the most suitable algorithm and feature template for establishing the “Ancient Chinese Sentence Segmentation System Based on Active Learning”, the sentence segmentation models established by applying different algorithms and feature templates matched with sequential text and active learning are compared in the first experiment in this study. The experimental results reveal that conditional random fields and three-word feature templates could effectively precede learning in active learning that they are suitable for developing an “active learning sentence segmentation model”. Humanities researchers are invited to use the “Ancient Chinese Sentence Segmentation System Based on Active Learning” for the sentence segmentation interpretation of ancient Chinese texts. Sentence segmentation model established by individual humanist’s annotation data are compared and analyzed, and semi-structured interview is used for deeply understanding humanists’ use perception of sentence segmentation with the system developed in this study and suggestions. The experimental results show that the “Ancient Chinese Sentence Segmentation System Based on Active Learning” could effectively learn humanists’ sentence segmentation annotation data and the prediction ability of the model, based on human-computer cooperation, could be constantly promoted. Significantly negative correlations between sentence segmentation prediction ability and humanists’ annotation type ratio and adjacent word type ratio are discovered in the analysis process. According to the interviews, humanists present positive evaluation on the system operation process and interface. Most respondents consider that the sentence segmentation prediction function of the system could provide effective assistance in ancient Chinese sentence segmentation. Naming solid model or other feature template design with ancient Chinese rules could be increased to further promote the sentence segmentation prediction ability. It is also expected to apply the developed system to humanities education and develop the digital humanities education platform for training ancient Chinese sentence segmentation. |
Reference: | 中文部分 牛紅廣 (2014)。關於古籍數字化性質及開發的思考。圖書館, (2), 107-108. 王力 (1976)。 古漢語通論 (Vol. 2)。中外出版社。 王丹。(2010)。古籍數字化與古典文學研究。社科縱橫,2,98-99。 李鐸、王毅(2005)。關於古代文獻信息化工程與古典文學研究之間互動關係的對話。文學遺產,1,126-137。 李響、才藏太、姜文斌、呂雅娟、劉群(2011)。最大熵和規則相結合的藏文句子邊界識別方法。中文信息學報,25(4),39-45。 林爾正、林丹紅(2007)。 計算機應用於古籍整理研究概況。 情報探索,2007(6),28-29。 梁喜濤、顧磊 (2015)。 基於分層選擇策略的主動學習分詞方法。計算機應用研究,32(5),1353-1356。 張逸(2018)。唐代墓誌銘與中國佛教寺廟志斷句研究。國立政治大學,臺北市。 張開旭、夏雲慶、宇航(2009)。基於條件隨機場的古漢語自動斷句與標點方法。清華大學學報: 自然科學版,(10),1733-1736。 黃瀚萱、孫春在(2007)。以序列標記方法解決古漢語斷句問題。國立交通大學,新竹市。 黃水清、王東波(2017)。古文信息處理研究的現狀及趨勢。圖書情報工作, 61(12),43-49. 葉智豪、王盟鈞、蔡宗翰(2011)。歷史文獻的命名實體描顯取一結合主動學習法之半監督式模型. 從保存到創造: 開啟數位人文研究。 1,131。 楊樹達(1963)。古書句讀釋例。 中華書局。 趙敏俐、杜曉勤(2013)。國學大數據時代來了。光明日報,09-16。 潘德利(2002)。中國古籍數字化進程和展望。 圖書情報工作,46(7), 117-120。 兰和群(2005)。古文断句与翻译技巧。 河南师范大学学报: 哲学社会科学版, 32(3),120-121。 顧磊、趙陽(2016)。古籍數字化標註資源建設的意義及其現狀分析。圖書館學研究,(4),49-52。 劉康、錢旭、王自強(2012)。主動學習算法綜述。 計算機工程與應用,48(34),1-4。 劉瀏、王東波、黃水清(2017)。機器學習視角的人工智能研究回顧及對圖書情報學的影響。圖書與情報,37(06),84-95。
西文部分 Graves, A. Supervised sequence labelling with recurrent neural networks. 2012. ISBN 9783642212703. URL http://books. google. com/books. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. Hu, Y. (2016). Classical Chinese Sentence Segmentation as Sequence Labeling. Li, S., Zhou, G., & Huang, C. R. (2012). Active learning for Chinese word segmentation. Proceedings of COLING 2012: Posters, 683-692. Lewis, D. D., & Gale, W. A. (1994, August). A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 3-12). Springer-Verlag New York, Inc.. Krishnakumar, A. (2007). Active learning literature survey. Technical reports, University of California, Santa Cruz. 42. Olsson, F. (2009). A literature survey of active machine learning in the context of natural language processing. Seung, H. S., Opper, M., & Sompolinsky, H. (1992, July). Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory (pp. 287-294). ACM. Settles, B., & Craven, M. (2008, October). An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the conference on empirical methods in natural language processing (pp. 1070-1079). Association for Computational Linguistics. Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1), 1-114. Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4), 267-373. Wang, B., Shi, X. and Su, J. (2017). A sentence segmentation method for ancient Chinese texts based on recurrent neural network. Acta Scientiarum Naturalium Universitatis Pekinensis, 53(2):255‒261. (in Chinese) Wang, B., Shi, X., Tan, Z., Chen, Y. and Wang, W. (2016). A sentence segmentation method for ancient Chinese texts based on NNLM. Proceedings of the Chinese Lexical Semantics Workshop 2016, Lecture Notes in Computer Science 10085, pp. 387–396. Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. |
Description: | 碩士 國立政治大學 圖書資訊與檔案學研究所 106155007 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0106155007 |
Data Type: | thesis |
DOI: | 10.6814/NCCU201900543 |
Appears in Collections: | [圖書資訊與檔案學研究所] 學位論文
|
Files in This Item:
File |
Size | Format | |
500701.pdf | 1822Kb | Adobe PDF2 | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|