Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/99804
|
Title: | 應用序列標記技術於地方志的實體名詞辨識 Named Entity Recognition in Difangzhi Using Sequential Labeling Techniques |
Authors: | 黃致凱 Huang, Chih Kai |
Contributors: | 劉昭麟 Liu, Chao Lin 黃致凱 Huang, Chih Kai |
Keywords: | 文字探勘 實體名詞辨識 機器學習 數位人文 Text Mining Named Entity Recognition Machine Learning Digital Humanity |
Date: | 2016 |
Issue Date: | 2016-08-09 11:24:27 (UTC+8) |
Abstract: | 地方志是中國過去由官方編輯的地方記事的資料,其內容包含廣泛,包含人物傳記、地理環境、任官紀錄等等,從中包含了很多現在還沒被整理出的人、事、物,由於地方志文本使用的詞彙、語法架構與現今的中文有相當大的差異,且文本中大多數沒有標點符號,所以面對的是沒有經過斷詞、斷句、斷段落的序列文字資料,所以並不適用現有的自然語言處理工具來做處理分析。因此,本研究針對地方志類型的資料去建立對應的實體名詞辨識模型,以序列標記方式標記出人名與地名的資訊,以及加入官職、入仕、年號以及日期等標記資訊,透過標記資料去從中找出更多中國古代人物的資料。 本研究透過監督式學習的方式去做機器學習來產生序列標記模型,首先從過去整理好的地方志中的人物資訊,抽取人、地名的資訊,並配合已知的名詞表來標記過去曾處理過的地方志語料,即使透過人工整理,過去所整理的資料還是有不正確的地方,這裡先經由前處理對資料都進一步的整理,然後標記時會產生歧義性的問題,我們提出了三種方法來進行標記,來解決歧義問題,並透過條件隨機場作為序列標記模型,同時配合名詞表、規則去做預先標記。透過實驗,去對未處理過的地方志語料做實體名詞辨識,辨識人名準確率皆可達到80%以上,另外再地名辨識部分可達到86%,能有如此好的辨識效果主因在於整理好的地方志語料與實驗語料之間敘述及記錄方式相似度是相當高的。運用標記的結果,試著用簡易的方法來做連結人名與地名資訊的實驗,找出語料中的人地名關聯資料,取樣作人工驗證,取樣結果說明我們的方法能有效的連結特定語法下的人名與地名;為了在未來的研究中,能夠做更深入的研究,嘗試從文本中切割出人物條目,運用地方志已知的特性,配合有限狀態機模型來判斷是否為條目開頭,雖能找出部分開頭,但會有許多遺漏狀況。 在未來的研究中,試著加入更多類型的標記,並做更完善的標記設計,讓辨識效果能有更多的提升,同時為了抓出更精確的人物資訊,除了嘗試段落切割、斷句之外,將試著做地方志的語法分析,確實的抓出語法結構來做人物與其他實體名詞的連結,自動化去整理出更完善的人物資訊。 Difangzhi is the local gazetteers compiled by local government of China. Its content is plenty and extensive. It’s including many undetected information, like biographical information, geographical information, and officer record information and so on. Because of the difference between Difangzhi corpus and modern Chinese language, we should not use current natural language processing tools directly. In order to extract biographical information, we construct our model to recognize the named entity and use the noun list to assist our annotation method in Difangzhi corpus. In this study, we use supervised learning to construct our model. At first, we need to generate our training data. According to the personal information list with manual verification and noun lists, we have reliable information to annotate words in Difangzhi corpus. However, they still have some noise in those lists. As a result, we must do the preprocessing to those lists for cleaning. After, the ambiguity problem will happen when we trying to annotate our corpus. Here we provide three methods to annotate our corpus with disambiguation. Using the annotated corpus to generate training data and built the condition random fields models. In our experiment, we use our models generated by three different annotate methods to predict the character label in testing Difangzhi corpus. According to the labeled result, we extract the person name and address name to evaluate. The result shows the precision of person name recognition is over 80%, and precision of address name recognition is about 86%. Because of the training corpus and test corpus is quite similar, the performances of our model is pretty well. Therefore, we use labeled result to find correlation of person name and address name. Using a simple way to connect person name and address name and sampling the result to evaluate. The sample result shows we could connect person name and address name correctly in some specific grammars. In order to analyze more deeply, we attempt to split clauses in Difangzhi corpus. Use finite state machine model to recognize the beginning of clauses. Although the result shows we could find some beginning of clauses, but our method still lose many beginning of clauses. In the future work, we attempt to add more information to annotate Difangzhi corpus and modify our disambiguated methods to make the recognition result better. In order to get more information about the person in the corpus, we will try to split paragraphs or sentences more precisely. Besides, we also try to analyze grammar in the corpus. Finding useful pattern to connect person name and other entities, like address name, officer name and so on. Generating the information about people appears in the corpus automatically. |
Reference: | [1] 中國歷代人物傳記資料庫,http://projects.iq.harvard.edu/chinesecbdb/home [last visited 2016/7/26] 。 [2] 地方志介紹,http://baike.baidu.com/view/143397.htm [last visited 2016/6/17]。 [3] 杜協昌,半自動詞彙擷取:簡化的詞夾子方法以及JavaScript元件開發及應用,第六屆數位典藏與數位人文國際研討會論文集,391-418,2015。 [4] 金觀濤、邱偉雲、劉昭麟,「共現」詞頻分析及其運用-以「華人」觀念起源為例,第三屆數位典藏與數位人文國際研討會論文集,199-223,2011。 [5] 異體字介紹,https://zh.wikipedia.org/wiki/異體字 [last visited 2016/6/18]。 [6] 異體字整理表,http://www.china-language.gov.cn/wenziguifan [last visited 2016/6/18]。 [7] 張尚斌,詞夾子演算法在專有名詞辨識上的應用──以歷史文件為例,國立台灣大學,碩士論文,2006。 [8] 陳叔倬、李其原、C. Isett、S. Morgan,18世紀中國常民的身高分布、營養、與福利-初步分析報告,第三屆數位典藏與數位人文國際研討會論文集,83-93,2011。 [9] 彭維謙、劉士綱、杜協昌、翁稷安、項潔,自動擷取中文典籍中人名之嘗試 ──以PMI斷詞於《資治通鑑》的應用為例,數位人文研究與技藝,國立台灣大學出版中心,139-163,2012。 [10] 劉吉軒、柯雲娥、張惠真、譚修雯、黃瑞期、甯格致,以文本分析呈現臺灣海外史料政治思想輪廓,第三屆數位典藏與數位人文國際研討會論文集,169-198,2011。 [11] K. Black, Sampling and Sampling Distributions, Business Statistics for Contemporary Decision Making, 216-241, Wiley, 2009.
[12] K.-J. Chen and S.-H. Liu, Word Identification for Mandarin Chinese Sentences, Proceedings of International Conference on Computational Linguistics, 101-107, 1992. [13] I. S. Dhillon and D. S. Modha, Concept Decompositions for Large Sparse Text Data Using Clustering, Journal of Machine Learning, 42(1-2), 143-175, 2001. [14] R. Grossma, G. Seni, J. Elder, N. Agawal and H. Liu, Model Complexity, Model Selection and Regularization, Ensemble Methods in Data Mining, Improving Accuracy Through Combining Predictions, 21-38, Morgan and Claypool, 2010. [15] R. Grishman and B. Sundheim, Sixth Message Understanding Conference: A Brief History, Proceedings of the 16th Conference on Computational linguistics, 466-471, 1996. [16] J. Lafferty, A. McCallum and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the 8th international conference on machine learning, 282-289, 2001. [17] C.-L. Liu, G.-T. Jin, Q.-F. Liu, W.-Y. Chiu and Y.-S. Yu, Some Chances and Challenges in Applying Language Technologies to Historical Studies in Chinese, Journal of Computational Linguistics and Chinese Language Processing, 16(2), 27‒46, 2011. [18] A. K. McCallum, MALLET: A Machine Learning for Language Toolkit, http://mallet.cs.umass.edu, 2002. [19] W.-H. Pang, S.-P. Chen and H. Cheng, Extracting Posting Data from Chinese Local Monographs, Proceedings of International Conference of Digital Archives and Digital Humanities, 94-116, 2012. [20] C. Sutton and A. McCallum, An Introduction to Conditional Random Fields for Relational Learning, Introduction to Statistical Relational Learning, 93-127, MIT Press, 2006. [21] X.-G. Wang and M. Inaba, Structures and Evolution of Digital Humanities: An Empirical Research based on Correspondence Analysis and Co-word Analysis, Proceedings of International Conference of Digital Archives and Digital Humanities, 1-16, 2009. [22] Y.-H. Wu, J. Zhao, B. Xu and H. Yu, Chinese Named Entity Recognition Based on Multiple Features, Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 427–434, 2005. [23] H.-P. Zhang and Q. Liu, Model of Chinese Words Rough Segmentation Based on N-Shortest Paths Method, Journal of Chinese Information Processing, 1-7, 2002. [24] H.-P. Zhang, Q. Liu and H.-K. Yu, Chinese Named Entity Recognition Using Role Model, Journal of Computational Linguistics and Chinese Language Processing, 8(2), 29-60, 2003. [25] Y. Zhai, Z. Rasheed and M. Shah, Conversation Detection in Feature Films Using Finite State Machines, Proceedings of 17th International Conference on Pattern Recognition, 458-461, 2004. |
Description: | 碩士 國立政治大學 資訊科學學系 102753029 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0102753029 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
Files in This Item:
File |
Size | Format | |
302901.pdf | 2266Kb | Adobe PDF2 | 456 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|