政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/112205

政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/112205

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | 全文筆數/總筆數 : 118940/150005 (79%)
造訪人次 : 83726877 線上人數 : 405

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

搜尋範圍

查詢小技巧：

您可在西文檢索詞彙前後加上"雙引號"，以獲取較精準的檢索結果

若欲以作者姓名搜尋，建議至進階搜尋限定作者欄位，可獲得較完整資料

進階搜尋

主頁 ‧ 登入 ‧ 上傳 ‧ 說明 ‧ 關於政大典藏 ‧ 管理

到手機版

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/112205

請使用永久網址來引用或連結此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/112205

題名:	以詞性組合為基礎之中文語言特徵研究 A Study of Part-of-Speech Pair-based Language Features in Chinese Texts
作者:	江易倫 Jiang, Yi Lun
貢獻者:	劉吉軒 Liu, Jyi Shane 江易倫 Jiang, Yi Lun
關鍵詞:	作者歸屬語言特徵隨機森林 Authorship attribution Language features Random forest
日期:	2017
上傳時間:	2017-08-28 11:41:25 (UTC+8)
摘要:	在作者歸屬的研究中，語言特徵的選擇一直是很重要的一環，因為會反映到整個預測結果表現。大多數常用的語言特徵雖然在分類上表現優異，像是高頻詞彙、n-grams、及標點符號等，但這些語言特徵內的詞組卻無法解釋分類間的因果關係及相互差異。為了解決這問題，本論文提出詞性組合、否定程度組合及情態詞組合共3種具有語言學意義的語言特徵作為輔助驗證，並以雷震這位作者的文本為基準，探討在「同主題不同作者」及「同作者不同主題」兩個研究方向上是否適用。本論文將會使用隨機森林演算法建立分類模型，使用OOB錯誤率評估分類模型分類表現，並透過重要特徵數值找出各詞組作為決策點的權重。最後希望能從分類規則中，找出不同作者以及不同類型間語言特徵的獨特性詞組並做解釋。 In the study of authorship attribution, the choice of language features have always been a very important part because it reflects the performance of the whole prediction. Most of the commonly used language features are excellent in classification, such as word frequencies, n-grams, and punctuation, but the phrases within these language features can not explain the causal relationship between categories and the differences between them. In order to solve this problem, this paper proposes 3 kinds of linguistic meaning as a auxiliary verification, and based on the Lei-Chen `s text, discussed "different authors with same topics" and "different genres with same author" is applied on the two research directions. In this paper, we will use the random forest algorithm to establish the classification model, use the OOB error rate assessment classification model classification performance, and through the important feature values to find the weight of each phrase as a decision point. Finally, we hope to find out unique phrases of different authors and different genres of language features from the classification rules and explain them.
參考文獻:	杜協昌，〈利用文本採礦探討《紅樓夢》的後四十回作者爭議〉，2012數位典藏與數位人文國際研討會，頁135-162，國立台灣大學，2012。 A. Abbasi, and H. Chen, “Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace,” ACM Transactions on Information Systems, vol. 26, no. 2, pp. 1-29, Mar. 2008. J. Wang, “A critical discourse analysis of Barack Obama’s speeches,” Journal of Language Teaching and Research, vol. 1, no. 3, pp.254-261,May 2010. 薛化元，《自由中國與民主憲政：1950年代台灣思想史的一個考察》，臺北縣板橋市：稻鄉出版社，頁1-11，1996。 M. Koppel, J. Schler, and S. Argamon, “Authorship Attribution: What`s Easy and What`s Hard?” Journal of Law & Policy, vol. 21, no. 2, pp. 317-331, Jun. 2013. M. Koppel, J. Schler, and S. Argamon, “Authorship attribution in the wild,” Language Resources and Evaluation, vol. 45, no. 1, pp. 83-94, Mar. 2011. N. Zechner, “The past, present and future of text classification,” in 2013 European Intelligence and Security Informatics Conference. EISIC’13, Aug. 2013, pp. 230-230. 郉義田，〈居延漢簡資料庫的建立與展望〉，2015數位典藏與數位人文國際研討會，頁1-7，國立台灣大學，2015。胡適，《中國章回小說考證》，天津市：南開大學出版社，頁187-328，2014。 E. Stamatatos, “A survey of modern authorship attribution methods,” Journal of the American Society for information Science and Technology, vol. 60, no. 3, pp. 538-556, Mar. 2009. M. Koppel and Y. Winter, “Determining if two documents are written by the same author,” Journal of the Association for Information Science and Technology, vol. 65, no. 1, pp. 178-187, Jan. 2014. V. G. Ashok, S. Feng, and Y. Choi, “Success with style: Using writing style to predict the success of novels,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Oct. 2013, pp. 1753–1764. S. Bird and E. Loper, “NLTK: the natural language toolkit,” Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1, Association for Computational Linguistics, pp. 63-70, 2002. B. Yu, “Function words for Chinese authorship attribution,” Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, Association for Computational Linguistics, pp. 45-53, 2012. A. Rocha, et al., “Authorship Attribution for Social Media Forensics,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 1, pp. 5-33, Jan. 2017. L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5-32, Oct. 2001. T. M. Oshiro, P. S. Perez, and J. A. Baranauskas, “How many trees in a random forest? ” in Machine Learning and Data Mining in Pattern Recognition, Jul. 2012, pp. 154-168. L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123-140, Aug. 1996. A. Caliskan-Islam, Stylometric Fingerprints and Privacy Behavior in Textual Data. Drexel University, pp. 81-85, 2015. M. L. Pacheco, K. Fernandes, and A. Porco, “Random forest with increased generalization: A universal background approach for authorship verification,” in Conference and Labs of the Evaluation Forum, 2015. M. Popescu and C. Grozea, “Kernel methods and string kernels for authorship analysis,” in Conference and Labs of the Evaluation Forum, 2012. L. Marujo, et al., “Textual Event Detection using Fuzzy Fingerprints,” in Intelligent Systems’2014, Springer International Publishing, pp.825-836, 2015. T. R. Reddy, B. V. Vardhan, and P. V. Reddy, “A Survey on Authorship Profiling Techniques,” International Journal of Applied Engineering Research, vol. 11, no. 5, pp. 3092-3102, 2016. M. Kuta, B. Puto, and J. Kitowski, “Authorship Attribution of Polish Newspaper Articles,” in Artificial Intelligence and Soft Computing, Springer International Publishing, 29 May 2016, pp. 474-483. A. Palomino-Garibay, et al., “A Random Forest Approach for Authorship Profiling,” in Conference and Labs of the Evaluation Forum, 2015. P. Galán-García, et al., “Supervised Machine Learning for the Detection of Troll Proles in Twitter Social Network: Application to a Real Case of Cyberbullying,” Logic Journal of the IGPL, vol. 24, no. 1, pp. 42–53, Feb. 2016. 孙雪、韩蕾、李昆仑，〈基于类别特征选择与反馈学习随机森林算法的邮件过滤系统研究〉，计算机应用与软件，第32卷，第4期，頁67-71，2015。 P. Maitra, S. Ghosh, and D. Das, “Authorship Verification – An Approach based on Random Forest,” in Conference and Labs of the Evaluation Forum, 2015. 任函、冯文贺、刘茂福等，〈基于语言现象的文本蕴涵识别〉，中文信息学报，第31卷，第1期，頁184-191，2017。孟雪井、孟祥兰、胡杨洋，〈基于文本挖掘和百度指数的投资者情绪指数研究〉，宏观经济研究，第1期，頁144-153，2016。周强、俞士汶，〈汉语短语标注标记集的确定〉，中文信息学报，第10卷，第4期，頁1-11，1996。丁声树，《现代汉语语法讲话》，北京：商务印书馆，頁180，1961。呂叔湘、朱德熙，《語法研究和探索》，北京：北京大學出版社，頁85，1983。劉月華、故韡、潘文娛，《實用現代漢語語法》，臺北市：師大書苑出版，頁124，1996。李泉，《汉语语法考察与分析》，北京市：北京語言文化大學，頁71，2001。张谊生，《现代汉语副词分析》，上海市：上海三聯書店，頁6，2010。謝佳玲，〈漢語情態詞的語意界定：語料庫為本的研究〉，中國語文研究，第1期，頁45-63，2006。张华伟、王明文、甘丽新，〈基于随机森林的文本分类模型研究〉，山东大学学报 (理学版)，第41卷，第3期，頁139-143，2006。
描述:	碩士國立政治大學資訊科學學系 104753018
資料來源:	http://thesis.lib.nccu.edu.tw/record/#G0104753018
資料類型:	thesis
顯示於類別:	[資訊科學系] 學位論文

文件中的檔案:

檔案	大小	格式	瀏覽次數
301801.pdf	3035Kb	Adobe PDF2	482	檢視/開啟

在政大典藏中所有的資料項目都受到原著作權保護.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - 回饋