Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/131479
|
Title: | 維度縮減於文本風格之應用研究 A Study of Data Reduction on Text Mining |
Authors: | 林志軒 Lin, Chih-Hsuan |
Contributors: | 余清祥 鄭文惠 Yue, Ching-Syang Cheng, Wen-Huei 林志軒 Lin, Chih-Hsuan |
Keywords: | 文字探勘 寫作風格 資料縮減 卡方檢定 交叉驗證 Text Mining Writing Style Data Reduction Chi-Square Test Cross-Validation |
Date: | 2020 |
Issue Date: | 2020-09-02 11:43:26 (UTC+8) |
Abstract: | 寫作風格是文字分析的常見議題,無論個人寫作、學術期刊、報章雜誌等,各文本多半都有自己的獨特風格,往往由用詞遣字及編排就能看出差異。寫作風格的量化分析經常透過分類模型,判定文章來自於哪位作者,由於分析時通常會因模型代入過多變數,使得運算時間過長,有些研究提議套用主成份分析之類的資料縮減方法,但如此多半無法具體詮釋文本差異。本文以分類寫作風格為研究目標,藉由卡方檢定等方法篩選相關變數,並與線性、非線性資料縮減方法比較,希冀可兼顧分類準確率及實質詮釋。 本文使用的文本都屬於白話文,包括臺灣及中國的報刊:2012~2019年《蘋果日報》、《自由時報》、《中國時報》頭條新聞,1971~1975年、1989~1993年《人民日報》頭版新聞,以及1919年、1926年《新青年》第七卷及第十一卷。各文本先經過結巴(jieba)斷詞處理,以倍數指標、卡方檢定等方法挑選變數,再與線性及非線性維度縮減選取變數比較,代入統計學習、機器學習模型,藉由交叉驗證比較分類準確率。分析發現本文提出的卡方檢定篩選方法較為穩定,分類準確率也較高,模型以XGBoost之類集成方法較佳。另外,根據本文挑選出的字詞判斷文本風格,《蘋果日報》、《自由時報》、《中國時報》用詞分別偏向於社會議題、政黨政治及兩岸關係議題,《人民日報》在1970年代、1990年代用詞偏向革命議題、經濟改革等議題,《新青年》第七卷、第十一卷用詞分別偏向於思想改革、資本主義等議題。 Writing style is a popular research topic in text mining and experts often can judge the authors of articles by checking the use of certain words. In addition to choosing proper words, statistical and machine learning models also are important in the study of writing style. In practice, usually many variables (e.g., words or phrases) are plugged into the models, costing a lot of computation time, and thus data reduction methods are recommended to speeding the analysis. However, it is difficult to give a reasonable interpretation to the variables after data reduction. In this study, we propose two methods for selecting variables, which take into account the accuracy and interpretation of classification models. The texts used in this study all belong to modern Chinese writing, including the headlines of Apple Daily, Liberty Times, and China Times (2012-2019), articles of People’s Daily (1971-1975 and 1989-1993), and Volumes 7 and 11 of New Youth Magazine (1919 and 1926). We first apply jieba to all articles for word segmentation, following by performing the variable selection methods (e.g., the proposed methods and linear/nonlinear dimension reduction methods), and finally plug the chosen variables into statistical and machine learning models. The model comparison is based on the F1 measures via cross-validation. We found that the proposed variable selection methods and the ensemble methods generally have the best performance in classification. As for the interpretation of selected variables, Apple Daily, Liberty Times and China Times each focused on issues related social affair, politics and cross-strait relationship, respectively. People’s Daily emphasized on topics related to revolution and economic reform in 1970’s and 1990’s, respectively. New Youth Magazine focused issues related to ideological reform and capitalism in Volumes 7 and 11, respectively. |
Reference: | 一、中文文獻 1.李竹君(2016)。「再思考新聞價值—以蘋果日報與中時集團的即時新聞為例」,台灣大學新聞研究所碩士論文。 2.宋長熾(2004)。「兩岸報紙對「2003年美伊戰爭」議題報導之研究-以《中國時報》、《聯合報》、《自由時報》、《人民日報》為例」,政治作戰學校新聞研究所碩士論文。 3.余清祥、葉昱廷(2020)。「以文字探勘技術分析臺灣四大報文字風格」,《數位典藏與數位人文》,第6卷。 4.陳美瑜(2013)。「中文文本作者辨識研究: 以社群網站--臉書為例」,臺灣師範大學英語學系碩士論文。 5.黃于珊(2017)。「文字探勘在總體經濟上之應用-以美國聯準會會議紀錄為例」。政治大學金融學系碩士論文。 6.黃培軒(2017)。「關鍵詞與階層式詞彙文本分群之應用」,政治大學統計學系碩士論文。 7.鄭開元(2018)。「基於詞頻、位置及類別關係的特徵選擇方法」,銘傳大學資訊管理學系碩士論文。
二、英文文獻 1.Bishop, C. (2006). Machine Learning and Pattern Recognition, Cambridge University Press. 2.Boyce, G., Curran, J. and Wingate, P. (Eds.) (1978). Newspaper History from the 17th Century to the Present Day, Acton Society, Press group. 3.Chuan, H., Zhe, D., Ruifan, L. and Yixin Z. (2008). Dimensionality Reduction for Text Using LLE, Beijing, China. 4.Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge, Cambridge University Press. 5.Archer, J. and Jockers, M.L. (2016). The Bestseller Code, New York: St. Martin’s Press. 6.Jolliffe, I.T. (2002) Principal Component Analysis, 2 edition, Springer, New York. 7.Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X. and Chen, E. (2015). “Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective,” Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, AAAI Press: 3650-3656. |
Description: | 碩士 國立政治大學 統計學系 107354025 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0107354025 |
Data Type: | thesis |
DOI: | 10.6814/NCCU202001336 |
Appears in Collections: | [統計學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
402501.pdf | | 5986Kb | Adobe PDF2 | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|