政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/131479

政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/131479

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | 全文笔数/总笔数 : 115261/146306 (79%)
造访人次 : 54604084 在线人数 : 320

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

搜寻范围

查询小技巧：

您可在西文检索词汇前后加上"双引号"，以获取较精准的检索结果

若欲以作者姓名搜寻，建议至进阶搜寻限定作者字段，可获得较完整数据

进阶搜寻

主页 ‧ 登入 ‧ 上传 ‧ 说明 ‧ 关于政大典藏 ‧ 管理

到手机版

政大機構典藏 > 商學院 > 統計學系 > 學位論文 > Item 140.119/131479

请使用永久网址来引用或连结此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/131479

题名:	維度縮減於文本風格之應用研究 A Study of Data Reduction on Text Mining
作者:	林志軒 Lin, Chih-Hsuan
贡献者:	余清祥鄭文惠 Yue, Ching-Syang Cheng, Wen-Huei 林志軒 Lin, Chih-Hsuan
关键词:	文字探勘寫作風格資料縮減卡方檢定交叉驗證 Text Mining Writing Style Data Reduction Chi-Square Test Cross-Validation
日期:	2020
上传时间:	2020-09-02 11:43:26 (UTC+8)
摘要:	寫作風格是文字分析的常見議題，無論個人寫作、學術期刊、報章雜誌等，各文本多半都有自己的獨特風格，往往由用詞遣字及編排就能看出差異。寫作風格的量化分析經常透過分類模型，判定文章來自於哪位作者，由於分析時通常會因模型代入過多變數，使得運算時間過長，有些研究提議套用主成份分析之類的資料縮減方法，但如此多半無法具體詮釋文本差異。本文以分類寫作風格為研究目標，藉由卡方檢定等方法篩選相關變數，並與線性、非線性資料縮減方法比較，希冀可兼顧分類準確率及實質詮釋。本文使用的文本都屬於白話文，包括臺灣及中國的報刊：2012～2019年《蘋果日報》、《自由時報》、《中國時報》頭條新聞，1971～1975年、1989～1993年《人民日報》頭版新聞，以及1919年、1926年《新青年》第七卷及第十一卷。各文本先經過結巴（jieba）斷詞處理，以倍數指標、卡方檢定等方法挑選變數，再與線性及非線性維度縮減選取變數比較，代入統計學習、機器學習模型，藉由交叉驗證比較分類準確率。分析發現本文提出的卡方檢定篩選方法較為穩定，分類準確率也較高，模型以XGBoost之類集成方法較佳。另外，根據本文挑選出的字詞判斷文本風格，《蘋果日報》、《自由時報》、《中國時報》用詞分別偏向於社會議題、政黨政治及兩岸關係議題，《人民日報》在1970年代、1990年代用詞偏向革命議題、經濟改革等議題，《新青年》第七卷、第十一卷用詞分別偏向於思想改革、資本主義等議題。 Writing style is a popular research topic in text mining and experts often can judge the authors of articles by checking the use of certain words. In addition to choosing proper words, statistical and machine learning models also are important in the study of writing style. In practice, usually many variables (e.g., words or phrases) are plugged into the models, costing a lot of computation time, and thus data reduction methods are recommended to speeding the analysis. However, it is difficult to give a reasonable interpretation to the variables after data reduction. In this study, we propose two methods for selecting variables, which take into account the accuracy and interpretation of classification models. The texts used in this study all belong to modern Chinese writing, including the headlines of Apple Daily, Liberty Times, and China Times (2012-2019), articles of People’s Daily (1971-1975 and 1989-1993), and Volumes 7 and 11 of New Youth Magazine (1919 and 1926). We first apply jieba to all articles for word segmentation, following by performing the variable selection methods (e.g., the proposed methods and linear/nonlinear dimension reduction methods), and finally plug the chosen variables into statistical and machine learning models. The model comparison is based on the F1 measures via cross-validation. We found that the proposed variable selection methods and the ensemble methods generally have the best performance in classification. As for the interpretation of selected variables, Apple Daily, Liberty Times and China Times each focused on issues related social affair, politics and cross-strait relationship, respectively. People’s Daily emphasized on topics related to revolution and economic reform in 1970’s and 1990’s, respectively. New Youth Magazine focused issues related to ideological reform and capitalism in Volumes 7 and 11, respectively.
參考文獻:	一、中文文獻 1.李竹君（2016）。「再思考新聞價值—以蘋果日報與中時集團的即時新聞為例」，台灣大學新聞研究所碩士論文。 2.宋長熾（2004）。「兩岸報紙對「2003年美伊戰爭」議題報導之研究-以《中國時報》、《聯合報》、《自由時報》、《人民日報》為例」，政治作戰學校新聞研究所碩士論文。 3.余清祥、葉昱廷（2020）。「以文字探勘技術分析臺灣四大報文字風格」，《數位典藏與數位人文》，第6卷。 4.陳美瑜（2013）。「中文文本作者辨識研究: 以社群網站--臉書為例」，臺灣師範大學英語學系碩士論文。 5.黃于珊（2017）。「文字探勘在總體經濟上之應用－以美國聯準會會議紀錄為例」。政治大學金融學系碩士論文。 6.黃培軒（2017）。「關鍵詞與階層式詞彙文本分群之應用」，政治大學統計學系碩士論文。 7.鄭開元（2018）。「基於詞頻、位置及類別關係的特徵選擇方法」，銘傳大學資訊管理學系碩士論文。二、英文文獻 1.Bishop, C. (2006). Machine Learning and Pattern Recognition, Cambridge University Press. 2.Boyce, G., Curran, J. and Wingate, P. (Eds.) (1978). Newspaper History from the 17th Century to the Present Day, Acton Society, Press group. 3.Chuan, H., Zhe, D., Ruifan, L. and Yixin Z. (2008). Dimensionality Reduction for Text Using LLE, Beijing, China. 4.Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge, Cambridge University Press. 5.Archer, J. and Jockers, M.L. (2016). The Bestseller Code, New York: St. Martin’s Press. 6.Jolliffe, I.T. (2002) Principal Component Analysis, 2 edition, Springer, New York. 7.Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X. and Chen, E. (2015). “Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective,” Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, AAAI Press: 3650-3656.
描述:	碩士國立政治大學統計學系 107354025
資料來源:	http://thesis.lib.nccu.edu.tw/record/#G0107354025
数据类型:	thesis
DOI:	10.6814/NCCU202001336
显示于类别:	[統計學系] 學位論文

文件中的档案:

档案	描述	大小	格式	浏览次数
402501.pdf		5986Kb	Adobe PDF2	0	检视/开启

在政大典藏中所有的数据项都受到原著作权保护.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - 回馈