Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/153362
|
Title: | 臺灣碩博士論文之文字分析—以商業及管理學門摘要為例 Text Analysis of Master’s and Doctoral Theses in Taiwan: A Study on Abstracts in the Field of Business and Administration |
Authors: | 劉貞莉 Liu, Chen-Li |
Contributors: | 陳怡如 余清祥 Chen, Yi-Ju Yue, Ching-Syang 劉貞莉 Liu, Chen-Li |
Keywords: | 文字分析 中文斷詞 探索性資料分析 文本分類 關聯性分析 Text Analysis Word Segmentation Exploratory Data Analysis Classification Association Analysis |
Date: | 2024 |
Issue Date: | 2024-09-04 14:55:57 (UTC+8) |
Abstract: | 自從人類發明文字,文字一直是人類傳遞知識、故事和情感的重要工具,藉由文字分析可以探索各時期的文化及科技等發展、社會特色及變遷軌跡,並能鉅細靡遺地發掘其中的關鍵因素。摘要則是文章、書籍的縮影,通常可在摘要的文字及其內容一窺全文的關鍵,以學術論文為例,讀者應能從摘要知道文章的研究目的、結論、重要啟發等要素。本研究以107至109學年度臺灣商業及管理(簡稱商管)學門的碩博士論文摘要為研究對象,除了整理論文的用字等寫作風格外,同時也嘗試使用群集分析等工具,剖析摘要三個單元的文字風格,比較商管各學類論文的特色,協助讀者撰寫及研讀商管學門的論文。 由於現代中文主要以白話文為主,通常以兩個字及以上組成的詞彙為基本單位,分析白話文時會先經過斷詞處理,取得更接近文意的重要詞彙。本研究將先探討兩種斷詞套件:Jieba和CKIP,從執行時間、詞彙數量、詞彙比例、詞彙種類與斷詞精確度等面向進行比較,提供使用者分析中文的參考。而摘要的文字分析主要從探索性資料分析著手,以人工標示將摘要分成「動機目的」、「方法素材」與「結論建議」三個單元,並根據斷詞結果的常見詞彙、字詞多樣性與共現詞叢等角度,探索商管論文的十個學類之寫作風格。資料分析顯示,CKIP斷詞結果能捕捉到臺灣碩博士商管學門論文摘要的慣用詞語,整體結果較符合本研究的期望。摘要三個單元之間的特徵與格式相當明顯,商管學門的十大學類可分為三大集群:醫管、會計、以及其他學類。另外,以各集群與各單元的常見詞彙與共現詞叢作為解釋變數,代入分類模型能有效地區隔商管學門的三個集群、摘要三個單元。 Writing has been a crucial tool for humans to exchange knowledge and express emotions. Through text analysis, we can explore the cultural and technological developments in various eras and understand social characteristics and changes. An abstract serves as the epitome of an article or book, often providing key insights of the full text. For example, readers are usually able to discern the research objectives, conclusions, and significant insights from the abstract of an academic paper. This paper studies the abstracts of master’s and doctoral theses in the field of business and administration (BA) in Taiwan between 2018 and 2020, using cluster analysis to dissect the textual styles of the three sections of the abstracts. The goal is to compare the characteristics of theses across various BA disciplines and to assist readers in writing and understanding BA academic papers. Modern Chinese writing typically consists of phrases (two or more words) as a basic unit, and thus word segmentation is the first step in analyzing Chinese text. We evaluate two word segmentation tools: Jieba and CKIP, and compare them in terms of execution time and segmentation accuracy to provide references for users analyzing Chinese text. For the study of textual style, we apply tools in exploratory data analysis and examine common terms, word diversity, and co-occurrence terms in abstracts, based on the word segmentation results. Note that the abstracts can be divided into three sections: Motivations & Purposes, Methods & Materials, and Conclusions & Suggestions. The analysis results show that the CKIP tool can capture the commonly used terms in master’s and doctoral thesis abstracts in Taiwan, aligning better with the expectations of this study. Additionally, by using the common terms and co-occurrence terms as explanatory variables in classification models, we can effectively distinguish between the three clusters of BA disciplines and the three sections of the abstracts. |
Reference: | 一、中文文獻 [1] eyck (2018)。 [XD] 中文很煩。批踢踢實業坊。https://www.ptt.cc/bbs/joke/M.1528192353.A.0A8.html [2] National Digital Archives Program (2003)。中文斷詞系統。https://ckipsvr.iis.sinica.edu.tw [3] 何立行、余清祥、鄭文惠 (2014)。「從文言到白話:《新青年》雜誌語言變化統計研究」。《東亞觀念史集刊》,7,427-454。 [4] 余清祥 (1998)。「統計在紅樓夢的應用」。《政大學報》,76,303-327。 [5] 余清祥、葉昱廷 (2020)。「以文字探勘技術分析臺灣四大報文字風格」。《數位典藏與數位人文》,6,67-94。 [6] 婚嫁 (2018)。「想過過過兒過過的生活是什麼梗 逼死外國人系列啊」。壹讀。https://read01.com/0e3ynKE.html [7] 宋子軒、冷燮、陳瑤瑤 (2012)。「概率抽樣條件下樣本代表性事後評估方法探討」。《統計研究》,29(7),96-100。 [8] 李宏毅 [Hung-yi Lee] (2019)。ELMO, BERT, GPT [Video]. YouTube. https://youtu.be/UYPa347-DdE?si=WFueLnLv8XDKuUF6
二、英文文獻 [1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models are Few-shot Learners. Advances in Neural Information Processing Systems, 33, 1877-1901. [2] Chang, P. C., Galley, M., & Manning, C. D. (2008). Optimizing Chinese Word Segmentation for Machine Translation Performance. In Proceedings of the Third Workshop on Statistical Machine Translation, 224-232. [3] Chen, X., Qiu, X., Zhu, C., & Liu, P. (2015). Long Short-term Memory Neural Networks for Chinese Word Segmentation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1385-1390. [4] Church, K. W. (1988). A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the Second Conference on Applied Natural Language Processing, 136-143. [5] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171-4186. [6] Efron, B., & Thisted, R. (1976). Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know? Biometrika, 63(3), 435-447. [7] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-term Memory. Neural Computation, 9(8), 1735-1780. [8] LaPlaca, P., Lindgreen, A., & Vanhammed, J. (2018). How to Write Really Good Articles for Premier Academic Journals. Industrial Marketing Management, 68, 202-209. [9] Li, P. H., Fu, T. J., & Ma, W. Y. (2020). Why Attention? Analyze BiLSTM Deficiency and Its Remedies in the Case of NER. In Proceedings of the AAAI Conference on Artificial Intelligence, 34(5), 8236-8244. [10] Lin, Q. X., Chang, C. H., & Chen, C. L. (2010). A Simple and Effective Closed Test for Chinese Word Segmentation Based on Sequence Labeling. Computational Linguistics and Chinese Language Processing, 15(3-4), 161-180. [11] Low, J. K., Ng, H. T., & Guo, W. (2005). A Maximum Entropy Approach to Chinese Word Segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, 161-164. [12] Ma, J., & Hinrichs, E. (2015). Accurate Linear-time Chinese Word Segmentation via Embedding Matching. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 1735-1745. [13] Ma, J., Ganchev, K., & Weiss, D. (2018). State-of-the-art Chinese Word Segmentation with Bi-LSTMs. arXiv preprint arXiv:1808.06511. [14] Ma, W. Y., & Chen, K. J. (2003). A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, 31-38. [15] Mosteller, F., & Wallace, D. L. (1984). Applied Bayesian and Classical Inference. Springer Series in Statistics. [16] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog, 1(8), 9. [17] Ríos-Toledo, G., Posadas-Durán, J. P. F., Sidorov, G., & Castro-Sánchez, N. A. (2022). Detection of Changes in Literary Writing Style Using N-grams as Style Markers and Supervised Machine Learning. Plos One, 17(7), e0267590. [18] Salton, G., & Buckley, C. (1988). Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5), 513-523. [19] Shannon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27(3), 379-423. [20] Simpson, E. H. (1949). Measurement of Diversity. Nature, 163(4148), 688-688. [21] Thisted, R., & Efron, B. (1987). Did Shakespeare Write a Newly-Discovered Poem? Biometrika, 74(3), 445-455. [22] Turing, A. M. (2009). Computing Machinery and Intelligence. Springer Netherlands, 23-65. [23] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. In Advances in Neural Information Processing Systems (NeurIPS), 30. [24] Yeh, W. C., Hsieh, Y. L., Chang, Y. C., & Hsu, W. L. (2022). Multifaceted Assessments of Traditional Chinese Word Segmentation Tool on Large Corpora. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022), 193-199. [25] Yue, C. J., & Clayton, M. (2005). An Similarity Measure Based on Species Proportions. Communications in Statistics: Theory and Methods, 34, 2123-2131. [26] Yue, C. J., Clayton, M., & Lin, F. (2001). A Nonparametric Estimator of Species Overlap. Biometrics, 57(3), 743-749. |
Description: | 碩士 國立政治大學 統計學系 111354014 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0111354014 |
Data Type: | thesis |
Appears in Collections: | [統計學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
401401.pdf | | 15837Kb | Adobe PDF | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|