政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/81107
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 113303/144284 (79%)
Visitors : 50812743      Online Users : 768
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/81107


    Title: 探索性資料分析方法在文本資料中的應用─以「新青年」雜誌為例
    A Study of Exploratory Data Analysis on Text Data ── A Case study based on New Youth Magazine
    Authors: 潘艷艷
    Pan, Yan Yan
    Contributors: 余清祥
    Yue, Jack
    潘艷艷
    Pan, Yan Yan
    Keywords: 非結構化數據
    文本分析
    探索性資料分析
    主成分分析
    羅吉斯迴歸
    Unstructured Data
    Text Analysis
    Exploratory data Analysis
    Principal Component Analysis
    Logistic Regression
    Date: 2015
    Issue Date: 2016-02-03 11:16:27 (UTC+8)
    Abstract: 隨著經濟繁榮和網絡發展的日新月異,線上線下每時每刻都產生龐大數據,其中約有80%的文字、影像等非結構化數據,如何量化和採取適合的分析方法,成為有效提取有價值信息及對其加以利用的關鍵。針對文字類型的資料,本文提出探索性資料分析方法,並以《新青年》雜誌的語言變化為例,呈現如何選取文本特徵并对其量化及分析的過程。
    首先,本文以卷為分析單位,多角度量化《新青年》雜誌各卷的文本結構,包括文本用字、用句、文言和白虛字使用以及常用字詞共用等方面,通過多種圖表相結合的呈現方式,窺探《新青年》雜誌語言變化歷程以及轉變特點。這其中既包括了對文言文到白話文轉變機制的探索,也包括白話語言演化的探索。其次,根據各卷初探的結果,尋找可區隔文言文和白話文兩種語言形式的文本特徵變數,再以《新青年》第一卷和第七卷為訓練樣本,結合主成分和羅吉斯迴歸,對文、白兩種語言形式的文章進行分類訓練,再利用第四卷進行測試。結果證實,所提取的文本變數能夠有效實現對文、白兩種語言形式的文章的區分。此外,本文亦根據前述初探結果以及人文學者經驗,探索《新青年》雜誌後期語言形式的變化,即從五四運動時期的白話文至以「紅色中文」為特徵的白話文(二戰之後中國使用的白話文)的變化。以第七卷和第十一卷為樣本進行訓練,結果證實這兩卷語言形式存在明顯區別;並加入台灣《聯合報》和中國大陸的《人民日報》進行分類預測,發現兩類報刊的語言偏向有明顯差異,值得後續深入研究。
    Tremendous data are produced every day, due to the rapid development of computer technology and economics. Unstructured data, such as text, pictures, videos, etc., account for nearly 80 percent of all data created. Choosing appropriate methods for quantifying and analyzing this kind of data would determine whether or not we can extract useful information. For that, we propose a standard operating process of exploratory data analysis (EDA) and use a case study of language changes in New Youth Magazine as a demonstration.
    First, we quantify the texts of New Youth magazine from different perspectives, including the uses of words, sentences, function words, and share of common vocabulary. We aim to detect the evolution of modern language itself as well as changes from traditional Chinese to modern Chinese. Then, according to the results of exploratory data analysis, we treat the first and seventh volumes of New Youth magazine for training data to develop classification model and apply the model to fourth volume (i.e., testing data). The results show that the traditional Chinese and modern Chinese can be successfully classified. Next, we intend to verify the changes from modern Chinese of the May 4th Movement to those by advocating Socialism. We treat the seventh volume and eleventh volume of New Youth magazine as training data and again develop a classification model. Then we apply this model to the United Daily News from Taiwan and People’s Daily from Mainland China. We found these two newspapers are very different and the style of United Daily News is closer to that of seventh volume, while the style of People’s Daily is more like that of eleventh volume. This indicates that the People’s Daily is likely to be influenced by the Soviet Union.
    Reference: 一、中文部分
    1.丁守和、殷敘彝(1963),從五四啓蒙運動到馬克思主義的傳播,生活·讀書·新知三聯書店。
    2.王治敏(2010),基於時間跨度的漢語教學常用詞表統計研究,華文教學與研究,4,49-55。
    3.何立行、余清祥、鄭文惠(2014),從文言到白話:《新青年》雜誌語言變化統計研究,東亞觀念史集刊,7,427-454。
    4.朱華宇、孫正興、張福炎(2001),一個基於向量空間模型的中文文本自動分類系統,計算機工程,vol. 27(2),70-73。
    5.余清祥(1998),統計在紅樓夢的應用,政大學報,76,303-327。
    6.李新麗(2007),《新青年》研究綜述,新聞大學,vol. 4,18-22。
    7.李榮陸、王建會、陳曉雲、陶曉鵬、胡運發(2005),使用最大嫡模型進行中文文本分類,計算機研究與發展,vol. 42(1),94-101。
    8.李美霞(2002),語言變遷研究綜述,北京師範大學學報,vol. 4,128-133。
    9.辛剛(1991),語言變異和語言系統,現代外語。
    10.莊森(2006),飛揚跋扈為誰雄——作為文學社團的新青年社研究,東方出版中心。
    11.張寶明、王中江(1998),回眸《新青年》,河南文藝出版社。
    12.陳平原(2002),思想史視野中的文學—《新青年》 研究(上),中國現代文學研究叢刊,vol. 3,1-31。
    13.陳斯華(2003),《新青年》雜誌登載文學作品數量分析表,東岳論叢,vol. 24(3),39-41。
    14.郭曙綸、馬玄思、李開拓(2014),基於《中國語言生活狀況報告》的字與詞的對比研究,北華大學學報,vol.15(3),10-13。
    15.趙岡、陳鍾毅(1980),紅樓夢研究新編,聯經出版社。
    16.鄭秋生、翟琳琳(2013),基於改進Rocchio算法的短文本自動分類研究,中原工學院學報,vol. 24(1),70-73。
    17.謝佳斌、金勇進(2009),探索性數據分析中的統計圖形應用,統計與信息論壇, vol. 24(7),13-17。

    二、英文部分
    1. Agresti, A.(1990), Categorical Data Analysis, New York: Wiley.
    2. Karlgren, B. (1952), “New Excursions in Chinese Grammar”, in Bulletin of the museum of Far Eastern Antiquities (Stockholm), 24:51-80.
    3.Mosteller, F. and Wallace, D. (1964), Inference and Disputed Authorship: the Federalist. Addison-Wesley.
    4.Richard, A.J. and Dean W.W. (2007), Applied Multivariate Statistical Analysis,6th edition, Pearson.
    5.Shannon, C.E. and Weaver W. (1948), A mathematical theory of
    communication, The Bell System Technical Journal, 27, 379–423 and 623–656.
    6.Simpson, E. H. (1949),"Measurement of diversity", Nature, 63: 688.
    7.Thisted, R. and Efron, B. (1986), “Did Shakespeare Write a Newly-discovered Poem?”, Biometrika, 74(3): 445-455.
    8.Tukey, J.W. (1977), Exploratory data analysis, Addison-Wesley.
    9.T.K.Das, P. Mohan Kumar(2013), Big Data Analytics: A Framework for Unstructured Data Analysis, International Journal of Engineering and Technology (IJET), Vol.5(1).
    Description: 碩士
    國立政治大學
    統計學系
    102354031
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0102354031
    Data Type: thesis
    Appears in Collections:[Department of Statistics] Theses

    Files in This Item:

    File SizeFormat
    403101.pdf1666KbAdobe PDF2254View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback