政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/59300
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 113648/144635 (79%)
Visitors : 51620022      Online Users : 571
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大典藏 > College of Commerce > Department of MIS > Theses >  Item 140.119/59300
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/59300


    Title: 應用文字探勘技術於英文文章難易度分類
    The Classification of the Difficulty of English Articles with Text Mining
    Authors: 許珀豪
    Hsu, Po Hao
    Contributors: 楊建民
    許珀豪
    Hsu, Po Hao
    Keywords: 文字探勘
    kNN
    英文文章適讀性
    英文語文難易度特徵
    文字特徵
    text mining
    kNN
    the difficulty of English articles
    the characteristics of the linguistic difficulty of English articles
    the characteristics of the text
    Date: 2012
    Issue Date: 2013-09-02 16:01:55 (UTC+8)
    Abstract: 英語學習者如何能在普及的網路環境中,挑選難易度符合自身英文閱讀能力的文章,便是一個值得探討的議題。為了提升文章難易度分類的準確度,近代研究選取許多難易度特徵去分類。本研究希望能夠藉由英文語文難易度特徵、文字特徵,各自歸類和綜合歸類後與原先官方文章類別比較,檢驗是否可以利用語文特徵與文字特徵結合後的歸類結果,來提高準度。
    本研究以GEPT的模擬試題文章作為歸類的依據。研究架構主要分成三部分:語文難易度特徵歸類、文字特徵歸類與綜合前兩者歸類。先以語文難易度特徵組成特徵向量的維度,並算出各語文特徵值後,再使用kNN將文章歸類成初級、中級或中高級,並做為比較準確度的依據;再以GEPT文章斷詞,並選取特徵詞作為特徵向量維度、TF-IDF作特徵值進行文字特徵歸類;最後則是將前面兩種特徵結合作為歸類標準。分別的F-measure為0.61、0.47,最後一個、也是表現最好的結果是以兩者結合後歸類,F-measure有0.68。
    如何從大量的英文文章當中找到適合自己程度循序漸進的學習,是本論文期望未來可以藉由最後語文難易度特徵加上文字特徵的結果來達到的目的。未來可以結合語文難易度特徵以及文字特徵來幫助英文文章做分類,並可以從中分類出不同類別且不同程度的英文文章,讓使用者自行選擇並閱讀,使學習成效進而提升。
    It is rather an important issue that how to grasp the difficulty of the articles in order to efficiently choose the English articles that match our proficiency in the popularity of Internet. Recently, researchers have selected many characteristics of difficulty degrees in order to enhance the accuracy of the classification. The study aims to simplify the former complicated procedures of article classification by using the classification results of linguistic difficulty characteristics, text characteristics respectively, and the combination of the both; in the hope to raise the accuracy of the classification through the comparison of the results.
    The article classification of the study is based on GEPT official practicing exams. There are three parts of this study: the characteristics of the linguistic difficulty and the text, and the combination of the both. First, the dimensions of the linguistic vectors will be the linguistic characteristics. The articles will be classified into primary, intermediate, or intermediate-high levels by kNN method, considered the comparison basis for the classification of the articles’ difficulty. Second, after GEPT articles are broken into words, the dimensions of the text vectors will be the selected words; the TF-IDF will be the values of the text vectors. The third part is to classify articles by using the combination of the former two results. After comparing the three, the best method is the third, the accuracy is 0.68.
    The study hopes the result could help people choose proper English articles to learn English step by step. In the future, we could classify the articles by the combination of the both of linguistic difficulty characteristics and text characteristics. Not only classified as the different levels, but also classified as the different categories. The learners could choose what they like and the articles could correspond their degree in order to promote the effect of learning.
    Reference: 英文
    [1]. Berry, M. J., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support.
    [2]. Berson, A., Smith, S., & Thearling, K. (1999). Building Data Mining Applications for CRM.
    [3]. Chiang, H. K., and Kuo, F. L. (2005). “Promoting Active Learning: Finding Right Articles for Right Learners,” Paper presented at the Fifth International Conference on AsiaCALL, Korea.
    [4]. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27-34.
    [5]. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI magazine, 13(3), 57.
    [6]. Grupe, F. H., & Owrang, M. M. (1995). DATA BASE MINING discovering new knowledge and competitive advantage. Information System Management, 12(4), 26-31.
    [7]. Han, J., & Kamber, M. (2006). Data mining: concepts and techniques. Morgan Kaufmann.
    [8]. Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27.
    [9]. Ionin, T., Zubizarreta, M. L., & Maldonado, S. B. (2008). Sources of linguistic knowledge in the second language acquisition of English articles. Lingua, 118(4), 554-576.
    [10]. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323.
    [11]. Jeng, C. C. (2001). Chinese readability analysis using artificial neural networks. Northern Illinois University.
    [12]. Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999). Performance measures for information extraction. In Proceedings of DARPA Broadcast News Workshop (pp. 249-252).
    [13]. McLaughlin, G. H. (1968). Proposals for British readability measures. Paper presented at the The Third International Reading Symposium, London.
    [14]. McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639-646.
    [15]. Nagy, W. E. Herman. PA (1987). Breadth and depth of vocabulary knowledge: Implications for acquisition and instruction. the nature of vocabulary acquisition, 19-35.
    [16]. Nie, J. Y., Brisebois, M., & Ren, X. (1996). On Chinese text retrieval. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 225-233). ACM.
    [17]. Painter, Mark P. (2004). The Legal Writer #24, It`s Not Only Lawyers and Judges. Ohio Lawyers Weekly, 6-14-2004
    [18]. Reeve, L., & Han, H. (2005, March). Survey of semantic annotation platforms. In Symposium on Applied Computing: Proceedings of the 2005 ACM symposium on Applied computing (Vol. 13, No. 17, pp. 1634-1638).
    [19]. Rogerson-Revell, P. (2007). Using English for international business: A European case study. English for specific purposes, 26(1), 103-120.
    [20]. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523.
    [21]. Salton, G., McGill, M. (1983). Introduction to Modern Information Retrieval, New York: McGraw-Hill.
    [22]. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
    [23]. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.
    [24]. Simoudis, E. (1996). Reality check for data mining. IEEE Expert: Intelligent systems and their applications, 11(5), 26-33.
    [25]. Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London: Butterworths.
    [26]. Witten, I. H., & Frank, E. (2000). Data mining: practical machine learning tools and techniques with Java implementations. CHEN, Z.
    [27]. Zakaluk, B. L., & Samuels, S. J. (Eds.). (1988). Readability: Its Past, Present, and Future. Newark, DE: International Reading Association.

    中文
    [1]. 宋佩貞(2009)。台灣審定版國小英語教科書適讀性公式建置與評估(碩士論文)。國立台東大學。台東縣
    [2]. 張瓊霙。英語廣泛閱讀。南投縣國教輔導團英語領域定期會議。
    [3]. 陳柏均(2011)。文件距離為基礎kNN分群技術與新聞事件偵測追蹤之研究(碩士論文)。國立政治大學。台北市
    [4]. 喻欣凱(2008)。運用支援向量機與文字探勘於股價漲跌趨勢之預測(碩士論文)。輔仁大學。台北市
    [5]. 黃孝文(2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究(碩士論文)。國立政治大學。台北市
    [6]. 黃宣範(1993)。語言、社會與族群意識—台灣語言社會學的研究。台北:文鶴。
    [7]. 黃昭憲(2010)。以語文特徵為基之中學閱讀測驗短文分級。第廿二屆自然語言與語音處理研討會論文集(頁98‒112)。 臺灣,南投
    [8]. 廖柏森(2004)。英語全球化脈絡裡的台灣英語教育。英語教學,29(1),107-121。
    [9]. 賴伯勇(2005)。論英文教材適讀性之研究與應用。人文及社會學科教學通訊,16(4),97-120。

    網路
    [1]. “100年「全民英檢」考生人數成長,101年將新增服務.” 網站來源: http://www.lttc.ntu.edu.tw/gept1/101GEPTnews.htm
    [2]. Jesse Dawson.“How To Choose The Best Readability Formula For Your Document.” 網站來源: http://www.streetdirectory.com/travel_guide/15675/writing/how_to_choose_the_best_readability_formula_for_your_document.html
    [3]. Timothy Bell(1998) .“Extensive Reading: Why? and How?” 網站來源:http://iteslj.org/Articles/Bell-Reading.html
    [4]. 李振清(2009). “閱讀是提升高中生英文能力的致勝關鍵.” 網站來源: http://cc.shu.edu.tw/~cte/gallery/ccli/abc/abc_127_20090204.htm
    Description: 碩士
    國立政治大學
    資訊管理研究所
    100356036
    101
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0100356036
    Data Type: thesis
    Appears in Collections:[Department of MIS] Theses

    Files in This Item:

    File SizeFormat
    603601.pdf1387KbAdobe PDF2493View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback