政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/136322
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 113303/144284 (79%)
Visitors : 50808844      Online Users : 677
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/136322


    Title: 名目型與次序型資料之分類模型比較及其在網路文本評論之應用
    A comparison of nominal and ordinal classification models with application to online reviews
    Authors: 柳瑞俞
    Liou, Ruei-Yu
    Contributors: 翁久幸
    Weng, Chiu-Hsing
    柳瑞俞
    Liou, Ruei-Yu
    Keywords: 次序邏輯斯模型
    多元邏輯斯模型
    Word2Vec
    TF-IDF
    FastText
    Ordered Logit Model
    Multinomial Logit Model
    Word2Vec
    TF-IDF
    FastText
    Date: 2021
    Issue Date: 2021-08-04 14:42:47 (UTC+8)
    Abstract: 隨著資訊科技的蓬勃發展,機器學習的技術越來越被大眾所使用,然而現今面對次序型的資料型態多半直接使用名目型分類模型而不是使用能夠正確考慮資料本身大小關係的次序型分類模型,McCullagh(1980)提出次序型目標變數的邏輯斯模型之推廣,稱為次序邏輯斯模型(Ordered Logit Model),本研究使用三種次序邏輯斯模型做為次序型分類模型,在名目型分類模型的部分使用樸素貝葉斯(Naïve Bayes)與多元邏輯斯模型,用來預測13組目標變數為次序型的資料集,並以正確率(Accuracy)、Macro-F1與均方誤差(MSE)做為衡量指標,結果發現只有其中六組資料集在次序型分類模型表現較好,進而我們發現這六組資料集中較多變數符合次序邏輯斯模型的「比例賠率假設(Proportional odds assumption)」,接著我們使用統計資料模擬的方法,驗證確實在符合模型假設之下的資料,使用次序型分類模型獲得較名目型分類模型佳的預測結果。
    最後我們將次序型資料的問題延伸至現今流行的文字分類議題,電影與Google評論等都會有一般民眾的留言與評論等級,通常分為1到5分,我們使用Word2Vec、TF-IDF與Fasttext的詞嵌入(Word Embedding)方式將文字資料轉為模型可以代入的向量型態,結果顯示中文評論使用次序型分類模型成效較佳,英文評論使用名目型分類模型較佳,詞嵌入方法也會影響預測結果,考慮越多周遭字詞的Word2Vec方法成效越好,TF-IDF法表現最差,但Word2Vec訓練方式較久,若有時間上的考量可以使用網路上使用Fasttext訓練好的Wiki Pretrain詞向量也有不差的成效。
    With the development of information technology, machine learning techniques are increasingly being used by the public. However, nowadays, when facing ordinal data, most of them use the nominal classification model instead of the ordinal classification that can correctly consider the rank relationship of the data. McCullagh (1980) proposed an extension of the logistic model of ordered target variables, called the ordered logit model. This study uses three ordered logit models as the ordinal classification model. Part of the nominal classification models uses Naïve Bayes and multinomial logit model to predict 13 sets of target variables as ordinal data, and uses Accuracy, Macro-F1 and Mean Square Error (MSE) As a measurement, it turns out that only six datasets perform better in the ordinal classification model. Then we found that more variables in these six datasets conform to the "Proportional odds assumption" of the ordered logistic model. Then we use statistical data simulation methods to verify that the data is indeed in line with the model assumptions, and use the ordinal classification model to obtain better prediction results than the nominal classification model.
    Finally, we extend the problem of ordinal data to the text classification issues. Movies and Google reviews will have public comments and ratings. They are usually divided into 1 to 5 points. The word embedding method we use Word2Vec, TF-IDF and FastText to convert the text data into a vector type that the model can use. The results show that the ordinal classification model for Chinese reviews is better , and the nominal classification model for English reviews is better. The word embedding method will also affect the prediction. As a result, the Word2Vec method that considers more surrounding words the better, the TF-IDF method performs the worst, but the training time of Word2Vec is longer, if you have time considerations, you can use the Wiki Pretrain word vector trained on the Internet using Fasttext, and it will have not bad results.
    Reference: Alan Agresti(2003). Categorical Data Analysis 3rd Edition, A JOHN WILEY & SONS, INC., PUBLICATION.
    Jones, K. S.(1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation.
    Liu, B.(2020). Text sentiment analysis based on CBOW model and deep learning in big data environment. Journal of Ambient Intelligence and Humanized Computing, 11(2), 451-458.
    McCullagh, P.(1980). Regression models for ordinal data.Journal of the Royal Statistical Society: Series B (Methodological), 42(2), 109-127.
    Cardoso, J., & da Costa, J. P.(2007). Learning to Classify Ordinal Data: The Data Replication Method. Journal of Machine Learning Research, 8, 1393-1429.
    Chu, W., & Keerthi, S. S.(2005). New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine learning, 145-152.
    Frank, E., & Hall, M.(2001). A simple approach to ordinal classification. ECML`01: Proceedings of the 12th European Conference on Machine Learning, 145-156.
    Jain, A. P., & Dandannavar, P.(2016). Application of machine learning techniques to sentiment analysis. In 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), 628-632.
    Koren, Y., & Sill, J.(2011). Ordrec: an ordinal model for predicting personalized item rating distributions. In Proceedings of the fifth ACM conference on Recommender systems, 117-124.
    Opitz, J., & Burst, S.(2019). Macro f1 and macro f1. arXiv preprint arXiv:1911.03347.
    Rennie, J. D., & Srebro, N.(2005). Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, 1.
    Saad, S. E., & Yang, J.(2019). Twitter sentiment analysis based on ordinal regression. IEEE Access, 7, 163677-163685.
    Jing, L. P., Huang, H. K., & Shi, H. B.(2002). Improved feature selection approach TFIDF in text mining. In Proceedings. International Conference on Machine Learning and Cybernetics, 2, 944-946.
    Joulin, A., Grave, E., & Dandannavar, P.(2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
    Vargas, V. M., Gutiérrez, P. A., & Hervás-Martínez, C.(2020). Cumulative link models for deep ordinal classification. Neurocomputing, 401, 48-58.
    Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T.(2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 628-632.
    Liu, C., Li, Y., Ping Li, & Fei, H.(2019). Deep Skip-Gram Networks for Text Classification. In Proceedings of the 2019 SIAM International Conference on Data Mining, 145-153.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J.(2013). Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546.
    Description: 碩士
    國立政治大學
    統計學系
    108354021
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0108354021
    Data Type: thesis
    DOI: 10.6814/NCCU202100932
    Appears in Collections:[Department of Statistics] Theses

    Files in This Item:

    File Description SizeFormat
    402101.pdf17929KbAdobe PDF2163View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback