Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/136322
|
Title: | 名目型與次序型資料之分類模型比較及其在網路文本評論之應用 A comparison of nominal and ordinal classification models with application to online reviews |
Authors: | 柳瑞俞 Liou, Ruei-Yu |
Contributors: | 翁久幸 Weng, Chiu-Hsing 柳瑞俞 Liou, Ruei-Yu |
Keywords: | 次序邏輯斯模型 多元邏輯斯模型 Word2Vec TF-IDF FastText Ordered Logit Model Multinomial Logit Model Word2Vec TF-IDF FastText |
Date: | 2021 |
Issue Date: | 2021-08-04 14:42:47 (UTC+8) |
Abstract: | 隨著資訊科技的蓬勃發展,機器學習的技術越來越被大眾所使用,然而現今面對次序型的資料型態多半直接使用名目型分類模型而不是使用能夠正確考慮資料本身大小關係的次序型分類模型,McCullagh(1980)提出次序型目標變數的邏輯斯模型之推廣,稱為次序邏輯斯模型(Ordered Logit Model),本研究使用三種次序邏輯斯模型做為次序型分類模型,在名目型分類模型的部分使用樸素貝葉斯(Naïve Bayes)與多元邏輯斯模型,用來預測13組目標變數為次序型的資料集,並以正確率(Accuracy)、Macro-F1與均方誤差(MSE)做為衡量指標,結果發現只有其中六組資料集在次序型分類模型表現較好,進而我們發現這六組資料集中較多變數符合次序邏輯斯模型的「比例賠率假設(Proportional odds assumption)」,接著我們使用統計資料模擬的方法,驗證確實在符合模型假設之下的資料,使用次序型分類模型獲得較名目型分類模型佳的預測結果。 最後我們將次序型資料的問題延伸至現今流行的文字分類議題,電影與Google評論等都會有一般民眾的留言與評論等級,通常分為1到5分,我們使用Word2Vec、TF-IDF與Fasttext的詞嵌入(Word Embedding)方式將文字資料轉為模型可以代入的向量型態,結果顯示中文評論使用次序型分類模型成效較佳,英文評論使用名目型分類模型較佳,詞嵌入方法也會影響預測結果,考慮越多周遭字詞的Word2Vec方法成效越好,TF-IDF法表現最差,但Word2Vec訓練方式較久,若有時間上的考量可以使用網路上使用Fasttext訓練好的Wiki Pretrain詞向量也有不差的成效。 With the development of information technology, machine learning techniques are increasingly being used by the public. However, nowadays, when facing ordinal data, most of them use the nominal classification model instead of the ordinal classification that can correctly consider the rank relationship of the data. McCullagh (1980) proposed an extension of the logistic model of ordered target variables, called the ordered logit model. This study uses three ordered logit models as the ordinal classification model. Part of the nominal classification models uses Naïve Bayes and multinomial logit model to predict 13 sets of target variables as ordinal data, and uses Accuracy, Macro-F1 and Mean Square Error (MSE) As a measurement, it turns out that only six datasets perform better in the ordinal classification model. Then we found that more variables in these six datasets conform to the "Proportional odds assumption" of the ordered logistic model. Then we use statistical data simulation methods to verify that the data is indeed in line with the model assumptions, and use the ordinal classification model to obtain better prediction results than the nominal classification model. Finally, we extend the problem of ordinal data to the text classification issues. Movies and Google reviews will have public comments and ratings. They are usually divided into 1 to 5 points. The word embedding method we use Word2Vec, TF-IDF and FastText to convert the text data into a vector type that the model can use. The results show that the ordinal classification model for Chinese reviews is better , and the nominal classification model for English reviews is better. The word embedding method will also affect the prediction. As a result, the Word2Vec method that considers more surrounding words the better, the TF-IDF method performs the worst, but the training time of Word2Vec is longer, if you have time considerations, you can use the Wiki Pretrain word vector trained on the Internet using Fasttext, and it will have not bad results. |
Reference: | Alan Agresti(2003). Categorical Data Analysis 3rd Edition, A JOHN WILEY & SONS, INC., PUBLICATION. Jones, K. S.(1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation. Liu, B.(2020). Text sentiment analysis based on CBOW model and deep learning in big data environment. Journal of Ambient Intelligence and Humanized Computing, 11(2), 451-458. McCullagh, P.(1980). Regression models for ordinal data.Journal of the Royal Statistical Society: Series B (Methodological), 42(2), 109-127. Cardoso, J., & da Costa, J. P.(2007). Learning to Classify Ordinal Data: The Data Replication Method. Journal of Machine Learning Research, 8, 1393-1429. Chu, W., & Keerthi, S. S.(2005). New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine learning, 145-152. Frank, E., & Hall, M.(2001). A simple approach to ordinal classification. ECML`01: Proceedings of the 12th European Conference on Machine Learning, 145-156. Jain, A. P., & Dandannavar, P.(2016). Application of machine learning techniques to sentiment analysis. In 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), 628-632. Koren, Y., & Sill, J.(2011). Ordrec: an ordinal model for predicting personalized item rating distributions. In Proceedings of the fifth ACM conference on Recommender systems, 117-124. Opitz, J., & Burst, S.(2019). Macro f1 and macro f1. arXiv preprint arXiv:1911.03347. Rennie, J. D., & Srebro, N.(2005). Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, 1. Saad, S. E., & Yang, J.(2019). Twitter sentiment analysis based on ordinal regression. IEEE Access, 7, 163677-163685. Jing, L. P., Huang, H. K., & Shi, H. B.(2002). Improved feature selection approach TFIDF in text mining. In Proceedings. International Conference on Machine Learning and Cybernetics, 2, 944-946. Joulin, A., Grave, E., & Dandannavar, P.(2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Vargas, V. M., Gutiérrez, P. A., & Hervás-Martínez, C.(2020). Cumulative link models for deep ordinal classification. Neurocomputing, 401, 48-58. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T.(2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 628-632. Liu, C., Li, Y., Ping Li, & Fei, H.(2019). Deep Skip-Gram Networks for Text Classification. In Proceedings of the 2019 SIAM International Conference on Data Mining, 145-153. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J.(2013). Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546. |
Description: | 碩士 國立政治大學 統計學系 108354021 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0108354021 |
Data Type: | thesis |
DOI: | 10.6814/NCCU202100932 |
Appears in Collections: | [統計學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
402101.pdf | | 17929Kb | Adobe PDF2 | 163 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|