Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/140754
|
Title: | 基於文字探勘技術及模型組合比較結果之旅館推薦應用 Hotel recommendation application based on text mining technology and model combination comparison results |
Authors: | 陳麒仲 Chen, Chi-Chung |
Contributors: | 周珮婷 陳麒仲 Chen, Chi-Chung |
Keywords: | 旅遊評論 條件熵 餘弦相似度 TF-IDF Word2Vec SVM Travel reviews Cosine similarity |
Date: | 2022 |
Issue Date: | 2022-07-01 16:58:15 (UTC+8) |
Abstract: | 在這網路發達的時代,人們使用線上訂房網站做預訂旅館已經是稀鬆平常的事,旅館在網站上的評價,也會直接影響旅客在訂房上的選擇。隨著增加自身旅館的評分、減少旅客回應的負面評論,是每家旅館業者所追求的目標,尤其是如何減少負面評論更為重視,所以針對負面評論內提到的問題,去制定改善計畫提升旅館的評價,是個有效的治本方法。對於旅客也希望能夠住到滿意的旅館,不會去影響自身的旅遊體驗,但訂房過程還需要查看每家旅館的資訊,所以經由系統去推薦適合的旅館,不僅能省時也能省力。
本研究透過網路爬蟲,蒐集訂房網站 Booking.com 上南北歐各一個熱門旅遊國家的旅館評論,以文字探勘 TF-IDF 的方法,配上資訊度量條件熵的方法,找尋特定國家旅館的負面關鍵字,幫助當地旅館業者能制定降低負面評論的計畫,以及定義真實負面評論旅客的標籤,透過詞向量模型和受歡迎的機器學習的分類演算法做出預測,為了著重在抓出真實負面評論旅客,模型評估指標選擇使用 Recall、F1Score、AUC Score 當標準,結果顯示以 Word2Vec 訓練的詞向量模型,以及擅長處於不平衡資料的 SVM 分類模型,兩者的組合模型成效較佳,尤其是由輸入中間的詞,去預測周圍的詞的 Skip gram 模型更優於 CBOW。最後根據預測出的真實負面評論旅客,針對其留過的負面評論,去計算與每間熱門旅館負面關鍵字的餘弦相似 度得分,推薦相似度得分較低的旅館。 In this era of the developed Internet, it is common for people to use online booking websites to make hotel reservations. The evaluation of hotels on the website will also directly affect the choice of travelers in booking. Every hotel operator wants to increase the rating of its hotel and reduce the negative reviews responded to by tourists. In particular, reducing negative reviews is more important. Therefore, we should formulate improvement plans for the problems mentioned in the negative reviews. The goal of this research is to help local hoteliers to develop a plan to reduce negative reviews. The web crawlers technique was used to collect hotel reviews on Booking.com. The method of text mining TF-IDF coupled with measuring conditional entropy of information to find the negative keywords of hotels in a specific country was used. Word vector models and popular machine learning classification algorithms were performed to identify the negative review travelers. The model evaluation indicators used are Recall, F1 Score, and AUC Score. The results show that the word vector model trained with Word2Vec and the SVM classification model perform better in imbalanced data settings. The Skip-gram model for predicting surrounding words by inputting the middle word is better than CBOW. Finally, the cosine similarity score was calculated with the negative keywords for each popular hotel, and a hotel recommendation was provided. |
Reference: | [1] Aizawa, A.(2003, January). An information-theoretic perspective of tf–idf measures. Information Processing & Management Volume 39, Issue 1, Pages 45-65. [2] Belgiu, M.(2016,April). Random forest in remote sensing: A review of applications and future directions. ISPRS Journal of Photogrammetry and Remote Sensing Volume 114, Pages 24-31. [3] Bouaziz, A., & Christel, D. P., & Pereira, C. C., & Precioso, F., & Lloret Patrick. (2014). Short Text Classification Using Semantic Random Forest. Data Warehousing and Knowledge Discovery pp 288–299. [4] Chen, Y., & Wang, X.(2012). Text feature extraction based on joint conditional entropy. Proceedings of 2012 2nd International Conference on Computer Science and Network Technology. [5] Cortes, C., & Vapnik, V. (1995). Support-vector networks, Machine Learning volume 20, pages273–297. [6] Eberendu, A. C. (2016, August). Unstructured Data: an overview of the data of Big Data. International Journal of Computer Trends and Technology–Volume 38 Number 1. [7] Fazzolari, M., & Petrocchi, M.(2018,August). A study on online travel reviews through intelligent data analysis. Information Technology & Tourism volume 20, pages37–58 (2018). [8] Gretzel, U., & Kyung, H. Y.(2008,January). Use and Impact of Online Travel Reviews. Information and Communication Technologies in Tourism 2008 pp 35–46. [9] Gretzel, U.(2021). Conceptualizing the smart tourism mindset: Fostering. Utopian thinking in smart tourism development, 1(1), 3–8. [10] Groves, M., & Mundt, K.(2015). Friend or foe? Google Translate in language for academic purposes. [11] Huang, Y., & Wang, R., & Wei, B., & Zheng, S. L., & Chen, M.(2021,July). Sentiment Classification of Crowdsourcing Participants'ReviewsText Based on LDA Topic Model. IEEE Access Volume 9. [12] Koo, C., & Xiang, Z., & Gretzel, U., & Sigala, M.(2021,September). Artificial intelligence (AI) and robotics in travel, hospitality and leisure. Electronic Markets volume 31, pages473–476. [13] Mikolov, T., & Chen, K., & Corrado, G., & Dean, J.(2013, January). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013. [14] Mikolov, T., & Surskever, I., & Chen, K., & Corrado, G., & Dean, J.(2013, December). Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 Pages 3111–3119. [15] Mitra, V., & Wang, C. J., & Banerjee, S.(2007,June). Text classification: A least square support vector machine approach. Applied Soft Computing Volume 7, Issue 3, June 2007, Pages 908-914. [16] Mostafa, L(2020). Machine Learning-Based Sentiment Analysis for Analyzing the Travelers Reviews on Egyptian Hotels. Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2020) pp 405–413 [17] Noyum, V. D., & Mofenjou, Y. P., & Feudjio, C., & Göktug, A., & Fokoue, E. (2021,January). Boosting the Predictive Accurary of Singer Identification Using Discrete Wavelet Transform For Feature Extraction. arXiv - CS - Sound Pub Date : 2021- 01-31. [18] Patel, A., & Meehan, K(2021). Fake News Detection on Reddit Utilising CountVectorizer and Term Frequency-Inverse Document Frequency with Logistic Regression, MultinominalNB and Support Vector Machine. 2021 32nd Irish Signals and Systems Conference (ISSC). [19] Polikar, R.(2012,January). Esemble Learning. Ensemble Machine Learning pp 1–34. [20] Ramos, J.(2003, January). Using TF-IDF to Determine Word Relevance in Document Queries. Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855. [21] Schafer, J. B. & Frankowski, D., & Herlocker, J., & Sen, S.(2007,January). Collaborative Filtering Recommender Systems. The Adaptive Web pp 291–324. [22] Schuckert, M. & Liu, X., & Law, R.(2015,August). Hospitality and Tourism Online Reviews: Recent Trends and Future Directions. Journal of Travel & Tourism Marketing Volume 32, 2015 - Issue 5. [23] Song, S., & Kawamura, H., & Uchida, J. & Saito, H.(2019,April). Determining tourist satisfaction from travel reviews. Information Technology & Tourism volume 21, pages337– 367. [24] Stringam, B. B., & Jr, J. G., & Vanleeuwen, D. M.(2010,June).Assessing the Importance and Relationships of Ratings on User-Generated Traveler Reviews. Traveler Reviews, Journal of Quality Assurance in Hospitality & Tourism, 11:2, 73-92. [25] Tang, Y., & Zhang, Y. Q., & Chawla, N. V., & Krasser, S.(2008,December). SVMs Modeling for Highly Imbalanced Classification. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1. [26] Wisam, A. Q., & Musa, M. A., & Bilal, l. A.(2019, June). An Overview of Bag of Words;Importance, 2019 International Engineering Conference (IEC). [27] Wu, X., & Kumar, V., & Quinlan, J. R., & Ghosh, J., & Yang, Q., & Motoda, H., & McLachlan, G. J., & Ng, A., & Liu, B., & Yu, P. S., & Zhou, Z. H., & Steinbach, M., & Hand, D. J., & Steinberg, D.(2007,December). Top 10 algorithms in data mining. Knowledge and Information Systems volume 14, pages1–37. [28] Xia, P., & Zhang, L., & Li, F.(2015,June). Learning similarity with cosine similarity. ensemble. Information Sciences Volume 307, Pages 39-52. [29] Zhao, D., & Du, N., & Chang, Z., & Li, Y.(2017). Keyword extraction for social media short text. 2017 14th Web Information Systems and Applications Conference. |
Description: | 碩士 國立政治大學 統計學系 109354022 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0109354022 |
Data Type: | thesis |
DOI: | 10.6814/NCCU202200539 |
Appears in Collections: | [統計學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
402201.pdf | | 3126Kb | Adobe PDF2 | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|