Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/141031
|
Title: | 多層級特徵與不平衡樣本下的預測性迴歸系統 A Predictive Regression System with Multi-Level Input Features and Unbalanced Sample Structures |
Authors: | 傅俊益 Fu, Jun-Yi |
Contributors: | 莊皓鈞 周彥君 Chuang, Hao-Chun Chou, Yen-Chun 傅俊益 Fu, Jun-Yi |
Keywords: | CWPCA 不平衡資料集 正規化方法 CWPCA Imbalanced dataset Regularization method |
Date: | 2022 |
Issue Date: | 2022-08-01 17:21:12 (UTC+8) |
Abstract: | 現今在數據分析領域中,時常會碰到多維度、不平衡的資料集,像是零售業 的新商品的銷量預測,但使用單一迴歸模型或多個迴歸模型去預測這種資料時都 有各自的缺點,而 Cohen, Jiao, and Zhang (2020)提出了介於兩者之間的 DAC(Data Aggregation with Clustering)模型,利用將不同品項的部份種類的特徵係數利用 共同估計特徵係數的方法,降低特徵係數估計的變異,藉此提高模型的表現。 但 DAC 模型表現會大幅受到超參數設定的影響,且特徵係數的檢定品質會 受到樣本數大小的影響。因此本研究延伸多層級的特徵變數的概念,但相較 DAC 模型使用 Bottom-up 的設計方法,本研究使用 Top-down 的設計方法,利用迴歸模型和正規化方法設計一個 CWPCA (Centralized With Penalized Coefficient Adjustment)模型,並利用統計模擬多種情境的資料集去比較 CWPCA 模型和 DAC 模型的表現,最後發現 CWPCA 模型不需要經過檢定、k-means 等有可能造成模型偏誤的流程,且在大部分的資料集的模型表現都能和表現最好的 DAC 模型差不多,並優於表現較差的 DAC 模型,我們希望未來能進一步應用在真實 世界的資料集,進而對實際的業務產生更大的效益。 Nowadays, in the field of data analysis, multi-dimensional and unbalanced data sets are very common, such as the sales forecast of new products in the retail industry. However, there are some disadvantages when using a single regression model or multiple regression models to predict such data. As a result, Cohen, Jiao, and Zhang (2020) proposed a DAC (Data Aggregation with Clustering) model between the two models, using the method of jointly estimating the coefficients of some types of coefficients of different items to reduce the variation of coefficients to improve the performance of the model. However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables and uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business. However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables, this study uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business. |
Reference: | Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2), 813-852. Chen, Y., Taeb, A., & Bühlmann, P. (2020). A Look at Robustness and Stability of l1- versus l0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al. Statistical Science, 35(4), 614-622. Cohen, M. C., Jiao, K., & Zhang, R. (2020). Data Aggregation and Demand Prediction. Available at SSRN 3411653. Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal L1‐norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(6), 797-829. Fu, A., Narasimhan, B., & Boyd, S. (2020). CVXR: An R Package for Disciplined Convex Optimization. Journal of Statistical Software, 94(14), 1 - 34. Hazimeh, H., & Mazumder, R. (2020). Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5), 1517-1537. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67. Li, Y., & Wu, H. (2012). A clustering method based on K-means algorithm. Physics Procedia, 25, 1104-1109. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Paper presented at the Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Melkumova, L., & Shatskikh, S. Y. (2017). Comparing Ridge and LASSO estimators for data analysis. Procedia engineering, 201, 746-755. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418-1429. |
Description: | 碩士 國立政治大學 資訊管理學系 109356009 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0109356009 |
Data Type: | thesis |
DOI: | 10.6814/NCCU202200639 |
Appears in Collections: | [資訊管理學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
600901.pdf | | 2611Kb | Adobe PDF2 | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|