政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/141031
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  全文筆數/總筆數 : 113160/144130 (79%)
造訪人次 : 50753415      線上人數 : 675
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋
    政大機構典藏 > 商學院 > 資訊管理學系 > 學位論文 >  Item 140.119/141031
    請使用永久網址來引用或連結此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/141031


    題名: 多層級特徵與不平衡樣本下的預測性迴歸系統
    A Predictive Regression System with Multi-Level Input Features and Unbalanced Sample Structures
    作者: 傅俊益
    Fu, Jun-Yi
    貢獻者: 莊皓鈞
    周彥君

    Chuang, Hao-Chun
    Chou, Yen-Chun

    傅俊益
    Fu, Jun-Yi
    關鍵詞: CWPCA
    不平衡資料集
    正規化方法
    CWPCA
    Imbalanced dataset
    Regularization method
    日期: 2022
    上傳時間: 2022-08-01 17:21:12 (UTC+8)
    摘要: 現今在數據分析領域中,時常會碰到多維度、不平衡的資料集,像是零售業 的新商品的銷量預測,但使用單一迴歸模型或多個迴歸模型去預測這種資料時都 有各自的缺點,而 Cohen, Jiao, and Zhang (2020)提出了介於兩者之間的 DAC(Data Aggregation with Clustering)模型,利用將不同品項的部份種類的特徵係數利用 共同估計特徵係數的方法,降低特徵係數估計的變異,藉此提高模型的表現。
    但 DAC 模型表現會大幅受到超參數設定的影響,且特徵係數的檢定品質會 受到樣本數大小的影響。因此本研究延伸多層級的特徵變數的概念,但相較 DAC 模型使用 Bottom-up 的設計方法,本研究使用 Top-down 的設計方法,利用迴歸模型和正規化方法設計一個 CWPCA (Centralized With Penalized Coefficient Adjustment)模型,並利用統計模擬多種情境的資料集去比較 CWPCA 模型和 DAC 模型的表現,最後發現 CWPCA 模型不需要經過檢定、k-means 等有可能造成模型偏誤的流程,且在大部分的資料集的模型表現都能和表現最好的 DAC 模型差不多,並優於表現較差的 DAC 模型,我們希望未來能進一步應用在真實 世界的資料集,進而對實際的業務產生更大的效益。
    Nowadays, in the field of data analysis, multi-dimensional and unbalanced data sets are very common, such as the sales forecast of new products in the retail industry. However, there are some disadvantages when using a single regression model or multiple regression models to predict such data. As a result, Cohen, Jiao, and Zhang (2020) proposed a DAC (Data Aggregation with Clustering) model between the two models, using the method of jointly estimating the coefficients of some types of coefficients of different items to reduce the variation of coefficients to improve the performance of the model.
    However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables and uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business.
    However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables, this study uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business.
    參考文獻: Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2), 813-852.
    Chen, Y., Taeb, A., & Bühlmann, P. (2020). A Look at Robustness and Stability of l1- versus l0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al. Statistical Science, 35(4), 614-622.
    Cohen, M. C., Jiao, K., & Zhang, R. (2020). Data Aggregation and Demand Prediction. Available at SSRN 3411653.
    Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal L1‐norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(6), 797-829.
    Fu, A., Narasimhan, B., & Boyd, S. (2020). CVXR: An R Package for Disciplined Convex Optimization. Journal of Statistical Software, 94(14), 1 - 34.
    Hazimeh, H., & Mazumder, R. (2020). Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5), 1517-1537.
    Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.
    Li, Y., & Wu, H. (2012). A clustering method based on K-means algorithm. Physics Procedia, 25, 1104-1109.
    MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Paper presented at the Proceedings of the fifth Berkeley symposium on mathematical statistics and probability.
    Melkumova, L., & Shatskikh, S. Y. (2017). Comparing Ridge and LASSO estimators for data analysis. Procedia engineering, 201, 746-755.
    Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.
    Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418-1429.
    描述: 碩士
    國立政治大學
    資訊管理學系
    109356009
    資料來源: http://thesis.lib.nccu.edu.tw/record/#G0109356009
    資料類型: thesis
    DOI: 10.6814/NCCU202200639
    顯示於類別:[資訊管理學系] 學位論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    600901.pdf2611KbAdobe PDF20檢視/開啟


    在政大典藏中所有的資料項目都受到原著作權保護.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 回饋