English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 113648/144635 (79%)
Visitors : 51633744      Online Users : 406
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大機構典藏 > 商學院 > 統計學系 > 學位論文 >  Item 140.119/141011
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/141011


    Title: 利用決策樹插補遺失值之模擬研究
    Missing Data Imputation with Classification and Regression Trees: A Simulation Study
    Authors: 陳政揚
    Chen, Jheng-Yang
    Contributors: 張育瑋
    Chang, Yu-Wei
    陳政揚
    Chen, Jheng-Yang
    Keywords: CART
    決策樹
    疊代插補
    插補遺失值
    CART
    Decision trees
    Iterative imputation
    Missing data imputation
    Date: 2022
    Issue Date: 2022-08-01 17:16:33 (UTC+8)
    Abstract: 遺失值的處理為資料分析前置處理之常見的議題。使用插補遺失值的方式讓資料成為完整資料再進行後續分析,是其中一種常見處理方法。本研究延續以決策樹插補遺失值的研究,比較了幾種文獻上現有的方法及一些小變形的插補表現。除了使用經典的CART演算法以外,也嘗試使用卡方檢定來找分割變數。有別於Rahman 與 Islam (2013) 提出的DMI方法,本文對於訓練資料的選取,不僅限於完全沒有遺失的觀測值,只要在要插補的變數沒有遺失值即可被選為訓練資料,可以更有效使用所有觀測值。在有遺失值的情境下建立決策樹,會遇到分割變數也有遺失值的問題,除了文獻上以平均數或眾數讓其通過的方法,本研究另外考慮兩種重抽的方式讓在分割變數為遺失值的元素通過。此外,參考文獻的一些疊代插補法,並將其運用於決策樹來插補遺失值:對於一筆資料,將各變數的遺失值疊代補值,直到收斂,這樣的方式可以避免文獻使用決策樹插補遺失值的通過問題,並且可以更有效率應用變數之間的關係。本研究使用模擬研究比較上述方法的優缺點,並且將這些方法應用至肝炎資料與信用卡核卡資料這兩筆實際資料。
    Dealing with missing values is an issue in data process before we conduct data analysis. It is a popular approach to impute missing data so that we have a complete data set for further data analysis. The current study continues the studies of imputing missing data using decision trees, we modify some methods in the literature and compare their imputation performance. In addition to the CART algorithm, chi-square tests are performed to find the split variable. Different from the DMI method proposed by Rahman and Islam (2013), the composition of the training data set is not limited to those observations without any missing values, but all the observations whose response variable is available are used for training in the current study. Through the modification, we tried to make most use of all the observed data. Besides, one would encounter the issue that there is a missing value in a split variable when building a decision tree using a data set with missing values. In addition to the imputation using the mean or mode so that all elements are able to be available down the tree in the literature, the current study proposes two resampling methods. Lastly, we incorporate some iterative imputation methods in the literature with decision trees. For a given data set, each variable with missing values will be imputed iteratively until convergence in the iterative imputation method. Hopefully, the relationship between variables can be utilized more effectively. We compare all the methods in some simulation studies. These methods are also applied to two real data sets: Hepatitis Data Set and Credit Approval Data Set.
    Reference: Beaulac, C., and Rosenthal, J. S. (2020). BEST: a decision tree algorithm that handles missing values. Computational Statistics, 35, 1001–1026.
    Batista, G. E. A. P. A., and Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17, 519–533.
    Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regression trees. Belmont, Calif. : Wadsworth International Group.
    Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. New York : Chapman & Hall.
    Fazakis, N., Kostopoulos, G., Kotsiantis, S., and Mporas, I. (2020). Iterative robust semi-supervised missing data imputation. IEEE Access, 8, 90555–90569.
    James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning : with Applications in R. New York : Springer.
    Kim, H., and Loh, W.-Y. (2001). Classification Trees with Unbiased Multiway Splits. Journal of the American Statistical Association, 96, 589–604.
    Luengo, J., García, S., and Herrera, F. (2012). On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems, 32, 77–108.
    Little, R. J. A., and Rubin, D. B. (2020). Statistical analysis with missing data (3rd ed.). Hoboken, NJ : Wiley.
    Loh, W.-Y., and Shih, Y.-S. (1997). Split Selection Methods for Classification Trees. Statistica Sinica, 7, 815–840.
    Merz, C., and Murphy, P. (1996). UCI Repository of Machine Learning Databases. University of California, Department of Information and Computer Science, Irvine. (http://www.ics.uci.edu/mlearn/MLRepository.html).
    Nikfalazar, S., Yeh, C. H., Bedingfield, S., and Khorshidi, H. A. (2020). Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowledge and Information Systems, 62, 2419–2437.
    Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA : Morgan Kaufmann.
    Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.
    Rahman, M. G., and Islam, M. Z. (2013). Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. Knowledge-Based Systems, 53, 51–65.
    Stekhoven, D. J., and Bühlmann, P. (2011). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112–118.
    van Buuren, S., and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67.
    Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of Translational Medicine, 4, 9.
    Description: 碩士
    國立政治大學
    統計學系
    109354019
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0109354019
    Data Type: thesis
    DOI: 10.6814/NCCU202200953
    Appears in Collections:[統計學系] 學位論文

    Files in This Item:

    File Description SizeFormat
    401901.pdf7308KbAdobe PDF20View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback