Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/67317
|
Title: | 預測模型的遺失值處理─選值順序的研究 Handling Missing Values in Predictive Model - Research of the Order of Data Acquisition |
Authors: | 黃秋芸 Huang, Chiu Yun |
Contributors: | 唐揆 Tang, Kwei 黃秋芸 Huang, Chiu Yun |
Keywords: | 預測模型 遺失值 Active Feature-value Acquisition 決策樹 Predictive Model Missing Value Active Feature-value Acquisition Decision Tree |
Date: | 2013 |
Issue Date: | 2014-07-07 11:10:36 (UTC+8) |
Abstract: | 商業知識的發展突飛猛進,其中,預測模型在眾多商業智慧中扮演重要的角色,然而,當我們從大量資料萃取隱藏、未知與潛在具有實用性的資訊處理過程時,往往會遇到許多資料品質上的問題而難以著手分析,尤其是遺失值 (Missing value)的問題在資料前置處理階段更是常見的困難。因此,要如何在建立預測模型時有效的處理遺失值是一個很重要的議題。 過去已有許多文獻致力於遺失值處理的議題,其中,Active Feature-Value Acquisition的相關研究更針對訓練資料的選填順序深入探討。Active Feature-Value Acquisition的概念是從具有遺失值的訓練資料中,選擇適當的遺失資料填補,讓預測的模型在最具效率的情況下達到理想的準確率。本研究將延續Active Feature-Value Acquisition的研究主軸,優先考量決策樹上的節點為遺失值選值填補的順序,提出一個新的訓練資料遺失值的選填順序方法─I Sampling,並透過實際的數據進行訓練與測試,同時我們也與過去文獻所提出的方法進行比較,了解不同的填值偵測與順序的選擇對於一個預測模型的分類準確率是否有影響,並了解各個方法的優缺點與在不同情境下的適用性。 本研究所提出的新方法與驗證的結果,將可給予未來從事預測行為的管理或學術工作一些參考與建議,可以依據不同性質的資料採取合宜的選值方式,以節省取值的成本並提高預測模型的分類能力。 The importance of business intelligence is accelerated developing nowadays. Especially predictive models play a key role in numerous business intelligence tasks. However, while we extract information from unidentified data, there are critical problems of how to handle the missing values, especially in the data pre-processing phase. Therefore, it is important to identify which methods best deal with the missing data when building predictive models. There are several papers dedicated in the research of strategies to deal with the missing values. The topic of Active-Feature Acquisition (aka. AFA) especially worked on the priority order of choosing which feature-value to acquire. The goal of AFA is to reduce the costs of achieving a desired model accuracy by identifying instances for which obtaining complete information is most informative. Followed by the AFA concept, we present an approach- I Sampling, in which feature-values are selected for acquisition based on the attribute on the top node of the current decision tree. Also we compare our approach with other methods in different situations and data missing patterns. Experimental results demonstrate that our approach can induce accurate models using substantially fewer feature-value acquisitions as compared to alternative policies in some situations. The method we proposed can aid the further predictive works in academic and business area. They can therefore choose the right method based on their needs and obtain the informative data in an efficient way. |
Reference: | [英文文獻] 1.Bennett, D. A. (2001), “How can I deal with missing data in my study? “Australian and New Zealand Journal of Public Health, 25(5), 464–469. 2.Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39(11), 27-35. 3.Gilks, W. R., Richardson, S.,& Spiegelhalter, D. J. (1996). Introducing Markov chain Monte Carlo. In Markov chain Monte Carlo in practice (pp. 1-19). London: Chapman & hall/CRC. 4.Kohavi, R. (1995, August). A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In IJCAI, (Vol.14, No.2, pp. 1137-1145). 5.Levin, N., & Zahavi, J. (2001). Predictive modeling using segmentation. Journal of Interactive Marketing, 15(2), 2-22. 6.Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective Sampling for Nearest Neighbor Classifiers. Machine Learning, 54(2), 125-152. 7.Lizotte, D. J., Madani, O., & Greiner, R. (2002, August). Budgeted learning of Naive-Bayes Classifiers. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence (pp. 378-385). Morgan Kaufmann Publishers Inc.. 8.Melville, P., Saar-Tsechansky, M., Provost, F., & Mooney, R. (2004, November). Active Feature-Value Acquisition for Classifier Induction. In Proceedings of the 4th IEEE International Conference on Data Mining. (pp. 483-486). Brighton, UK. 9.Peng, C. Y. J., Harwell, M., Liou, S.M., & Ehman, L.H. (2006). Advances in missing data methods and implications for educational research. In Real data analysis, 31-78. North Carolina,US : Information Age Publishing. 10.Pyle , D. (1999). Data Preparation for Data Mining. Massachusetts: Morgan Kaufmann. 11.Quinlan, J. R. (1989, December). Unknown attribute values in induction. In ML (pp. 164-168). 12.Redman, T. C. (1996). Data quality for the information age. Massachusetts: Artech House, Incorporated. 13.Rubin, D. B. (1987). Multiple imputation for non-response in surveys. New York: John Wiley & Sons. 14.Saar-Tsechansky, M., Melville, P., & Provost, F. (2009, 4). Active Feature-Value Acquisition. Management Science,55(4), 664-684. 15.Schafer, J. L. (1999). Multiple imputation: a primer. Statiscal methods in medical research, 8(1), 3-15. 16.Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best Practices for Missing Data Management in Counseling Psychology. Journal of Counseling Psychology, 57(1), 1-10. 17.Settles, B. (2010). Active Learning Literature Survey. Computer Sciences Technical Report 1648, Unversity of Wisconsin, Madison, 52, 55-66. 18.Simon, H. A., & Lea, G. (1974). Problem solving and rule induction: A unified view. Knowledge and cognition. Oxford, England: Lawrence Erlbaum. 19.Tong, S., & Koller, D. (2001, August). Active learning for structure in Bayesian networks. In International joint conference on artificial intelligence, (vol. 17, No.1, pp. 863-869). 20.Vinod, N. C., & Punithavalli, D. M. (2011). Classification of Incomplete Data Handling Techniques-An Overview. International Journal on Computer Science and Engineering, 3(1), 340-344. 21.Zheng, Z., & Padmanabhan, B. (2002). On Active Learning for Data Acquisition. In Proceedings of IEEE International Condference on Data Mining. (pp. 562-569). 22.Zhu, X., & Wu, X. (2005). Cost-Constrained Data Acquisition for Intelligent Data Preparation. IEEE Transactions on Knowledge and Data Engineering, 17(11), 1542-1556. [中文文獻] 1.麥爾荀伯格、庫基耶 (2013),大數據 (初版) (林俊宏譯),台北市:天下文化 (原著出版年:2013年)。 2.王鴻龍、楊孟麗、陳俊如、林定香 (2012),缺失資料在因素分析上的處理方法之研究,教育科學研究期刊,第五十七卷第一期,頁29-50。 3.吳元彰、沈永勝、楊鍵樵 (2007),應用加權式灰關聯法與自動分群技術於遺失值填補問題,技術學刊,第二十二卷第一期,頁77-87。 4.彼得杜拉克(1980),動盪時代下的經營(初版)(李辛模譯),台北市: 現代企業經營管理 (原著出版年:1980年)。 5.林惠玲、陳正倉 (2004),統計學:方法與應用,台北市:雙葉書廊。 6.林曉芳 (2002),以 Hot deck 插補法推估成就測驗之不完整作答反應,國立政治大學教育學系教育心理與輔導組博士學位論文,未出版,台北市。 7.翁頌舜、梁德馨 (2002),資料採礦資料缺值插補之變異數分析,輔仁管理評論,第九卷第三期,頁163-180。 8.馬芳資、林我聰 (2003),決策樹形式知識之線上預測系統架構,圖書館學與資訊科學,第二十九卷第二期,頁60-76。 9.陳信木、林佳瑩 (1997),調查資料之遺漏值的處置─以熱卡插補法為例,調查研究─方法與應用,第三期,頁75-106。 10.黃齡葦 (2005),遺失資料之多重插補法模擬比較,國立台灣大學農藝學研究所碩士論文,未出版,台北市。 [網路資料] 1.UCI machine Learning Repository. (n.d.). Retrieved from https://archive.ics.uci.edu/ml/index.html |
Description: | 碩士 國立政治大學 企業管理研究所 101355006 102 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0101355006 |
Data Type: | thesis |
Appears in Collections: | [企業管理學系] 學位論文
|
Files in This Item:
File |
Size | Format | |
500601.pdf | 1125Kb | Adobe PDF2 | 1077 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|