政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/140755
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 113325/144300 (79%)
Visitors : 51157197      Online Users : 915
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/140755


    Title: 基於相關性的類別特徵選擇方法之評估
    An Evaluation of Correlation-Based Categorical Feature Selection Methods
    Authors: 張智鈞
    Chang, Chih-Chun
    Contributors: 周珮婷
    Chou, Pei-Ting
    張智鈞
    Chang, Chih-Chun
    Keywords: 變數篩選
    維度縮減
    變數相關性
    過濾法

    類別型資料
    Feature selection
    Dimension reduction
    Variable association
    Filter method
    Entropy
    Categorical datasets
    Date: 2022
    Issue Date: 2022-07-01 16:58:28 (UTC+8)
    Abstract: 隨著機器學習的蓬勃發展,變數篩選之重要性不言而喻,適當地挑選變數可以優化統計模型的預測表現、降低電腦計算成本以及幫助分析者更加理解資料蘊含的意義。變數篩選主要分為過濾法、包裝法和嵌入法三種,本研究旨在使用一些能夠計算變數相關性的指標,如皮爾森積差相關係數、條件熵、交叉熵、相對熵、Goodman and Kruskal’s τ、克拉瑪V係數等,結合過濾法進行變數篩選,並探討於不同指標下各個資料集的預測表現,亦會比較與原始資料集預測表現的差異。本研究共使用十筆資料進行實驗,包含兩筆模擬資料和八筆真實資料,其中大部分為類別型資料。
    在模擬資料中,本研究發現在資料變數為類別型的情況下,條件熵挑選重要變數的能力優於其他指標。在真實資料中,部分資料使用過濾法進行變數篩選後,仍有不錯的預測表現,然而亦有部分資料的預測表現不佳,推測可能和解釋變數之類別個數過多、觀測值過少、資料不平衡以及將連續型變數離散化時轉換不當有關。本研究認為類別個數過多、觀測值過少和資料不平衡的問題可以嘗試透過適當地合併類別去處理,而將連續型變數離散化時可依據原始資料的分配切分。
    未來的研究方向應著重於如何針對類別型資料設定挑選變數的門檻,以及是否能將過濾法與包裝法和嵌入法結合出新的演算法,進而更精準地篩選出重要的變數,並提升資料分析的效率。
    With the vigorous development of machine learning, the importance of feature selection is self-evident. Appropriate selection of features can optimize the accuracy of statistical models, reduce computational costs, and help analysts better comprehend the data. Feature selection is mainly divided into filter, wrapper and embedded method, and we put emphasis on filter method. This study implemented filter method by utilizing several indices which can calculate the association of variables such as Pearson correlation coefficient, entropy, Goodman and Kruskal’s τ, Cramer’s V, etc., and we also compared the performance on dimensionally reduced datasets under each index and original datasets. Moreover, we used ten datasets to conduct experiments, including two simulated and eight real datasets, most of which are categorical. Among these simulated datasets, we found that the ability of conditional entropy to select important variables was better than other indices in categorical variables. Among the real datasets, some of them still had good performance while the others did not. We speculated that the poor performance was associated with excessive categories, lacking in observations, imbalanced dataset and improper discretizing continuous variables. We believed that these problems can be solved by advisably merging categories and discretizing continuous variables according to the distribution of the original datasets. Future study should mainly focus on how to moderately set the threshold when filtering variables in categorical datasets, and whether new algorithms can be created by integrating filter, wrapper and embedded methods to enhance the performance on feature selection and improve the efficiency of categorical data analysis.
    Reference: Akoglu, H. (2018). User`s guide to correlation coefficients. Turkish journal of emergency medicine, 18(3), 91-93.
    Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185.
    Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7), 711-720.
    Beh, E. J., & Davy, P. J. (1998). Theory & Methods: Partitioning Pearson’s Chi‐Squared Statistic for a Completely Ordered Three‐Way Contingency Table. Australian & New Zealand Journal of Statistics, 40(4), 465-477.
    Boltz, S., Debreuve, E., & Barlaud, M. (2007). kNN-based high-dimensional Kullback-Leibler distance for tracking. In Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS`07) (pp. 16-16). IEEE.
    Boltz, S., Debreuve, E., & Barlaud, M. (2009). High-dimensional statistical measure for region-of-interest tracking. IEEE Transactions on Image Processing, 18(6), 1266-1283.
    Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., & Lang, M. (2020). Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis, 143, 106839.
    Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
    Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28.
    Cover, T. M., & Thomas, J. A. (1991). Entropy, relative entropy and mutual information. Elements of information theory, 2(1), 12-13.
    Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance. Proceedings of 5th Future Business Technology Conference (FUBUTEC 2008) pp. 5-12.
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
    Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215-232.
    Cramer, H. (1946). Mathematical methods of statistics, Princeton Univ. Press, Princeton, NJ.
    D`Ambra, L., & Lauro, N. (1989). Non symmetrical analysis of three-way contingency tables. In Multiway data analysis (pp. 301-315).
    D’Ambra, L., Beh, E. J., & Lombardo, R. (2005). Decomposing Goodman-Kruskal tau for Ordinal Categorical Variables. International Statistical Institute, 55th.
    Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732–769.
    Gruosso, T., Mieulet, V., Cardon, M., Bourachot, B., Kieffer, Y., Devun, F., ... & Mechta‐Grigoriou, F. (2016). Chronic oxidative stress promotes H2 AX protein degradation and enhances chemosensitivity in breast cancer patients. EMBO molecular medicine, 8(5), 527-549.
    Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
    Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (Eds.). (2008). Feature extraction: foundations and applications (Vol. 207). Springer.
    Hull, J. J. (1994). A database for handwritten text recognition. IEEE Trans. Pattern Anal. Mach. Intelligence, 16(5), 550-554.
    Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
    Kurgan, L. A., Cios, K. J., Tadeusiewicz, R., Ogiela, M., & Goodenday, L. S. (2001). Knowledge discovery approach to automated cardiac SPECT diagnosis. Artificial intelligence in medicine, 23(2), 149-169.
    Masoudi-Sobhanzadeh, Y., Motieghader, H., & Masoudi-Nejad, A. (2019). FeatureSelect: a software for feature selection based on machine learning approaches. BMC bioinformatics, 20(1), 1-17.
    National Development Council (2020). 2018 Mobile Phone Users` Digital Opportunity Survey (AE080006) [data file]. Available from Survey Research Data Archive, Academia Sinica. doi:10.6141/TW-SRDA-AE080006-1
    Pearson, K. (1895). VII. Note on regression and inheritance in the case of two parents. proceedings of the royal society of London, 58(347-352), 240-242.
    Remeseiro, B., & Bolon-Canedo, V. (2019). A review of feature selection methods in medical applications. Computers in biology and medicine, 112, 103375.
    Rodriguez-Galiano, V. F., Luque-Espinar, J. A., Chica-Olmo, M., & Mendes, M. P. (2018). Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Science of the total environment, 624, 661-672.
    Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423.
    Sun, Y., Lu, C., & Li, X. (2018). The cross-entropy based multi-filter ensemble method for gene selection. Genes, 9(5), 258.
    Wah, Y. B., Ibrahim, N., Hamid, H. A., Abdul-Rahman, S., & Fong, S. (2018). Feature Selection Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy. Pertanika Journal of Science & Technology, 26(1).
    Wang, J., Xu, J., Zhao, C., Peng, Y., & Wang, H. (2019). An ensemble feature selection method for high-dimensional data based on sort aggregation. Systems Science & Control Engineering, 7(2), 32-39.
    Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Icml (Vol. 97, No. 412-420, p. 35).
    Yöntem, M. K., Kemal, A. D. E. M., Ilhan, T., & KILIÇARSLAN, S. (2019). Divorce prediction using correlation-based feature selection and artificial neural networks. Nevşehir Hacı Bektaş Veli Üniversitesi SBE Dergisi, 9(1), 259-273.
    Description: 碩士
    國立政治大學
    統計學系
    109354026
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0109354026
    Data Type: thesis
    DOI: 10.6814/NCCU202200500
    Appears in Collections:[Department of Statistics] Theses

    Files in This Item:

    File Description SizeFormat
    402601.pdf6042KbAdobe PDF20View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback