Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/146305
|
Title: | 一個將表格型資料轉換成影像的監督式方法用於以卷積神經網絡為基礎的深度學習預測 The supervised approach for converting tabular data into images for CNN-based deep learning prediction |
Authors: | 凃于珊 Tu, Yu-Shan |
Contributors: | 吳漢銘 凃于珊 Tu, Yu-Shan |
Keywords: | 組內和組間分析法 監督式距離 卷積神經網絡 Within and between analysis Supervised distance matrix Convolutional neural network |
Date: | 2023 |
Issue Date: | 2023-08-02 13:03:47 (UTC+8) |
Abstract: | 在處理表格型資料的分類預測問題時,如果使用傳統的機器學習方法,如決策樹、隨機森林和支持向量機,我們通常需要進行資料特徵擷取和預處理。然而,近年來,有研究提出將表格型資料轉換成圖像,然後利用卷積神經網絡模型來訓練和預測轉換後的資料圖像。這種方法不僅能省去前述的預處理步驟,還能獲得更好的預測效果。在這些方法中,表格資料圖像生成器(Image Generator for Tabular Data,簡稱 IGTD)透過最小化特徵距離矩陣與目標圖像像素位置距離矩陣之間的差異,將表格型資料中的每個特徵(變數)分配到圖像中的唯一像素位置,從而生成每一個樣本的圖像。在這些圖像中,像素強度反映了樣本中相對應特徵(變數)的值。IGTD 方法不需要資料領域知識,且能提供更佳的特徵鄰域結構。本研究基於 IGTD 的特性,引入了監督式距離計算的概念,並在生成圖像的過程中加入資料類別資訊,以提高圖像分類預測的準確性。首先,我們根據資料類別資訊,採用組內和組間分析法(Within and Between Analysis,簡稱WABA),計算特徵間的不同相關係數及其對應的距離。然後,我們利用這些由不同相關係數生成的圖像進行資料擴充,以增加樣本數,解決資料樣本數遠小於特徵數的問題。此外,我們也考慮了不同相關係數轉換成距離的轉換公式,以了解其對資料生成圖像的影響,以及對卷積神經網絡模型結果的影響。我們將所提出的方法應用於多個實際的基因表達資料集,結果顯示,新方法優於 IGTD。除了能顯著提升卷積神經網絡模型的預測準確性外,同時也擴展了卷積神經網絡在表格型資料應用的範疇。 When dealing with classification prediction problems of tabular data, traditional machine learning methods such as decision trees, random forests, and support vector machines usually require data feature extraction and preprocessing. However, recent research has proposed converting tabular data into images, and then using convolutional neural network models to train and predict the converted data images. This method not only eliminates the aforementioned preprocessing steps but also achieves better prediction results. Among these methods, the Image Generator for Tabular Data (IGTD) minimizes the difference between the feature distance matrix and the target image pixel position distance matrix, assigning each feature (variable) in the tabular data to a unique pixel position in the image, thereby generating an image for each sample. In these images, the pixel intensity reflects the value of the corresponding feature (variable) in the sample. The IGTD method does not require domain knowledge of the data and can provide a better feature neighborhood structure. Based on the characteristics of IGTD, this study introduces the concept of supervised distance calculation and incorporates data category information during the image generation process to improve the accuracy of image classification prediction. First, we use the Within and Between Analysis (WABA) based on data category information to calculate different correlation coefficients and their corresponding distances between features. Then, we use the images generated by these different correlation coefficients for data augmentation to increase the number of samples and solve the problem of the number of data samples being far less than the number of features. In addition, we also consider different conversion formulas for converting correlation coefficients into distances to understand their impact on data image generation and the results of the convolutional neural network model. We applied the proposed method to multiple actual gene expression datasets. The results show that the new method is superior to IGTD. In addition to significantly improving the prediction accuracy of the convolutional neural network model, it also expands the application of convolutional neural networks to the tabular data. |
Reference: | Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., et al. (2000). Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403(6769), 503–511.
Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L., Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R., & Korsmeyer, S. J.(2002). Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature genetics,0(1), 41–47.
Bazgir, O., Zhang, R., Dhruba, S. R., Rahman, R., Ghosh, S., & Pal, R. (2020). Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nature communications, 11(1), 4391.
Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade: Second Edition, (pp. 437–478).
Bertucci, F., Salas, S., Eysteries, S., Nasser, V., Finetti, P., Ginestier, C., CharafeJauffret, E., Loriod, B., Bachelart, L., Montfort, J., et al. (2004). Gene expression profiling of colon cancer by dna microarrays and correlation with histoclinical parameters. Oncogene, 23(7), 1377–1391.
Chollet, F. (2021). Deep learning with Python. Simon and Schuster
Ciregan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3642–3649).: IEEE.
Dansereau, F., Alutto, J. A., & Yammarino, F. J. (1984). Theory testing in organizational behavior: The varient approach. Prentice Hall.
Díaz-Uriarte, R. (2005). Supervised methods with genomic data: a review and cautionary view. Data analysis and visualization in genomics and proteomics,(pp. 193–214).
Gu, Q., Li, Z., & Han, J. (2012). Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725.
Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., & Cuadros, J. (2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22), 2402–2410.
Hart, P. E., Stork, D. G., & Duda, R. O. (2000). Pattern classification. Wiley Hoboken.
Hua, J., Xiong, Z., Lowey, J., Suh, E., & Dougherty, E. R. (2005). Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21(8), 1509–1515.
Jirapech-Umpai, T. & Aitken, S. (2005). Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC bioinformatics, 6(1), 1–11.
Kamnitsas, K., Ledig, C., Newcombe, V. F., Simpson, J. P., Kane, A. D., Menon, D. K., Rueckert, D., & Glocker, B. (2017). Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis, 36, 61–78.
Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673–679.
Kim, K., Zhang, S., Jiang, K., Cai, L., Lee, I.-B., Feldman, L. J., & Huang, H. (2007). Measuring similarities between gene expression profiles through new data transformations. BMC bioinformatics, 8, 1–14.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.
Lee, J. W., Lee, J. B., Park, M., & Song, S. H. (2005). An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48(4), 869–885.
Li, Y., Campbell, C., & Tipping, M. (2002). Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics, 18(10), 1332–1339.
Ma, S. & Zhang, Z. (2018). Omicsmapnet: Transforming omics data to take advantage of deep convolutional neural network for discovery. arXiv preprint arXiv:1804.05283.
Odena, A., Olah, C., & Shlens, J. (2017). Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning (pp. 2642–2651).: PMLR.
Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870), 436–442.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.
Sharma, A., Vans, E., Shigemizu, D., Boroevich, K. A., & Tsunoda, T. (2019). Deepinsight: A methodology to transform a non-image data to an image for convolution neural network architecture. Scientific reports, 9(1), 11399.
Simonyan, K. & Zisserman, A. (2014a). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Simonyan, K. & Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2), 203–209.
Wainberg, M., Merico, D., Delong, A., & Frey, B. J. (2018). Deep learning in biomedicine. Nature biotechnology, 36(9), 829–838.
Wu, H.-M., Tien, Y.-J., Ho, M.-R., Hwu, H.-G., Lin, W.-c., Tao, M.-H., & Chen, C.-h. (2018). Covariate-adjusted heatmaps for visualizing biological data via correlation decomposition. Bioinformatics, 34(20), 3529–3538.
Yeung, K. Y., Bumgarner, R. E., & Raftery, A. E. (2005). Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 21(10), 2394–2402.
Zhu, Y., Brettin, T., Xia, F., Partin, A., Shukla, M., Yoo, H., Evrard, Y. A., Doroshow, J. H., & Stevens, R. L. (2021). Converting tabular data into images for deep learning with convolutional neural networks. Scientific eports, 11(1), 11325. |
Description: | 碩士 國立政治大學 統計學系 110354011 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0110354011 |
Data Type: | thesis |
Appears in Collections: | [統計學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
401101.pdf | | 14087Kb | Adobe PDF2 | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|