Abstract: | 本文應用攝護腺癌症蛋白質資料庫,是經由表面強化雷射解吸電離飛行質譜技術的血清蛋白質強度資料,藉此資料判斷受測者是否罹患癌症。此資料庫之受測者包含正常、良腫、癌初和癌末四種類別,其中包括兩筆資料,一筆為包含約48000個區間資料(變數)之原始資料,另一筆為經由人工變數篩選後,僅剩餘779區間資料(變數)之人工處理資料,此兩筆皆為高維度資料,皆約有650個觀察值。高維度資料因變數過多,除了分析不易外,亦造成運算時間較長。故本研究目的即探討在有效的維度縮減方式下,找出最小化分錯率的方法。
本研究除探討以上維度縮減方法對此病例資料庫分類之成效外,亦結合線性維度縮減-主成份分析,非線性維度縮減-主成份分析網路,希望能藉重疊法再改善僅做單一維度縮減方法之病例篩檢分錯率,根據分析結果,重疊法對原始資料改善效果不明顯,但對人工處理資料卻有明顯的改善效果。 In this paper, we study the serum protein data set of prostate cancer, which acquired by Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (SELDI-TOF-MS) technique. The data set, with four populations of prostate cancer patients, includes both raw data and preprocessed data. There are around 48000 variables in raw data and 779 variables in preprocessed data. The sample size of each data is around 650. Because of the high dimensionality, this data set provokes higher level of difficulty and computation time. Therefore, the goal of this study is to search efficient dimension reduction methods.
We first compare three classification methods: support vector machine, artificial neural network, and classification and regression tree. And, we use discrete wavelet transform, principal component analysis and principal component analysis networks to reduce the data dimension.
Then, we discuss the dimension reduction methods and propose overlap method that combines the linear dimension reduction method-principal component analysis, and the nonlinear dimension reduction method-principal component analysis networks to improve the classification result. We find that the improvement of overlap method is significant in the preprocessed data, but not significant in the raw data. |
