Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/142122
|
Title: | 單細胞數據重新排序後降維的穩定性 Stability of Single-cell Dimension Reduction after Data Shuffling |
Authors: | 黃渝庭 Huang, Yu-Ting |
Contributors: | 張家銘 Chang, Jia-Ming 黃渝庭 Huang, Yu-Ting |
Keywords: | 單細胞轉錄 PCA PHATE Monocle 3 t-SNE UMAP scRNA-seq PHATE Monocle 3 t-SNE UMAP |
Date: | 2022 |
Issue Date: | 2022-10-05 09:14:37 (UTC+8) |
Abstract: | 有別於以往的 bulk RNA-seq,sinle cell RNA-seq (scRNA-seq) 是一種在單細胞水平上準確地量化基因轉錄表現的方法。然而scRNA-seq數據高維的特性,難以可視化和研究隱藏在其中的生物訊息,許多可用於scRNA-seq降維的演算法因此誕生,以便生物學家可以找到不同細胞型態之間的關係並發現新的細胞類型。我們使用Monocle 3,一個可用於處理scRNA-seq下游分析的R套件,來研究降維可視化圖的穩定性,該套件提供t-SNE、UMAP與PCA的降維方法。另外我們也採用另一種降維方法 PHATE,一個近期被提出的演算法,以 Python 實作。我們比較了這4種降維方式 (PHATE、t-SNE、UMAP與PCA),驗證它們的結果在視覺與量化上,都會隨著輸入的變化而改變,這裡輸入的變化指的是相同的稀疏矩陣,但是行(基因)或列(細胞)順序被打亂。 我們一共使用了七個數據集,一個用於測試PHATE,另外六個數據則用於測試 t-SNE、UMAP與PCA。在後者中,兩個來自Monocle 3教學範例的 C. elegans 數據,另外兩個是來自 Single Cell Portal 網站,數據集 ID 為 345 的 PBMC 資料集,以及 ID 為 1526 的胰島細胞,最後兩個數據集是小鼠視網膜和大腦樣本。透過可視化圖,我們發現 PHATE 以及 PCA 會產生穩定的群集,但由於 PCA 線性降維的特性,無法提供很好的視覺化結果,沒辦法將細胞很好地區分開來,而 t-SNE 以及 UMAP 有時會因為資料中細胞或基因重新排序後,改變了降維後的可視化結果。除了視覺化比較,我們應用了幾個指標進行量化比較: knn-preservation,一個以細胞為單位來測量相鄰關係的保存情況;三個內部評估指標 Calinski-Harabasz Index、Davies-Bouldin Index 和 Xie-Beni Index,以評估集群的好壞; Jaccard Index (JI) - 旨在衡量兩個分群結果之間是否會將同一個細胞分為同一群;最後一個是 Robinson–Foulds (RF) hieratical,一個評估是否保留全局結構的方法,用來計算降維後與在原始空間的群集之間的相似程度。 在所有資料集中,knn-preservation不會因為重新排序後而與原始資料有太大的改變。而三個內部評估指標在經過我們改變排序後也有部分的變動,例如 Calinski-Harabasz Index 在原始輸入資料是4.3788,而在重排序100次後的平均為 4.5402,標準差為0.0617。還有某些資料集的 JI 小於 0.60,這意味著分群是很不穩定的。我們數據集的 RF-hieratical 值大部分不為零,有些甚至達到最壞的情況,這代表重新排序後的原始資料讓群集與群集間的關係發生了巨大的變化。總之,這些降維方法的分群結果並不穩定,因此在不改變原始輸入的情況下,就利用原有資料的單一降維結果來進行生物推論是有點危險的。 single cell RNA-seq (scRNA-seq) is an accurate method to quantify the transcriptome at the single-cell level, unlike previous bulk RNA-seq. However, high-dimensional scRNA-seq data is difficult to visualize and investigate biological information hidden inside. Many algorithms are proposed for scRNA-seq dimensionality reduction such that biologists can find relationships between cells and discover new cell types. We used Monocle 3, an R toolkit for downstream analysis of scRNA-seq, to study the stability of scRNA-seq low-dimensional visualization by three dimensional reduction methods: PCA, t-SNE, and UMAP. Then we take another newer proposed dimensional reduction algorithm, PHATE, practiced in python. We compare these four methods, PHATE, PCA, t-SNE, and UMAP, and verify that their results vary visually and quantitatively with changed input. The change here is the same input sparse matrix, but the row (gene) or column (cell) has been shuffled. We used seven datasets, one of which is the embryoid body for testing PHATE, and the other six are used to test PCA, t-SNE, and UMAP. In these six datasets, the first two are the C. elegans data from the Monocle 3 tutorial, and the other two are the PBMC dataset with ID 345 and the islet dataset with ID 1526 from the Single Cell Portal website. The last two datasets are the Mouse Retinal and Brain samples. With the visualization, we found that PHATE and PCA produce stable clusters. However, due to the characteristics of linear dimensionality reduction of PCA, it cannot provide good visual results, and there is no way to distinguish cells well. t-SNE and UMAP sometimes change the relationship of clusters in 2D after a given cell or gene is shuffled. Besides, we applied a couple of metrics for the quantification. knn-preservation can measure the preservation of neighboring relations at a cellular level. We include three internal evaluation indexes, the Calinski-Harabasz index, the Davies-Bouldin index, and the Xie-Beni Index, to assess how good a cluster is. Jaccard Index measures whether cells will be divided into the same group between two clustering results. The last one is RF-hieratical, a way to evaluate whether the global meta information is preserved, i.e. whether the captured cell-to-cell relationship is preserved between original and shuffle inputs. The knn-preservation for the pairwise comparison is similar. The three internal evaluation indexes have some changes after we shuffled input. For example, the average Calinski-Harabasz Index is 4.5402 for reordering 100 times with a standard deviation of 0.0617. The Jaccard Index of some datasets has less than 0.6, which means the clusters are very unstable. The RF-hieratical values of our dataset are mostly non-zero, and some even reach the worst case, which means that the reordered input data has caused huge changes in the cluster-to-cluster relationship. In conclusion, these dimensionality reduction methods are not stable regarding shuffle data input. Therefore, it is too early to show that these dimensionality reduction tools are stable without considering the variation by changing the original input. |
Reference: | [1] Zheng, G. X. Y., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., Ziraldo, S. B., Wheeler, T. D., McDermott, G. P., Zhu, et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8. [2] Macosko, E. Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A. R., Kamitaki, N., Martersteck, E. M., Trombetta, J. J., Weitz, D. A., Sanes, J. R., Shalek, A. K., Regev, A., & McCarroll, S. A. (2015). Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5), 1202–1214. [3] A.M. Klein, L. Mazutis, I. Akartuna, N. Tallapragada, A. Veres, V. Li, L. Peshkin, D.A. Weitz, M.W. Kirschner. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells, Cell, 161 (2015), pp. 1187-1201 [4] Harold Hotelling. "Analysis of a complex of statistical variables into principal components," Journal of Educational Psychology, 24(6):417, 1933. [5] Moon, K.R., van Dijk, D., Wang, Z. et al. "Visualizing structure and transitions in high-dimensional biological data," Nat Biotechnol 37, 1482–1492, 2019. [6] Van der Maaten, L. & Hinton, G. "Visualizing data using t-SNE," J. Mach. Learn, Res. 9, 2579– 260, 2008. [7] McInnes, L., Healy, J. & Melville, J. "UMAP: Uniform Manifold Approximation and Projection for dimension reduction," Preprint at https://arxiv.org/abs/1802.03426, 2018. [8] Becht, E., McInnes, L., Healy, J. et al. "Dimensionality reduction for visualizing single-cell data using UMAP," Nat Biotechnol 37, 38–44, 2019. [9] Trapnell C. et. al. "The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells," Nat. Biotechnol. 32, 381–386, 2014. [10] Tang, M., Kaymaz, Y., Logeman, B., Eichhorn, S., Liang, Z. S., Dulac, C., & Sackton, T. B. (n.d.). "Evaluating single-cell cluster stability using the Jaccard similarity index," Bioinformatics, 37(15), 2212–2214, 2021. [11] Heiser CN, Lau KS. A Quantitative Framework for Evaluating Single-Cell Data Structure Preservation by Dimensionality Reduction Techniques. Cell Rep. 2020, May 5. [12] Liu, X., Song, W., Wong, B. Y., Zhang, T., Yu, S., Lin, G. N., & Ding, X. (2019). A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biology, 20(1). [13] Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–47. [14] Chatzou, M., Floden, E. W., di Tommaso, P., Gascuel, O., & Notredame, C. (2018). Generalized bootstrap supports for phylogenetic analyses of protein sequences incorporating alignment uncertainty. Systematic Biology, 67(6),997–1009. [15] Moon, Kevin, "Embryoid Body data for PHATE", Mendeley Data, V1, 2018. [16] Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, Qiu X, Lee C, Furlan SN, Steemers FJ, Adey A, Waterston RH, Trapnell C, Shendure J. "Comprehensive single-cell transcriptional profiling of a multicellular organism," Science, Aug 18;357(6352):661–667, 2017. [17] Packer JS, Zhu Q, Huynh C, Sivaramakrishnan P, Preston E, Dueck H, Stefanik D, Tan K, Trapnell C, Kim J, Waterston RH, Murray JI. "A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution," Science, Sep 20;365(6459):eaax1971, 2019. [18] Slyper M, Waldman J, Dionne D. & Li B. Study: ICA: Blood Mononuclear Cells (2 donors, 2 sites). [19] Balboa, D., Barsby, T., Lithovius, V., Saarimäki-Vire, J., Omar-Hmeadi, M., Dyachok, O., Montaser, H., Lund, P. E., Yang, M., Ibrahim, H., Näätänen, A., Chandra, V., Vihinen, H., Jokitalo, E., Kvist, J., Ustinov, J., Nieminen, A. I., Kuuluvainen, E., Hietakangas, V., … Otonkoski, T. (2022). Functional, metabolic and transcriptional maturation of human pancreatic islets derived from stem cells. Nature Biotechnology. [20] Macosko, E.Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tir-osh, I., Bialas, A.R., Kamitaki, N., Martersteck, E.M., et al. (2015). Highly par-allel genome-wide expression profiling of individual cells using nanoliter drop-lets. Cell. [21] Ximerakis M, Lipnick SL, Innes BT, Simmons SK, Adiconis X, Dionne D, Mayweather BA, Nguyen L, Niziolek Z, Ozek C, Butty VL, Isserlin R, Buchanan SM, Levine SS, Regev A, Bader GD, Levin JZ, Rubin LL. Single-cell transcriptomic profiling of the aging mouse brain. Nat Neurosci. 2019 Oct. [22] Qiu, X. et. al. "Reversed graph embedding resolves complex single-cell trajectories," Nat. Methods 14, 979–982, 2017. [23] Cao, J. et. al. "The single-cell transcriptional landscape of mammalian organogenesis," Nature 566, 496–502, 2019. [24] Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9, 5233 (2019). [25] Wolf, F. A., Hamey, F. K., Plass, M., Solana, J., Dahlin, J. S., Göttgens, B., Rajewsky, N., Simon, L., & Theis, F. J. (2019). PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biology, 20(1), 1–9. [26] Hao, Y., Hao, S., Andersen-Nissen, E., Mauck, W. M., Zheng, S., Butler, A., Lee, M. J., Wilk, A. J., Darby, C., Zager, M., Hoffman, P., Stoeckius, M., Papalexi, E., Mimitou, E. P., Jain, J., Srivastava, A., Stuart, T., Fleming, L. M., Yeung, B., … Satija, R. (2021). Integrated analysis of multimodal single-cell data. Cell, 184(13). |
Description: | 碩士 國立政治大學 資訊科學系 109753102 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0109753102 |
Data Type: | thesis |
DOI: | 10.6814/NCCU202201606 |
Appears in Collections: | [資訊科學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
310201.pdf | | 38082Kb | Adobe PDF2 | 139 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|