Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/141555
|
Title: | scGHSOM: 單細胞序列與基因編輯資料階層式分群與視覺分析 scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM |
Authors: | 温上蓉 Wen, Shang-Jung |
Contributors: | 郁方 張家銘 Yu, Fang Chang, Jia-Ming 温上蓉 Wen, Shang-Jung |
Keywords: | 非監督式分群 單細胞序列 GHSOM CyTOF CRISPR GHSOM Unsupervised clustering scRNA-seq CRISPR CyTOF |
Date: | 2021 |
Issue Date: | 2022-09-02 14:47:42 (UTC+8) |
Abstract: | 資料科學應用於生物醫學與研究領域,在近年來已經發展成不可或缺的重要角色。透過分析複雜的基因或細胞的異質資料,得到資料的相關性或關係,進而預測出對應疾病的治療方式。 我們將非監督式的階層式分群方法 Growing Hierarchical Self-organizing Map (GHSOM)應用於生物資料,例如:單細胞序列資料及CRISPR基因資料。而為了判別出分群後,群之中的重要屬性,我們提出了重要屬性辨別演算法,此演算法依據「群內變異小」且「群間變異大」的規則來找出的在該群中影響分群結果的重要屬性們。而因為較難呈現與分析GHSOM的階層事分群結果,我們也提出兩個結果視覺化方法,「分群特徵呈現圖(Cluster Feature Map)」及「分群位置分佈圖(Cluster Distribution Map)」。分群特徵呈現圖為一個樹狀結構圖,且以顏色來呈現指定的特徵,如:某特定屬性的值,可以讓使用著很容易地觀察到該特徵在分群結果上的表現;而分群位置分佈圖為呈現出每一個葉群(Leaf cluster)的相對位置。我們希望透過這兩個視覺化呈現方式,能夠不使用任何降維方法,如:UMAP、t-SNE等,就能呈現並分析分群的階層式結果。 GHSOM的階層式結構比起非階層式的分群方法,能顯示更多高維度資料的細節。我們比較了GHSOM與其他七種分群方法,如:ACCENSE、K-means、flowMeans等,且GHSOM在之中表現可圈可點,甚至在內部評估中出眾。外部評估為評估分群結果與資料類型的吻合程度,我們使用ARI分數來實現外部評估,在ARI分數中,GHSOM為0.88,且位居第三名;而內部評估為不參考Label,單純計算群內距離小、群與群距離大的分群乾淨程度,我們使用CH分數來實現內部評估,在CH分數中,GHSOM得到4.2,為所有分群方法中的第一名。透過內、外部評估,顯示了GHSOM在眾多分群方法中是有競爭力的。 我們提出綜合視覺化方式來呈現非監督式分群法分群後的基因-細胞依賴性資料結果。在非監督式分群法GHSOM分群資料後,分群特徵呈現圖及分群位置分佈圖能不透過降維方法,呈現出分群結果及其特徵,讓使用者能更一目瞭然階層式分群結果的表現及分佈。 Data science applications in the medical field have been growing and have become an indispensable role in research. Analyzing and learning from historical data on genes and cells provides predictions on their relation for effective treatment.
We apply an unsupervised and hierarchical clustering, Growing Hierarchical Self-organizing Map (GHSOM), to investigate biological data such as CyTOF, single-cell sequencing and CRISPR genomic data. To identify significant attributes of clusters, we propose a novel Significant Attributes Identification Algorithm. The algorithm figures out attributes having slight variations within the target cluster and high variations between clusters. Through these significant attributes, we would know that the data in the target cluster is highly affected by those significant attributes.
Besides, the hierarchical structure of GHSOM clustering results is hard to be presented and analyzed. We also propose two visualization maps, Cluster Feature Map and Cluster Distribution Map. The cluster feature map shows the hierarchical result in coloring each cluster according to the feature value that we would like to observe (The color can be freely defined). Therefore, it is easy for users to identify the uniqueness of features. In the cluster distribution map, we map leaf clusters as circles on the corresponding positions of GHSOM results. The size of circles represents the data size of the clusters. And the color also can be freely defined to the feature that we would like to observe, such as cell type and certain attribute value. We present the clustering result without dimension reduction techniques such as UMAP and t-SNE. |
Reference: | [1] Rahul Satija, Jeffrey A Farrell, David Gennert, Alexander F Schier, and Aviv Regev. Spatial reconstruction of single-cell gene expression data. Nature biotechnology, 33(5):495–502, 2015. [2] Vladimir Yu Kiselev, Kristina Kirschner, Michael T Schaub, Tallulah Andrews, Andrew Yiu, Tamir Chandra, Kedar N Natarajan, Wolf Reik, Mauricio Barahona, Anthony R Green, and Martin Hemberg. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods, 14(5):483–486, May 2017. [3] Peijie Lin, Michael Troup, and Joshua WK Ho. Cidr: Ultrafast and accurate clustering through imputation for single-cell rna-seq data. Genome biology, 18(1):1– 11, 2017. [4] T.Kohonen.The self-organizing map. Proceedings of the IEEE,78(9):1464–1480, 1990. [5] Michael Dittenbach, Dieter Merkl, and Andreas Rauber. Growing hierarchical self-organizing map. Proceedings of the International Joint Conference on Neural Networks, 6:15 – 19 vol.6, 02 2000. [6] Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992. [7] S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982. [8] Karthik Shekhar, Petter Brodin, Mark M Davis, and Arup K Chakraborty. Automatic classification of cellular expression by nonlinear stochastic embedding (accense). Proceedings of the National Academy of Sciences, 111(1):202–207, 2014. [9] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [10] Nikolay Samusik, Zinaida Good, Matthew H Spitzer, Kara L Davis, and Garry P Nolan. Automated mapping of phenotype space with single-cell data. Nature methods, 13(6):493–496, 2016. [11] Jacob H Levine, Erin F Simonds, Sean C Bendall, Kara L Davis, D Amir Elad, Michelle D Tadmor, Oren Litvin, Harris G Fienberg, Astraea Jager, Eli R Zunder, et al. Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell, 162(1):184–197, 2015. [12] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008. [13] Nima Aghaeepour, Radina Nikolic, Holger H Hoos, and Ryan R Brinkman. Rapid cell population identification in flow cytometry data. Cytometry Part A, 79(1):6– 13, 2011. [14] G Finak, A Bashasharti, R Brinkmann, and R Gottardo. Merging mixture model components for improved cell population identification in high throughput flow cytometry data. Advances in Bioinformatics, 100, 2009. [15] Sofie Van Gassen, Britt Callebaut, Mary J Van Helden, Bart N Lambrecht, Piet Demeester, Tom Dhaene, and Yvan Saeys. Flowsom: Using self-organizing maps for visualization and interpretation of cytometry data. Cytometry Part A, 87(7):636– 645, 2015. [16] Axel Theorell, Yenan Troi Bryceson, and Jakob Theorell. Determination of essential phenotypic elements of clusters in high-dimensional entities—depeche. PLoS One, 14(3):e0203247, 2019. [17] Ludo Waltman and Nees Jan Van Eck. A smart local moving algorithm for large-scale modularity-based community detection. The European physical journal B, 86(11):1–14, 2013. [18] Ziheng Zou, Kui Hua, and Xuegong Zhang. Hgc: fast hierarchical clustering for large-scale single-cell data. Bioinformatics, 37(21):3964–3965, 2021. [19] Thomas Bonald, Bertrand Charpentier, Alexis Galland, and Alexandre Hollocou. Hierarchical graph clustering using node pair sampling. 2018. [20] Kenichi Shimada, John Bachman, Jeremy Muhlich, and Timothy Mitchison. shinydepmap, a tool to identify targetable cancer genes and their functional connections from cancer dependency map data. eLife, 10, 02 2021. [21] Fiona M Behan, FrancescoIorio, Gabriele Picco, Emanuel Gonçalves, Charlotte M Beaver, Giorgia Migliardi, Rita Santos, Yanhua Rao, Francesco Sassi, Marika Pinnelli, et al. Prioritization of cancer therapeutic targets using crispr–cas9 screens. Nature, 568(7753):511–516, 2019. [22] Neema Agrawal, PVN Dasaradhi, Asif Mohmmed, Pawan Malhotra, Raj K Bhatnagar, and Sunil K Mukherjee. Rna interference: biology, mechanism, and applications. Microbiology and molecular biology reviews, 67(4):657–685, 2003. [23] Aviad Tsherniak, Francisca Vazquez, Phil G Montgomery, Barbara A Weir, Gregory Kryukov, Glenn S Cowley, Stanley Gill, William F Harrington, Sasha Pantel, John M Krill-Burger, et al. Defining a cancer dependency map. Cell, 170(3):564– 576, 2017. [24] Dieter Merkl M. D. E. P. Andreas Rauber. The growing hierarchical self- organizing map. [25] Esteban J Palomo, Enrique Domínguez, Rafael Marcos Luque, and José Muñoz. An intrusion detection system based on hierarchical self-organization. In Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS’08, pages 139–146. Springer, 2009. [26] Mark Bruls, C. Huizing, and J. V. Wijk. Squarified treemaps. In VisSym, 2000. [27] Plotly Technologies Inc. Collaborative data science, 2015. [28] Xiao Liu, Song Weichen, Brandon Wong, Ting Zhang, Shunying Yu, Guan Lin, and Xianting Ding. A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biology, 20, 12 2019. [29] Ujjwal Maulikand Sanghamitra Bandyopadhyay. Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on pattern analysis and machine intelligence, 24(12):1650–1654, 2002. [30] Jorge M Santos and Mark Embrechts. On the use of the adjusted rand index as a metric for evaluating supervised classification. In International conference on artificial neural networks, pages 175–184. Springer, 2009. [31] E. Becht, L. McInnes, John Healy, C. Dutertre, I. Kwok, L. Ng, F. Ginhoux, and E. Newell. Dimensionality reduction for visualizing single-cell data using umap. Nature Biotechnology, 37:38–44, 2019. [32] Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. Cell, 177(7):1888– 1902, 2019. [33] Jelili Oyelade, Itunuoluwa Isewon, Funke Oladipupo, Olufemi Aromolaran, Efosa Uwoghiren, Faridah Ameh, Moses Achas, and Ezekiel Adebiyi. Clustering algorithms: Their application to gene expression data. Bioinformatics and Biology Insights, 10:BBI.S38316, 2016. PMID: 27932867. [34] Mayra Z Rodriguez, Cesar H Comin, Dalcimar Casanova, Odemir M Bruno, Diego R Amancio, Luciano da F Costa, and Francisco A Rodrigues. Clustering algorithms: A comparative approach. PloS one, 14(1):e0210236, 2019. [35] Harun Pirim, Burak Ekşioğlu, Andy D Perkins, and Çetin Yüceer. Clustering of high throughput gene expression data. Computers & operations research, 39(12):3046–3061, 2012. [36] Sebastian J Teran Hidalgo and Shuangge Ma. Clustering multilayer omics data using muncut. BMC genomics, 19(1):1–13, 2018. [37] Prabhakar Chalise and Brooke L Fridley. Integrative clustering of multi-level ‘omic data based on non¬negative matrix factorization algorithm. PloS one, 12(5):e0176278, 2017. [38] Saket Navlakha and Carl Kingsford. The power of protein interaction networks for associating genes with diseases. Bioinformatics, 26(8):1057–1063, 2010. [39] Elio Masciari, Giuseppe Massimiliano Mazzeo, and Carlo Zaniolo. Analysing microarray expression data through effective clustering. Information Sciences, 262:32–45, 2014. [40] Diego H Milone, Georgina Stegmayer, Mariana López, Laura Kamenetzky, and Fernando Carrari. Improving clustering with metabolic pathway data. BMC bioinformatics, 15(1):1–10, 2014. [41] Deepika Kumar and Usha Batra. Clustering algorithms for gene expression data: A review. International Journal of Recent Research Aspects, 4:122–28, 2017. [42] Shweta Srivastava and Nikita Joshi. Clustering techniques analysis for microarray data. Int J Comput Sci Mob Comput, 3:359–364, 2014. [43] R Prabahari and V Thiagarasu. Density based clustering using gaussian estimation technique. Int J Recent Innovat Trend Comput Commun, 2:4078–4081, 2014. [44] Lerato Lerato and Thomas Niesler. Clustering acoustic segments using multi-stage agglomerative hierarchical clustering. PloS one, 10(10):e0141756, 2015. [45] Plamen Angelov, Yannis Manolopoulos, Lazaros Iliadis, Asim Roy, and Marley Vellasco. Advances in big data. In Proceedings of the 2nd INNS Conference on Big Data, pages 23–25. Springer, 2016. [46] M Sathya Deepa and N Sujatha. Comparative studies of various clustering techniques and its characteristics. International Journal of Advanced Networking and Applications, 5(6):2104, 2014. [47] Tim Stuart and Rahul Satija. Integrative single-cell analysis. Nature reviews genetics, 20(5):257–272, 2019. [48] Ermelinda Porpiglia, Nikolay Samusik, Andrew Tri Van Ho, Benjamin D Cosgrove, Thach Mai, Kara L Davis, Astraea Jager, Garry P Nolan, Sean C Bendall, Wendy J Fantl, et al. High¬resolution myogenic lineage mapping by single-cell mass cytometry. Nature cell biology, 19(5):558–567, 2017. |
Description: | 碩士 國立政治大學 資訊管理學系 108356002 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0108356002 |
Data Type: | thesis |
DOI: | 10.6814/NCCU202201336 |
Appears in Collections: | [資訊管理學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
600201.pdf | | 9001Kb | Adobe PDF2 | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|