Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/153385
|
Title: | 解碼 PC1 的力量:一種快速準確並基於共變異的 Hi-C 資料 A/B 染色體區室辨別方法 Decoding the Power of PC1: A Fast and Accurate Covariance-Based Method for A/B Compartment Identification in Hi-C Data |
Authors: | 程至榮 Cheng, Zhi-Rong |
Contributors: | 張家銘 Chang, Jia-Ming 程至榮 Cheng, Zhi-Rong |
Keywords: | 高通量染色體捕獲技術 染色質區室分析 主成份分析 Hi-C Chromatin compartments analysis Principal Component Analysis (PCA) |
Date: | 2024 |
Issue Date: | 2024-09-04 15:00:57 (UTC+8) |
Abstract: | 在 Hi-C 皮爾森相關矩陣中識別 A 和 B 染色體區室的標準作法是基於主成份分析,然而其運作原理卻鮮少被討論。對於 Hi-C 皮爾森相關矩陣,我們提出其第一主成份的變異解釋率通常很高,並且該解釋率反應了 PC1 與皮爾森相關矩陣上之區室的匹配程度。此外,我們提出了一種啟發式算法,透過 Hi-C 皮爾森相關矩陣的共變異矩陣估計出第一主成份的型態,而不需要直接進行主成份分析。我們的啟發式算法可以使用隨機抽樣有效的實現以加快計算速度,為了解決高解析度下的記憶體瓶頸,我們使用一種最近發表的區室識別工具 POSSUMM 改進了算法,它接受稀疏的 Hi-C O/E 矩陣作為輸入。在我們的實驗中,我們的算法在時間或是記憶體使用上,其基準測試的表現優於使用 Scikit-learn 和 POSSUMM 等軟體工具的幂迭代法(Power iteration),同時與作為基準答案的第一主成份有高相似度。程式碼公開於下列網址 https://github.com/ZhiRongDev/HiCPEP。 The PCA-based method is the standard for identifying A and B compartments in the Hi-C Pearson matrix. However, the reason why it works is rarely discussed. For the Hi-C Pearson matrix, we propose that the explained variance ratio of PC1 is usually high, and the ratio will reflect how the PC1 matches the compartments on the Pearson matrix. Besides, we propose a heuristic algorithm to estimate the pattern of PC1 according to the Hi-C Pearson's covariance matrix without explicitly performing PCA. Our method can be implemented efficiently using random sampling techniques to accelerate calculations. To address the memory bottleneck at finer matrix resolutions, we adapt the algorithm using principles from POSSUMM, a recently published compartment identification tool that takes the sparse Hi-C O/E matrix as input. In our experiments, our algorithm outperforms Power iteration methods, such as those implemented in Scikit-learn and POSSUMM, in terms of the time or memory usage, while maintaining a high degree of similarity to the ground truth PC1. The code is freely available at https://github.com/ZhiRongDev/HiCPEP. |
Reference: | [1] Erez Lieberman-Aiden*, Nynke L. van Berkum*, et al. “Comprehensive mapping of long-range interactions reveals folding principles of the human genome.”Science 326 (2009). GScholar Citations: 1626. Cover Article.
[2] Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science. 2002 Feb 15;295(5558):1306-11. doi: 10.1126/science 1067799. PMID: 11847345.
[3] Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J.S., and Ren, B. (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380.
[4] Rao, S., Huang, S.-C., Glenn, St., Hilaire, B., Engreitz, J. M., Perez, E. M., etal. (2017). Cohesin loss eliminates all loop domains. Cell 171, 305 – 320.e24. doi:10.1016/j.cell.2017.09.026
[5] Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014 Dec 18;159(7):1665-80. doi: 10.1016/j.cell.2014.11.021. Epub 2014 Dec 11. Erratum in: Cell. 2015 Jul 30;162(3):687-8. PMID: 25497547; PMCID: PMC5635824.
[6] Harris, H.L., Gu, H., Olshansky, M. et al. Chromatin alternates between A and B compartments at kilobase scale for subgenic organization. Nat Commun 14, 3303 (2023). https://doi.org/10.1038/s41467-023-38429-1
[7] Yaffe, E., and Tanay, A. (2011). Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43 (11), 1059–1065. doi:10.1038/ng.947
[8] Servant, N., Varoquaux, N., Lajoie, B. R., Viara, E., Chen, C. J., Vert, J. P., et al. (2015). HiC-pro: An optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259. doi:10.1186/s13059-015-0831-x
[9] Imakaev, M., Fudenberg, G., McCord, R. P., Naumova, N., Goloborodko, A., Lajoie, B.R., et al. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9 (10), 999–1003. doi:10.1038/nmeth.2148
[10] Knight, P. A., and Daniel, R. (2013). A fast algorithm for matrix balancing. IMA J. Numer. Analysis 33 (3), 1029–1047. doi:10.1093/imanum/drs019
[11] Kalluchi A, Harris HL, Reznicek TE, Rowley MJ. Considerations and caveats for analyzing chromatin compartments. Front Mol Biosci. 2023 Apr 5;10:1168562. doi: 10.3389/fmolb.2023.1168562. PMID: 37091873; PMCID: PMC10113542.
[12] Jolliffe Ian T. and Cadima Jorge 2016 Principal component analysis: a review and recent developments Phil. Trans. R. Soc. A.3742015020220150202 http://doi.org/10.1098/rsta.2015.0202
[13] Kruse, K., Hug, C.B. & Vaquerizas, J.M. FAN-C: a feature-rich framework for the analysis and visualization of chromosome conformation capture data. Genome Biol 21, 303 (2020). https://doi.org/10.1186/s13059-020-02215-9
[14] Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of LineageDetermining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
[15] Abdennur, N., and Mirny, L.A. (2020). Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. doi: 10.1093/bioinformatics/btz540.
[16] Neva C. Durand, Muhammad S. Shamim, Ido Machol, Suhas S. P. Rao, Miriam H. Huntley, Eric S. Lander, and Erez Lieberman Aiden. ”Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments.” Cell Systems 3(1), 2016.
[17] Zheng X, Zheng Y. CscoreTool: fast Hi-C compartment analysis at high resolution. Bioinformatics. 2018 May 1;34(9):1568-1570. doi: 10.1093/bioinformatics/btx802. PMID: 29244056; PMCID: PMC5925784.
[18] Xiong, K., and Ma, J. (2019). Revealing Hi-C subcompartments by imputing interchromosomal chromatin interactions. Nat. Commun. 10 (1), 5069. doi:10.1038/s41467- 019-12954-4.
[19] Wen, Z., Zhang, W., Zhong, Q., Xu, J., Hou, C., Qin, Z. S., et al. (2022). Extensive chromatin structure-function associations revealed by accurate 3D compartmentalization characterization. Front. Cell Dev. Biol. 10, 845118. doi:10. 3389/fcell.2022.845118
[20] van Berkum NL, Lieberman-Aiden E, Williams L, Imakaev M et al. Hi-C: a method to study the three-dimensional architecture of genomes. J Vis Exp 2010 May 6;(39). PMID: 20461051
[21] Sanborn AL, Rao SS, Huang SC, Durand NC et al. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc Natl Acad Sci U S A 2015 Nov 24;112(47):E6456-65. PMID: 26499245
[22] Jonathon Shlens. A Tutorial on Principal Component Analysis. 2014. arXiv:1404.1100
[23] Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. arXiv:1201.0490
[24] Baglama, J. & Lothar, R. Augmented implicitly restarted lanczos bidiagonalization methods. SIAM J. Sci. Comput 27, 19–42 (2005). https://doi.org/10.1137/04060593X
[25] Free Software Foundation, I. (2014). GNU Datamash. Retrieved from https://www.gnu.org/software/datamash/ |
Description: | 碩士 國立政治大學 資訊科學系 111753151 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0111753151 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
315101.pdf | | 5331Kb | Adobe PDF | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|