Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/154569
|
Title: | MCIENet : 基於 CNN 的 DNA 序列多尺度資訊提取模型用於三維染色質交互作用預測 MCIENet : Multi-scale CNN-based Information Extraction from DNA Sequences for 3D chromatin interactions Prediction |
Authors: | 何彥南 Ho, Yen-Nan |
Contributors: | 張家銘 Chang, Jia-Ming 何彥南 Ho, Yen-Nan |
Keywords: | 染色質環預測 深度學習 DNA序列 Inception架構 三維基因組學 Chromatin loop prediction Deep learning DNA sequence Inception architecture 3D genomics |
Date: | 2024 |
Issue Date: | 2024-12-02 11:21:52 (UTC+8) |
Abstract: | 染色質三維結構對於基因調控具有重要影響,染色質環(Chromatin loops)作為其基本單位,其結構和功能在不同細胞類型中存在差異,研究染色質三維結構可以幫助科學家們進一步理解細胞功能與運作。可是實際透過儀器與實體實驗去獲取三維結構資訊需要較高的設備、時間與樣本取得上的成本,也因為如此,許多計算預測方法被提出來,目的是透過 DNA 序列資訊、蛋白質或是開放染色質(open chromatin)等資訊去預測是否存在 CTCF 環的結構,而其中僅使用 DNA 序列資訊進行預測是最為困難的任務。本研究提出了一種新型深度學習模型 MCIENet (Multi-scale CNN-based Information Extraction Net),MCIENet採用Inception架構,對DNA序列進行多尺度特徵提取。我們在正常細胞 (GM12878) 與癌症細胞 (Helas3) 上進行了驗證,結果表明 MCIENet在不同細胞類型上均取得了優異的預測性能,尤其是在較長的DNA序列作為輸入時效果顯著。並揭示了在預測不同細胞類型時,在模型模型架構的設計上是存在差異性的。此外,我們使用 DNABERT2-512 基於大量基因資料所訓練的預訓練模型進行微調,發現在癌症細胞(Helas3) 上的效果很差,證實了這種基於大量基因資訊訓練的預訓練模型無法應用在所有種類的細胞結構預測上。此外,透過 DeepLIFT 可解釋性分析,進一步去觀察 MCIENet 的效果,發現其在長序列輸入時對於細節的捕捉更優秀,此外本研究還透過可解釋分析證實了 anchor-base 方法在錨點中心偏移時所存在的問題,導致其在後續使用上缺乏穩定性,且有所限制。 The three-dimensional structure of chromatin plays a crucial role in gene regulation. Chromatin loops, as the fundamental units of chromatin structure, exhibit diverse structures and functions across different cell types. Investigating the three-dimensional chromatin structure can help scientists gain a deeper understanding of cellular functions and operations. However, experimentally obtaining three-dimensional structural information through instruments and physical experiments requires substantial equipment, time, and sample acquisition costs. Consequently, numerous computational prediction methods have been proposed to predict CTCF loops using DNA sequence information, protein information, or open chromatin information. Among these methods, prediction solely based on DNA sequence information is the most challenging task. In this study, we propose a novel deep learning model, MCIENet (Multi-scale CNN-based Information Extraction Net), which employs an Inception architecture to extract multi-scale features from DNA sequences. We validated MCIENet on normal cells (GM12878) and cancer cells (Helas3). The results demonstrate that MCIENet performs better prediction on different cell types, especially when longer DNA sequences are used as input. Furthermore, our findings reveal differences in model architecture design when predicting different cell types. Additionally, we fine-tuned the DNABERT2-512 pre-trained model, which was trained on a large amount of genetic data, and found that its performance on cancer cells (Helas3) was poor. This confirms that pre-trained models trained on large amounts of genetic information cannot be applied to all types of cell structure prediction. Moreover, through DeepLIFT interpretability analysis, we further observed that MCIENet excels at capturing details when inputting long sequences. This study also confirms, through interpretability analysis, the limitations of anchor-based methods when the anchor center is shifted, leading to a lack of stability and restrictions in subsequent applications. |
Reference: | 1. Dekker, Job, et al. "Capturing chromosome conformation." science 295.5558 (2002): 1306-1311. 2. Zhou, Zhihan, et al. "Dnabert-2: Efficient foundation model and benchmark for multi-species genome." arXiv preprint arXiv:2306.15006 (2023). 3. Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. 4. Mao, Weiguang, Dennis Kostka, and Maria Chikina. "Modeling enhancer-promoter interactions with attention-based neural networks." bioRxiv (2017): 219667. 5. Zhuang, Zhong, Xiaotong Shen, and Wei Pan. "A simple convolutional neural network for prediction of enhancer–promoter interactions with DNA sequence data." Bioinformatics 35.17 (2019): 2899-2906. 6. Zhang, Mingyang, Yujia Hu, and Min Zhu. "EPIsHilbert: Prediction of enhancer-promoter interactions via hilbert curve encoding and transfer learning." Genes 12.9 (2021): 1385. 7. Ni, Yu, et al. "EPI-Mind: Identifying Enhancer–Promoter Interactions Based on Transformer Mechanism." Interdisciplinary Sciences: Computational Life Sciences 14.3 (2022): 786-794. 8. Cao, Fan, et al. "Chromatin interaction neural network (ChINN): a machine learning-based method for predicting chromatin interactions from DNA sequences." Genome biology 22 (2021): 1-25. 9. Schwessinger, Ron, et al. "DeepC: predicting 3D genome folding using megabase-scale transfer learning." Nature methods 17.11 (2020): 1118-1124. 10. Zhou, Jian. "Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale." Nature genetics 54.5 (2022): 725-734. 11. Singh, Shashank, et al. "Predicting enhancer-promoter interaction from genomic sequence with deep neural networks." Quantitative Biology 7 (2019): 122-137. 12. Hong, Zengyan, et al. "Identifying enhancer–promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism." Bioinformatics 36.4 (2020): 1037-1043. 13. Jing, Fang, Shao-Wu Zhang, and Shihua Zhang. "Prediction of enhancer–promoter interactions using the cross-cell type information and domain adversarial neural network." BMC bioinformatics 21.1 (2020): 1-16. 14. Agarwal, Aman, and Li Chen. "DeepPHiC: Predicting promoter-centered chromatin interactions using a novel deep learning approach." Bioinformatics 39.1 (2023): btac801. 15. Trieu, Tuan, Alexander Martinez-Fundichely, and Ekta Khurana. "DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure." Genome biology 21 (2020): 1-11. 16. Fudenberg, Geoff, David R. Kelley, and Katherine S. Pollard. "Predicting 3D genome folding from DNA sequence with Akita." Nature methods 17.11 (2020): 1111-1117. 17. Tan, Jimin, et al. "Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening." Nature biotechnology (2023): 1-11. 18. Yakovchuk, Peter, Ekaterina Protozanova, and Maxim D. Frank-Kamenetskii. "Base-stacking and base-pairing contributions into thermal stability of the DNA double helix." Nucleic acids research 34.2 (2006): 564-574. 19. Kumaran, R. Ileng, Rajika Thakar, and David L. Spector. "Chromatin dynamics and gene positioning." Cell 132.6 (2008): 929-934. 20. Dekker, Job, Marc A. Marti-Renom, and Leonid A. Mirny. "Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data." Nature Reviews Genetics 14.6 (2013): 390-403. 21. Bonev, Boyan, and Giacomo Cavalli. "Organization and function of the 3D genome." Nature Reviews Genetics 17.11 (2016): 661-678. 22. Dekker, Job, et al. "The 4D nucleome project." Nature 549.7671 (2017): 219-226. 23. Dekker, Job, et al. "Spatial and temporal organization of the genome: Current state and future aims of the 4D nucleome project." Molecular cell (2023). 24. Soroczynski, Jan, and Viviana I. Risca. "Technological advances in probing 4D genome organization." Current Opinion in Cell Biology 84 (2023): 102211. 25. Lieberman-Aiden, Erez, et al. "Comprehensive mapping of long-range interactions reveals folding principles of the human genome." science 326.5950 (2009): 289-293. 26. Fullwood, Melissa J., and Yijun Ruan. "ChIP‐based methods for the identification of long‐range chromatin interactions." Journal of cellular biochemistry 107.1 (2009): 30-39. 27. Zhou, Tianming, Ruochi Zhang, and Jian Ma. "The 3D genome structure of single cells." Annual review of biomedical data science 4 (2021): 21-41. 28. Jerkovic, Ivana, and Giacomo Cavalli. "Understanding 3D genome organization by multidisciplinary methods." Nature Reviews Molecular Cell Biology 22.8 (2021): 511-528. 29. Babu, Deepak, and Melissa J. Fullwood. "3D genome organization in health and disease: emerging opportunities in cancer translational medicine." Nucleus 6.5 (2015): 382-393. 30. Akıncılar, Semih Can, et al. "Long-range chromatin interactions drive mutant TERT promoter activation." Cancer discovery 6.11 (2016): 1276-1291. 31. Krumm, Anton, and Zhijun Duan. "Understanding the 3D genome: emerging impacts on human disease." Seminars in cell & developmental biology. Vol. 90. Academic Press, 2019. 32. Umlauf, David, and Raphaël Mourad. "The 3D genome: From fundamental principles to disease and cancer." Seminars in cell & developmental biology. Vol. 90. Academic Press, 2019 33. Goel, Viraat Y., and Anders S. Hansen. "The macro and micro of chromosome conformation capture." Wiley Interdisciplinary Reviews: Developmental Biology 10.6 (2021): e395. 34. Pal, Koustav, Mattia Forcato, and Francesco Ferrari. "Hi-C analysis: from data generation to integration." Biophysical reviews 11 (2019): 67-78. 35. Rao, Suhas SP, et al. "A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping." Cell 159.7 (2014): 1665-1680. 36. Schoenfelder, Stefan, et al. "The pluripotent regulatory circuitry connecting promoters to their long-range interacting elements." Genome research 25.4 (2015): 582-597. 37. Piecyk, Robert S., Luca Schlegel, and Frank Johannes. "Predicting 3D chromatin interactions from DNA sequence using Deep Learning." Computational and Structural Biotechnology Journal 20 (2022): 3439-3448. 38. Jin, Fulai, et al. "A high-resolution map of the three-dimensional chromatin interactome in human cells." Nature 503.7475 (2013): 290-294. 39. Hsieh, Tsung-Han S., et al. "Mapping nucleosome resolution chromosome folding in yeast by micro-C." Cell 162.1 (2015): 108-119. 40. Schoenfelder, Stefan, et al. "Promoter capture Hi-C: high-resolution, genome-wide profiling of promoter interactions." JoVE (Journal of Visualized Experiments) 136 (2018): e57320. 41. Li, Guoliang, et al. "Chromatin interaction analysis with paired-end tag (ChIA-PET) sequencing technology and application." BMC genomics 15.12 (2014): 1-10. 42. Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017). 43. Whalen, Sean, Rebecca M. Truty, and Katherine S. Pollard. "Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin." Nature genetics 48.5 (2016): 488-496. 44. Yang, Yang, et al. "Exploiting sequence-based features for predicting enhancer–promoter interactions." Bioinformatics 33.14 (2017): i252-i260. 45. Schreiber, Jacob, et al. "Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture." BioRxiv (2017): 103614. 46. Zeng, Wanwen, Mengmeng Wu, and Rui Jiang. "Prediction of enhancer-promoter interactions via natural language processing." BMC genomics 19 (2018): 13-22. 47. Min, Xiaoping, et al. "Predicting enhancer-promoter interactions by deep learning and matching heuristic." Briefings in Bioinformatics 22.4 (2021): bbaa254. 48. Fan, Yongxian, and Binchao Peng. "StackEPI: identification of cell line-specific enhancer–promoter interactions based on stacking ensemble learning." BMC bioinformatics 23.1 (2022): 1-18. 49. Chen, Ken, Huiying Zhao, and Yuedong Yang. "Capturing large genomic contexts for accurately predicting enhancer-promoter interactions." Briefings in Bioinformatics 23.2 (2022): bbab577. 50. Li, Wenran, Wing Hung Wong, and Rui Jiang. "DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning." Nucleic acids research 47.10 (2019): e60-e60. 51. Zhang, Ruochi, et al. "Predicting CTCF-mediated chromatin loops using CTCF-MP." Bioinformatics 34.13 (2018): i133-i141. 52. Wang, Weibing, et al. "CCIP: predicting CTCF-mediated chromatin loops with transitivity." Bioinformatics 37.24 (2021): 4635-4642. 53. Ahmad, Muneer, Low Tan Jung, and Al-Amin Bhuiyan. "From DNA to protein: Why genetic code context of nucleotides for DNA signal processing? A review." Biomedical Signal Processing and Control 34 (2017): 44-63. 54. Dakhli, Abdesselem, and Chokri Ben Amar. "Power spectrum and dynamic time warping for DNA sequences classification." Evolving Systems 11 (2020): 637-646. 55. Ng, Patrick. "dna2vec: Consistent vector representations of variable-length k-mers." arXiv preprint arXiv:1701.06279 (2017). 56. Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013). 57. Ji, Yanrong, et al. "DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome." Bioinformatics 37.15 (2021): 2112-2120. 58. Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018). 59. ENCODE Project Consortium. "An integrated encyclopedia of DNA elements in the human genome." Nature 489.7414 (2012): 57. 60. Tang, Zhonghui, et al. "CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription." Cell 163.7 (2015): 1611-1627. 61. Li, Guoliang, et al. "Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation." Cell 148.1 (2012): 84-98. 62. Cao, Fan, and Melissa J. Fullwood. "Inflated performance measures in enhancer–promoter interaction-prediction methods." Nature genetics 51.8 (2019): 1196-1198. 63. Sharma, Sagar, Simone Sharma, and Anidhya Athaiya. "Activation functions in neural networks." Towards Data Sci 6.12 (2017): 310-316. 64. Mendoza-Pitti, Luis, et al. "Developing a Long Short-Term Memory-Based Model for Forecasting the Daily Energy Consumption of Heating, Ventilation, and Air Conditioning Systems in Buildings." Applied Sciences 11.15 (2021): 6722. 65. Li, Z., et al. "cardiGAN: A generative adversarial network model for design and discovery of multi principal element alloys." Journal of Materials Science & Technology 125 (2022): 81-96. 66. Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." The journal of machine learning research 15.1 (2014): 1929-1958. 67. Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." International conference on machine learning. pmlr, 2015. 68. Shrikumar, Avanti, Peyton Greenside, and Anshul Kundaje. "Learning important features through propagating activation differences." International conference on machine learning. PMlR, 2017. 69. Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why should i trust you?" Explaining the predictions of any classifier." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. 70. Lundberg, Scott. "A unified approach to interpreting model predictions." arXiv preprint arXiv:1705.07874 (2017). 71. Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. 72. Child, Rewon, et al. "Generating long sequences with sparse transformers." arXiv preprint arXiv:1904.10509 (2019). 73. Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv preprint arXiv:2004.05150 (2020). |
Description: | 碩士 國立政治大學 資訊科學系 110753202 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0110753202 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
320201.pdf | | 24538Kb | Adobe PDF | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|