資料載入中.....
|
請使用永久網址來引用或連結此文件:
https://nccur.lib.nccu.edu.tw/handle/140.119/159412
|
題名: | 基於Hi-C數據的病毒基因組組裝與宿主關聯分析方法改進 Improvement of viral gene assembly and host analysis methods based on Hi-C data |
作者: | 李佳芬 Li, Jia-Fen |
貢獻者: | 張家銘 Chang, Jia-Ming 李佳芬 Li, Jia-Fen |
關鍵詞: | Hi-C 分箱 微生物基因組 聚類演算法 圖神經網路 病毒與宿主關係 Hi-C Binning Microbial genomes Clustering algorithms Graph neural networks Virus–host interactions |
日期: | 2025 |
上傳時間: | 2025-09-01 16:56:58 (UTC+8) |
摘要: | 在微生物群落研究中,病毒與宿主的關聯解析至關重要,而 Hi-C 技術提供了一種透過 DNA 片段的物理交互作用來推測病毒與宿主關係的方法。ViralCC 與 MetaCC 為近年發展的代表性 Hi-C 數據處理工具,分別針對病毒與細菌(或其它原核生物)進行分箱(binning)分析,能夠從 Hi-C 相互作用矩陣中組裝基因組、預測病毒與宿主配對關係,並進行微生物基因組重建。然而,這些工具在處理環境樣本時仍面臨諸多挑戰,包括組裝不完整、錯誤分箱率高、基因組污染度偏高,以及計算效率與擴展性受限等問題。因此,有必要針對 Hi-C 分箱流程進行進一步優化與擴充。本研究將針對 ViralCC 與 MetaCC 進行優化,提出一套改進的 Hi-C 微生物分箱分析方法,以 Hi-C 相互作用矩陣為核心,整合多種基因組特徵(如 GC 含量、重疊群長度、Hi-C 交互作用強度等),並設計新的動態聚類演算法,通過優化圖結構分析與機器學習技術,方法有 Leiden 與 Louvain 等社群偵測演算法之調參優化,並設計結合圖結構分析與圖神經網路(GNN)之動態聚類演算法,導入 GNN 自動學習圖中結構與特徵,進行嵌入式分群(embedding-based clustering),以提高微生物基因組重建之完整性並有效降低污染度。最終透過與 ViralCC 與 MetaCC 等現有方法進行比較,驗證優化後的效果。實驗結果顯示,ViralCC 雖能成功生成 525 個純病毒分箱,但無法處理宿主重疊群;MetaCC 所產生的 211 個分箱中,有高達 158 個(約 74.9%)為病毒與宿主混合分箱,顯示其分群策略產生明顯混淆。而本研究方法則有效將病毒與宿主重疊群分離,最終生成 88,792 個重疊群所對應之分箱,且無混合分箱產生,提升分箱品質與可信度,改善分群純度。關鍵詞:Hi-C、分箱、微生物基因組、聚類演算法、圖神經網路、病毒與宿主關係 Deciphering virus–host linkages is pivotal in microbiome research, and Hi-C proximity ligation enables inference of these associations from physical DNA contacts. Recent Hi-C binning tools, ViralCC and MetaCC, can assemble genomes, predict virus–host pairs, and reconstruct microbial genomes from Hi-C interaction matrices; however, environmental samples still pose challenges, including fragmented assemblies, high misbinning, elevated contamination, and limited computational scalability. We present an improved Hi-C microbial binning framework that centers on the Hi-C interaction matrix while integrating genomic features (GC content, contig length) and Hi-C contact strength. The method couples parameter-optimized community detection (Leiden and Louvain) with a dynamic clustering algorithm that fuses graph-structural analysis and a graph neural network (GNN) to learn embeddings for embedding-based clustering, aiming to boost completeness and reduce contamination. In benchmarking, ViralCC generated 525 pure viral bins but did not handle host contigs, whereas MetaCC produced 211 bins, of which 158 (74.9%) were mixed virus–host bins, indicating clustering confounding. Our approach effectively separated viral and host contigs, successfully binning 88,792 contigs with no mixed bins, thereby improving bin purity and reliability and strengthening downstream virus–host pairing. Keywords:Hi-C, binning, microbial genomes, clustering algorithms, graph neural networks, virus–host interactions |
參考文獻: | [1] Du, Y., Fuhrman, J.A. & Sun, F. (2023). ViralCC retrieves complete viral genomes and virus-host pairs from metagenomic Hi-C data. Nat Commun 14, 502. [2] Du, Y., Sun, F. (2023). MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data. Nat Commun 14, 6231. [3] Integra Biosciences, "Short Read vs. Long Read Sequencing," Integra Biosciences. [Online]. Available: https://www.integra-biosciences.com/global/en/blog/article/short-read-vs-long-read-sequencing. [4] dyxstat, "MetaCC: Scalable and Integrative Analyses of MetaHi-C Data," GitHub repository, 2023. [Online]. Available: https://github.com/dyxstat/MetaCC [5] Yoon SH, Ha SM, Lim J, Kwon S, Chun J. A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie Van Leeuwenhoek. 2017 Oct;110(10):1281-1286. doi: 10.1007/s10482-017-0844-4. Epub 2017 Feb 15. PMID: 28204908. [6] Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1), 5233. https://doi.org/10.1038/s41598-019-41695-z [7] Wikipedia contributors. (2024). Contig. Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Contig [8] Wikipedia contributors. (2024). Adapter. Wikipedia. Retrieved from https://zh.wikipedia.org/zh-tw/%E8%A1%94%E6%8E%A5%E5%AD%90 [9] EMBnet. (2014). The contig: a concept in genome assembly. EMBnet.journal, 20(1), 20. https://journal.embnet.org/index.php/embnetjournal/article/view/200 [10] Chklovski, A. (2023). CheckM2: An enhanced framework for assessing genome quality using machine learning. GitHub Repository. Retrieved from: https://github.com/chklovski/CheckM2 [11] Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. https://doi.org/10.1101/gr.186072.114 [12] ScienceDirect. (2024). Adjusted Rand Index. Retrieved from: https://www.sciencedirect.com/topics/computer-science/adjusted-rand-index [13] Wikipedia contributors. (2024). Rand Index. Wikipedia. Retrieved from: https://en.wikipedia.org/wiki/Rand_index [14] National Center for Biotechnology Information, "Sequence Read Archive," NCBI. [Online]. Available: https://www.ncbi.nlm.nih.gov/sra. [15] Simroux, “VirSorter: mining viral signal from microbial genomic data,” GitHub repository, https://github.com/simroux/VirSorter (accessed Jul. 13, 2025). |
描述: | 碩士 國立政治大學 資訊科學系 112753103 |
資料來源: | http://thesis.lib.nccu.edu.tw/record/#G0112753103 |
資料類型: | thesis |
顯示於類別: | [資訊科學系] 學位論文
|
文件中的檔案:
檔案 |
大小 | 格式 | 瀏覽次數 |
310301.pdf | 1503Kb | Adobe PDF | 0 | 檢視/開啟 |
|
在政大典藏中所有的資料項目都受到原著作權保護.
|