政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/159412

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 118575/149625 (79%)
Visitors : 79298588 Online Users : 756

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/159412

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/159412

Title:	基於Hi-C數據的病毒基因組組裝與宿主關聯分析方法改進 Improvement of viral gene assembly and host analysis methods based on Hi-C data
Authors:	李佳芬 Li, Jia-Fen
Contributors:	張家銘 Chang, Jia-Ming 李佳芬 Li, Jia-Fen
Keywords:	Hi-C 分箱微生物基因組聚類演算法圖神經網路病毒與宿主關係 Hi-C Binning Microbial genomes Clustering algorithms Graph neural networks Virus–host interactions
Date:	2025
Issue Date:	2025-09-01 16:56:58 (UTC+8)
Abstract:	在微生物群落研究中，病毒與宿主的關聯解析至關重要，而 Hi-C 技術提供了一種透過 DNA 片段的物理交互作用來推測病毒與宿主關係的方法。ViralCC 與 MetaCC 為近年發展的代表性 Hi-C 數據處理工具，分別針對病毒與細菌（或其它原核生物）進行分箱（binning）分析，能夠從 Hi-C 相互作用矩陣中組裝基因組、預測病毒與宿主配對關係，並進行微生物基因組重建。然而，這些工具在處理環境樣本時仍面臨諸多挑戰，包括組裝不完整、錯誤分箱率高、基因組污染度偏高，以及計算效率與擴展性受限等問題。因此，有必要針對 Hi-C 分箱流程進行進一步優化與擴充。本研究將針對 ViralCC 與 MetaCC 進行優化，提出一套改進的 Hi-C 微生物分箱分析方法，以 Hi-C 相互作用矩陣為核心，整合多種基因組特徵（如 GC 含量、重疊群長度、Hi-C 交互作用強度等），並設計新的動態聚類演算法，通過優化圖結構分析與機器學習技術，方法有 Leiden 與 Louvain 等社群偵測演算法之調參優化，並設計結合圖結構分析與圖神經網路（GNN）之動態聚類演算法，導入 GNN 自動學習圖中結構與特徵，進行嵌入式分群（embedding-based clustering），以提高微生物基因組重建之完整性並有效降低污染度。最終透過與 ViralCC 與 MetaCC 等現有方法進行比較，驗證優化後的效果。實驗結果顯示，ViralCC 雖能成功生成 525 個純病毒分箱，但無法處理宿主重疊群；MetaCC 所產生的 211 個分箱中，有高達 158 個（約 74.9%）為病毒與宿主混合分箱，顯示其分群策略產生明顯混淆。而本研究方法則有效將病毒與宿主重疊群分離，最終生成 88,792 個重疊群所對應之分箱，且無混合分箱產生，提升分箱品質與可信度，改善分群純度。關鍵詞：Hi-C、分箱、微生物基因組、聚類演算法、圖神經網路、病毒與宿主關係 Deciphering virus–host linkages is pivotal in microbiome research, and Hi-C proximity ligation enables inference of these associations from physical DNA contacts. Recent Hi-C binning tools, ViralCC and MetaCC, can assemble genomes, predict virus–host pairs, and reconstruct microbial genomes from Hi-C interaction matrices; however, environmental samples still pose challenges, including fragmented assemblies, high misbinning, elevated contamination, and limited computational scalability. We present an improved Hi-C microbial binning framework that centers on the Hi-C interaction matrix while integrating genomic features (GC content, contig length) and Hi-C contact strength. The method couples parameter-optimized community detection (Leiden and Louvain) with a dynamic clustering algorithm that fuses graph-structural analysis and a graph neural network (GNN) to learn embeddings for embedding-based clustering, aiming to boost completeness and reduce contamination. In benchmarking, ViralCC generated 525 pure viral bins but did not handle host contigs, whereas MetaCC produced 211 bins, of which 158 (74.9%) were mixed virus–host bins, indicating clustering confounding. Our approach effectively separated viral and host contigs, successfully binning 88,792 contigs with no mixed bins, thereby improving bin purity and reliability and strengthening downstream virus–host pairing. Keywords：Hi-C, binning, microbial genomes, clustering algorithms, graph neural networks, virus–host interactions
Reference:	[1] Du, Y., Fuhrman, J.A. & Sun, F. (2023). ViralCC retrieves complete viral genomes and virus-host pairs from metagenomic Hi-C data. Nat Commun 14, 502. [2] Du, Y., Sun, F. (2023). MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data. Nat Commun 14, 6231. [3] Integra Biosciences, "Short Read vs. Long Read Sequencing," Integra Biosciences. [Online]. Available: https://www.integra-biosciences.com/global/en/blog/article/short-read-vs-long-read-sequencing. [4] dyxstat, "MetaCC: Scalable and Integrative Analyses of MetaHi-C Data," GitHub repository, 2023. [Online]. Available: https://github.com/dyxstat/MetaCC [5] Yoon SH, Ha SM, Lim J, Kwon S, Chun J. A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie Van Leeuwenhoek. 2017 Oct;110(10):1281-1286. doi: 10.1007/s10482-017-0844-4. Epub 2017 Feb 15. PMID: 28204908. [6] Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1), 5233. https://doi.org/10.1038/s41598-019-41695-z [7] Wikipedia contributors. (2024). Contig. Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Contig [8] Wikipedia contributors. (2024). Adapter. Wikipedia. Retrieved from https://zh.wikipedia.org/zh-tw/%E8%A1%94%E6%8E%A5%E5%AD%90 [9] EMBnet. (2014). The contig: a concept in genome assembly. EMBnet.journal, 20(1), 20. https://journal.embnet.org/index.php/embnetjournal/article/view/200 [10] Chklovski, A. (2023). CheckM2: An enhanced framework for assessing genome quality using machine learning. GitHub Repository. Retrieved from: https://github.com/chklovski/CheckM2 [11] Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. https://doi.org/10.1101/gr.186072.114 [12] ScienceDirect. (2024). Adjusted Rand Index. Retrieved from: https://www.sciencedirect.com/topics/computer-science/adjusted-rand-index [13] Wikipedia contributors. (2024). Rand Index. Wikipedia. Retrieved from: https://en.wikipedia.org/wiki/Rand_index [14] National Center for Biotechnology Information, "Sequence Read Archive," NCBI. [Online]. Available: https://www.ncbi.nlm.nih.gov/sra. [15] Simroux, “VirSorter: mining viral signal from microbial genomic data,” GitHub repository, https://github.com/simroux/VirSorter (accessed Jul. 13, 2025).
Description:	碩士國立政治大學資訊科學系 112753103
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0112753103
Data Type:	thesis
Appears in Collections:	[資訊科學系] 學位論文

Files in This Item:

File	Size	Format
310301.pdf	1503Kb	Adobe PDF	0	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback