政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/131632

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 115256/146303 (79%)
Visitors : 54537939 Online Users : 212

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/131632

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/131632

Title:	BigBigTree: 基於Nextflow框架利用分群串接法建立巨量同源基因演化樹 BigBigTree: a divide and concatenate strategy for the phylogenetic reconstruction of large orthologous datasets using Nextflow framework
Authors:	蔡漢龍 Tsai, Han-Lung
Contributors:	張家銘 Chang, Jia-Ming 蔡漢龍 Tsai, Han-Lung
Keywords:	基因樹演化樹 Nextflow 分群串接 Gene tree phylogenetic tree Nextflow divide and concatenate
Date:	2020
Issue Date:	2020-09-02 12:15:36 (UTC+8)
Abstract:	演化樹（phylogenetic tree）是根據不同生物間的型態、構造、生理、生態、遺傳和基因序列等特徵，將生物做系統化的分類，做成各物種間演化、親緣關係的樹狀圖，從中我們可以了解到序列間推斷的演化歷史。由於次世代定序技術及第三代定序技術的發展，越來越多的基因資料可以取得，面對龐大的資料量，甚至是最快的方法都具有挑戰性。一些重要的多基因家族（如嗅覺受體）已無法通過最準確的方法—最大似然（Maximum likelihood）來構建系統發育樹。在本研究中，我們提出了BigBigTree，透過分群串接法將問題分解為較小的問題並獨立解決。這個方法依賴於在直系同源基因的大型數據集中，進行分群的能力，每群直系同源基因都使用一種典型方法來構建演化樹，並在第二階段處理樹的上層(超級樹)，從每棵子樹中選擇每種物種的一種蛋白質序列，對來自同一物種的所有蛋白質序列進行多重序列比對，最後依其直系同源關係將序列串接起來，用於建構超級樹。這個方法的優點是我們減少了要分類的序列數量，且不會丟失資訊，因為最後的串接序列代表所有的序列。BigBigTree可以有效地處理特定於譜系的重複，但不能處理基因水平轉移，它更適合分析大的真核生物家族，如激酶或嗅覺受體。我們利用真實數據及模擬數據對BigBigTree進行評估，並與RAxML v8.2.12、RAxML-ng 及 IQ-TREE2 比較結果。在大多數情況下，BigBigTree的執行時間比RAxML和RAxML-ng快。在拓樸精度方面，BigBigTree在模擬數據上展現比其他方法更好的性能，並在實際數據中獲得與其他方法接近的精度。BigBigTree的原始碼及docker容器可在https://github.com/jmchanglabtw/bigbigtree和https://hub.docker.com/r/changlabtw/bigbigtree中取得。 A phylogenetic tree is a branching diagram based on the similarities of creatures in morphology, structure, physiology, genetics, ecology, and genetic sequence. It shows an inferred evolutionary history among sequences. Thanks to the next-generation sequencing technique and the third-generation sequencing technique, more and more sequences have become available. This overwhelming amount of data is challenging, even the fastest methods. Some important multi-genetic families like olfactory receptors have become impossible to build a phylogenetic tree with the most accurate methods like Maximum Likelihood (ML). Here we show how a simple Divide and Concatenate strategy, BigBigTree, can be applied to this problem by breaking it down into smaller problems that are solved independently. Our approach relies on the ability to identify within large dataset clusters of orthologous genes. Each group of orthologous genes is used to build a phylogenetic tree using a typical approach. The upper level of the tree (super-tree) is resolved in a second stage. One protein per species is chosen from each subtree. All proteins from the same species are aligned together. The alignment used for building the super-tree results from concatenating all these alignments, where within-species paralogues appear in the same columns, and orthologues appear in the same row. The advantage is that we reduce the number of sequences to classify without losing information as all sequences are represented in the final alignment. This approach can efficiently deal with lineage-specific duplications, but not with lateral transfers. It is better suited for the analysis of large eukaryotic families like the kinases or the olfactory receptors. We evaluated BigBigTree in simulation and real data sets against RAxML v8.2.12, RAxML-ng, and IQ-TREE2. BigBigTree is faster than RAxML and RAxML-ng in most cases. Regarding topology accuracy, BigBigTree shows better performance than others in simulation data and gets compatible accuracy with others in real data. The source code and docker of the method are available at https://github.com/jmchanglabtw/bigbigtree and https://hub.docker.com/r/changlabtw/bigbigtree, where the latter allows users one-click installation.
Reference:	1. Contributors to Wikimedia projects. Phylogenetic tree. 2002 Nov 20 [cited 2020 May 19]; Available from: https://en.wikipedia.org/wiki/Phylogenetic_tree 2. BIL 106 - Lecture 4 [Internet]. [cited 2020 May 19]. Available from: http://www.bio.miami.edu/dana/106/106F05_4.html 3. Larget BR, Kotha SK, Dewey CN, Ané C. BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics. 2010 Nov 15;26(22):2910–1. 4. Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 2014 Sep 1;30(17):i541–8. 5. Mirarab S, Warnow T. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015 Jun 15;31(12):i44–52. 6. Vachaspati P, Warnow T. ASTRID: Accurate Species TRees from Internode Distances. BMC Genomics. 2015 Oct 2;16 Suppl 10:S3. 7. Lemoine F, -B. Domelevo Entfellner J, Wilkinson E, Correia D, Dávila Felipe M, De Oliveira T, et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data [Internet]. Vol. 556, Nature. 2018. p. 452–6. Available from: http://dx.doi.org/10.1038/s41586-018-0043-0 8. Delsuc F, Brinkmann H, Philippe H. Phylogenomics and the reconstruction of the tree of life [Internet]. Vol. 6, Nature Reviews Genetics. 2005. p. 361–75. Available from: http://dx.doi.org/10.1038/nrg1603 9. Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003 Oct 23;425(6960):798–804. 10. Ashkenazy H, Sela I, Levy Karin E, Landan G, Pupko T. Multiple Sequence Alignment Averaging Improves Phylogeny Reconstruction. Syst Biol. 2019 Jan 1;68(1):117–30. 11. Chang J-M, Floden EW, Herrero J, Gascuel O, Di Tommaso P, Notredame C. Incorporating alignment uncertainty into Felsenstein’s phylogenetic bootstrap to improve its reliability [Internet]. Bioinformatics. 2019. Available from: http://dx.doi.org/10.1093/bioinformatics/btz082 12. BLAST: Basic Local Alignment Search Tool [Internet]. [cited 2020 May 19]. Available from: https://blast.ncbi.nlm.nih.gov/Blast.cgi 13. hcluster [Internet]. PyPI. [cited 2020 May 19]. Available from: https://pypi.org/project/hcluster/ 14. Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205–17. 15. Katoh K, Misawa K, Kuma K-I, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002 Jul 15;30(14):3059–66. 16. TreeSoft: TreeBeST [Internet]. [cited 2020 May 19]. Available from: http://treesoft.sourceforge.net/treebest.shtml 17. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020 May 1;37(5):1530–4. 18. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015 Jan;32(1):268–74. 19. Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010 May;59(3):307–21. 20. Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316–9. 21. Di Tommaso Jean-Francois Taly Javier Herrero Cedric Notredame J-MCMMP. A divide and concatenate strategy for the phylogenetic reconstruction of large orthologous datasets. SMBE poster. 2012; 22. Clustering Run - MCL Clusters - Microsporidia [Internet]. [cited 2020 May 19]. Available from: https://genome.jgi.doe.gov/clm/run/microsporidia-2017-01.1750;sjugmT?organismsGroup=microsporidia 23. Mallo D, De Oliveira Martins L, Posada D. SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees. Syst Biol. 2016 Mar;65(2):334–44. 24. Fletcher W, Yang Z. INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol. 2009 Aug;26(8):1879–88. 25. Lafond M, Meghdari Miardan M, Sankoff D. Accurate prediction of orthologs in the presence of divergence after duplication. Bioinformatics. 2018 Jul 1;34(13):i366–75. 26. Robinson DF, Foulds LR. Comparison of phylogenetic trees [Internet]. Vol. 53, Mathematical Biosciences. 1981. p. 131–47. Available from: http://dx.doi.org/10.1016/0025-5564(81)90043-2 27. Cardona G, Llabrés M, Rosselló F, Valiente G. Metrics for phylogenetic networks I: generalizations of the Robinson-Foulds metric. IEEE/ACM Trans Comput Biol Bioinform. 2009 Jan;6(1):46–61. 28. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014 May 1;30(9):1312–3. 29. Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019 Nov 1;35(21):4453–5. 30. Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992 Jun;8(3):275–82. 31. Chang J-M, Di Tommaso P, Notredame C. TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol Biol Evol. 2014 Jun;31(6):1625–37. 32. Sonnhammer ELL, Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 2002 Dec;18(12):619–20. 33. Nichio BTL, Marchaukoski JN, Raittz RT. New Tools in Orthology Analysis: A Brief Review of Promising Perspectives. Front Genet. 2017 Oct 31;8:165. 34. Wang Y, Coleman-Derr D, Chen G, Gu YQ. OrthoVenn: a web server for genome wide comparison and annotation of orthologous clusters across multiple species. Nucleic Acids Res. 2015 Jul 1;43(W1):W78–84. 35. Xu L, Dong Z, Fang L, Luo Y, Wei Z, Guo H, et al. OrthoVenn2: a web server for whole-genome comparison and annotation of orthologous clusters across multiple species [Internet]. Vol. 47, Nucleic Acids Research. 2019. p. W52–8. Available from: http://dx.doi.org/10.1093/nar/gkz333 36. Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015 Aug 6;16:157. 37. Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019 Nov 14;20(1):238. 38. Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C. OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 2011 Jan;39(Database issue):D289–94. 39. Huerta-Cepas J, Serra F, Bork P. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Mol Biol Evol. 2016 Jun;33(6):1635–8.
Description:	碩士國立政治大學資訊科學系 107753006
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0107753006
Data Type:	thesis
DOI:	10.6814/NCCU202001688
Appears in Collections:	[資訊科學系] 學位論文

Files in This Item:

File	Description	Size	Format
300601.pdf		3407Kb	Adobe PDF2	120	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback