English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 114205/145239 (79%)
Visitors : 52585849      Online Users : 975
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 >  Item 140.119/113294
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/113294


    Title: 基於資料科學方法之巨量蛋白質功能預測
    Applying Data Science to High-throughput Protein Function Prediction
    Authors: 劉義瑋
    Liu, Yi-Wei
    Contributors: 廖文宏
    Liao, Wen-Hung
    劉義瑋
    Liu, Yi-Wei
    Keywords: 蛋白質功能預測
    機器學習
    Protein function prediction
    Machine learning
    Date: 2017
    Issue Date: 2017-10-02 10:16:27 (UTC+8)
    Abstract: 自人體基因組計畫與次世代定序的完成後,生物資料呈現爆炸性的成長,其中蛋白質序列也是大量發現的基因產物之一,然而蛋白質的功能檢測與標記極其耗時,因此存在大量已知序列卻不知其功能的蛋白質,在實驗前透過電腦先預測可能之功能,能夠幫助生物學家排定不同的蛋白質功能實驗順序,因而加快蛋白質功能標注的速度。基因本體論(GO)是一個被廣泛使用描述基因產物功能與性質的分類方法,分為生物途徑、細胞組件、分子功能三個分支,每個分支皆為一個由多個GO組成的階層樹。蛋白質功能預測為透過蛋白質序列預測該蛋白質所擁有的GO,因此可以視為一個多標籤的分類機器學習問題。我們提出一個基於序列同源性的機器學習預測框架,同時能夠結合蛋白質家族的資訊,並設計多種不同的投票方法解決多標籤的預測問題。
    Biological data has grown explosively with the accomplishment of Human Genome Project and Next-generation sequencing. Annotating protein function with wet lab experiment is time-consuming, so many proteins’ functions are still unknown. Fortunately, computational function prediction can help wet lab formulate biological hypotheses and prioritize experiments. Gene Ontology (GO) is the framework for unifying the representation of gene function and classifying these functions into three domains namely, Biological Process Ontology, Cellular Component Ontology, and Molecular Function Ontology. Each domain is a hierarchical tree composed of labels known as GO terms. Protein function prediction can be considered as a multiple label classification problem, i.e., given a protein sequence, predict its GO terms. We proposed a machine learning framework to predict protein function based on its homology sequence structure, which is believed to contain protein family information and designed various voting mechanisms to resolve the multiple label prediction problem.
    Reference: [1] Christophe Dessimoz and Nives Škunca. The Gene Ontology Handbook. Springer, 2016.
    [2] Predrag Radivojac, Wyatt T Clark, Tal Ronnen Oron, Alexandra M Schnoes, Tobias Wittkop,
    Artem Sokolov, Kiley Graim, Christopher Funk, Karin Verspoor, Asa Ben-Hur, et al.
    A large-scale evaluation of computational protein function prediction. Nature methods,
    10(3):221–227, 2013.
    [3] Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D’Andrea,
    Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Verspoor, Asa Ben-Hur,
    et al. An expanded evaluation of protein function prediction methods shows an improvement
    in accuracy. Genome biology, 17(1):184, 2016.
    [4] Jia-Ming Chang, Emily Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung, and Wen-
    Lian Hsu. Psldoc: Protein subcellular localization prediction based on gapped-dipeptides
    and probabilistic latent semantic analysis. Proteins: Structure, Function, and Bioinformatics,
    72(2):693–710, 2008.
    [5] Jia-Ming Chang, Jean-Francois Taly, Ionas Erb, Ting-Yi Sung, Wen-Lian Hsu, Chuan Yi
    Tang, Cedric Notredame, and Emily Chia-Yu Su. Efficient and interpretable prediction of
    protein functional classes by correspondence analysis and compact set relations. PloS one,
    8(10):e75542, 2013.
    [6] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman.
    Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.
    [7] Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng
    Zhang, Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new generation
    of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997.
    [8] Ian Sillitoe, Tony E Lewis, Alison Cuff, Sayoni Das, Paul Ashford, Natalie L Dawson,
    Nicholas Furnham, Roman A Laskowski, David Lee, Jonathan G Lees, et al. Cath: comprehensive
    structural and functional annotations for genome sequences. Nucleic acids
    research, 43(D1):D376–D381, 2015.
    [9] Christine A Orengo, AD Michie, S Jones, David T Jones, MB Swindells, and Janet M
    Thornton. Cath–a hierarchic classification of protein domain structures. Structure, 5(8):
    1093–1109, 1997.
    [10] Sayoni Das, David Lee, Ian Sillitoe, Natalie L Dawson, Jonathan G Lees, and Christine A
    Orengo. Functional classification of cath superfamilies: a domain-based approach for
    protein function annotation. Bioinformatics, 31(21):3460–3467, 2015.
    [11] Sayoni Das, Ian Sillitoe, David Lee, Jonathan G Lees, Natalie L Dawson, John Ward, and
    Christine A Orengo. Cath funfhmmer web server: protein functional annotations using
    functional family assignments. Nucleic acids research, 43(W1):W148–W153, 2015.
    [12] Chin-Sheng Yu, Chih-Jen Lin, and Jenn-Kang Hwang. Predicting subcellular localization
    of proteins for gram-negative bacteria by support vector machines based on n-peptide
    compositions. Protein Science, 13(5):1402–1406, 2004.
    [13] Keun-Joon Park and Minoru Kanehisa. Prediction of protein subcellular locations by support
    vector machines using compositions of amino acids and amino acid pairs. Bioinformatics,
    19(13):1656–1663, 2003.
    [14] Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine
    learning, 42(1-2):177–196, 2001.
    [15] Yuxiang Jiang. Cafa2: Matlab evaluation codes for the 2nd cafa experiment. https:
    //github.com/yuxjiang/CAFA2, 2016.
    [16] Robert C Edgar. Search and clustering orders of magnitude faster than blast. Bioinformatics,
    26(19):2460–2461, 2010.
    Description: 碩士
    國立政治大學
    資訊科學學系
    104753013
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0104753013
    Data Type: thesis
    Appears in Collections:[資訊科學系] 學位論文

    Files in This Item:

    File SizeFormat
    301301.pdf4293KbAdobe PDF2222View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback