English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 113648/144635 (79%)
Visitors : 51675966      Online Users : 634
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 >  Item 140.119/153377
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/153377


    Title: GBactPro:基於機器學習方法對細菌啟動子進行跨物種預測
    GBactPro: General Bacterial Promoter prediction across species using machine learning
    Authors: 高語謙
    Kao, Yu-Chien
    Contributors: 張家銘
    Chang, Jia-Ming
    高語謙
    Kao, Yu-Chien
    Keywords: 細菌啟動子
    機器學習
    隨機森林模型
    長短期記憶模型(LSTM)
    Bacterial promoters
    Machine learning
    Random forest
    Long short-term memory (LSTM)
    Date: 2024
    Issue Date: 2024-09-04 14:59:32 (UTC+8)
    Abstract: 啟動子為DNA上轉錄起始點上游特定基因片段,是調控DNA轉錄的重要位置,雖然已有許多啟動子預測工具,但大多只專注在少數物種,我們結合Promotech的跨物種預測,與台大周信宏教授開發的啟動子 scanning model,建立GBactPro,使用scanning model生成啟動子資料,訓練隨機森林模型以及深度學習模型找出每個區域的序列特徵,其中隨機森林模型可以透過相鄰區域的資訊學到更多的序列特徵,比單純計算序列結合能量的scanning model準確;深度學習模型使用1D-CNN及LSTM,利用LSTM可以學習長距離特徵的特性,不需透過scanning model 事先處理預測資料,也可正確地預測長序列中是否包含啟動子,我們的模型可以達到比Promotech更好的跨物種預測結果;此外使用GBactPro進行分區跨物種預測,在Minus10及Minus35區域的結果符合生物學上序列高度保留的特徵。最後在一些特定物種的特殊序列特徵,例如Alphaproteobacteria在 -7位置T出現的頻率較低,我們發現模型的預測結果在這些物種上會來得較差,符合生物學上的序列特徵。
    Promoters are specific gene segments upstream of the transcription start site (TSS) on DNA and play an essential role in regulating DNA transcription. Although many promoter prediction tools exist, most focus on a limited number of species, especially E. coli. We have developed GBactPro by combining Promotech's cross-species prediction concept with the promoter scanning model developed by Professor Hsin-Hung David Chou from National Taiwan University. GBactPro uses the scanning model to generate data and identify sequence features in each region. The random forest model can learn more sequence features and is more accurate than the scanning model, which only calculates the sequence binding energy. We also trained deep learning models using 1D-CNN and LSTM. LSTM‘s ability to learn long-distance features predicts the presence of promoters in long sequences without the need for preprocessing via the scanning model. Our model achieves better cross-species prediction results than Promotech. Additionally, GBactPro performs region-specific cross-species predictions, with results in the -10 and -35 areas aligning with the biologically conserved sequence features. Finally, we observed that the model's performance is less effective for specific species with unique sequence characteristics, such as Alphaproteobacteria lacking T at position -7, which meets with the biological sequence features.
    Reference: 1. Crick, F. H. (1958, January). On protein synthesis. In Symp Soc Exp Biol (Vol. 12, No. 138-63, p. 8).
    2. Pribnow, D. (1975). Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proceedings of the National Academy of Sciences, 72(3), 784-788.
    3. Myers, K. S., Noguera, D. R., & Donohue, T. J. (2021). Promoter architecture differences among alphaproteobacteria and other bacterial taxa. MSystems, 6(4), 10-1128.
    4. Bhandari, N., Khare, S., Walambe, R., & Kotecha, K. (2021). Comparison of machine learning and deep learning techniques in promoter prediction across diverse species. PeerJ Computer Science, 7, e365.
    5. Oubounyt, M., Louadi, Z., Tayara, H., & Chong, K. T. (2019). DeePromoter: robust promoter predictor using deep learning. Frontiers in genetics, 10, 286.
    6. Zhang, M., Jia, C., Li, F., Li, C., Zhu, Y., Akutsu, T., ... & Song, J. (2022). Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Briefings in Bioinformatics, 23(2), bbab551.
    7. Chevez-Guardado, R., & Peña-Castillo, L. (2021). Promotech: a general tool for bacterial promoter recognition. Genome Biology, 22, 1-16.
    8. Ho, T. K. (1995, August). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278-282). IEEE.
    9. Dey, R., & Salem, F. M. (2017, August). Gate-variants of gated recurrent unit (GRU) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS) (pp. 1597-1600). IEEE.
    10. Medsker, L. R., & Jain, L. (2001). Recurrent neural networks. Design and Applications, 5(64-67), 2.
    11. Zhang, M., Li, F., Marquez-Lago, T. T., Leier, A., Fan, C., Kwoh, C. K., ... & Jia, C. (2019). MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics, 35(17), 2957-2965.
    12. Rahman, M. S., Aktar, U., Jani, M. R., & Shatabda, S. (2019). iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features. Molecular Genetics and Genomics, 294(1), 69-84.
    13. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
    14. Kari, H., Bandi, S. M. S., Kumar, A., & Yella, V. R. (2022). Deepromclass: Delineator for eukaryotic core promoters employing deep neural networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(1), 802-807.
    15. Martinez, G. S., Perez-Rueda, E., Kumar, A., Dutt, M., Maya, C. R., Ledesma-Dominguez, L., ... & Kelvin, D. J. (2024). CDBProm: the Comprehensive Directory of Bacterial Promoters. NAR Genomics and Bioinformatics, 6(1), lqae018.
    16. Kuo, Syue-Ting (2023) High-Throughput Approaches Quantitatively Elucidate the Design Principles of Bacterial Regulatory Elements, National Taiwan University, Department of Life Science, Doctoral Dissertation
    17. scanning model, May 2024, https://github.com/vickykao17/GBactPro/tree/main/scanning_model
    18. Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841-842.
    19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
    20. Coleman, G. A., Davín, A. A., Mahendrarajah, T. A., Szánthó, L. L., Spang, A., Hugenholtz, P., ... & Williams, T. A. (2021). A rooted phylogeny resolves early bacterial evolution. Science, 372(6542), eabe0511.
    Description: 碩士
    國立政治大學
    資訊科學系
    111753130
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0111753130
    Data Type: thesis
    Appears in Collections:[資訊科學系] 學位論文

    Files in This Item:

    File Description SizeFormat
    313001.pdf2445KbAdobe PDF0View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback