Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/153377
|
Title: | GBactPro:基於機器學習方法對細菌啟動子進行跨物種預測 GBactPro: General Bacterial Promoter prediction across species using machine learning |
Authors: | 高語謙 Kao, Yu-Chien |
Contributors: | 張家銘 Chang, Jia-Ming 高語謙 Kao, Yu-Chien |
Keywords: | 細菌啟動子 機器學習 隨機森林模型 長短期記憶模型(LSTM) Bacterial promoters Machine learning Random forest Long short-term memory (LSTM) |
Date: | 2024 |
Issue Date: | 2024-09-04 14:59:32 (UTC+8) |
Abstract: | 啟動子為DNA上轉錄起始點上游特定基因片段,是調控DNA轉錄的重要位置,雖然已有許多啟動子預測工具,但大多只專注在少數物種,我們結合Promotech的跨物種預測,與台大周信宏教授開發的啟動子 scanning model,建立GBactPro,使用scanning model生成啟動子資料,訓練隨機森林模型以及深度學習模型找出每個區域的序列特徵,其中隨機森林模型可以透過相鄰區域的資訊學到更多的序列特徵,比單純計算序列結合能量的scanning model準確;深度學習模型使用1D-CNN及LSTM,利用LSTM可以學習長距離特徵的特性,不需透過scanning model 事先處理預測資料,也可正確地預測長序列中是否包含啟動子,我們的模型可以達到比Promotech更好的跨物種預測結果;此外使用GBactPro進行分區跨物種預測,在Minus10及Minus35區域的結果符合生物學上序列高度保留的特徵。最後在一些特定物種的特殊序列特徵,例如Alphaproteobacteria在 -7位置T出現的頻率較低,我們發現模型的預測結果在這些物種上會來得較差,符合生物學上的序列特徵。 Promoters are specific gene segments upstream of the transcription start site (TSS) on DNA and play an essential role in regulating DNA transcription. Although many promoter prediction tools exist, most focus on a limited number of species, especially E. coli. We have developed GBactPro by combining Promotech's cross-species prediction concept with the promoter scanning model developed by Professor Hsin-Hung David Chou from National Taiwan University. GBactPro uses the scanning model to generate data and identify sequence features in each region. The random forest model can learn more sequence features and is more accurate than the scanning model, which only calculates the sequence binding energy. We also trained deep learning models using 1D-CNN and LSTM. LSTM‘s ability to learn long-distance features predicts the presence of promoters in long sequences without the need for preprocessing via the scanning model. Our model achieves better cross-species prediction results than Promotech. Additionally, GBactPro performs region-specific cross-species predictions, with results in the -10 and -35 areas aligning with the biologically conserved sequence features. Finally, we observed that the model's performance is less effective for specific species with unique sequence characteristics, such as Alphaproteobacteria lacking T at position -7, which meets with the biological sequence features. |
Reference: | 1. Crick, F. H. (1958, January). On protein synthesis. In Symp Soc Exp Biol (Vol. 12, No. 138-63, p. 8). 2. Pribnow, D. (1975). Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proceedings of the National Academy of Sciences, 72(3), 784-788. 3. Myers, K. S., Noguera, D. R., & Donohue, T. J. (2021). Promoter architecture differences among alphaproteobacteria and other bacterial taxa. MSystems, 6(4), 10-1128. 4. Bhandari, N., Khare, S., Walambe, R., & Kotecha, K. (2021). Comparison of machine learning and deep learning techniques in promoter prediction across diverse species. PeerJ Computer Science, 7, e365. 5. Oubounyt, M., Louadi, Z., Tayara, H., & Chong, K. T. (2019). DeePromoter: robust promoter predictor using deep learning. Frontiers in genetics, 10, 286. 6. Zhang, M., Jia, C., Li, F., Li, C., Zhu, Y., Akutsu, T., ... & Song, J. (2022). Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Briefings in Bioinformatics, 23(2), bbab551. 7. Chevez-Guardado, R., & Peña-Castillo, L. (2021). Promotech: a general tool for bacterial promoter recognition. Genome Biology, 22, 1-16. 8. Ho, T. K. (1995, August). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278-282). IEEE. 9. Dey, R., & Salem, F. M. (2017, August). Gate-variants of gated recurrent unit (GRU) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS) (pp. 1597-1600). IEEE. 10. Medsker, L. R., & Jain, L. (2001). Recurrent neural networks. Design and Applications, 5(64-67), 2. 11. Zhang, M., Li, F., Marquez-Lago, T. T., Leier, A., Fan, C., Kwoh, C. K., ... & Jia, C. (2019). MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics, 35(17), 2957-2965. 12. Rahman, M. S., Aktar, U., Jani, M. R., & Shatabda, S. (2019). iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features. Molecular Genetics and Genomics, 294(1), 69-84. 13. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. 14. Kari, H., Bandi, S. M. S., Kumar, A., & Yella, V. R. (2022). Deepromclass: Delineator for eukaryotic core promoters employing deep neural networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(1), 802-807. 15. Martinez, G. S., Perez-Rueda, E., Kumar, A., Dutt, M., Maya, C. R., Ledesma-Dominguez, L., ... & Kelvin, D. J. (2024). CDBProm: the Comprehensive Directory of Bacterial Promoters. NAR Genomics and Bioinformatics, 6(1), lqae018. 16. Kuo, Syue-Ting (2023) High-Throughput Approaches Quantitatively Elucidate the Design Principles of Bacterial Regulatory Elements, National Taiwan University, Department of Life Science, Doctoral Dissertation 17. scanning model, May 2024, https://github.com/vickykao17/GBactPro/tree/main/scanning_model 18. Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841-842. 19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830. 20. Coleman, G. A., Davín, A. A., Mahendrarajah, T. A., Szánthó, L. L., Spang, A., Hugenholtz, P., ... & Williams, T. A. (2021). A rooted phylogeny resolves early bacterial evolution. Science, 372(6542), eabe0511. |
Description: | 碩士 國立政治大學 資訊科學系 111753130 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0111753130 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
313001.pdf | | 2445Kb | Adobe PDF | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|