政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/155990

政大典藏 > College of Informatics > Executive Master Program of Computer Science of NCCU > Theses > Item 140.119/155990

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/155990

Title:	結合大型語言模型之代理用於 Android App 錯誤重現任務 Combining Large Language Models for Agent Tasks in Android App Bug Reproduction
Authors:	黃毓學
Contributors:	蔡子傑黃毓學
Keywords:	自動化錯誤重現軟體測試除錯大型語言模型提示工程 Android App Automated Bug Reproduction Software testing and debugging Large Language Models Prompt Engineering
Date:	2025
Issue Date:	2025-03-03 14:28:52 (UTC+8)
Abstract:	代理（Agent）任務與大型語言模型（Large Language Models, LLM）兩者研究領域持續互相影響者，代理任務為 LLM 模型擴展了更多數據的類別，而 LLM 為代理研究解決了以往透過強化學習、監督學習做不到的問題，兩者結合謂為趨勢。本文即是探討使用 LLM 作為代理，試圖解決運行在 Android App中的錯誤描述的重現任務中，當遺失錯誤步驟過多時，因而無法用強化學習方式順利重現錯誤的問題。透過任務的轉換將強化學習獎勵設計的困難，轉為如何輸入適當的提示詞給 LLM，包括使用日誌解析工具來降低長上下文對 LLM 的生成文字準確性的影響。借鏡強化學習訓練的思維，高度結合 LLM，為降低龐大狀態空間搜索，代理可能低效率探索，本文使用子目標區域（subgoal regions）的概念，透過 LLM 找出只與目標句有高度關聯的區域去搜索，進而降低要搜尋比對的數量。也將問題拆解成可以用 LLM 作為代理去運行的子任務，規劃流程為子目標區、制定靜態計畫、動態調整、動態探索的流程、應用 LLM 的規劃（planning）、推理（reasoning）、提取代換文字的能力。本文貢獻為在大量遺漏描述任務如何結合 LLM 在錯誤重現任務的提示工程。從流程各項子任務評估驗證 LLM 的規劃及推理能力，評估結果：在子目標區域（subgoal regions）子任務，本文使用 GPT-4 在 Top-1 Accuracy：57%， Top-2 Accuracy：100% 可映射到正確目標區域。在靜態計畫子任務中 LLM 的表現，有 Top-1 Accuracy - 42%、 Top-2 Accuracy - 71%、Top-3 Accuracy - 100%。同時為了減少長上下文的影響，對 LLM 可能會有不正確的生成，因此使用事件日誌提取參數的工具 Spell 演算法，使得在提取特定字串的子任務中，LLM 有 90%的準確率。但在兩項子任務中，將提取後相關文字代換的子任務，以及在動態生成建議行動的子任務中，LLM 都呈現偽陽性（false positive）高的狀況，這在錯誤重現任務中，並不能允許這樣情況發生，因為可能導致後續重現錯誤的基礎與使用者描述不一致，這個結果顯示 LLM 代理用於錯誤重現任務在自動化仍有提升的空間。未來研究方向為用思考方式的語言模型或是使用 Open AI 近期提出強化微調（Reinforcement Learning Fine-Tuning）方式，透過訓練調整 LLM 輸出的順序，使 LLM 代理能在特定任務中發揮更準確的表現，使錯誤重現任務達到自動化的目標。 The research fields of agent tasks and large language models (LLMs) continue to influence each other. Agent tasks expand the types of data available to LLMs, while LLMs solve problems that reinforcement learning and supervised learning could not address in the past. The combination of these two approaches is becoming a trend. This paper explores using LLMs as agents to solve the problem of reproducing error descriptions in Android apps when too many error steps are missing, making it difficult to reproduce the error using reinforcement learning. By transforming the task, the difficulty of reward design in reinforcement learning is shifted to how to input appropriate prompts to the LLM. This includes using log parsing tools to reduce the impact of long contexts on the accuracy of the generated text by the LLM. Drawing on reinforcement learning training concepts and closely integrating LLMs, the paper aims to reduce the inefficiency of agent exploration in a large state space. The concept of subgoal regions is employed to use LLMs to identify areas highly related to the target sentence, thereby reducing the number of regions to search and compare. The task is broken down into subtasks that can be executed by LLMs as agents, with a process that includes subgoal regions, static planning, dynamic adjustments, dynamic exploration, and the application of LLM capabilities such as planning, reasoning, and extracting substitute text. The contribution of this paper is in how LLMs are integrated into error reproduction tasks with prompt engineering, particularly when a significant portion of the description is missing. The paper evaluates LLM planning and reasoning capabilities across various subtasks in the workflow. The evaluation results show that for the subgoal regions subtask, using GPT-4 achieved Top-1 Accuracy: 57% and Top-2 Accuracy: 100%, mapping to the correct target regions. In the static planning subtask, LLM performance achieved Top-1 Accuracy: 42%, Top-2 Accuracy: 71%, and Top-3 Accuracy: 100%. To reduce the impact of long contexts, which can lead to inaccurate generation, the Spell algorithm, an event log parameter extraction tool, was used in specific string extraction subtasks, leading to 90% accuracy for the LLM. However, in two subtasks—substitute text extraction and dynamically generating suggested actions—the LLM showed high false positive rates, which is unacceptable in the error reproduction task, as it may result in a mismatch between the foundation of error reproduction and the user’s description. This outcome indicates that there is still room for improvement in using LLMs as agents for automating error reproduction tasks. Future research directions include using models with a reasoning approach or OpenAI’s recently proposed reinforcement learning fine-tuning method to adjust the order of LLM outputs through training. This will enable LLM agents to perform more accurately in specific tasks, ultimately achieving the goal of automating error reproduction tasks.
Reference:	[1] Zhang, Z., Winn, R., Zhao, Y., Yu, T., & Halfond, W. G. J. (2023). Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement Learning Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, Seattle, WA, USA. https://doi.org/10.1145/3597926.3598066 [2] Zhang, Z., Tawsif, F. M., Ryu, K., Yu, T., & Halfond, W. G. J. (2024). Mobile Bug Report Reproduction via Global Search on the App UI Model. Proc. ACM Softw. Eng., 1(FSE), Article 117. https://doi.org/10.1145/3660824 [3] Ran, D., Wang, H., Song, Z., Wu, M., Cao, Y., Zhang, Y., Yang, W., & Xie, T. (2024). Guardian: A Runtime Framework for LLM-Based UI Exploration Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria. https://doi.org/10.1145/3650212.3680334 [4] Dziri, N., Lu, X., Sclar, M., Li, X. L., Jiang, L., Yuchen Lin, B., West, P., Bhagavatula, C., Le Bras, R., Hwang, J. D., Sanyal, S., Welleck, S., Ren, X., Ettinger, A., Harchaoui, Z., & Choi, Y. (2023). Faith and Fate: Limits of Transformers on Compositionality. arXiv:2305.18654. Retrieved May 01, 2023, from https://ui.adsabs.harvard.edu/abs/2023arXiv230518654D [5] Zheran Liu, E., et al. (2018) Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration. arXiv:1802.08802 DOI: 10.48550/arXiv.1802.08802 [6] Kim, G., et al. (2023) Language Models can Solve Computer Tasks. arXiv:2303.17491 DOI: 10.48550/arXiv.2303.17491 [7] Xi, Z., et al. (2023) The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv:2309.07864 DOI: 10.48550/arXiv.2309.07864 [8] Jothimurugan, K., Bastani, O., & Alur, R. (2020). Abstract Value Iteration for Hierarchical Reinforcement Learning. arXiv:2010.15638. Retrieved October 01, 2020, from https://ui.adsabs.harvard.edu/abs/2020arXiv201015638J [9] Du, Y., Watkins, O., Wang, Z., Colas, C., Darrell, T., Abbeel, P., Gupta, A., & Andreas, J. (2023). Guiding Pretraining in Reinforcement Learning with Large Language Models. arXiv:2302.06692. Retrieved February 01, 2023, from https://ui.adsabs.harvard.edu/abs/2023arXiv230206692D [10] Song, C. H., Wu, J., Washington, C., Sadler, B. M., Chao, W.-L., & Su, Y. (2022). LLM- Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. arXiv:2212.04088. Retrieved December 01, 2022, from https://ui.adsabs.harvard.edu/abs/2022arXiv221204088S [11] Lan, Y., Lu, Y., Li, Z., Pan, M., Yang, W., Zhang, T., & Li, X. (2024). Deeply Reinforcing Android GUI Testing with Deep Reinforcement Learning Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal. https://doi.org/10.1145/3597503.362334460 [12] Du, M., & Li, F. (2016). Spell: Streaming parsing of system event logs. 2016 IEEE 16th International Conference on Data Mining (ICDM), [13] Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., & Dormann, N. (2021). Stable-baselines3: reliable reinforcement learning implementations. J. Mach. Learn. Res., 22(1), Article 268. [14] Huang, S., & Ontañón, S. (2020). A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. arXiv:2006.14171. Retrieved June 01, 2020, from https://ui.adsabs.harvard.edu/abs/2020arXiv200614171H [15] Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems. [16] Gaon, M., & Brafman, R. (2020). Reinforcement learning with non-markovian rewards. Proceedings of the AAAI conference on artificial intelligence, [17] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C.,…Amodei, D. (2020). Language models are few-shot learners Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada. [18] Ye, J., Chen, X., Xu, N., Zu, C., Shao, Z., Liu, S., Cui, Y., Zhou, Z., Gong, C., Shen, Y., Zhou, J., Chen, S., Gui, T., Zhang, Q., & Huang, X. (2023). A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. ArXiv, abs/2303.10420. [19] Achiam, O. J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H.-i., Bavarian, M., Belgum, J., Bello, I.,…Zoph, B. (2023). GPT-4 Technical Report. [20] Du, M., Li, F., Zheng, G., & Srikumar, V. (2017). DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, Texas, USA. https://doi.org/10.1145/3133956.3134015 [21] Wang, W., Bao, H., Huang, S., Dong, L., & Wei, F. (2021, August). MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. In C. Zong, F. Xia, W. Li, & R. Navigli, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 Online. [22] UI Automator. https://developer.android.com/training/testing/ui-automator. [23] 2020. Bug Report – AnkiDroid 6432. https://github.com/ankidroid/Anki-Android/issues/6432 [24] 2023. ReproBot Website. https://sites.google.com/usc.edu/reprobot/home.61 [25] Wang, D., Zhao, Y., Feng, S., Zhang, Z., Halfond, W. G. J., Chen, C., Sun, X., Shi, J., & Yu, T. (2024). Feedback-Driven Automated Whole Bug Report Reproduction for Android Apps. arXiv:2407.05165. Retrieved July 01, 2024, from https://ui.adsabs.harvard.edu/abs/2024arXiv240705165W [26] Peng, A., Sucholutsky, I., Li, B. Z., Sumers, T. R., Griffiths, T. L., Andreas, J., & Shah, J.A. (2024). Learning with language-guided state abstractions. arXiv preprint arXiv:2402.18759.
Description:	碩士國立政治大學資訊科學系碩士在職專班 110971022
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0110971022
Data Type:	thesis
Appears in Collections:	[Executive Master Program of Computer Science of NCCU] Theses

Files in This Item:

File	Size	Format
102201.pdf	3215Kb	Adobe PDF	0	View/Open

社群 sharing

Loading...