政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/158582

政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/158582

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 117629/148660 (79%)
Visitors : 71667685 Online Users : 349

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大典藏 > College of Commerce > Department of MIS > Theses > Item 140.119/158582

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/158582

Title:	大型語言模型誠實性對齊的持續優化與多指標評估 Continuous Optimization and Multi-Metric Evaluation of Honesty Alignment in Large Language Models
Authors:	蔡品洋 Tsai, Pin-Yang
Contributors:	陳恭 Chen, Kung 蔡品洋 Tsai, Pin-Yang
Keywords:	幻覺誠實性迭代微調拒答策略知識邊界 Hallucination Honesty Iterative Fine-tuning Refusal Strategy Knowledge Boundary
Date:	2025
Issue Date:	2025-08-04 14:28:20 (UTC+8)
Abstract:	本研究旨在解決大型語言模型（LLM）中關鍵的「幻覺」問題，也就是模型生成看似合理但實質錯誤資訊的傾向。透過系統性的迭代微調策略，提升模型的「誠實性」，訓練其準確辨識自身知識邊界，並在面對不確定的問題時主動拒絕回答。為此，本研究基於「Alignment for Honesty」的理論框架，以 GPT-4o-mini 模型為實驗對象，設計了兩組實驗。第一組實驗在原始論文資料集上進行了四輪微調，而第二組則擴展至三個異質資料集，並進行了長達十輪的迭代微調，以驗證資料多樣性與訓練持續性的增益效果。研究結果表明，迭代微調能顯著提升模型的拒答能力，且在多資料集環境下的長期訓練展現出更穩健且持續的學習軌跡。實驗發現，雖然模型的準確率（Accuracy）因學會拒答而下降，但本研究提出的「正確率」（Correctness）指標（包含正確回答與正確拒答）始終維持在高水平，這證明了模型並非性能退化，而是學會了將潛在的錯誤回答策略性地轉換為合理的拒絕。此外，本研究也證實直接調整模型參數的微調方法，在行為對齊上的效果遠優於僅依賴提示語的引導。而在訓練模板的選擇上，我們發現語言型信心表達（CONFIDENCE-VERB）比數值型更能帶來穩定且可控的訓練成果。整體來說，本研究驗證了多資料集的長期迭代微調，是培養大型語言模型知識邊界意識、構建更可靠 LLM 的一條可行路徑，為後續的誠實性對齊研究提供了全面的評估框架與實證基礎，也為開發與部署更值得信賴的人工智慧系統提供了具體的實務指導。 This study addresses "hallucination" in Large Language Models (LLMs) by using iterative fine-tuning to enhance model "honesty," training them to identify their knowledge boundaries and refuse unknown questions. Based on the "Alignment for Honesty" framework, our research used the GPT-4o-mini model to perform iterative fine-tuning for up to ten rounds on diverse datasets, testing the benefits of data diversity and sustained training. The results show that this method significantly improves the model's refusal capability, especially with long-term, multi-dataset training. While traditional "Accuracy" drops, our proposed "Correctness" metric—which includes appropriate refusals—remains high, proving the model learns a better response strategy rather than simply experiencing performance degradation. We also confirmed that direct parameter fine-tuning is substantially more effective than prompt-based guidance and that language-based confidence templates (CONFIDENCE-VERB) yield more stable outcomes. Overall, this study validates that long-term, multi-dataset iterative fine-tuning is a viable path for cultivating an awareness of knowledge boundaries in Large Language Models and building more reliable LLMs. It offers a robust framework and practical guidance for developing and evaluating trustworthy AI.
Reference:	Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., ... & Kaplan, J. (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., ... & Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186). Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., ... & Liu, T. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1-55. Jin, D., Pan, E., Oufattole, N., Weng, W. H., Fang, H., & Szolovits, P. (2021). What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14), 6421. Joshi, M., Choi, E., Weld, D. S., & Zettlemoyer, L. (2017). Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., ... & Petrov, S. (2019). Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453-466. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730-27744. Rajpurkar, P., Jia, R., & Liang, P. (2018). Know what you don't know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. Yang, Y., Chern, E., Qiu, X., Neubig, G., & Liu, P. (2024). Alignment for honesty. Advances in Neural Information Processing Systems, 37, 63565-63598. Zhang, H., Diao, S., Lin, Y., Fung, Y. R., Lian, Q., Wang, X., ... & Zhang, T. (2023). R-tuning: Teaching large language models to refuse unknown questions. arXiv preprint arXiv:2311.09677.
Description:	碩士國立政治大學資訊管理學系 112356044
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0112356044
Data Type:	thesis
Appears in Collections:	[Department of MIS] Theses

Files in This Item:

File	Description	Size	Format
604401.pdf		1319Kb	Adobe PDF	0	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback