政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/158581

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 117578/148609 (79%)
Visitors : 71294653 Online Users : 545

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 商學院 > 資訊管理學系 > 學位論文 > Item 140.119/158581

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/158581

Title:	結合規則式評分與分群方法之大型語言模型語意風險與合規性評估 Semantic risk and compliance evaluation on LLM responses using rule-based scoring and clustering
Authors:	陳卉縈 Chen, Hui-Ying
Contributors:	郁方 Yu, Fang 陳卉縈 Chen, Hui-Ying
Keywords:	大型語言模型 PyRIT GHSOM 倫理合規性安全性評估對抗式提示越獄攻擊 Large Language Models PyRIT GHSOM Ethical compliance Safety evaluation Adversarial prompts Jailbreaking
Date:	2025
Issue Date:	2025-08-04 14:28:07 (UTC+8)
Abstract:	隨著大型語言模型（LLM）廣泛應用於自然語言處理領域，如何強化其倫理防護能力、抵禦惡意提示詞攻擊，成為當前重要的研究課題。本研究提出一套具可解釋性的雙層評估架構，結合 PyRIT 規則式風險評分與 GHSOM 語意分群方法，從合規性與語氣風險兩個層面，系統性檢視模型的安全性表現。在本架構中，模型回應依據風險程度與語言風格被分類為四種語氣行為類型：明確違規（Vulgar）、語氣冒犯（Blunt）、潛在誤導（Deceptive）與合規回應（Eloquent）。此外，透過語意分群與特徵選取分析，本方法亦能辨識群集層級的風險特徵，並協助偵測出規則式評分中常見的誤判情形。實驗涵蓋 10 組情境與 12 種越獄攻擊腳本，共分析 2,925 筆模型回應。結果顯示，Gemini 產出的違規回應數量最多（119 筆），其次為 Perplexity（70 筆）與 DeepSeek（59 筆），而 Claude 與 ChatGPT 則整體展現出較高的倫理一致性。為進一步驗證風險行為是否具有跨模型的遷移性，本研究將其中 170 筆高風險提示詞重新測試於 API 模型與本地量化模型。結果顯示，API 模型仍容易受到對抗性提示詞影響，而量化模型則因理解能力較弱，導致攻擊成功率相對較低。整體而言，本研究所提出的整合式雙層評估方法，能有效補足傳統規則式指標的侷限，並提升語言模型風險分析的深度與可解釋性，為未來的 LLM 安全評估與對抗性測試提供重要的實證基礎與應用潛力。 Large Language Models (LLMs) have advanced natural language processing (NLP) applications but remain vulnerable to ethical misalignment and adversarial prompts. This study proposes a dual-layer evaluation framework that integrates rule-based scoring using the Python Risk Identification Tool (PyRIT) with clustering via the Growing Hierarchical Self-Organizing Map (GHSOM). LLM outputs are categorized into Vulgar, Blunt, Deceptive, and Eloquent behaviors based on compliance and semantic risks. The framework also enables cluster-level feature identification and false positive detection. Evaluating 2,925 responses across 10 scenarios and 12 jailbreak scripts, Gemini generated the highest number of Vulgar outputs (119), fol- lowed by Perplexity (70) and DeepSeek (59), while Claude and ChatGPT were more ethically aligned. Testing 170 high-risk prompts on API-based versus quantized local models revealed that API models remain susceptible to adversarial inputs, whereas quantized models exhibited lower attack success rates—likely due to reduced comprehension rather than stronger alignment safeguards. These findings underscore the value of layered evaluation frameworks for improving the safety and interpretability of LLMs.
Reference:	AI, D. (2024a). Deepseek-r1-distill-llama-8b [Accessed: 2025-05]. AI, M. (2024b). Meta-llama-3.1-8b-instruct [Accessed: 2025-05]. Anthropic. (2023). Claude [Model version: Claude 3.5 Haiku].https://www.anthropic. com/claude DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., et al. (2024). Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. https://arxiv.org/abs/2405.04434 Deng, G., Liu, Y., Li, Y., Wang, K., Zhang, Y., Li, Z., Wang, H., Zhang, T., & Liu, Y. (2024). Masterkey: Automated jailbreaking of large language model chatbots. Proceedings 2024 Network and Distributed System Security Symposium. https://doi. org/10.14722/ndss.2024.24188 Dittenbach, M., Merkl, D., & Rauber, A. (2001). Hierarchical clustering of document archives with the growing hierarchical self-organizing map. Proceedings of the International Conference on Artificial Neural Networks (ICANN), 486–491. https://doi.org/10.1007/3-540-44668-0_70 Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the association for computational linguistics: Emnlp 2020 (pp. 3356–3369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.301 Google. (2024). Gemini [Model version: Gemini 2.0 Flash-Lite]. https://gemini.google. com/app Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B., & Xiong, D. (2023). Evaluating large language models: A comprehensive survey. https://arxiv.org/abs/2310.19736 Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., & Steinhardt, J. (2021). Aligning ai with shared human values. International Conference on Learning Rep- resentations. https://openreview.net/forum?id=dNy%5C_RKzJacY Huang, Y., Zhang, Q., Y, P. S., & Sun, L. (2023). Trustgpt: A benchmark for trustworthy and responsible large language models. https://arxiv.org/abs/2306.11507 Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464– 1480. https://doi.org/10.1109/5.58325 Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., & Petrov, S. (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453–466. https://doi.org/10.1162/tacl_a_00276 Lees, A., Tran, V. Q., Tay, Y., Sorensen, J., Gupta, J., Metzler, D., & Vasserman, L. (2022). A new generation of perspective api: Efficient multilingual character-level transformers. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3197–3207. https://doi.org/10.1145/3534678.3539147 Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., & Liu, Y. (2024). Jailbreaking chatgpt via prompt engineering: An empirical study. Munoz, G. D. L., Minnich, A. J., Lutz, R., Lundeen, R., Dheekonda, R. S. R., Chikanov, N., Jagdagdorj, B.-E., Pouliot, M., Chawla, S., Maxwell, W., Bullwinkel, B., Pratt, K., de Gruyter, J., Siska, C., Bryan, P., Westerhoff, T., Kawaguchi, C., Seifert, C., Kumar, R. S. S., & Zunger, Y. (2024). Pyrit: A framework for security risk identification and red teaming in generative ai system. https://arxiv.org/abs/2410.02828 Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). Crows-pairs: A challenge dataset for measuring social biases in masked language models. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 1953–1967). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.154 OpenAI. (2023). Chatgpt [Model version: GPT-4o mini]. https://openai.com/chatgpt Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large language model connected with massive apis. https://arxiv.org/abs/2305.15334 Perplexity. (2023). Perplexity ai [Model version: Sonar]. https://www.perplexity.ai Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100, 000+ questions for machine comprehension of text. https://arxiv.org/abs/1606.05250 Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. https://arxiv.org/abs/1908.10084 Rudinger, R., Naradowsky, J., Leonard, B., & Durme, B. V. (2018). Gender bias in coreference resolution. https://arxiv.org/abs/1804.09301 Su, J., Kempe, J., & Ullrich, K. (2024). Mission impossible: A statistical perspective on jailbreaking llms. https://arxiv.org/abs/2408.01420 Talmor, A., Herzig, J., Lourie, N., & Berant, J. (2019). Commonsenseqa: A question answering challenge targeting commonsense knowledge. https://arxiv.org/abs/1811. 00937 Tang, H., Li, H., Liu, J., Hong, Y., Wu, H., & Wang, H. (2021). Dureade_robust: A chinese dataset towards evaluating robustness and generalization of machine reading comprehension in real-world applications. https://arxiv.org/abs/2004.11142 Wen, S.-J., Chang, J.-M., & Yu, F. (2024). Scghsom: Hierarchical clustering and visualization of single-cell and crispr data using growing hierarchical som. https://arxiv. org/abs/2407.16984 Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). Hotpotqa: A dataset for diverse, explainable multi-hop question answering. https://arxiv.org/abs/1809.09600 Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2018). Gender bias in coreference resolution: Evaluation and debiasing methods. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 2 (short papers) (pp. 15–20). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2003 Zhao, Y., Zhao, C., Nan, L., Qi, Z., Zhang, W., Tang, X., Mi, B., & Radev, D. (2023). RobuT: A systematic study of table QA robustness against human-annotated adversarial perturbations. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 6064–6081). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.334 Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y., Yang, L., Ye, W., Zhang, Y., Gong, N. Z., & Xie, X. (2024). Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. https://arxiv.org/abs/2306.04528 Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. https://arxiv.org/ abs/2307.15043
Description:	碩士國立政治大學資訊管理學系 112356043
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0112356043
Data Type:	thesis
Appears in Collections:	[資訊管理學系] 學位論文

Files in This Item:

File	Description	Size	Format
604301.pdf		2174Kb	Adobe PDF	0	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback