English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 117578/148609 (79%)
Visitors : 71294653      Online Users : 545
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大機構典藏 > 商學院 > 資訊管理學系 > 學位論文 >  Item 140.119/158581
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/158581


    Title: 結合規則式評分與分群方法之大型語言模型語意風險與合規性評估
    Semantic risk and compliance evaluation on LLM responses using rule-based scoring and clustering
    Authors: 陳卉縈
    Chen, Hui-Ying
    Contributors: 郁方
    Yu, Fang
    陳卉縈
    Chen, Hui-Ying
    Keywords: 大型語言模型
    PyRIT
    GHSOM
    倫理合規性
    安全性評估
    對抗式提示
    越獄攻擊
    Large Language Models
    PyRIT
    GHSOM
    Ethical compliance
    Safety evaluation
    Adversarial prompts
    Jailbreaking
    Date: 2025
    Issue Date: 2025-08-04 14:28:07 (UTC+8)
    Abstract: 隨著大型語言模型(LLM)廣泛應用於自然語言處理領域,如何強化其倫理防護能力、抵禦惡意提示詞攻擊,成為當前重要的研究課題。本研究提出一套具可解釋性的雙層評估架構,結合 PyRIT 規則式風險評分與 GHSOM 語意分群方法,從合規性與語氣風險兩個層面,系統性檢視模型的安全性表現。在本架構中,模型回應依據風險程度與語言風格被分類為四種語氣行為類型:明確違規(Vulgar)、語氣冒犯(Blunt)、潛在誤導(Deceptive)與合規回應(Eloquent)。此外,透過語意分群與特徵選取分析,本方法亦能辨識群集層級的風險特徵,並協助偵測出規則式評分中常見的誤判情形。實驗涵蓋 10 組情境與 12 種越獄攻擊腳本,共分析 2,925 筆模型回應。結果顯示,Gemini 產出的違規回應數量最多(119 筆),其次為 Perplexity(70 筆)與 DeepSeek(59 筆),而 Claude 與 ChatGPT 則整體展現出較高的倫理一致性。為進一步驗證風險行為是否具有跨模型的遷移性,本研究將其中 170 筆高風險提示詞重新測試於 API 模型與本地量化模型。結果顯示,API 模型仍容易受到對抗性提示詞影響,而量化模型則因理解能力較弱,導致攻擊成功率相對較低。整體而言,本研究所提出的整合式雙層評估方法,能有效補足傳統規則式指標的侷限,並提升語言模型風險分析的深度與可解釋性,為未來的 LLM 安全評估與對抗性測試提供重要的實證基礎與應用潛力。
    Large Language Models (LLMs) have advanced natural language processing (NLP) applications but remain vulnerable to ethical misalignment and adversarial prompts. This study proposes a dual-layer evaluation framework that integrates rule-based scoring using the Python Risk Identification Tool (PyRIT) with clustering via the Growing Hierarchical Self-Organizing Map (GHSOM). LLM outputs are categorized into Vulgar, Blunt, Deceptive, and Eloquent behaviors based on compliance and semantic risks. The framework also enables cluster-level feature identification and false positive detection. Evaluating 2,925 responses across 10 scenarios and 12 jailbreak scripts, Gemini generated the highest number of Vulgar outputs (119), fol- lowed by Perplexity (70) and DeepSeek (59), while Claude and ChatGPT were more ethically aligned. Testing 170 high-risk prompts on API-based versus quantized local models revealed that API models remain susceptible to adversarial inputs, whereas quantized models exhibited lower attack success rates—likely due to reduced comprehension rather than stronger alignment safeguards. These findings underscore the value of layered evaluation frameworks for improving the safety and interpretability of LLMs.
    Reference: AI, D. (2024a). Deepseek-r1-distill-llama-8b [Accessed: 2025-05].
    AI, M. (2024b). Meta-llama-3.1-8b-instruct [Accessed: 2025-05].
    Anthropic. (2023). Claude [Model version: Claude 3.5 Haiku].https://www.anthropic. com/claude
    DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., et al. (2024). Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. https://arxiv.org/abs/2405.04434
    Deng, G., Liu, Y., Li, Y., Wang, K., Zhang, Y., Li, Z., Wang, H., Zhang, T., & Liu, Y. (2024). Masterkey: Automated jailbreaking of large language model chatbots. Proceedings 2024 Network and Distributed System Security Symposium. https://doi. org/10.14722/ndss.2024.24188
    Dittenbach, M., Merkl, D., & Rauber, A. (2001). Hierarchical clustering of document archives with the growing hierarchical self-organizing map. Proceedings of the International Conference on Artificial Neural Networks (ICANN), 486–491. https://doi.org/10.1007/3-540-44668-0_70
    Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the association for computational linguistics: Emnlp 2020 (pp. 3356–3369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.301
    Google. (2024). Gemini [Model version: Gemini 2.0 Flash-Lite]. https://gemini.google. com/app
    Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B., & Xiong, D. (2023). Evaluating large language models: A comprehensive survey. https://arxiv.org/abs/2310.19736
    Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., & Steinhardt, J. (2021). Aligning ai with shared human values. International Conference on Learning Rep- resentations. https://openreview.net/forum?id=dNy%5C_RKzJacY
    Huang, Y., Zhang, Q., Y, P. S., & Sun, L. (2023). Trustgpt: A benchmark for trustworthy and responsible large language models. https://arxiv.org/abs/2306.11507
    Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464– 1480. https://doi.org/10.1109/5.58325
    Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., & Petrov, S. (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453–466. https://doi.org/10.1162/tacl_a_00276
    Lees, A., Tran, V. Q., Tay, Y., Sorensen, J., Gupta, J., Metzler, D., & Vasserman, L. (2022). A new generation of perspective api: Efficient multilingual character-level transformers. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3197–3207. https://doi.org/10.1145/3534678.3539147
    Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., & Liu, Y. (2024). Jailbreaking chatgpt via prompt engineering: An empirical study. Munoz, G. D. L., Minnich, A. J., Lutz, R., Lundeen, R., Dheekonda, R. S. R., Chikanov, N., Jagdagdorj, B.-E., Pouliot, M., Chawla, S., Maxwell, W., Bullwinkel, B., Pratt, K., de Gruyter, J., Siska, C., Bryan, P., Westerhoff, T., Kawaguchi, C., Seifert, C., Kumar, R. S. S., & Zunger, Y. (2024). Pyrit: A framework for security risk identification and red teaming in generative ai system. https://arxiv.org/abs/2410.02828
    Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). Crows-pairs: A challenge dataset for measuring social biases in masked language models. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 1953–1967). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.154
    OpenAI. (2023). Chatgpt [Model version: GPT-4o mini]. https://openai.com/chatgpt
    Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large language model connected with massive apis. https://arxiv.org/abs/2305.15334
    Perplexity. (2023). Perplexity ai [Model version: Sonar]. https://www.perplexity.ai Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100, 000+ questions for machine comprehension of text. https://arxiv.org/abs/1606.05250
    Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. https://arxiv.org/abs/1908.10084
    Rudinger, R., Naradowsky, J., Leonard, B., & Durme, B. V. (2018). Gender bias in coreference resolution. https://arxiv.org/abs/1804.09301
    Su, J., Kempe, J., & Ullrich, K. (2024). Mission impossible: A statistical perspective on jailbreaking llms. https://arxiv.org/abs/2408.01420
    Talmor, A., Herzig, J., Lourie, N., & Berant, J. (2019). Commonsenseqa: A question answering challenge targeting commonsense knowledge. https://arxiv.org/abs/1811. 00937
    Tang, H., Li, H., Liu, J., Hong, Y., Wu, H., & Wang, H. (2021). Dureade_robust: A chinese dataset towards evaluating robustness and generalization of machine reading comprehension in real-world applications. https://arxiv.org/abs/2004.11142
    Wen, S.-J., Chang, J.-M., & Yu, F. (2024). Scghsom: Hierarchical clustering and visualization of single-cell and crispr data using growing hierarchical som. https://arxiv. org/abs/2407.16984
    Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). Hotpotqa: A dataset for diverse, explainable multi-hop question answering. https://arxiv.org/abs/1809.09600
    Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2018). Gender bias in coreference resolution: Evaluation and debiasing methods. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 2 (short papers) (pp. 15–20). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2003
    Zhao, Y., Zhao, C., Nan, L., Qi, Z., Zhang, W., Tang, X., Mi, B., & Radev, D. (2023). RobuT: A systematic study of table QA robustness against human-annotated adversarial perturbations. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 6064–6081). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.334
    Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y., Yang, L., Ye, W., Zhang, Y., Gong, N. Z., & Xie, X. (2024). Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. https://arxiv.org/abs/2306.04528
    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. https://arxiv.org/ abs/2307.15043
    Description: 碩士
    國立政治大學
    資訊管理學系
    112356043
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0112356043
    Data Type: thesis
    Appears in Collections:[資訊管理學系] 學位論文

    Files in This Item:

    File Description SizeFormat
    604301.pdf2174KbAdobe PDF0View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback