政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/157243

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | 全文筆數/總筆數 : 116039/147077 (79%)
造訪人次 : 58300512 線上人數 : 566

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

搜尋範圍

查詢小技巧：

您可在西文檢索詞彙前後加上"雙引號"，以獲取較精準的檢索結果

若欲以作者姓名搜尋，建議至進階搜尋限定作者欄位，可獲得較完整資料

進階搜尋

主頁 ‧ 登入 ‧ 上傳 ‧ 說明 ‧ 關於政大典藏 ‧ 管理

到手機版

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/157243

請使用永久網址來引用或連結此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/157243

題名:	擴散模型之顯著圖合理性評估及語義分析 Rationality Evaluation and Semantic Analysis of Saliency Maps in Diffusion Models
作者:	林大維 Lin, Da-Wei
貢獻者:	紀明德 Chi, Ming-Te 林大維 Lin, Da-Wei
關鍵詞:	擴散模型顯著圖文字到圖像生成模型語義分析 Diffusion Models Saliency Maps Text-to-Image Generation Models Semantic Analysis
日期:	2025
上傳時間:	2025-06-02 14:57:39 (UTC+8)
摘要:	近年來，擴散模型（Diffusion Models）在圖像生成領域取得重大進展，特別是 Stable Diffusion 使文字生成圖像的能力達到新高度。然而，模型在解析自然語言與圖像生成的關聯時，可能會產生特徵糾纏（Feature Entanglement），影響生成結果的合理性。本研究採用 DAAM（Diffusion Attentive AttributionMap）方法，透過分析交互注意力層（Cross Attention Map）生成的顯著圖（Saliency Maps），探討模型對提示詞的關注範圍及其對生成圖像的影響。我們提出一種自動化合理性評估方法，結合 Segment Anything（SAM）語義分割技術，以量化顯著圖的準確性，並比較不同 Stable Diffusion 預訓練模型（如 v1.5、v2.1、SDXL）的泛化能力。此外，透過句法剖析（DependencyParsing）與特徵糾纏分析，探討語言提示詞對圖像生成的影響，並驗證形容詞與場景描述對生成結果的影響範圍。實驗結果顯示，DAAM 在語義關聯性評估方面優於傳統梯度方法（如 Grad-CAM、Grad-CAM++），能更準確地反映文本與圖像的對應關係。此外，我們發現某些形容詞會影響整體場景，而非僅限於描述對象，顯示 Stable Diffusion 在處理複雜提示詞時仍面臨挑戰。未來研究將進一步優化 DAAM 技術，並探索更精確的語義解釋方法，以提升擴散模型的可解釋性與生成品質。 Diffusion models have improved image generation, with Stable Diffusion advancing text-to-image synthesis. However, feature entanglement affects coherence. This study employs the Diffusion Attentive Attribution Map(DAAM) to analyze saliency maps from cross-attention layers, examining prompt processing and its impact on generation. We propose an automated evaluation method using the Segment Anything Model (SAM) for semantic segmentation to assess saliency accuracy. DAAM’s generalization is compared across Stable Diffusion versions (v1.5,v2.1, SDXL), with linguistic prompt influence analyzed through dependency parsing and feature entanglement studies. Results show that DAAM outperforms gradient-based methods like Grad-CAM in semantic relevance, revealing how certain adjectives influence entire scenes. Future research will refine DAAM and improve semantic interpretation for better model explainability and generation quality.
參考文獻:	[1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10 684–10 695. [2] R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Ture, “What the DAAM: Interpreting stable diffusion using cross attention,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 5644–5659. [Online]. Available: https://aclanthology.org/2023.acl-long.310/ [3] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” 2019. [4] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 8821–8831. [Online]. Available: https://proceedings.mlr.press/v139/ramesh21a.html [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. [6] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [7] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky,“The Stanford CoreNLP natural language processing toolkit,” in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, K. Bontcheva and J. Zhu, Eds. Baltimore, Maryland: Association for Computational Linguistics, Jun. 2014, pp. 55–60. [Online]. Available: https://aclanthology.org/P14-5010/ [8] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014. [9] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 6840–6851. [Online]. Available: https://proceedings.neurips.cc/paper_files/ paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf [10] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” 2015. [11] R. M. Schmidt, “Recurrent neural networks (rnns): A gentle introduction and overview,” 2019. [Online]. Available: https://arxiv.org/abs/1912.05911 [12] C. B. Vennerød, A. Kjærran, and E. S. Bugge, “Long short-term memory rnn,” 2021. [Online]. Available: https://arxiv.org/abs/2105.06756 [13] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” CoRR, vol. abs/2106.09685, 2021. [Online]. Available: https://arxiv.org/abs/2106.09685 [14] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” 2014. [15] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradientbased localization,” International Journal of Computer Vision, vol. 128, no. 2, p. 336–359, Oct. 2019. [Online]. Available: http://dx.doi.org/10.1007/s11263-019-01228-7 [16] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” 2015. [Online]. Available: https://arxiv.org/abs/1511.08458 [17] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” 2023. [Online]. Available: https://arxiv.org/abs/2304.02643 [18] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” 2015. [Online]. Available: https://arxiv.org/abs/1405.0312 [19] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” 2022. [Online]. Available: https://arxiv.org/abs/2201.12086 [20] G. Somepalli, V. Singla, M. Goldblum, J. Geiping, and T. Goldstein, “Diffusion art or digital forgery? investigating data replication in diffusion models,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 6048–6058. [21] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or,“Prompt-to-prompt image editing with cross attention control,” 2022. [22] S. Ge, T. Park, J.-Y. Zhu, and J.-B. Huang, “Expressive text-to-image generation with rich text,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 7545–7556. [23] J. Sun, D. Fu, Y. Hu, S. Wang, R. Rassin, D.-C. Juan, D. Alon, C. Herrmann, S. van Steenkiste, R. Krishna, and C. Rashtchian, “Dreamsync: Aligning text-toimage generation with image understanding feedback,” 2023. [24] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2022. [25] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. [26] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” 2022. [Online]. Available: https://arxiv.org/abs/2010.02502 [27] R. Daroya, A. Sun, and S. Maji, “Cose: A consistency-sensitivity metric for saliency on image classification,” 2023. [Online]. Available: https: //arxiv.org/abs/2309.10989 [28] V. Shah, N. Ruiz, F. Cole, E. Lu, S. Lazebnik, Y. Li, and V. Jampani,“Ziplora: Any subject in any style by effectively merging loras,” 2023. [Online]. Available: https://arxiv.org/abs/2311.13600 [29] B. Kim, J. Seo, S. Jeon, J. Koo, J. Choe, and T. Jeon, “Why are saliency maps noisy? cause of and solution to noisy saliency maps,” 2019. [Online]. Available: https://arxiv.org/abs/1902.04893 [30] H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” 2021. [Online]. Available: https://arxiv.org/abs/2012.09838 [31] J. Guerrero-Viu, M. Hasan, A. Roullier, M. Harikumar, Y. Hu, P. Guerrero, D. Gutiérrez, B. Masia, and V. Deschaintre, “Texsliders: Diffusion-based texture editing in clip space,” in Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24, ser. SIGGRAPH ’24. ACM, Jul. 2024, p. 1–11. [Online]. Available: http://dx.doi.org/10.1145/3641519.3657444 [32] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021. [Online]. Available: https://arxiv.org/abs/2010.11929
描述:	碩士國立政治大學資訊科學系 111753161
資料來源:	http://thesis.lib.nccu.edu.tw/record/#G0111753161
資料類型:	thesis
顯示於類別:	[資訊科學系] 學位論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
316101.pdf		7689Kb	Adobe PDF	0	檢視/開啟

在政大典藏中所有的資料項目都受到原著作權保護.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - 回饋