English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 114393/145446 (79%)
Visitors : 53042106      Online Users : 915
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/155417


    Title: 仿聲機器:語音克隆代理中的身份展演
    Disembodied Voice : A Discourse on Performative Identities within Voice Cloning AI Agents
    Authors: 金天尹
    Chin, Tien-Yin
    Contributors: 陶亞倫
    紀明德

    Tao, Ya-Lun
    Chi, Ming-Te

    金天尹
    Chin, Tien-Yin
    Keywords: 語音克隆
    身份展演
    人工智慧代理人
    新媒體藝術
    Voice Cloning
    Identity Performance
    AI Agents
    New Media Art
    Date: 2024
    Issue Date: 2025-02-04 15:28:40 (UTC+8)
    Abstract: 本研究旨在探索語音克隆(Voice Cloning)技術與大型語言模型(LLM)在藝術創作及身份展演領域的應用潛力,並提出一個即時對話 AI 代理框架。該框架實現多說話人音色混合、情緒音色調節及客製化語言模型的動態角色切換,支援複雜互動場景及多樣化身份設定,並將其靈活應用於三件藝術作品——《不存在的電話簿》、《萬神殿,Ai - men》與《AI,是我,Ai, it's me》。透過技術開發、藝術實踐與觀眾回饋,本研究揭示語音克隆技術在多元身份展演、互動設計及藝術表達方面的潛力與創新意涵。

    在技術設計層面,本研究提出模組化即時對話框架,將AI 代理分為聲音模組與心智模組,賦予其語音合成、提示工程與情感表達等核心功能。框架具備情境適應性,能即時調整角色身份、情感語調及語言特徵,並支援自定義回覆設定,以滿足在互動藝術及戲劇表演等場景的應用需求。此外,研究透過音色融合實驗,深入探索情緒、性別、語言及口音等多維語音特徵的組合,展現身份的流動與靈活性。

    在藝術創作實踐中,語音克隆技術被視為「身份嬉戲(Identity Play)」的媒介,呈現動態且多層次的身份展演,挑戰傳統對聲音之於身份的固著性與真實性的認知,超越性別、語言與情緒的既定框架,在虛擬網路空間以外的真實世界重新建構自我表達的可能性。研究進一步透過創作反思延伸討論 AI 代理的人機關係、偏見再現等議題,並將這些洞察轉化為可共享的藝術經驗與知識體系。

    本研究不僅在技術層面驗證了語音克隆即時互動框架的可行性,亦透過藝術實踐探討技術在人機互動及身份展演中的應用潛力與倫理挑戰。未來研究可持續擴展至多情感建模、跨文化場景及多語音適應性等領域,為語音克隆在藝術、人文及社會實踐中的發展提供理論支持與實踐依據。
    This study explores the potential of voice cloning technology and large language models (LLM) in artistic creation and identity performance, proposing a real-time conversational AI agent framework. This framework integrates multi-speaker voice blending, emotional tone modulation, and dynamic role-switching with customized language models, enabling complex interactive scenarios and diverse identity representations. It is applied to three artistic projects—The Non-Existent Phonebook, Pantheon, Ai - men, and Ai, it's me. Through technological development, artistic practice, and audience feedback, the study highlights the innovative potential of voice cloning in identity expression, interactive design, and artistic representation.

    Technically, the study introduces a modular real-time conversational framework, dividing the AI agent into voice and cognitive modules, featuring speech synthesis, prompt engineering, and emotional expression. This adaptable system supports real-time adjustments to identity roles, emotional tones, and linguistic characteristics, while allowing customized responses for interactive art and theatrical applications. Voice blending experiments further demonstrate the fluidity and flexibility of identity across emotions, genders, languages, and accents.

    Artistically, voice cloning serves as a medium for "Identity Play", presenting dynamic and multi-layered identity performances that challenge traditional notions of links between voice and one's identity and authenticity. It transcends gender, language, and emotional boundaries, reconstructing self-expression possibilities in real-world contexts. The study also reflects on human-AI relationships and bias reproduction, transforming these insights into shared artistic knowledge systems.

    This research not only validates the feasibility of real-time interactive voice cloning but also addresses ethical challenges and explores applications in identity performance. Future work may extend into multi-emotion modeling, cross-cultural scenarios, and multi-voice adaptability, contributing to the theoretical and practical development of voice cloning in artistic and social contexts.
    Reference: Amazon Web Services. (2024). What is a RESTful API? [Accessed: 2024-06-30]. (Cit. p. 18).
    Arik, S.,, Chen, J.,, Peng, K.,, Ping, W., & Zhou, Y. (2018). Neural voice cloning with a few samples. Advances in neural information processing systems, 31 (cit. p. 7).
    Arik, S.,, Diamos, G.,, Gibiansky, A.,, Miller, J.,, Peng, K.,, Ping, W.,, Raiman, J., & Zhou, Y. (2017). Deep voice 2: Multi-speaker neural text-to-speech. arXiv preprint arXiv:1705.08947 (cit. p. 1).
    Austin, J. (1962). How to do things with words. Oxford University Press. (Cit. p. 4).
    Baker, A. M.,, Sonn, C. C., & Meyer, K. (2020). Voices of displacement: A methodology of sound portraits exploring identity and belonging. Qualitative Research, 20(6), 892– 909 (cit. p. 15).
    Burden, D., & Savin-Baden, M. (2019). Virtual humans: Today and tomorrow. Chapman; Hall/CRC. (Cit. p. 17).
    Butler, J. (1990). Gender trouble: Feminism and the subversion of identity. Routledge. (Cit. p. 4).
    Butler, J. (2011). Bodies that matter: On the discursive limits of sex. routledge. (Cit. p. 9).
    Carli, L. L.,, LaFleur, S. J., & Loeber, C. C. (1995). Nonverbal behavior, gender, and influence. Journal of personality and social psychology, 68(6), 1030 (cit. p. 10).
    Chen, W., & Jiang, X. (2023). Voice-cloning artificial-intelligence speakers can also mimic human-specific vocal expression (cit. p. 7).
    Cremen, P. (2018). Personal development in the higher education and training of social care workers in ireland [Doctoral dissertation, University of Sheffield]. (Cit. p. 32).
    Deshpande, A.,, Rajpurohit, T.,, Narasimhan, K., & Kalyan, A. (2023). Anthropomorphization of ai: Opportunities and risks. corr abs/2305.14784 (2023), 7 pages. (Cit. p. 10).
    Etzrodt, K., & Engesser, S. (2021). Voice-based agents as personified things: Assimilation and accommodation as equilibration of doubt. Human-Machine Communication, 2, 57–76 (cit. p. 2).
    Freeman, G.,, Zamanifard, S.,, Maloney, D., & Adkins, A. (2020). My body, my avatar: How people perceive their avatars in social virtual reality. Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, 1–8 (cit. p. 11).
    Frühholz, S.,, Trost, W., & Grandjean, D. (2014). The role of the medial temporal limbic system in processing emotions in voice and music. Progress in neurobiology, 123, 1–17 (cit. p. 8).
    Gallagher, M. (2015). Field recording and the sounding of spaces. Environment and planning D: society and space, 33(3), 560–576 (cit. p. 46).
    Goffman, E. (1981). Forms of talk. University of Pennsylvania (cit. p. 9).
    Goffman, E. (2023). The presentation of self in everyday life. In Social theory re-wired (pp. 450–459). Routledge. (Cit. p. 9).
    Goodfellow, I. (2016). Deep learning. (Cit. p. 8).
    Gudmalwar, A.,, Shah, N.,, Akarsh, S.,, Wasnik, P., & Shah, R. R. (2024). Vecl-tts: Voice identity and emotional style controllable cross-lingual text-to-speech. arXiv preprint arXiv:2406.08076 (cit. p. 21).
    Hazan, V., & Baker, R. (2010). Does reading clearly produce the same acoustic-phonetic modifications as spontaneous speech in a clear speaking style? DiSS-LPSS Joint Workshop 2010 (cit. p. 10).
    Hill, J. (1995). The voices of don gabriel: Responsibility and self in a modern mexicano narrative. The dialogic emergence of culture, 97–147 (cit. p. 10).
    Huang, M. (2022). X.com. x [2024/08/10]. (Cit. p. 11).
    Hughes, S. M.,, Mogilski, J. K., & Harrison, M. A. (2014). The perception and parameters of intentional voice manipulation. Journal of Nonverbal Behavior, 38, 107–127 (cit. p. 10).
    Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings, 1, 373–376 (cit. p. 7).
    Hwang, A. H.-C.,, Siy, J. O.,, Shelby, R., & Lentz, A. (2024). In whose voice?: Examining ai agent representation of people in social interaction through generative speech. Proceedings of the 2024 ACM Designing Interactive Systems Conference, 224–245 (cit. pp. 11, 42, 43).
    Jemine, C., et al. (2019). Real-time voice cloning (cit. p. 8).
    Jia, Y.,, Zhang, Y.,, Weiss, R.,, Wang, Q.,, Shen, J.,, Ren, F.,, Nguyen, P.,, Pang, R.,, Lopez Moreno, I.,, Wu, Y., et al. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31 (cit. p. 7).
    Jiang, G.,, Xu, M.,, Zhu, S.-C.,, Han, W.,, Zhang, C., & Zhu, Y. (2024). Evaluating and inducing personality in pre-trained language models. Advances in Neural Information Processing Systems, 36 (cit. p. 10).
    Kim, J., & Seifert, U. (2007). Embodiment and agency: Towards an aesthetics of interactive performativity. Proceedings of the 4th Sound and Music Computing Conference (cit. p. 4).
    Kirk, N. W., & Cunningham, S. J. (2024). Listen to yourself! prioritization of self-associated and own voice cues. British Journal of Psychology (cit. p. 8).
    Lavan, N.,, Burton, A. M.,, Scott, S. K., & McGettigan, C. (2019). Flexible voices: Identity perception from variable vocal signals. Psychonomic bulletin & review, 26, 90–102 (cit. p. 10).
    Lawy, J. (2017). Theorizing voice: Performativity, politics and listening. Anthropological Theory, 17, 192–215 (cit. p. 10).
    Lee, P. Y. K.,, Ma, N. F.,, Kim, I.-J., & Yoon, D. (2023). Speculating on risks of ai clones to selfhood and relationships: Doppelganger-phobia, identity fragmentation, and living memories. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW1), 1–28 (cit. pp. 10, 11, 42).
    Liu, R.,, Hu, Y.,, Yi, R.,, Xiang, Y., & Li, H. (2024). Generative expressive conversational speech synthesis. arXiv preprint arXiv:2407.21491 (cit. p. 7).
    Llanes-Jurado, J.,, Gómez-Zaragozá, L.,, Minissi, M. E.,, Alcañiz, M., & Marín-Morales, J. (2024). Developing conversational virtual humans for social emotion elicitation based on large language models. Expert Systems with Applications, 246, 123261 (cit. p. 17).
    Lyu, H.,, Jiang, S.,, Zeng, H.,, Xia, Y.,, Wang, Q.,, Zhang, S.,, Chen, R.,, Leung, C.,, Tang, J., & Luo, J. (2023). Llm-rec: Personalized recommendation via prompting large language models. arXiv preprint arXiv:2307.15780 (cit. p. 10).
    McKinlay, A. (2010). Performativity: From jl austin to judith butler. may fly, 119 (cit. p. 9).
    MIT Media Lab. (2024). Future you: An interactive digital twin system for self-reflection and personal growth [Accessed: 2024-10-21]. (Cit. p. 11).
    MIT Technology Review. (2024). Openai released its advanced voice mode to more people: Here's how to get it [Accessed: 2024-12-27]. https://www.technologyreview.com/2024/09/24/1104422/openai-released-its-advanced-voice-mode-to-more-people-heres-how-to-get-it/ (cit. p. 2).
    Moore, M. (2013). Coaching the multiplicity of mind: A strengths-based model. Global Advances in Health and Medicine, 2(4), 78–84 (cit. p. 32).
    Napolitano, D. (2020). The cultural origins of voice cloning. Proceedings of the International Conference on Voice Technology, 123–130 (cit. pp. 2, 7, 8, 46).
    Napolitano, D. (2023). The shaping of a standard voice: Sonic and sociotechnical imaginaries in smart speakers. Im@ go. A Journal of the Social Imaginary, (21), 177–196 (cit. pp. 1, 42).
    OpenAI. (2023). Chatgpt can now see, hear, and speak [Accessed: 2024-10-21]. (Cit. p. 2).
    OpenAI Community. (2024). Whisper hallucination: How to recognize and solve [Accessed:2024-10-21]. (Cit. p. 19).
    Qin, Z.,, Zhao, W.,, Yu, X., & Sun, X. (2023). Openvoice: Versatile instant voice cloning. arXiv preprint arXiv:2312.01479 (cit. p. 7).
    Raphael, B. N., & Scherer, R. C. (1987). Voice modifications of stage actors: Acoustic analyses. Journal of Voice, 1(1), 83–87 (cit. p. 10).
    Rohrer, J. (2020). Project december: Simulate the dead. [2024/08/10]. (Cit. p. 11).
    RVC-Boss. (2024). Gpt-sovits [2024/08/10]. (Cit. pp. 20, 21).
    Saunders, C., & Fernyhough, C. (2017). Reading margery kempe’s inner voices. postmedieval: a journal of medieval cultural studies, 8, 209–217 (cit. p. 32).
    Seconds, S. (2022). Plutchik's wheel of emotions [Accessed: 2024-10-21]. (Cit. p. 36). Semeraro, A.,, Vilella, S., & Ruffo, G. (2021). Pyplutchik: Visualising and comparing emotion-annotated corpora. PloS one, 16(9), e0256503 (cit. p. 37).
    Serapio-García, G.,, Safdari, M.,, Crepy, C.,, Sun, L.,, Fitz, S.,, Romero, P.,, Abdulhai, M.,,Faust, A., & Matarić, M. (2023). Personality traits in large language models. arXiv preprint arXiv:2307.00184 (cit. p. 43).
    Shen, J.,, Pang, R.,, Weiss, R. J.,, Schuster, M.,, Jaitly, N.,, Yang, Z.,, Chen, Z.,, Zhang, Y.,, Wang, Y.,, Skerrv-Ryan, R., et al. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4779–4783 (cit. p. 7).
    Soonpipatskul, N.,, Pal, D.,, Watanapa, B., & Charoenkitkarn, N. (2023). Personality perceptions of conversational agents: A task-based analysis using thai as the conversational language. IEEE Access (cit. pp. 43, 47).
    Sterne, J. (2008). Enemy voice. Social Text, 26(3), 79–100 (cit. p. 8).
    SYSTRAN. (2023). Faster-whisper: Optimized whisper speech recognition model [Accessed:2024-06-30]. (Cit. p. 19).
    Tadimalla, S. Y., & Maher, M. L. (2024). Implications of identity in ai: Creators, creations, and consequences. Proceedings of the AAAI Symposium Series, 3(1), 528–535 (cit. pp. 43, 47).
    Tanaka, Y. L., & Kudo, Y. (2012). Effects of familiar voices on brain activity. International journal of nursing practice, 18, 38–44 (cit. p. 8).
    Turkle, S. (1999). Cyberspace and identity. Contemporary Sociology, 28(6), 643–648 (cit. p. 11).
    Turkle, S. (2005). The second self: Computers and the human spirit. Mit Press. (Cit. pp. 2,3).
    Weidman, A. (2014). Anthropology and voice. Annual Review of Anthropology, 43(1), 37–51 (cit. pp. 2, 8).
    Weinberger, S., & Kunath, S. (2011). The speech accent archive: Towards a typology of english accents. Language and Computers, 73, 45–60 (cit. pp. 19, 42).
    Wikipedia Contributors. (2024). Intelligent agent [Accessed: 2024-06-29]. (Cit. p. 3).
    Yahoo News. (2024).詐團想用 ai 變聲騙錢新北警拆穿守住男子 38萬元 [Accessed:2024-10-21]. https://tw.news.yahoo.com/%E8%A9%90%E5%9C%98%E6%83%B3%E7%94%A8ai%E8%AE%8A%E8%81%B2%E9%A8%99%E9%8C%A2%E6%96%B0%E5%8C%97%E8%AD%A6%E6%8B%86%E7%A9%BF%E5%AE%88%E4%BD%8F%E7%94%B7%E5%AD%9038%E8%90%AC%E5%85%83-094601396.html (cit. p. 1).
    Zimman, L. (2018). Transgender voices: Insights on identity, embodiment, and the gender of the voice. Language and Linguistics Compass, 12(8), e12284 (cit. p. 10).
    毛榮富 et al. (2017a). 社交媒體時代社會性的未來: 按 sherry turkle 的自我概念進行的考察. 資訊社會研究, (32), 51–82 (cit. p. 2).
    毛榮富 et al. (2017b). 社交媒體時代社會性的未來: 按 sherry turkle 的自我概念進行的
    考察. 資訊社會研究, (32), 51–82 (cit. p. 11).
    王右君. (2009). 重訪網路上的身份展演: 以同志論壇 motss 為分析對象.新聞學研究,(99), 47–77 (cit. p. 11).
    王萌, 曹., 姜丹. (2024). 一种节奏与内容解纠缠的语音克隆模型. Artificial Intelligence and Robotics Research, 13, 166 (cit. p. 14).
    華山文創園區. (2024). Ai,是我 [華山科技藝術節《創世紀》舞台劇節目]. (Cit. p. 31).
    Description: 碩士
    國立政治大學
    數位內容碩士學位學程
    111462001
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0111462001
    Data Type: thesis
    Appears in Collections:[數位內容碩士學位學程] 學位論文

    Files in This Item:

    File SizeFormat
    200101.pdf15092KbAdobe PDF0View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback