English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 114205/145239 (79%)
Visitors : 52583531      Online Users : 1006
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 >  Item 140.119/152577
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/152577


    Title: 基於深度學習框架的變聲音訊還原機制
    Restoration of Altered Sound Based on Deep Learning Framework
    Authors: 黃大維
    Huang, Ta-Wei
    Contributors: 廖文宏
    Liao, Wen-Hung
    黃大維
    Huang, Ta-Wei
    Keywords: 深度學習
    語者辨識
    語音辨識
    變聲音訊還原
    Deep Learning
    Speaker Recognition
    Speech Recognition
    Altered Sound Restoration
    Date: 2024
    Issue Date: 2024-08-05 12:46:50 (UTC+8)
    Abstract: 本研究旨在透過深度學習技術對變聲後的音訊進行還原。雖然傳統的語音辨識系統在處理標準音訊時表現出色,但它們在面對經過變聲處理的音訊時往往效果有限。因此,我們的研究著重於使用深度學習方法,將這些經變聲處理的音訊檔案還原到與其原始狀態盡可能接近的程度。完成這一步驟後,我們將透過語音識別和語者辨識系統來評估這種還原技術的有效性。
    我們的研究還探討了語者辨識系統在保護語者隱私和安全方面存在的潛在風險;儘管變聲技術在一定程度上可以掩蓋語者的身份特徵,但我們發現還原模型在滿足條件並對變聲音檔進行還原之後,語者辨識模型依然能在多數情況下辨識出變聲後的語者。這一結果代表即使聲音經過變聲處理,模型的表現足以讓在保護語者隱私方面存在潛在風險。
    我們選用了多種的變聲樣本資料集,結合不同的深度學習模型來擷取音訊特徵,並進行音訊還原。本研究探討了包含生成對抗網路、VAE相結合的深度學習模型在音訊還原方面的應用,目的在於保留語者獨特性。
    實驗部分重點在於比較不同模型在變聲音訊還原後的效果。我們使用OpenAI的Whisper系統進行語音辨識,並利用CER(字符錯誤率)來評估語音辨識的準確度。此外,我們還採用了wav2vec 2.0和ResNet模型進行語者辨識,以檢驗音訊還原技術在保持語者特徵方面的有效性。為了更全面地評估,我們也使用了其他兩個評估指標:感知評估語音品質(PESQ)和語音傳達指數(STOI)。PESQ幫助我們比較原始音檔、變聲後音檔和還原音檔之間的品質,以評估還原度。STOI則用於評估變聲後音檔的清晰度,幫助我們了解變聲技術對語音可懂度的影響。透過這些指標,我們能夠從更多角度評估和比較不同模型的性能。
    實驗結果中的語音辨識部分,雖然HiFi-GAN在某些情況下能夠提供清晰、接近原始語音的還原效果,但所有GAN模型在變聲效果較顯著時,均對語音辨識造成較嚴重的干擾,這代表在未來的研究中需要進一步優化和調整模型;語者辨識部分,我們了解變聲還原音訊在訓練及測試中是可行的,但在不包含進原始音檔及變聲還原音檔訓練資料時,仍面臨一些挑戰。結合VITS和HiFi-GAN後,不僅提升了語音辨識與語者辨識的準確度,也克服了當使用原始音檔訓練,變聲後還原音檔測試時準確率過低的問題。
    將變聲後音檔作為訓練集訓練語者辨識模型測試的部分,我們了解在RVC的Top1檢索之下,模型基於訓練資料較能還原始語者的特徵,導致在實驗的環境下準確率相比其他VITS模型都要較低。
    性別轉換變聲與其他變聲方式較為不同,對音檔的破壞性較小,在相對不影響到人聲的情況下在各項指標都獲得更好的成績,凸顯了變聲方式的選擇對還原後音檔還原有一定的關連性。
    進一步的測試中,我們探討了變聲還原音檔在Top5語者辨識中的準確率,即模型預測的前五名語者中是否包含正確答案。結果顯示,即使在變聲後,模型在Top5預測中仍能高機率地包含正確語者,這意味著語者辨識系統在保護語者隱私和安全方面存在潛在風險。
    This study aims to restore altered audio using deep learning techniques. Traditional speech recognition systems perform well with standard audio but struggle with altered audio. Thus, our research focuses on using deep learning methods to restore altered audio files to their original state as closely as possible. We then evaluate the effectiveness of these restoration techniques using speech recognition and speaker recognition systems.

    This study also explored the potential risks of speaker identification systems in protecting privacy. Although voice transformation technology can obscure speaker identity, we found that restored audio can still be recognized by speaker recognition models in most cases. This indicates that even transformed voices may still pose privacy risks due to the effectiveness of these models.

    We selected various altered audio datasets and combined different deep learning models to extract audio features and perform restoration. This study explores the application of models combining Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) to preserve the speaker's unique characteristics. We used OpenAI's Whisper system for speech recognition and measured Character Error Rate (CER) to evaluate accuracy. Additionally, we used wav2vec 2.0 and ResNet for speaker recognition to assess the effectiveness of the restoration techniques.

    To provide a comprehensive evaluation, we introduced Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI). PESQ compares the quality of original, altered, and restored audio, while STOI evaluates the clarity of altered audio. These metrics allow us to assess and compare model performance from multiple angles.

    Our results show that while HiFi-GAN improves speech quality (PESQ) and recognition accuracy (CER), GAN models can still interfere with recognition when alteration effects are significant. In speaker recognition, combining VITS and HiFi-GAN improved accuracy, addressing the issue of low accuracy when using original audio for training and altered audio for testing. Using altered audio as a training set, we found that RVC's Top-1 retrieval significantly reduced distortion, leading to lower accuracy compared to other VITS models. Gender transformation caused less distortion and achieved better scores, highlighting the impact of alteration methods on restoration quality.

    Further tests evaluated the Top-5 speaker recognition accuracy of restored altered audio. Results show that even after alteration, the model often includes the correct speaker in the Top5 predictions, indicating potential risks to speaker privacy and security.
    Reference: 1. MyEdit online voice changer; Available from:https://myedit.online/tw/audio-editor/voice-changer.
    2. Kim, T., et al. Learning to discover cross-domain relations with generative adversarial networks. in International conference on machine learning. 2017. PMLR.
    3. Kaneko, T. and H. Kameoka. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. in 2018 26th European Signal Processing Conference (EUSIPCO). 2018. IEEE.
    4. Almahairi, A., et al. Augmented cyclegan: Learning many-to-many mappings from unpaired data. in International conference on machine learning. 2018. PMLR.
    5. Kaneko, T., et al. Maskcyclegan-vc: Learning non-parallel voice conversion with filling in frames. in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021. IEEE.
    6. Kong, J., J. Kim, and J. Bae, Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 2020. 33: p. 17022-17033.
    7. Kim, J., J. Kong, and J. Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. in International Conference on Machine Learning. 2021. PMLR.
    8. Kingma, D.P. and M. Welling, Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
    9. Radford, A., et al. Robust speech recognition via large-scale weak supervision. in International Conference on Machine Learning. 2023. PMLR.
    10. Baevski, A., et al., wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 2020. 33: p. 12449-12460.
    11. He, K., et al. Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    12. Rix, A.W., et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221). 2001. IEEE.
    13. Taal, C.H., et al., An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on audio, speech, and language processing, 2011. 19(7): p. 2125-2136.
    14. FastWER. Available from: https://github.com/kahne/fastwer.
    15. Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS. Available from: https://github.com/PlayVoice/so-vits-svc-5.0.
    16. Hsu, W.-N., et al., Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021. 29: p. 3451-3460.
    17. Kim, J.W., et al. Crepe: A convolutional representation for pitch estimation. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. IEEE.
    18. Retrieval-based-Voice-Conversion. Available from: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.
    19. Jiang, P.-T., et al., Layercam: Exploring hierarchical class activation maps for localization. IEEE Transactions on Image Processing, 2021. 30: p. 5875-5888.
    Description: 碩士
    國立政治大學
    資訊科學系
    111753218
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0111753218
    Data Type: thesis
    Appears in Collections:[資訊科學系] 學位論文

    Files in This Item:

    File Description SizeFormat
    321801.pdf2755KbAdobe PDF3View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback