English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 114014/145046 (79%)
Visitors : 52053521      Online Users : 377
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 >  Item 140.119/152575
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/152575


    Title: 自監督式聲音特徵在跨語言語者辨識的表現評估
    Evaluation of Cross-Lingual Speaker Recognition using SSL-Based Acoustic Features
    Authors: 陳柏翰
    Chen, Po-Han
    Contributors: 廖文宏
    Liao, Wen-Hung
    陳柏翰
    Chen, Po-Han
    Keywords: 深度學習
    跨語言語者辨識
    自監督式學習
    聲音特徵
    Deep learning
    Cross-lingual speaker recognition
    Self-supervised learning
    Acoustic feature
    Date: 2024
    Issue Date: 2024-08-05 12:46:26 (UTC+8)
    Abstract: 語者辨識作為一種身分辨識的技術,已被廣泛應用在我們的生活當中,如保全系統、語音助手等。過去的語者辨識研究中,大多以語者使用單一語言情境下的辨識為主,但是現今在生活中使用兩種以上語言的人越來越多,當他們使用和註冊時不同語言進行辨識,就可能發生錯誤,為此就需要跨語言的語者辨識模型。而近年來所提出的自監督式學習(Self-supervised learning, SSL)模型,已經能夠從大量未標記資料中學習通用特徵,相較於頻譜圖和梅爾倒頻譜係數(MFCC)等,該經過預訓練的通用特徵,在跨語言語者辨識任務的表現則有待評估。
    在本論文中,我們提出以預訓練的SSL深度學習模型,將音訊資料轉換為聲音特徵,並用於跨語言語者辨識的評估,另外也會針對資料擴增的特徵做分析。具體而言,我們直接將音訊資料輸入SSL預訓練模型來產生嵌入向量作為聲音特徵,接著再使用ResNet架構的語者辨識模型做跨語言表現分析。透過此方法,我們測試在由實驗室收集包含120位語者資料的MET-120,並且使用SSL模型的Wav2Vec 2.0 和 BEATs來取得聲音特徵,我們發現經過微調的 Wav2Vec 2.0模型在MET-120平均表現上達到了九成以上,取得優秀且穩定的結果,而在未經微調的情況下,BEATs在MET-120也獲得最佳的表現。並且我們也發現,語言是否為母語以及語者的性別差異,都可能會對後續的辨識表現造成影響。在資料擴增的實驗中,則是使用SpecAugment和ShuffleAugment這類近年來用在聲音資料上的方法進行跨語言測試。結果顯示,後者更能有效改善跨語言的辨識效果,並在後續搭配對特徵降維來取得最佳的擴增效果。最後,我們在合成語音的跨語言攻擊測試中看到,這類先進的合成語音不容易透過特徵轉移的方式,對使用嵌入特徵的辨識模型,在跨語言測試造成混淆攻擊。
    Speaker recognition, as a form of biometric identification technology, has been widely integrated into our daily lives, such as in security systems and voice assistants. Traditionally, speaker recognition research has predominantly focused on scenarios where the speaker uses a single language. However, with the increasing number of people using multiple languages in their daily lives, recognition errors may occur when speakers use a different language from the one they registered with. This necessitates the development of cross-lingual speaker recognition models. In recent years, self-supervised learning (SSL) models have demonstrated the capability to learn general features from large amounts of unlabeled data. Compared to spectrograms and Mel-frequency cepstral coefficients (MFCCs), the performance of these pretrained general features in cross-lingual speaker recognition tasks requires further evaluation.
    In this paper, we propose utilizing pretrained SSL deep learning models to convert audio data into acoustic features and evaluate their performance in cross-lingual speaker recognition. Additionally, we analyze the impact of data augmentation techniques on these features. Specifically, we input raw audio data into SSL pretrained models to generate embedding vectors as acoustic features, followed by performance analysis using ResNet as a speaker recognition model in cross-lingual scenarios.
    We tested a speech dataset, MET-120, collected from 120 participants in our laboratory. We obtained acoustic features using SSL models Wav2Vec 2.0 and BEATs. Our findings indicate that the fine-tuned Wav2Vec 2.0 model achieved over 90% accuracy on MET-120, demonstrating excellent and stable results. Without fine-tuning, BEATs also delivered optimal performance on MET-120. We observed that factors such as whether the language is the speaker's native language and the speaker's gender could influence recognition performance.
    In the data augmentation experiments, we primarily used recent methods applied to audio data such as SpecAugment and ShuffleAugment for cross-lingual testing. Results showed that the latter effectively improved cross-lingual recognition performance. In the final dimensionality reduction experiment, combining dimensionality reduction with ShuffleAugment yielded the best results, enhancing performance in both same-language and cross-lingual tests. Finally, in cross-lingual attack tests with synthetic speech, we found that advanced synthetic speech is not easily confounded through feature transfer, indicating the robustness of embedding features against such attacks.
    Reference: [1] Wu, Y.-C. and W.-H. Liao. Toward text-independent cross-lingual speaker recognition using english-mandarin-taiwanese dataset. in 2020 25th International Conference on Pattern Recognition (ICPR). 2021. IEEE.
    [2] Liao, W.-H., W.-Y. Chen, and Y.-C. Wu. On the Robustness of Cross-lingual Speaker Recognition using Transformer-based Approaches. in 2022 26th International Conference on Pattern Recognition (ICPR). 2022. IEEE.
    [3] Dehak, N., et al., Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 2010. 19(4): p. 788-798.
    [4] Wan, L., et al. Generalized end-to-end loss for speaker verification. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. IEEE.
    [5] Snyder, D., et al. X-vectors: Robust dnn embeddings for speaker recognition. in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2018. IEEE.
    [6] Dosovitskiy, A., et al., An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
    [7] Wu, H., et al. Cvt: Introducing convolutions to vision transformers. in Proceedings of the IEEE/CVF international conference on computer vision. 2021.
    [8] Mohamed, A., et al., Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 2022.
    [9] Chen, T., et al., Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 2020. 33: p. 22243-22255.
    [10] Baevski, A., et al., wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 2020. 33: p. 12449-12460.
    [11] Chen, S., et al., Beats: Audio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058, 2022.
    [12] Park, D.S., et al., Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
    [13] Sato, Y., N. Ikeda, and H. Takahashi. Shuffleaugment: A Data Augmentation Method Using Time Shuffling. in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023. IEEE.
    [14] Barrault, L., et al., Seamless: Multilingual Expressive and Streaming Speech Translation. arXiv preprint arXiv:2312.05187, 2023.
    [15] Abayomi-Alli, O.O., et al., Data augmentation and deep learning methods in sound classification: A systematic review. Electronics, 2022. 11(22): p. 3795.
    [16] Mokgonyane, T.B., et al. Automatic speaker recognition system based on machine learning algorithms. in 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA). 2019. IEEE.
    [17] Jaiswal, A., et al., A survey on contrastive self-supervised learning. Technologies, 2020. 9(1): p. 2.
    [18] Song, X., et al. SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition. in Interspeech. 2020.
    [19] Zhang, D. and Z.-H. Zhou, (2D) 2PCA: Two-directional two-dimensional PCA for efficient face representation and recognition. Neurocomputing, 2005. 69(1-3): p. 224-231.
    [20] He, K., et al. Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [21] Van der Maaten, L. and G. Hinton, Visualizing data using t-SNE. Journal of machine learning research, 2008. 9(11).
    [22] Li, P., et al., Reliable visualization for deep speaker recognition. arXiv preprint arXiv:2204.03852, 2022.
    [23] Selvaraju, R.R., et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. in Proceedings of the IEEE international conference on computer vision. 2017.
    [24] Hutiri, W.T. and A.Y. Ding. Bias in automated speaker recognition. in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2022.
    [25] Mason, J. and J. Thompson, Gender effects in speaker recognition. Proc. ICSP-93, Beijing, 1993: p. 733-736.
    [26] Wang, S., et al. Investigation of specaugment for deep speaker embedding learning. in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. IEEE.
    Description: 碩士
    國立政治大學
    資訊科學系
    111753208
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0111753208
    Data Type: thesis
    Appears in Collections:[資訊科學系] 學位論文

    Files in This Item:

    File Description SizeFormat
    320801.pdf4112KbAdobe PDF0View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback