政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/139218

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 113311/144292 (79%)
Visitors : 50939653 Online Users : 941

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/139218

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/139218

Title:	基於轉換器之跨語言語者辨識強健性分析 On the Robustness of Cross-Lingual Speaker Recognition Using Transformer-Based Approaches
Authors:	陳威妤 Chen, Wei-Yu
Contributors:	廖文宏 Liao, Wen-Hung 陳威妤 Chen, Wei-Yu
Keywords:	語者辨識跨語言音料庫深度神經網路轉換器對抗例攻擊 Speaker Recognition Cross-lingual Dataset Deep Neural Networks Transformer Adversarial Attack
Date:	2022
Issue Date:	2022-03-01 17:21:09 (UTC+8)
Abstract:	語者辨識廣泛運用於生活之中，小至語音助理，大至犯罪鑑識。隨著深度學習技術進展，語者辨識正確率逐步提升，不過多數研究專注於單一語言，鮮少處理跨語言的語者辨識任務，跨語言資料集亦相當稀少。本研究錄製跨語言資料集MET-40，其中每位參與者皆以三種語言（包含華語、英語和台語）錄音，當中有40位參與者，男性與女性人數各半，總和時長為740分鐘。華語、英語與台語文本主要取自小學教材，部分英語文本採用TIMIT資料集，錄製後會評估每位參與者運用個別語言之流暢度。　　本論文採用轉換器與卷積網路為概念的網路架構，探討單一語言、混合語言、跨語言之語者辨識，訓練模型包含ResNet、Vision Transformer（ViT）和Convolutional Vision Transformer（CvT），搭配三種常用語音特徵（頻譜圖、梅爾頻譜圖和梅爾倒頻譜係數圖）進行實驗。混合語言與跨語言差異在於識別語言是否加入訓練資料，混合語言是測試之語言已存在於訓練集，反之跨語言則是識別語言不在訓練集。實驗結果在MET-40單一語言模型下，最高可達97.16%準確率。與單一語言模型相比，混合兩種語言以上之模型最高識別率為99.17%。跨語言識別中，語者混合越多種類語言有助於提升模型之泛化性。在我們實驗中，單一語言模型在跨語言識別最高準確率為79.64%，混合兩種語言以上之模型於跨語言辨識最高準確率為90.92%。實驗結果顯示CvT不易受抽取特徵影響，且具有較佳泛化性，不論是單一語言、混合語言或跨語言，整體識別度最佳。　　模型的強健度，攸關應用時安全性，因此本論文分析不同模型的語者辨識遭致對抗例攻擊影響程度。從對抗例攻擊實驗結果，訓練與測試選擇相同語言資料集，可利用FGSM與PGD產生有效攻擊。進而探討跨語言攻擊之可轉移性，其中特徵使用頻譜圖或梅爾頻譜圖之模型攻擊不具可轉移性，而梅爾倒頻譜係數圖特徵雖在語者識別任務有卓越表現，但易受對抗例攻擊影響使得識別率降低。即使沒有訓練資料依然能產生攻擊，在FGSM跨語言攻擊中平均下降31.57%識別率，因此採用梅爾倒頻譜係數圖特徵之模型需要更加小心保護。 Speaker recognition is widely used in daily life, ranging from voice assistants to criminal forensics. With the rapid progress in deep learning technology, accuracy of speaker recognition has increased accordingly. However, most studies focus on a single language. Cross-language speaker recognition is rarely investigated. Cross-language data sets are also quite scarce. This study collected trilingual (including Mandarin, English, and Taiwanese) cross-language recordings named MET-40. A total of 40 participants (20 male, 20 female) contribute to the dataset which contains 740 minutes of audio. Mandarin, English and Taiwanese texts are mainly taken from elementary school textbooks, and some English texts use TIMIT. The fluency of individual participant in each language is also evaluated. We employ ResNet, vision transformer (ViT), and convolutional vision transformer (CvT) in combination with three acoustic features, namely, spectrogram, Mel spectrogram and Mel frequency cepstral coefficient) for single, mixed and cross-language speaker recognition tasks. In the mixed-language setting, the language to be tested is included in the training set, while in the cross-language scenario the language to be tested is not used for training. Experimental results show that the highest accuracy is 97.16% for single language models. Mixture of two languages improves the performance to 99.17%. In cross-language situations, the accuracy drops significantly to 79.64%, as the spoken language is not present in the training data. When two languages are employed for training, the accuracy increased to 90.92%. In general, CvT-based models demonstrate best generalization ability in all cases. The robustness of the model is critical to security in practical applications. Therefore, we analyze how adversarial attacks impact different speaker identification models. Experimental results reveal that if the same language dataset is selected for training and testing, FGSM and PGD attacks can be effectively generated. In the case of cross-language models, however, adversarial attacks using spectrogram or Mel-spectrogram are not transferable. Finally, when MFCC is chosen to be the acoustic feature, extra caution needs to be taken as attacks can still take place without training data, and the recognition rate is reduced by 31.57% using FGSM cross-language attack.
Reference:	[1] Eberhard, D. M., Simons, G. F., Fennig, C. D. (eds).Ethnologue: Languages of the World. 23rd Edition. Dallas, TX: SIL International, 2020. [2] Grenier, Gilles Zhang, et al “The value of language skills”, IZA World of Labor, 2021. [3] Nawaz, Shah, et al. "Cross-modal Speaker Verification and Recognition: A Multilingual Perspective." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. [4] Wu, Yi-Chieh, and Wen-Hung Liao. "Toward Text-independent Cross-lingual Speaker Recognition Using English-Mandarin-Taiwanese Dataset." 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021. [5] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. [6] Parmar, Niki, et al. "Image Transformer." International Conference on Machine Learning. PMLR, 2018. [7] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). [8] Wu, Haiping, et al. "Cvt: Introducing convolutions to Vision Transformers." arXiv preprint arXiv:2103.15808 (2021). [9] Kua, Jia Min Karen, Julien Epps, and Eliathamby Ambikairajah. "i-Vector with sparse representation classification for speaker verification." Speech Communication 55.5 (2013): 707-720. [10] Variani, Ehsan, et al. "Deep neural networks for small footprint text-dependent speaker verification." 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2014. [11] Snyder, David, et al. "Deep Neural Network Embeddings for Text-Independent Speaker Verification." Interspeech. 2017. [12] Ravanelli, Mirco, and Yoshua Bengio. "Speaker recognition from raw waveform with sincnet." 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018. [13] Ghezaiel, Wajdi, Luc Brun, and Olivier Lézoray. "Hybrid network for end-to-end text-independent speaker identification." 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021. [14] LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324. [15] Nagrani, Arsha, Joon Son Chung, and Andrew Zisserman. "Voxceleb: a large-scale speaker identification dataset." arXiv preprint arXiv:1706.08612 (2017). [16] Chatfield, Ken, et al. "Return of the devil in the details: Delving deep into convolutional nets." arXiv preprint arXiv:1405.3531 (2014). [17] Ding, Shaojin, et al. "Autospeech: Neural architecture search for speaker recognition." arXiv preprint arXiv:2005.03215 (2020). [18] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014. [19] Gong, Yuan, Yu-An Chung, and James Glass. "AST: Audio Spectrogram Transformer." arXiv preprint arXiv:2104.01778 (2021). [20] Touvron, Hugo, et al. "Training data-efficient image transformers & distillation through attention." International Conference on Machine Learning. PMLR, 2021. [21] Durou, Geoffrey. Multilingual text-independent speaker identification. FACULTE POLYTECHNIQUE DE MONS (BELGIUM), 2000. [22] Xie, Weidi, et al. "Utterance-level aggregation for speaker recognition in the wild." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. [23] Min, Feixia, Xiaofeng Qiu, and Fan Wu. "Adversarial attack? Don`t panic." 2018 4th International Conference on Big Data Computing and Communications （BIGCOM）. IEEE, 2018. [24] Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Nips 2017: Defense against adversarial attack, 2017c. URL https://www.kaggle.com/c/ nips-2017-defense-against-adversarial-attack. [25] Wang, Xin, et al. "ASVspoof 2019: a large-scale public database of synthetized, converted and replayed speech." Computer Speech & Language （2020）: 101114. [26] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 （2014） [27] Olivier, Raphael, Bhiksha Raj, and Muhammad Shah. "High-Frequency Adversarial Defense for Speech and Audio." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. [28] Madry, Aleksander, et al. "Towards deep learning models resistant to adversarial attacks." arXiv preprint arXiv:1706.06083 (2017). [29] Panayotov, Vassil, et al. "Librispeech: an asr corpus based on public domain audio books." 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015. [30] Salamon, Justin, Christopher Jacoby, and Juan Pablo Bello. "A dataset and taxonomy for urban sound research." Proceedings of the 22nd ACM international conference on Multimedia. 2014. [31] Shao, Rulin, et al. "On the adversarial robustness of visual transformer." arXiv preprint arXiv:2103.15670 (2021). [32] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [33] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). [34] Jati, Arindam, et al. "Adversarial attack and defense strategies for deep speaker recognition systems." Computer Speech & Language 68 (2021): 101199.
Description:	碩士國立政治大學資訊科學系 108753131
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0108753131
Data Type:	thesis
DOI:	10.6814/NCCU202200288
Appears in Collections:	[資訊科學系] 學位論文

Files in This Item:

File	Description	Size	Format
313101.pdf		4788Kb	Adobe PDF2	1	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback