政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/151504

政大典藏 > College of Informatics > Executive Master Program of Computer Science of NCCU > Theses > Item 140.119/151504

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/151504

Title:	結合肢體動作識別及擴散模型的文字生成舞蹈機制 Text-to-dance mechanism using human pose estimation and stable diffusion
Authors:	洪健庭 Hung, Chien-Ting
Contributors:	廖文宏 Liao, Wen-Hung 洪健庭 Hung, Chien-Ting
Keywords:	深度學習肢體辨識生成式人工智慧文字生成舞蹈 Deep Learning Human Pose Recognition Generative AI Text-to-Dance
Date:	2024
Issue Date:	2024-06-03 11:42:54 (UTC+8)
Abstract:	肢體辨識在機器視覺領域是一個很重要的問題，如何在影像以及圖像中抓取人體骨骼的節點（如肩膀、手肘、手腕等）座標，不僅可以知道人物在圖像中的位置，還可藉由辨識結果去預測該人物在做什麼動作。擴散模型(Diffusion Model)在近年得到廣大的關注，最令人驚豔的是其在AIGC(AI Generated Conten)領域的表現，許多文字生成圖片都是基於擴散模型的應用，包含DALL·E、Imagen、Midjourney和StableDiffusion等。除了在圖片生成任務上表現出色之外，其他任務的生成效果也相當卓越。本論文探討使用Stable Diffusion 和OpenPose 來生成流暢的舞蹈動作，前者利用自定義文字產生人物外觀以及產生單位舞蹈動作的排序，並使用線性轉換的方式串接整體舞蹈動作，後者在連續舞蹈動作任務中作出肢體辨識，使以利自由設定角色外觀以及排序舞蹈動作。結合上述方式，本論文提出的使用文字產生舞蹈動作方法，不僅為影像製作領域引入一種新的模式，更在製作過程中可以更方便選擇角色、場景以及角色動作的設定，過往需要每一幀的繪畫出來或者真人根據設定動作去呈現，如果加上角色需要更換的情況下，相比傳統方法節省很多步驟及時間，這個的方法不僅擴展了影像生成的研究範疇，同時結合AIGC的方法為實際應用中提供了一種可行的解決方案。 Pose estimation is a significant problem in the field of computer vision. It involves capturing the coordinates of skeletal joints (such as shoulders, elbows, wrists, etc.) of a human body in images and videos. This not only provides information about the person's position in the image but also enables predicting their actions based on the recognized joints. In recent years, diffusion models have gained significant attention, particularly for their impressive performance in the field of AI Generated Content (AIGC). Many text-to-image generation applications, including DALL·E, Imagen, Midjourney, and StableDiffusion, are based on diffusion models. These models have shown outstanding performance not only in image generation tasks but also in various other generative tasks. This thesis explores the use of the Stable Diffusion and OpenPose. The former, within the framework of this paper, allows for generating custom character appearances and producing ordered unit-level dance movements based on custom text inputs. These movements are then concatenated using linear transformations to create coherent overall dance sequences. The latter, OpenPose, performs pose estimation in continuous dance movement tasks. This framework enables the flexible configuration of character appearances and the sequencing of dance movements. Combining the above-mentioned approaches, the method proposed in this work, which utilizes text to generate dance movements, not only introduces a new pattern into the field of image production but also facilitates the selection of characters, scenes, and character actions during the production process. Previously, each frame required drawing or presenting actions based on set movements by real individuals. With the added flexibility for changing characters, our method significantly reduces steps and time compared to traditional approaches. In conjunction with AIGC methods, the proposed mechanism provides a viable solution for practical applications.
Reference:	[1] Cao, Y., Li, S., Liu, Y., Yan, Z., Dai, Y., Yu, P. S., & Sun, L. (2023). A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226. [2] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. [3] Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7291-7299). [4] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851. [5] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695). [6] Zhang, L., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543. [7] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. [8] Li, R., Yang, S., Ross, D. A., & Kanazawa, A. (2021). Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 13401-13412). [9] Tseng, J., Castellon, R., & Liu, K. (2023). Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 448-458). [10] Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., & Bermano, A. H. (2022). Human motion diffusion model. arXiv preprint arXiv:2209.14916. [11] Wang, T., Li, L., Lin, K., Lin, C. C., Yang, Z., Zhang, H., ... & Wang, L. (2023). DisCo: Disentangled Control for Referring Human Dance Generation in Real World. arXiv preprint arXiv:2307.00040. [12] Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., ... & Liu, Z. (2023). ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model. arXiv preprint arXiv:2304.01116. [13] Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. [14] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. [15] Song, J., Meng, C., & Ermon, S. (2020). Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. [16] Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12873-12883). [17] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 (pp. 234-241). Springer International Publishing. [18] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. [19] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3), 37-52. [20] Hung-yi Lee【機器學習 2023】(生成式 AI) https://youtube.com/playlist?list=PLJV_el3uVTsOePyfmkfivYZ7Rqr2nMk3W&si=bLQJWEJsVmMG1HL3 [21] Hugging Face – The AI community building the future. https://huggingface.co/ [22] Civitai \| Stable Diffusion models, embeddings, LoRAs and more https://civitai.com/ [23] Wikipedia：Linear interpolation https://en.wikipedia.org/wiki/Linear_interpolation
Description:	碩士國立政治大學資訊科學系碩士在職專班 110971024
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0110971024
Data Type:	thesis
Appears in Collections:	[Executive Master Program of Computer Science of NCCU] Theses

Files in This Item:

File	Description	Size	Format
102401.pdf		4608Kb	Adobe PDF	0	View/Open

社群 sharing

Loading...