情感可控的个性化完整三维虚拟形象表情动画生成
Emotion-Controlled Personalized and Complete 3D Avatar Expression Animation Generation
-
摘要: 语音驱动的三维虚拟形象情感表情动画,旨在合成与输入语音具有同步嘴唇动作和面部表情的三维人脸动画。然而,现有方法受限于三维人脸先验,在合成具有口腔内部结构的三维人脸动画方面存在一定的局限性,导致最终生成结果缺乏真实感。此外,现有多数方法往往重点关注虚拟形象唇部动作与语音的同步,而较少关注语音情感变化对面部表情的影响,使得生成的表情动画不够自然,真实感受到限制,影响了用户体验。针对以上问题,本文提出了一种情感可控的个性化完整三维虚拟形象表情动画生成方法,以生成具有完整口腔结构和丰富情感表情的人脸动画,提高三维虚拟形象的真实感。该方法由三个核心模块组成:具有完整口腔结构的中性表情动画生成模块、表情检索模块和表情融合模块。具有完整口腔结构的中性表情动画生成模块首先通过基于Transformer的自回归模型实现语音到三维人脸动画序列的跨模态映射,输出中性人脸动画序列,并通过交叉监督的训练图,引入了文本驱动的一致性损失,确保了输入语音与嘴唇区域的同步性。接着,本文在该模块中提出并应用了一种基于人脸关键点的口腔结构三维模型形变算法,依次将生成的口腔模型与对应的中性人脸动画序列进行融合,输出包含口腔结构的中性表情模型序列。表情检索模块根据输入的语音序列和人脸图片进行情感识别和检索,获取带有情感的三维人脸模型。表情融合模块通过深度神经网络将包含口腔结构的中性表情动画与带有情感的三维人脸模型融合,生成具有口腔结构与情感表情的三维人脸表情动画。此外,本文还提出了一种基于线性插值的表情过渡算法实现了表情动画在多种情绪间的平滑过渡。现有实验表明,本文生成的包含口腔结构且具有情感表情的三维人脸动画均能在保持唇部动作与语音同步的同时,有效提高三维虚拟形象的真实感。Abstract: Speech-driven emotional expression animation of 3D avatars aims to generate 3D facial animations that not only feature synchronized lip movements with the input voice data but also convey a range of emotional expressions. However, owing to the limitations of 3D face prior, existing methods have some limitations in the synthesis of 3D facial animations with internal oral structures, resulting in a lack of realism in the final result. In addition, despite the advancements in this field, the majority of existing research predominantly focuses on the synchronization of lip movements and spoken words in 3D avatars, and insufficient attention is given to the significant role that emotional fluctuations play in shaping facial expressions. This limitation makes the generated expression animation not sufficiently natural, and the realism of the 3D facial animation is limited, which affects users’ feelings. To solve these problems, this paper proposes an emotion-controlled personalized and complete 3D avatar expression animation generation method to generate facial animation containing a detailed representation of the inner oral structure while including a wide array of emotional expressions to improve the realism of 3D facial animations. The method consists of three core modules: neutral expression animation with complete oral structure generation, expression retrieval, and expression fusion. The neutral expression animation with complete oral structure generation module first outputs a neutral expression animation sequence, which achieves cross-modal mapping from speech to a 3D facial animation sequence using an auto-regressive model based on Transformer and introduces text-driven consistency loss through cross-supervised training graph to ensure synchronization between input speech and the lip region. This paper proposes an oral structure 3D model deformation algorithm based on the landmarks of the face in this module. This algorithm enables the dynamic deformation of the oral structure model, which is then seamlessly fused with the corresponding neutral expression animation sequences. The result is a neutral expression animation sequence that includes a detailed and accurate representation of the oral structure. The expression retrieval module obtains the 3D face model with emotional expression by recognizing and retrieving the emotion according to the input speech sequence and image of the face. The expression fusion module merges the neutral expression animation, which includes the oral structure, with emotionally charged 3D face models through the deep neural network. The 3D facial expression animation generated by the fusion module not only maintains the synchronization of lip movements with the speech but also conveys a range of emotions. In addition, this paper proposes an expression transition algorithm based on linear interpolation to achieve a smooth transition between different emotions on the 3D facial animation. Experimental results demonstrated the effectiveness of the proposed method. Additionally, the 3D facial animation with both the oral structure and emotional representation generated with this method can maintain lip movements synchronized with the speech. Moreover, it can effectively improve the reality of 3D avatars.