Emotion-Controlled Personalized and Complete 3D Avatar Expression Animation Generation
-
Graphical Abstract
-
Abstract
Speech-driven emotional expression animation of 3D avatars aims to generate 3D facial animations that not only feature synchronized lip movements with the input voice data but also convey a range of emotional expressions. However, owing to the limitations of 3D face prior, existing methods have some limitations in the synthesis of 3D facial animations with internal oral structures, resulting in a lack of realism in the final result. In addition, despite the advancements in this field, the majority of existing research predominantly focuses on the synchronization of lip movements and spoken words in 3D avatars, and insufficient attention is given to the significant role that emotional fluctuations play in shaping facial expressions. This limitation makes the generated expression animation not sufficiently natural, and the realism of the 3D facial animation is limited, which affects users’ feelings. To solve these problems, this paper proposes an emotion-controlled personalized and complete 3D avatar expression animation generation method to generate facial animation containing a detailed representation of the inner oral structure while including a wide array of emotional expressions to improve the realism of 3D facial animations. The method consists of three core modules: neutral expression animation with complete oral structure generation, expression retrieval, and expression fusion. The neutral expression animation with complete oral structure generation module first outputs a neutral expression animation sequence, which achieves cross-modal mapping from speech to a 3D facial animation sequence using an auto-regressive model based on Transformer and introduces text-driven consistency loss through cross-supervised training graph to ensure synchronization between input speech and the lip region. This paper proposes an oral structure 3D model deformation algorithm based on the landmarks of the face in this module. This algorithm enables the dynamic deformation of the oral structure model, which is then seamlessly fused with the corresponding neutral expression animation sequences. The result is a neutral expression animation sequence that includes a detailed and accurate representation of the oral structure. The expression retrieval module obtains the 3D face model with emotional expression by recognizing and retrieving the emotion according to the input speech sequence and image of the face. The expression fusion module merges the neutral expression animation, which includes the oral structure, with emotionally charged 3D face models through the deep neural network. The 3D facial expression animation generated by the fusion module not only maintains the synchronization of lip movements with the speech but also conveys a range of emotions. In addition, this paper proposes an expression transition algorithm based on linear interpolation to achieve a smooth transition between different emotions on the 3D facial animation. Experimental results demonstrated the effectiveness of the proposed method. Additionally, the 3D facial animation with both the oral structure and emotional representation generated with this method can maintain lip movements synchronized with the speech. Moreover, it can effectively improve the reality of 3D avatars.
-
-