Abstract:
DeepFake technology, which has emerged and been developed rapidly in recent years, has profoundly changed the way and level of multimedia content forgery, posing new severe challenges to content security in cyberspace. This paper mainly focuses on the most harmful video face-swapping forgery among the deep forgeries, and proposes a two-stream detection method exploiting eye and mouth artifacts based on the I3D (Inception3D) network. Firstly, considering that most of the existing forgery detection methods ignore the important time information in the video, the currently commonly used 2D convolution merely has the spatial domain perception ability, and therefore, we extend it to the I3D convolution, enabling the network with the ability to simultaneously learn the spatial and temporal domains information. Meanwhile, through adjusting the I3D network structure, it could be improved from the original multi-class classification task design to an efficient network that is more suitable for the binary classification task of DeepFake forensics. Furthermore, considering that the forgery of eye and mouth areas is more difficult and it is easier to leave tampering artifacts in video face-swapping operation, a two-stream network structure based on these two areas is proposed, and the two-stream output results are finally used to form collaborative decision-making. Via extensive experiments on commonly used databases such as Celeb-DF, DFDC, DeepFakeDetection, and FaceForensics++, the results verify that the detection accuracy and computational efficiency of the method proposed in this paper are substantially improved compared with the most advanced Xception and standard I3D networks.