TY - JOUR
T1 - Deepfake Detection Using Spatiotemporal Transformer
AU - Kaddar, Bachir
AU - Fezza, Sid Ahmed
AU - Akhtar, Zahid
AU - Hamidouche, Wassim
AU - Hadid, Abdenour
AU - Serra-Sagristà, Joan
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s)
PY - 2024/9/12
Y1 - 2024/9/12
N2 - Recent advances in generative models and the availability of large-scale benchmarks have made deepfake video generation and manipulation easier. Nowadays, the number of new hyper-realistic deepfake videos used for negative purposes is dramatically increasing, thus creating the need for effective deepfake detection methods. Although many existing deepfake detection approaches, particularly CNN-based methods, show promising results, they suffer from several drawbacks. In general, poor generalization results have been obtained under unseen/new deepfake generation methods. The crucial reason for the above defect is that CNN-based methods focus on the local spatial artifacts, which are unique for every manipulation method. Therefore, it is hard to learn the general forgery traces of different manipulation methods without considering the dependencies that extend beyond the local receptive field. To address this problem, this article proposes a framework that combines Convolutional Neural Network (CNN) with Vision Transformer (ViT) to improve detection accuracy and enhance generalizability. Our method, named HCiT, exploits the advantages of CNNs to extract meaningful local features, as well as the ViT’s self-attention mechanism to learn discriminative global contextual dependencies in a frame-level image explicitly. In this hybrid architecture, the high-level feature maps extracted from the CNN are fed into the ViT model that determines whether a specific video is fake or real. Experiments were performed on Faceforensics++, DeepFake Detection Challenge preview, Celeb datasets, and the results show that the proposed method significantly outperforms the state-of-the-art methods. In addition, the HCiT method shows a great capacity for generalization on datasets covering various techniques of deepfake generation.
AB - Recent advances in generative models and the availability of large-scale benchmarks have made deepfake video generation and manipulation easier. Nowadays, the number of new hyper-realistic deepfake videos used for negative purposes is dramatically increasing, thus creating the need for effective deepfake detection methods. Although many existing deepfake detection approaches, particularly CNN-based methods, show promising results, they suffer from several drawbacks. In general, poor generalization results have been obtained under unseen/new deepfake generation methods. The crucial reason for the above defect is that CNN-based methods focus on the local spatial artifacts, which are unique for every manipulation method. Therefore, it is hard to learn the general forgery traces of different manipulation methods without considering the dependencies that extend beyond the local receptive field. To address this problem, this article proposes a framework that combines Convolutional Neural Network (CNN) with Vision Transformer (ViT) to improve detection accuracy and enhance generalizability. Our method, named HCiT, exploits the advantages of CNNs to extract meaningful local features, as well as the ViT’s self-attention mechanism to learn discriminative global contextual dependencies in a frame-level image explicitly. In this hybrid architecture, the high-level feature maps extracted from the CNN are fed into the ViT model that determines whether a specific video is fake or real. Experiments were performed on Faceforensics++, DeepFake Detection Challenge preview, Celeb datasets, and the results show that the proposed method significantly outperforms the state-of-the-art methods. In addition, the HCiT method shows a great capacity for generalization on datasets covering various techniques of deepfake generation.
KW - convolutional neural network
KW - Deepfake video
KW - detection
KW - vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85190245677&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/99f65787-4c47-394b-ac15-ea4870e4ac67/
U2 - 10.1145/3643030
DO - 10.1145/3643030
M3 - Article
AN - SCOPUS:85190245677
SN - 1551-6857
VL - 20
SP - 1
EP - 21
JO - ACM Transactions on Multimedia Computing, Communications and Applications
JF - ACM Transactions on Multimedia Computing, Communications and Applications
IS - 11
M1 - 345
ER -