An Implicit Motion Guided Latent Completion Framework for Audio-Driven Talking Head Synthesis

Xiang Li , Xiaoli Huang, Xiaonan Luo
Article
2026 / Volume 9 / Pages 2607-2628
Published 25 April 2026

Abstract

Audio-driven talking head synthesis requires simultaneous achievement of photorealistic visual quality, accurate lip synchronization, robust identity preservation, and computational efficiency. With the rapid proliferation of digital humans, these capabilities are becoming increasingly critical for widespread applications in human-computer interaction. Methods relying on raw audio conditioning suffer from identity-averaged dynamics, while those using explicit geometric priors (2D landmarks or 3DMM coefficients) face inherent information bottlenecks that cause over-smoothing and loss of fine-grained lip details. We present IMGTalk, an Implicit Motion Guided latent completion framework that addresses these limitations through a principled two-stage decoupled design. In Stage 1, a Transformer-based Audio-to-Implicit Motion Encoder learns a cross-identity mapping without per-subject fine-tuning from HuBERT audio features to implicit expression deformation , conditioned on each subject ‘s neutral silent state to enable cross-identity adaptation without per-subject fine-tuning. In Stage 2, an Implicit Motion-Conditioned UNet (IMC-UNet) completes the masked mouth region entirely within the VAE latent space, driven by the predicted expression deformation via positional-encoded cross-attention. We further introduce Reference Spatial Cross-Attention, a reference injection mechanism that replaces naive channel-wise concatenation by projecting the reference latent at every U-Net layer, enabling spatially selective retrieval of lip and teeth appearance information. Extensive experiments on the HDTF and TFHP benchmarks demonstrate that IMGTalk achieves state-of-the-art performance across all evaluation metrics.

Keywords

audio-driven talking head, lip synchronization, digital human, human-computer interaction, implicit motion representation