StyleGeoSync: High-Fidelity Lip Sync with Style-Aware Geometric Guidance of 3D Facial Segmentation

Anonymous Code Repository

Abstract

For talking head generation and its downstream applications, the generalized audio-driven lip-sync technology is of great significance. However, existing methods still face challenges in eliminating artifacts in the generated frames and maintaining the personalized speaking style of the template video. In this paper, we propose StyleGeoSync, a generalized high-fidelity lip-sync framework, consisting of style-aware audio-driven spatial geometry construction and 3D facial segmentation-guided texture generation. To achieve personalized and style-consistent lip synchronization, we introduce a style condition mechanism that disentangles lip motion into audio-related absolute dynamics and audio-independent speaking style. Moreover, to fundamentally eliminate visual artifacts, we design the texture generation as a standalone module, which prevents interference from the reference image during lip motion synthesis. By decoupling lip movements and image textures into two stages, this framework can generate lip-synced videos with better image quality according to the given audio, and achieve specific speaking style transfer based on the reference video. Experimental results on CelebV-HQ and HDTF datasets demonstrate that our proposed method outperforms existing methods in image quality, lip synchronization, and identity preservation.

Single Motion + Multiple Portraits


Given a template video, a driving audio, and a reference video that provides the speaking style, our method can generate high-fidelity lip-synced results with the specified speaking style under the guidance of 3D facial segmentation.


Comparison with other methods in cross-driving setting


第一行...


Single Motion + Multiple Portraits


第一行...


Single Motion + Multiple Portraits


第一行...


Single Motion + Multiple Portraits


第一行...