Earlier this week, Microsoft Research Asia announced VASA-1, an AI model that can create synchronized animated videos of people talking and singing from a single photo and an existing audio track. Ars Technica: In the future, anyone who has a tool to power virtual avatars that render locally without the need for a video feed, or something similar, will be able to take a photo of someone they find online and say what they want to say. You may be able to make it seem like you're saying anything. The abstract of the accompanying research paper, titled “VASA-1: Real-time Generated Lifelike Voice-Driven Conversation Face,” states, “Real-time engagement with lifelike avatars that emulate human conversational behavior. The way will be opened.'' This is the work of his Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining Guo.
The VASA framework (short for “Visual Affective Skills Animator”) uses machine learning to analyze still images and audio clips. You can generate realistic videos with accurate facial expressions, head movements, and lip-syncing to audio. Rather than replicating or simulating speech (as with other Microsoft research), it relies on existing audio input that has been recorded or spoken specifically for a specific purpose.