ByteDance has developed technology that enables humans to have a conversation with a "socially intelligent agent," as well as another agent.
The technology is called introverted, intuitive, feeling, and perceiving (INFP), which is based on human traits. It's an audio-driven head generation framework for dyadic interaction that can make an image of one picture talk with a human or talk with another agent.
INFP prioritizes empathy, creativity, deep understanding, and a focus on the human experience when interacting with a human voice or a lifelike AI agent on the screen.
This interaction allows the conversation to flow smoothly and naturally using human characteristics. Previous head generation work from ByteDance focused on one-sided communication or required the human interacting with the agent to switch roles.
It's unclear whether the company plans to make the technology available to users of any of its social platforms, including TikTok.
This latest AI agent model alternates between speaking and listening, guided by the input dyadic audio, the variation in volume level of an audio signal over time.
INFP combines what is known as motion-based head imitation and audio-guided motion-generation technology.
ByteDance developers describe the technology in a research paper released earlier this month as being able to initially learn facial communicative behaviors from real-life conversation videos. Authors and dimensional motion latent space use motion latent codes -- a compressed representation of the video -- to animate a static image.
The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio-driven head generation in interactive scenarios.
Developers also introduced something called DyConv into the mix. It is a large-scale dataset of rich dyadic conversations collected from the Internet means to demonstrate superior performance and effectiveness of its method.
But some of those images and conversations used may seem familiar to some. The developers have used the voice of Celine Dion, and images of Elon Musk, Mark Cuban and others. One example in the paper demonstrates the technology on an image that resembles Cuban, and a previous version uses an image of Musk.
China does have privacy laws called the Personal Information Protection Law (PIPL) enacted in 2021. It aims to protect the personal information of individuals located within China, but it's not clear if it would have the same protection for people living outside of the company such as in the United States.
ByteDance wrote in the post that the purpose of this work is for research. The images and audios used in these demos are from public sources, but some of the voices and images are from well-known public figures.