In a very short time generative AI has become extraordinarily adept at creating cogent text/speech and producing captivating imagery, including motion video. As impressive as the results are, there is still room for significant improvement. Adeia is working on solutions across several fronts, but today we'd like to focus on one particular challenge: realistic AI-generated images of people who speak.
Gen AI is already being used to create commentary – and commentators – for all kinds of video, leveraging sophisticated architectures like generative adversarial networks (GAN), latent diffusion models, and hybrid approaches that combine multiple synthesis methodologies. The types of video include explainers, how-to’s, unboxings, and more. This commentary, originally audio-only, is increasingly delivered by AI-generated graphical characters. Examples of this could be found in abundance at CES 2026.
Eventually, however, it is going to be common to have narration delivered by photorealistic, AI-generated humans. At the same time, agentic AI is coming, and it will be big. Agents are artificial intelligences that can perform complex tasks autonomously. These agents are going to become ubiquitous, and they are going to get human-looking avatars, generated by the same AI-based technologies that will generate talking heads for video content.
AI tools already exist that require nothing more than a prompt to generate a script and a talking-head video to deliver it, complete with user-specified gender, skin tone, facial features, age, and vocal range. The results are impressive, but they don’t seem entirely natural yet. In order to get to where AI can generate a natural-looking video of a human speaker, however, a fundamental question remains: how do you define “natural”?
And then the real challenge comes from the next question: how do you even measure naturalness? How do you measure the generated performance quantitatively? How can you computationally evaluate the algorithms generating talking head videos? The number of variables is enormous, and many are inextricably linked to others.
Every spoken language is built from a set of elemental sounds called phonemes. Each phoneme corresponds to a specific mouth shape called a viseme. Phonemes and visemes have to match. For example, when your generated image of a person says the letter O, the image’s mouth should round upward. Phonemes and visemes should of course have precise temporal alignment, but synchronization is only the first challenge that must be met to create truly believable characters.
Developing believable visual representations of seemingly real people must also take emotion into account. Vocal tone and inflection changes with emotion, but what changes should be made in each viseme when the talking head is alternately expressing excitement, seriousness, urgency? Does the entire facial expression match the tone of voice? This goes far beyond just making sure that phonemes and visemes are in synch. Does everything about the image – voice, tone, expression, posture – correlate the way a human viewer expects?
The world is multilingual. This leads to an associated challenge: figuring out how to use the same AI-generated avatar (or a similar one with culturally relevant physical signifiers such as dress and hair style) to deliver the same message in any language a viewer prefers, with lip movements, vocal intonations, gestures, etc., adapted so the result looks native rather than like a poorly dubbed movie. Different languages have different phonemes and visemes, and different amounts of phonemes and visemes, all of which requires sophisticated cross-lingual adaptation.
Believable AI-generated content extends well beyond human likenesses. Realistic physics, material textures, and environmental dynamics all present their own measurement and optimization challenges.
Beyond the uncanny valley
The task is to avoid falling back into the uncanny valley, the term coined to describe computer-rendered characters that look almost natural but remain unsettlingly off.
We quantify naturalness through perceptual quality metrics that evaluate facial movement fluidity, eye gaze patterns, micro-expressions, and temporal consistency. We are developing adaptive metric fusion architectures based on these evaluations, so that we can go far beyond simple lip-sync accuracy. We assess affective coherence — the synchronization between facial expressions, vocal prosody, and gestural movements.
Our architectures dynamically combine no-reference and reference-based video quality metrics based on contextual analysis of the input video characteristics, speaker identity, and assessment objectives. This allows us to detect subtle inconsistencies that traditional lip-sync metrics would miss. This multi-dimensional approach captures what makes generated content truly believable to human observers.
Furthermore, we’re developing training-free systems that can extract narrative structures and apply cinematic patterns across linguistic boundaries. Using advanced retrieval-augmented generation frameworks with graph-based narrative alignment, we can preserve emotional coherence and timing synchronization even as we adapt visual speech patterns to match different phonological systems — all without requiring massive parallel training datasets for every language pair.
Our goal is to be able to provide a score for the naturalness of an AI-generated character.
Gen AI is already impressive, but we're just beginning to unlock its potential. At Adeia, we're not just improving individual components; we're envisioning comprehensive systems that assess, verify, and optimize AI-generated content across multiple dimensions simultaneously. From measuring affective coherence to enabling authentic multilingual adaptation to ensuring content authenticity through advanced fingerprinting, our research is focused on making generative AI not just more natural-looking, but more trustworthy, more versatile, and more useful across the full spectrum of real-world applications.