Adeia Blog

All Blogs

April 22, 2026

Skipping the Trip to the Uncanny Valley with Generative AI

Subhabrata Bhattacharya

Skipping the Trip to the Uncanny Valley with Generative AI

In a very short time generative AI has become extraordinarily adept at creating cogent text/speech and producing captivating imagery, including motion video. As impressive as the results are, there is still room for significant improvement. Adeia is working on solutions across several fronts, but today we'd like to focus on one particular challenge: realistic AI-generated images of people who speak.

Gen AI is already being used to create commentary – and commentators – for all kinds of video, leveraging sophisticated architectures like generative adversarial networks (GAN), latent diffusion models, and hybrid approaches that combine multiple synthesis methodologies. The types of video include explainers, how-to’s, unboxings, and more. This commentary, originally audio-only, is increasingly delivered by AI-generated graphical characters. Examples of this could be found in abundance at CES 2026.

Eventually, however, it is going to be common to have narration delivered by photorealistic, AI-generated humans. At the same time, agentic AI is coming, and it will be big. Agents are artificial intelligences that can perform complex tasks autonomously. These agents are going to become ubiquitous, and they are going to get human-looking avatars, generated by the same AI-based technologies that will generate talking heads for video content.

AI tools already exist that require nothing more than a prompt to generate a script and a talking-head video to deliver it, complete with user-specified gender, skin tone, facial features, age, and vocal range. The results are impressive, but they don’t seem entirely natural yet. In order to get to where AI can generate a natural-looking video of a human speaker, however, a fundamental question remains: how do you define “natural”?

And then the real challenge comes from the next question: how do you even measure naturalness? How do you measure the generated performance quantitatively? How can you computationally evaluate the algorithms generating talking head videos? The number of variables is enormous, and many are inextricably linked to others.

Every spoken language is built from a set of elemental sounds called phonemes. Each phoneme corresponds to a specific mouth shape called a viseme. Phonemes and visemes have to match. For example, when your generated image of a person says the letter O, the image’s mouth should round upward. Phonemes and visemes should of course have precise temporal alignment, but synchronization is only the first challenge that must be met to create truly believable characters.

Developing believable visual representations of seemingly real people must also take emotion into account. Vocal tone and inflection changes with emotion, but what changes should be made in each viseme when the talking head is alternately expressing excitement, seriousness, urgency? Does the entire facial expression match the tone of voice? This goes far beyond just making sure that phonemes and visemes are in synch. Does everything about the image – voice, tone, expression, posture – correlate the way a human viewer expects?  

The world is multilingual. This leads to an associated challenge: figuring out how to use the same AI-generated avatar (or a similar one with culturally relevant physical signifiers such as dress and hair style) to deliver the same message in any language a viewer prefers, with lip movements, vocal intonations, gestures, etc., adapted so the result looks native rather than like a poorly dubbed movie. Different languages have different phonemes and visemes, and different amounts of phonemes and visemes, all of which requires sophisticated cross-lingual adaptation.

Believable AI-generated content extends well beyond human likenesses. Realistic physics, material textures, and environmental dynamics all present their own measurement and optimization challenges.

Beyond the uncanny valley

The task is to avoid falling back into the uncanny valley, the term coined to describe computer-rendered characters that look almost natural but remain unsettlingly off.

We quantify naturalness through perceptual quality metrics that evaluate facial movement fluidity, eye gaze patterns, micro-expressions, and temporal consistency. We are developing adaptive metric fusion architectures based on these evaluations, so that we can go far beyond simple lip-sync accuracy. We assess affective coherence — the synchronization between facial expressions, vocal prosody, and gestural movements.

Our architectures dynamically combine no-reference and reference-based video quality metrics based on contextual analysis of the input video characteristics, speaker identity, and assessment objectives. This allows us to detect subtle inconsistencies that traditional lip-sync metrics would miss. This multi-dimensional approach captures what makes generated content truly believable to human observers.

Furthermore, we’re developing training-free systems that can extract narrative structures and apply cinematic patterns across linguistic boundaries. Using advanced retrieval-augmented generation frameworks with graph-based narrative alignment, we can preserve emotional coherence and timing synchronization even as we adapt visual speech patterns to match different phonological systems — all without requiring massive parallel training datasets for every language pair.

Our goal is to be able to provide a score for the naturalness of an AI-generated character.

Gen AI is already impressive, but we're just beginning to unlock its potential. At Adeia, we're not just improving individual components; we're envisioning comprehensive systems that assess, verify, and optimize AI-generated content across multiple dimensions simultaneously. From measuring affective coherence to enabling authentic multilingual adaptation to ensuring content authenticity through advanced fingerprinting, our research is focused on making generative AI not just more natural-looking, but more trustworthy, more versatile, and more useful across the full spectrum of real-world applications.

The Wi-Fi Sharing Dilemma

Multi-Device Video Consumption and AI are Reshaping the E-Commerce Landscape

Highlights from Adeia’s Innovations in Extended Reality at the 2024 Augmented World Expo (AWE)

Highlights from Adeia’s Participation at the 2024 IEEE CVPR

Subhabrata Bhattacharya

Advanced R&D Director

Subhabrata Bhattacharya is an Advanced R&D Director at Adeia with over 20 years of experience in computer vision, machine learning, and computational imaging, where he spearheads CV/ML intellectual property development at the intersection of cutting-edge research and real-world deployment. His inventive work has yielded a growing portfolio of patent applications in emerging areas such as multimodal LLM fingerprinting, event-triggered sensor fusion, and federated learning for edge devices, reflecting a career-long commitment to translating deep technical insight into protectable, high-value innovation. He has held key technical roles at some of the world's most influential technology organizations, including Netflix, Ericsson, Adobe, Siemens, Microsoft Research, IBM, and Infosys, consistently delivering impactful systems at scale across industries. An accomplished researcher, he has published at premier venues including IEEE CVPR, ACM Multimedia, and MICCAI, and holds granted patents spanning domains from endoscopic image classification and visual effects processing to spatial anomaly detection. He earned his Ph.D. in Computer Engineering from the University of Central Florida, and his contributions have been recognized with a Best Paper Award at ICMR 2014, Grand Challenge prizes at ACM Multimedia 2013 and 2014, and two Netflix Hackday wins in the Studio category. He is an active member of the research community, having served as an invited speaker at industry and academic venues and as a program committee member for IEEE CVPR, ACM MM, and ECCV.