a16z 04月08日 23:28
AI Avatars Escape the Uncanny ValleyNew
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了AI化身的最新进展,它不再仅仅是生成内容,而是开始拥有实体化身。文章分析了AI化身的技术挑战,包括面部表情、肢体语言与语音的同步。通过对超过20款产品的实际测试,作者评估了当前的技术水平,并指出了AI化身在内容创作、广告、企业沟通等领域的应用前景。文章还讨论了AI化身在消费者、中小企业和大型企业中的实际应用案例。

🗣️ AI化身技术的核心在于实现语音与面部表情的精准同步,这涉及到对语音和面部动作之间关系的复杂建模。早期模型主要基于单张图像,效果有限,而如今的模型能够生成半身甚至全身动作,并实现动态背景。

🎬 AI化身技术在内容创作、广告、企业内部沟通等领域展现出广泛的应用潜力。消费者可以使用该技术创建动画角色,中小企业可以利用其进行广告投放,大型企业则可以将其用于培训、本地化和高管形象塑造。

⚙️ AI化身的构建涉及面部、语音、身体动作和渲染等多个环节,每个环节都面临技术挑战。例如,面部表情需要保持一致性,语音要自然流畅,身体动作要与语音同步,渲染则需要实现实时交互。

What happens when AI doesn’t just generate content, but embodies it? AI has already mastered the ability to produce realistic photos, videos, and voices, passing the visual and auditory Turing Test. The next big leap is in AI avatars: combining a face with a voice to create a talking character.

Can’t you just generate an image of a face, animate it, and add a voiceover? Not quite. The challenge isn’t just nailing the lip sync — it’s making facial expressions and body language move in tandem. It would be weird if your mouth opened in surprise, but your cheeks and chin didn’t budge! And if a voice sounds excited but the corresponding face doesn’t react, the human-like illusion falls apart.

We’re starting to see real progress here. AI avatars are already being used in content creation, advertising, and corporate communication. Today’s versions are still mostly talking heads — functional, but limited — but we’ve seen some exciting developments in the last few months, and there’s clearly meaningful progress on the horizon.

In this post, we’ll break down what’s working now, what’s next, and the most impressive AI avatar products today, drawn from my hands-on testing of over 20 of them.

How has the research evolved?

AI avatars are a uniquely challenging research problem. To make a talking face, a model needs to learn realistic phoneme-to-viseme mapping: the relationship between speech sounds (phonemes) and their corresponding mouth movements (visemes). If this is “off,” the mouth and voice will look out of sync or even completely disconnected.

To make the issue even more complex, your mouth isn’t the only thing that moves when you talk. The rest of your face moves in conjunction, along with your upper body and sometimes your hands. And everyone has their own distinct style of speaking. Think about how you speak, compared to your favorite celebrity: even if you’re saying the same sentence, your mouths will move differently. If you tried to apply your lip sync to their face, it would look weird.

Over the last few years, this space has evolved significantly from a research perspective. I reviewed over 70 papers on AI talking heads since 2017 and saw a clear progression in model architecture — from CNNs and GANs, to 3D-based approaches like NeRFs and 3D Morphable Models, then to transformers and diffusion models, and most recently, to DiT (diffusion models based on the transformer architecture). The timeline below highlights the most cited papers from each year.

Both the quality of generations and the capabilities of models have improved dramatically. Early approaches were limited. Imagine starting with a single photo of a person, masking the bottom half of their face, and generating new mouth movements based on target facial landmarks from audio input. These models were trained on a limited corpus of quality lip sync data, most of which was closely cropped at the face. More realistic results, like “​lip-syncing Obama​,” required many hours of video of the target person and were very limited in outputs.

Today’s models are much more flexible and powerful. They can generate half-body or even full body movement, realistic talking faces, and dynamic background motion — all in the same video! These newer models are trained more like traditional text-to-video models on much larger datasets, using a variety of techniques to maintain lip sync accuracy amid all the motion. 

The first preview of this came with Bytedance’s OmniHuman-1 model, which was introduced in February (and was recently made available in Dreamina). The space is moving quickly — Hedra​ released Character-3 in March, which in our head-to-head testing is now best-in-class for most use cases. Hedra also works for non-human characters, like this talking Waymo, and enables users to prompt emotions and movement via text.

New use cases are also emerging around AI animation, spurred by trends like the Studio Ghibli movement. The below video came from a starting image frame and the audio track. Hedra generated the character’s lip sync and face + upper body movement. And check out the moving characters in the background!

Real-world jobs for AI avatars

There are countless use cases for AI avatars — just imagine all the different places where you interact with a character or watch a video where someone is speaking. We’ve already seen usage across consumers, SMBs, and even enterprises.

This is an early market map. The space is evolving quickly, and the product distinctions are relatively rough. Many products theoretically could make avatars for most or all of these use cases, but we’ve found, in practice, that it’s hard to build the workflow and tune the model to excel at everything. Below, we’ve outlined examples for how each segment of the market is leveraging AI avatars.

Consumers: Character creation

Anyone can now create animated characters from a single image, which is a massive unlock for creativity. It’s hard to overstate how meaningful this is for everyday people who want to use AI to tell a story. One of the reasons early AI videos were criticized as “slides of images” is there were no talking characters (or speech only came in the form of voiceovers).

When you can make something talk, your content becomes much more interesting. And beyond traditional narrative video, you can create things like AI streamers, podcasters, and music videos. The videos linked here were all made onHedra, which enables users to create dynamic, speaking characters from a single starting image and either an audio clip or a script.

If you’re starting with a video instead of an image, Sync​ can apply lip sync to make the character’s face fit your audio. And if you want to use real human performance to drive the movement of your character, tools likeRunway Act-One andViggle make it possible.

One of my favorite creators using AI to animate characters isNeural Viz, whose series, “The Monoverse,” imagines a post-human universe populated by Glurons. It’s only a matter of time before we see an explosion of AI-generated shows — or even just standalone influencers — now that the barrier to entry is so much lower.

Unanswered Oddities – Episode 1: Humans (youtube.com/@NeuralViz)

As avatars become easier to stream in real-time, we also expect to see consumer-facing companies implement them as a core part of their UI. Imagine learning a language with a live AI “coach” that is not just a disembodied voice, but a full character with a face and personality. Companies like Praktika are already doing this, and it will only get more natural over time. 

SMBs: Lead generation

Ads have become one of the first killer use cases of AI avatars. Instead of hiring actors and a production crew, businesses can now have hyper-realistic AI characters promote their products. Companies like Creatify and Arcads make this seamless — just provide a product link and they generate an ad: writing the script, pulling B-roll and images, and “casting” an AI actor.

This has unlocked advertising for businesses that could never afford traditional ad production. It’s particularly popular among ecommerce companies, games, and consumer apps. Chances are, you’ve already seen AI-generated ads on YouTube or TikTok. Now B2B companies are exploring the tech as well, using AI avatars for content marketing or personalized outreach with tools like Yuzu Labs and Vidyard.

Many of these products combine an AI actor — whether a clone of a real person or a unique character — with other assets like product photos, video clips, and music. Users can control where these assets appear, or put it on “autopilot” and let the product pull together a video for you. You can either write the script yourself or use an AI-generated one.

Enterprises: Scaling content

Beyond marketing, enterprises are finding a range of applications for AI avatars. A few examples:

Learning and development. Most large companies produce training and educational videos for employees, covering everything from onboarding to compliance, product tutorials, and skill development. AI tools like Synthesia can automate this process, making content creation faster and more scalable. Some roles also require ongoing, video-based training — imagine a salesperson practicing their negotiation skills with an AI avatar from a product like Anam.

Localization. If a company has customers or employees in different countries, it may want to localize content into different languages or switch out cultural references. AI actors make it fast and easy to personalize your videos for different geographies. Thanks to ​AI voice translation​ from companies like ElevenLabs, businesses can generate the same video in dozens of languages, with natural-sounding voices. 

Executive presence. AI avatars let executives scale their presence by cloning their persona to create personalized content for employees or customers. Instead of filming every product announcement or a “thank you” message, companies can generate a realistic AI twin of their CEO or product lead. We’re also seeing companies like Delphi and Cicero make it easy for thought leaders to interact with and answer questions from people they’d never normally be able to meet 1:1.

What are the ingredients of an AI avatar? 

Creating a believable AI avatar is a challenge, with each element of realism presenting its own technical hurdles. It’s not just about avoiding the uncanny valley, it’s about solving fundamental problems in animation, speech synthesis, and real-time rendering. Here’s a breakdown of what’s required, why it’s so hard to get right, and where we’re seeing progress:

  • Face – Whether you’re cloning a person or creating a new character, you need a face that stays consistent between frames and moves realistically while talking. Context-aware expressiveness remains a challenge (e.g. an avatar yawning while saying “I’m tired”).
  • Voice – The voice needs to sound real and match the character; a teenage girl’s face shouldn’t have an older woman’s voice. Most of the AI avatar companies we’ve met useElevenLabs, which has an extensive voice library and allows you to clone your own.
  • Lip sync – Getting quality lip sync is tricky. Entire companies, likeSync, are dedicated to solving this problem. Other models like MoCha (from Meta) and OmniHuman are trained on larger datasets and use various techniques to strongly condition face generation on the accompanying audio. train on larger datasets but find ways to strongly condition face frame generation on audio.
  • Body – Your avatar can’t just be a floating head! Newer models enable avatars with full bodies that can move, but we’re still in early days in terms of both scaling them and delivering them to users.
  • Background – Avatars don’t exist in a vacuum. The lighting, depth, and interactions in their surrounding environment need to match the scene. Ideally, avatars will even be able to touch and engage with things in their environment, like picking up a product.

If you want your avatar to engage in real-time conversations — like joining a Zoom meeting — there are a few other things you need to add:

  • Brain – Your avatar needs to be able to “think.” Products that enable conversation today typically enable you to upload or connect to a knowledge base. In the future, more complex versions of this will hopefully include more memory and personality. Avatars should be able to remember past conversations with you and have their own “flair.”
  • Streaming – It’s not easy to stream all of this with minimal latency. Products likeLiveKit andAgora are making progress here, but it’s hard to make all these models work while minimizing latency. We’ve seen a few products do this well — likeTolan, an AI alien companion with a voice and face — but there’s still work to be done.

What would we like to see?

There’s still so much to build and improve in this space. A few areas that are top-of-mind:

Character consistency and transformation

Historically, each AI avatar had one, fixed “look.” Their outfit, pose, and environment were static. Some products are starting to offer more options. For example, this character from HeyGen, Raul, has 20 looks! But it would be great to more easily transform a character however you want.

Better facial movement and expressiveness

Faces have long been the weak link of AI avatars, often looking robotic. That’s starting to change with products like Captions’ new Mirage, which delivers a more natural look and broader range of expressions. We’d love to see AI avatars that understand the emotional context of a script and react appropriately, like looking scared if the character is fleeing from a monster.

Body movement

Today, the vast majority of avatars have little movement below the face — even basic things like hand gestures. Gesture control has been fairly programmatic: for example, Argil allows you to select different types of body language for each segment of your video. We’re excited to see more natural, inferred motion in the future. 

Interacting with the “real world”

Right now, AI avatars can’t interact with their surroundings. An attainable near-term goal may be enabling them to hold products in ads. Topview has already made progress (see the below video for their process and outcome), and we’re excited to see what’s to come as models improve.

More real-time applications

To name a few potential use cases: doing a video call with an AI doctor, browsing curated products with an AI sales assistant, or FaceTiming with a character from your favorite TV show. The latency and reliability aren’t quite human-level, but they’re getting close. Check out a demo of me chatting with Tavus‘ latest model.

Where are we headed?

One of our main learnings from investing in both foundation model companies and AI applications over the past few years? It’s nearly impossible to predict with any degree of certainty where a given space is headed. However, it feels safe to say that the application layer is poised for rapid growth now that the underlying model quality finally feels good enough to generate AI talking heads that aren’t painful to watch.

We expect this space will give rise to multiple billion-dollar companies, with products segmented by use case and target customer. For example, an executive looking for an AI clone to film videos for customers will need (and be willing to pay) for a higher level of quality and realism than a fan making a quick clip of their favorite anime character to send to friends.

Workflow is also important. If you’re generating ads with AI influencers, you’ll want to use a platform that can automatically pull in product details, write scripts, add B-roll and product photos, push the videos to your social channels, and measure results. On the other hand, if you’re trying to tell a story using AI characters, you’ll prioritize tools that enable you to save and re-use characters and scenes, and easily splice together different types of clips.

We can't wait to see what emerges here. If you're building in this space, I'd love to chat. Reach out to jmoore@a16z.com or venturetwins​

on X.

                                                                    </div>

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI化身 人工智能 内容创作 数字人
相关文章