Every Day Update
Gen AI Top Papers and Research

Contribute to the AI community by sharing your insights and expertise

Diffusion Self-Guidance for Controllable Image Generation

next/image
Diffusion/Controllable Image Generation

TL;DR: Self-guidance is a method for controllable image generation that guides sampling using only the attention and activations of a pretrained diffusion model. Without any extra models or training, you can move or resize objects, or even replace them with items from real images, without changing the rest of the scene. You can also borrow the appearance of another image or rearrange scenes into a desired layout.

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

Synthetic Humans/Embodied Avatar

VLOGGER is a novel framework to synthesize humans from audio. Given a single input image like the ones shown on the first column, and a sample audio input, our method generates photorealistic and temporally coherent videos of the person talking and vividly moving.

Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling

Gaussion/3D Human Avatar

Modeling animatable human avatars from RGB videos is a long-standing and challenging problem. Recent works usually adopt MLP-based neural radiance fields (NeRF) to represent 3D humans, but it remains difficult for pure MLPs to regress pose-dependent garment details.

StyleGaussian: Instant 3D Style Transfer with Gaussian Splatting

Gaussian/3D Style Transfer

StyleGaussian, a novel 3D style transfer technique that allows instant transfer of any image's style to a 3D scene at 10 frames per second (fps). Leveraging 3D Gaussian Splatting (3DGS), StyleGaussian achieves style transfer without compromising its real-time rendering ability and multi-view consistency. Project :https://kunhao-liu.github.io/StyleGaussian/

Speech Research

next/image
Speech/Audio/Voice Clone

This page lists some speech related research at Microsoft Research Asia, conducted by the team led by Xu Tan. The research topics cover text to speech, singing voice synthesis, music generation, automatic speech recognition, etc. Some research are open-sourced via NeuralSpeech and Muzic.

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

next/image
Zero-Shot Voice Clone / Audio

Ttext-to-speech (TTS) model NaturalSpeech 3. FACodec converts complex speech waveform into disentangled subspaces representing speech attributes of content, prosody, timbre, and acoustic details and reconstruct high-quality speech waveform from these attributes. Hf-Demo : https://huggingface.co/spaces/amphion/naturalspeech3_facodec

ResAdapter : Domain Consistent Resolution Adapter for Diffusion Models

next/image
Lora/ResAdapter/Resolution

Overview of ResAdapter. Left: Pipeline of ResAdapter. It is based on the frozen model (e.g., SD or SDXL) learns resolution priors from mixed-resolution datasets, which can be integrated into any personalized model to generate multi-resolution images. Right: Architecture comparison between ResAdapter and the vanilla LoRA. ResAdapter is only inserted to downsampler and upsampler, and unfreezes the group normalization of resnet blocks. Page:https://res-adapter.github.io/

PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics CVPR 2024

NeRF/Physics/3D Reconstruction

PhysGaussian is a pioneering unified simulation-rendering pipeline that generates physics-based dynamics and photo-realistic renderings simultaneously and seamlessly. Page:https://xpandora.github.io/PhysGaussian/

GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians

Gaussian/Head Avatar

GaussianAvatars combine dynamic 3D Gaussian splats with a parametric morphable face model for photorealistic avatars. Their method excels in animation control, showcasing superior performance in reenactments from driving videos, surpassing existing techniques.

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians

Gaussian/Head Avatar

The Gaussian Head Avatar method combines controllable 3D Gaussians and MLP-based deformation fields to achieve high-fidelity head avatar modeling, outperforming existing sparse-view methods. It ensures fine-grained dynamic details and expression accuracy, achieving ultra high-fidelity rendering quality at 2K resolution

Tuning-Free Noise Rectification:for High Fidelity Image-to-Video Generation

Noise Control/Image-to-Video Generation

Noise Rectification is a simple but effective method for image-to-video generation in open domains, and is tuning-free and plug-and-play. Below are several comparisons between method and other methods.

ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

next/image
Image2Video Generation/Consistency

Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence. A grand challenge in I2V generation is to maintain visual consistency throughout the video: existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame, as well as ensure a fluid and logical progression within the video narrative

GEA: Reconstructing Expressive 3D Gaussian Avatar from Monocular Video

Gaussian/Nerf/3D Reconstruction

A novel method utilizing 3D Gaussians for creating expressive 3D avatars achieves state-of-the-art performance in photorealistic novel view synthesis. It features accurate pose estimation, attention-aware networks, and an iterative re-initialization strategy for high-fidelity reconstructions and fine-grained control over body and hand poses. Project-Page Project page: https://3d-aigc.github.io/GEA/

PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator

Finetune LORAs / Diffusion Models / PeRFlow

PeRFlow trains piecewise-linear rectified flow models for fast sampling. These models can be initialized from pretrained diffusion models, such as Stable Diffusion (SD). The obtained weights of PeRFlow serve as a general accelerator module which is compatible with various fine-tuned stylized SD models as well as SD-based generation/editing pipelines. Specifically, are computed by the PeRFlow's weights minus the pretrained SD. One can fuse the PeRFlow. into various SD pipelines for (conditional) image generation/editing to enable high-quality few-step inference.

Deformable One-shot Face Stylization via DINO Semantic Guidance

next/image
GANS/StyleGAN/Deformable Stylization

This paper presents a novel approach to one-shot face stylization, focusing on appearance and structure. They use a self-supervised vision transformer, DINO-ViT, and integrate spatial transformers into StyleGAN for deformation-aware stylization. Innovative constraints and style-mixing enhance deformability and efficiency, demonstrating superiority over existing methods through extensive comparisons. Code is available at https://github.com/zichongc/DoesFS.

Pix2Gif: Motion-Guided Diffusion for GIF Generation

next/image
Text2Video/Animation/Diffusion

Pix2Gif introduces a novel approach to image-to-GIF generation using text and motion prompts. Their model utilizes motion-guided warping and perceptual loss to ensure content consistency. Pretrained on curated data, it effectively translates prompts into coherent GIFs, demonstrated through extensive experiments. Page:https://hiteshk03.github.io/Pix2Gif/

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

next/image
Speech/Talking Face Generation

PixArt-Σ is a cutting-edge Diffusion Transformer model that generates 4K images with superior fidelity and alignment to text prompts. It achieves this through high-quality training data and efficient token compression, resulting in smaller model size and superior image quality compared to existing models. Project-Page: https://pixart-alpha.github.io/PixArt-sigma-project/

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

next/image
Speech/Talking Face Generation

The proposal introduces EmoSpeaker, a method enhancing emotional expression in generated facial animations. It employs a visual attribute-guided audio decoupler, fine-grained emotion coefficient prediction, and intensity control to improve emotional quality and lip synchronization. Experimental results show superiority over existing methods. Project-Page: https://peterfanfan.github.io/EmoSpeaker/

AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation

next/image
Speech/LLMs/Talking Head

AVI-Talking, a system for creating lifelike talking faces that match speech with expressive facial movements. Using advanced language models, it generates instructions for facial details based on speech, resulting in realistic and emotionally consistent animations.

REAL3D-PORTRAIT: ONE-SHOT REALISTIC 3D TALKING PORTRAIT SYNTHESIS

next/image
Talking Head/Face Generation/Lipsync/Nerf

Real3D-Portrait addresses limitations in one-shot 3D talking portrait generation by enhancing reconstruction accuracy, stable animation, and realism. It employs a large image-to-plane model, efficient motion adapter, and head-torso-background super-resolution model for realistic videos, alongside a generalizable audio-to-motion model for audio-driven animation.

Resolution-Agnostic Neural Compression for High-Fidelity Portrait Video Conferencing via Implicit Radiance Fields

next/image
Talking Head/Face Generation/Lipsync/Nerf

A novel low bandwidth neural compression approach for high-fidelity portrait video conferencing is proposed. Dynamic neural radiance fields reconstruct talking heads with expression features, enabling ultra-low bandwidth transmission and high fidelity portrait rendering via volume rendering.

Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis

next/image
Talking Head/Face Generation/Lipsync

The paper introduces DynTet, a novel hybrid representation combining neural networks and dynamic meshes for accurate facial avatar generation. It addresses artifacts and jitters in implicit methods like NeRF, achieving fidelity, lip synchronization, and real-time performance. Code is available. https://github.com/zhangzc21/DynTet

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

next/image
Talking Head/Face Generation/Lipsync

EMO, a pioneering framework for generating lifelike talking head videos by directly synthesizing video from audio inputs. Unlike traditional methods, EMO bypasses 3D models, ensuring seamless transitions and maintaining identity. Experimental results show superior expressiveness and realism, even in singing videos.

Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes

next/image
Lipsync

DeepFake can be bifurcated into entertainment applications like face swapping and illicit uses such as lipsyncing fraud

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

next/image
Lipsync

This paper proposes a method for generating diverse and synchronized talking faces from a single audio input. It tackles challenges by decoupling identity, content, and emotion from audio and maintaining diversity and consistency. The method involves Progressive Audio Disentanglement and Controllable Coherent Frame generation.

G4G: A Generic Framework for High Fidelity Talking Face Generation with Fine-grained Intra-modal Alignment

next/image
Lipsync

This paper addresses the challenge of generating high-fidelity talking faces with synchronized lip movements for arbitrary audio. They propose G4G, a framework enhancing audio-image alignment using diagonal matrices and multi-scale supervision, achieving competitive results.

Context-aware Talking Face Video Generation

next/image
Talking Head/Face Generation

This paper introduces a method for generating multi-person talking face videos considering contextual interactions. It utilizes facial landmarks to control video generation stages, achieving synchronized and coherent results surpassing baselines.