Contribute to the AI community by sharing your insights and expertise
TL;DR: Self-guidance is a method for controllable image generation that guides sampling using only the attention and activations of a pretrained diffusion model. Without any extra models or training, you can move or resize objects, or even replace them with items from real images, without changing the rest of the scene. You can also borrow the appearance of another image or rearrange scenes into a desired layout.
VLOGGER is a novel framework to synthesize humans from audio. Given a single input image like the ones shown on the first column, and a sample audio input, our method generates photorealistic and temporally coherent videos of the person talking and vividly moving.
Modeling animatable human avatars from RGB videos is a long-standing and challenging problem. Recent works usually adopt MLP-based neural radiance fields (NeRF) to represent 3D humans, but it remains difficult for pure MLPs to regress pose-dependent garment details.
StyleGaussian, a novel 3D style transfer technique that allows instant transfer of any image's style to a 3D scene at 10 frames per second (fps). Leveraging 3D Gaussian Splatting (3DGS), StyleGaussian achieves style transfer without compromising its real-time rendering ability and multi-view consistency. Project :https://kunhao-liu.github.io/StyleGaussian/
This page lists some speech related research at Microsoft Research Asia, conducted by the team led by Xu Tan. The research topics cover text to speech, singing voice synthesis, music generation, automatic speech recognition, etc. Some research are open-sourced via NeuralSpeech and Muzic.
Ttext-to-speech (TTS) model NaturalSpeech 3. FACodec converts complex speech waveform into disentangled subspaces representing speech attributes of content, prosody, timbre, and acoustic details and reconstruct high-quality speech waveform from these attributes. Hf-Demo : https://huggingface.co/spaces/amphion/naturalspeech3_facodec
Overview of ResAdapter. Left: Pipeline of ResAdapter. It is based on the frozen model (e.g., SD or SDXL) learns resolution priors from mixed-resolution datasets, which can be integrated into any personalized model to generate multi-resolution images. Right: Architecture comparison between ResAdapter and the vanilla LoRA. ResAdapter is only inserted to downsampler and upsampler, and unfreezes the group normalization of resnet blocks. Page:https://res-adapter.github.io/
PhysGaussian is a pioneering unified simulation-rendering pipeline that generates physics-based dynamics and photo-realistic renderings simultaneously and seamlessly. Page:https://xpandora.github.io/PhysGaussian/
GaussianAvatars combine dynamic 3D Gaussian splats with a parametric morphable face model for photorealistic avatars. Their method excels in animation control, showcasing superior performance in reenactments from driving videos, surpassing existing techniques.
The Gaussian Head Avatar method combines controllable 3D Gaussians and MLP-based deformation fields to achieve high-fidelity head avatar modeling, outperforming existing sparse-view methods. It ensures fine-grained dynamic details and expression accuracy, achieving ultra high-fidelity rendering quality at 2K resolution
Noise Rectification is a simple but effective method for image-to-video generation in open domains, and is tuning-free and plug-and-play. Below are several comparisons between method and other methods.
Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence. A grand challenge in I2V generation is to maintain visual consistency throughout the video: existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame, as well as ensure a fluid and logical progression within the video narrative
A novel method utilizing 3D Gaussians for creating expressive 3D avatars achieves state-of-the-art performance in photorealistic novel view synthesis. It features accurate pose estimation, attention-aware networks, and an iterative re-initialization strategy for high-fidelity reconstructions and fine-grained control over body and hand poses. Project-Page Project page: https://3d-aigc.github.io/GEA/
PeRFlow trains piecewise-linear rectified flow models for fast sampling. These models can be initialized from pretrained diffusion models, such as Stable Diffusion (SD). The obtained weights of PeRFlow serve as a general accelerator module which is compatible with various fine-tuned stylized SD models as well as SD-based generation/editing pipelines. Specifically, are computed by the PeRFlow's weights minus the pretrained SD. One can fuse the PeRFlow. into various SD pipelines for (conditional) image generation/editing to enable high-quality few-step inference.
This paper presents a novel approach to one-shot face stylization, focusing on appearance and structure. They use a self-supervised vision transformer, DINO-ViT, and integrate spatial transformers into StyleGAN for deformation-aware stylization. Innovative constraints and style-mixing enhance deformability and efficiency, demonstrating superiority over existing methods through extensive comparisons. Code is available at https://github.com/zichongc/DoesFS.
Pix2Gif introduces a novel approach to image-to-GIF generation using text and motion prompts. Their model utilizes motion-guided warping and perceptual loss to ensure content consistency. Pretrained on curated data, it effectively translates prompts into coherent GIFs, demonstrated through extensive experiments. Page:https://hiteshk03.github.io/Pix2Gif/
PixArt-Σ is a cutting-edge Diffusion Transformer model that generates 4K images with superior fidelity and alignment to text prompts. It achieves this through high-quality training data and efficient token compression, resulting in smaller model size and superior image quality compared to existing models. Project-Page: https://pixart-alpha.github.io/PixArt-sigma-project/
The proposal introduces EmoSpeaker, a method enhancing emotional expression in generated facial animations. It employs a visual attribute-guided audio decoupler, fine-grained emotion coefficient prediction, and intensity control to improve emotional quality and lip synchronization. Experimental results show superiority over existing methods. Project-Page: https://peterfanfan.github.io/EmoSpeaker/
AVI-Talking, a system for creating lifelike talking faces that match speech with expressive facial movements. Using advanced language models, it generates instructions for facial details based on speech, resulting in realistic and emotionally consistent animations.
Real3D-Portrait addresses limitations in one-shot 3D talking portrait generation by enhancing reconstruction accuracy, stable animation, and realism. It employs a large image-to-plane model, efficient motion adapter, and head-torso-background super-resolution model for realistic videos, alongside a generalizable audio-to-motion model for audio-driven animation.
A novel low bandwidth neural compression approach for high-fidelity portrait video conferencing is proposed. Dynamic neural radiance fields reconstruct talking heads with expression features, enabling ultra-low bandwidth transmission and high fidelity portrait rendering via volume rendering.
The paper introduces DynTet, a novel hybrid representation combining neural networks and dynamic meshes for accurate facial avatar generation. It addresses artifacts and jitters in implicit methods like NeRF, achieving fidelity, lip synchronization, and real-time performance. Code is available. https://github.com/zhangzc21/DynTet
EMO, a pioneering framework for generating lifelike talking head videos by directly synthesizing video from audio inputs. Unlike traditional methods, EMO bypasses 3D models, ensuring seamless transitions and maintaining identity. Experimental results show superior expressiveness and realism, even in singing videos.
This paper proposes a method for generating diverse and synchronized talking faces from a single audio input. It tackles challenges by decoupling identity, content, and emotion from audio and maintaining diversity and consistency. The method involves Progressive Audio Disentanglement and Controllable Coherent Frame generation.
This paper addresses the challenge of generating high-fidelity talking faces with synchronized lip movements for arbitrary audio. They propose G4G, a framework enhancing audio-image alignment using diagonal matrices and multi-scale supervision, achieving competitive results.
This paper introduces a method for generating multi-person talking face videos considering contextual interactions. It utilizes facial landmarks to control video generation stages, achieving synchronized and coherent results surpassing baselines.