Multi-Modal Study

Paper Review

Multi-Modal Study

StoryImager - A Unified and Efficient Framework for Coherent Story Visualization and Completion

Ming Tao, Bing-Kun Bao, Hao Tang, Yaowei Wang, Changsheng Xu

paper :
StoryImager

Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models

Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yanfeng Wang, Weidi Xie

paper :
Intelligent Grimm

Generating Realistic Images from In-the-wild Sounds

Taegyeong Lee, Jeonghun Kang, Hyeonyu Kim, Taehwan Kim

paper :
Image from in-the-wild Sounds

local minimum에 빠지지 않기 위해 audio attention과 sentence attention을 이용한 stage (a)의 initialization이 매우 중요

ViViT - A Video Vision Transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, Cordelia Schmid

paper :
ViViT

ViViT는 아직 정리 완료 못했음 TBD…

TBD

LLaMA-VID - An Image is Worth 2 Tokens in Large Language Models

PEEKABOO - Interactive Video Generation via Masked-Diffusion

Video Diffusion Model

Style Aligned Image Generation via Shared Attention

ControlNet - Adding Conditional Control to Text-to-Image Diffusion Models

InstructPix2Pix - Learning to Follow Image Editing Instructions

The Platonic Representation Hypothesis

Action2Sound - Ambient-Aware Generation of Action Sounds from Egocentric Videos

MasaCtrl - Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

Overview
(Mask-Guided) Mutual Self-Attention

DreamMatcher - Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization