How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models

Pascal Chang, Jingwei Tang, Markus Gross, Vinicius C. Azevedo
ICLR 2024 (Oral)
ETH Zürich

Fixed Noise

Bilinear Warping

PYoCo [1]

\(\smallint\)-noise (Ours)

We propose a distribution-preserving warping method tailored for Gaussian noise. The resulting noise samples serve as good noise priors for improving temporal coherency in diffusion models.

Abstract

Video editing and generation methods often rely on pre-trained image-based diffusion models. During the diffusion process, however, the reliance on rudimentary noise sampling techniques that do not preserve correlations present in subsequent frames of a video is detrimental to the quality of the results. This either produces high-frequency flickering, or texture-sticking artifacts that are not amenable to post-processing. With this in mind, we propose a novel method for preserving temporal correlations in a sequence of noise samples. This approach is materialized by a novel noise representation, dubbed \(\smallint\)-noise (integral noise), that reinterprets individual noise samples as a continuously integrated noise field: pixel values do not represent discrete values, but are rather the integral of an underlying infinite-resolution noise over the pixel area. Additionally, we propose a carefully tailored transport method that uses \(\smallint\)-noise to accurately advect noise samples over a sequence of frames, maximizing the correlation between different frames while also preserving the noise properties. Our results demonstrate that the proposed \(\smallint\)-noise can be used for a variety of tasks, such as video restoration and editing, surrogate rendering, and conditional video generation.

Method Overview

The integral noise representation

Noise Warping Comparison

Our method allows warping Gaussian noise with extreme deformations while still preserving its Gaussian properties. This is not achievable with standard warping methods, as we show in the comparison below. In the first row, we compare with standard interpolation methods applied on consecutive frames \((F_{n-1}, F_{n})\). These tend to create numerical dissipation, which destroys high-frequency details and produces blurring. In the second row, we compute a flow map between the initial frame and the current one, and apply the interpolation methods on the pair \((F_0, F_{n})\). Due to numerical error in the mapping, flickering and incoherence appear in the result. Our \(\smallint\)-noise method outperforms all existing warping methods by transporting the noise perfectly while keeping its Gaussian properties.


Bilinear

Bicubic

Nearest Neighbor

Root-bilinear*

\(\smallint\)-noise (Ours)


*The root-bilinear interpolation is a simple modification of bilinear interpolation where we replace the interpolation coefficients by their squareroot. This has the property that when applied to a set of independent Gaussian noise samples, it would preserve the unit variance.


Realistic Appearance Transfer with SDEdit

Random Noise

Fixed Noise

PYoCo (progressive) [1]

Control-A-Video [2]

Bilinear Warping

Bicubic Warping

Nearest Warping

\(\smallint\)-noise (Ours)


Video Super-resolution with I²SB

Input (low-res)

Random Noise

Fixed Noise

PYoCo (progressive) [1]

Control-A-Video [2]

Bilinear Warping

Bicubic Warping

Nearest Warping

\(\smallint\)-noise (Ours)


Input (low-res)

Random Noise

Fixed Noise

PYoCo (progressive) [1]

Control-A-Video [2]

Bilinear Warping

Bicubic Warping

Nearest Warping

\(\smallint\)-noise (Ours)


Video JPEG Restoration with I²SB

Input (JPEG compressed)

Random Noise

Fixed Noise

PYoCo (progressive) [1]

Control-A-Video [2]

Bilinear Warping

Bicubic Warping

Nearest Warping

\(\smallint\)-noise (Ours)


Input (JPEG compressed)

Random Noise

Fixed Noise

PYoCo (progressive) [1]

Control-A-Video [2]

Bilinear Warping

Bicubic Warping

Nearest Warping

\(\smallint\)-noise (Ours)


Pose-to-Person Video Generation with PIDM

Fixed Noise

Random Noise

Bilinear Warping

\(\smallint\)-noise (Ours)

Fixed Noise

Random Noise

Bilinear Warping

\(\smallint\)-noise (Ours)


Fluid Simulation Super-resolution

Condition

Fixed Noise

Random Noise

Control-A-Video [2]

\(\smallint\)-noise (Ours)

Condition

Fixed Noise

Random Noise

Control-A-Video [2]

\(\smallint\)-noise (Ours)

Condition

Fixed Noise

Random Noise

Control-A-Video [2]

\(\smallint\)-noise (Ours)


(Supplementary) Integration with DeepFloyd IF

DeepFloyd IF is a state-of-the-art text-to-image diffusion model by Stability AI. It consists of a frozen text encoder and three cascaded pixel diffusion models, respectively generating 64x64 px, 256x256 px and 1024x1024 px images. We show that our \(\smallint\)-noise prior can be integrated with DeepFloyd IF. We illustrate this on two tasks: video super-resolution and video stylization.


Video super-resolution: we give a 64x64 video sample to the Stage II model and use the following prompts to guide super-resolution: "A blackswan on water, photography, 4k", "A car on a road in the mountains". We additionally apply a simple cross-frame attention mechanism to help improve temporal coherency. The results below comparing different noise priors show that the simple combination of cross-frame attention with our \(\smallint\)-noise prior significantly reduces visual artifacts when lifting DeepFloyd IF to the temporal domain.


Input (low-res)

DeepFloyd IF + Random noise

DeepFloyd IF + Fixed noise

DeepFloyd IF + \(\smallint\)-noise (Ours)


Video stylization: we experiment with DeepFloyd IF's style transfer ability on the mountain car example. In the following we show a style of an oil painting applied to the original video. Our \(\int\)-noise prior is compared to fixed and random noise, and cross-frame attention is applied (anchor frame every 10 frames). As one can see, the choice of the noise prior has a clear impact on the temporal coherence of the final results. Note that better tuning of the style transfer parameters in DeepFloyd IF is likely to further improve the stylization quality.


Input

DeepFloyd IF + Random noise

DeepFloyd IF + Fixed noise

DeepFloyd IF + \(\smallint\)-noise (Ours)


(Supplementary) Warped Noise in Latent Diffusion

No Cross-Frame Attention

Fixed Noise

\(\smallint\)-noise (Ours)

With Cross-Frame Attention

Fixed Noise

\(\smallint\)-noise (Ours)

With Cross-Frame Attention + Feature Injection

Fixed Noise

\(\smallint\)-noise (Ours)

(Supplementary) Comparison with DDIM inversion

DDIM inversion is a popular inversion method that has also been used to obtain more informative noise priors for video editing tasks. However, it suffers from two main problems. First, it only produces one noise map per image. This may not always be compatible with other diffusion-based methods like SDEdit or I²SB, which uses DDPM. Second, as it remains primarily an inversion method, the spatial and temporal information of the image are entangled inside the noise.


We compare our method with DDIM inversion in the appearance transfer example with SDEdit. We experiment with two settings:

  • DDIM inversion to intermediate step (first row): similar to how we applied SDEdit above, we use DDIM to invert the synthetic video frames back to an intermediate timestep (60% of total steps), and then denoise it using forward DDIM. As expected, since there is no prompt to be changed or other settings to be modified, it mostly reconstructs the original synthetic video without adding any realistic appearance details.
  • DDIM inversion as initial noise (second row): we run a full DDIM inversion for each frame to obtain a noise map, which we treat similarly to the other noise priors we showed above: we add it to input frames and denoise using forward DDIM. Because the input video is far from the data distribution of the model (trained on realistic images of bedrooms), the DDIM-inverted noise is far from Gaussian. This makes it a poor candidate as a noise prior.
As a comparison, our noise prior only contains temporal information, so the model can generate realistic details on top of the synthetic scene without being constrained to reconstruct the input sequence. Furthermore, by applying the same warping to different noise samples (which DDIM inversion cannot do), we can obtain different variations in the final result (third row). Note that for fairness, we also use deterministic DDIM for denoising in our method, i.e. our noise warping method is only used once for the initial noise.


DDIM inversion to intermediate step


DDIM inversion as initial noise


\(\smallint\)-noise (Ours)


Here is a visual comparison between our noise prior and the one obtained from DDIM inversion. While DDIM inversion produces temporally correlated noise, its distribution heavily depends on how far the input video is from the training distribution of the diffusion model. In contrast, our warping method retains the Gaussian properties of the noise.


DDIM inversion noise

\(\smallint\)-noise (Ours)

References

[1] Ge, Songwei, et al. "Preserve your own correlation: A noise prior for video diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.


[2] Chen, Weifeng, et al. "Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models." arXiv preprint arXiv:2305.13840 (2023).