Our method allows warping Gaussian noise with extreme deformations while still preserving its Gaussian properties. This is not achievable with standard warping methods, as we show in the comparison below. In the first row, we compare with standard interpolation methods applied on consecutive frames \((F_{n-1}, F_{n})\). These tend to create numerical dissipation, which destroys high-frequency details and produces blurring. In the second row, we compute a flow map between the initial frame and the current one, and apply the interpolation methods on the pair \((F_0, F_{n})\). Due to numerical error in the mapping, flickering and incoherence appear in the result. Our \(\smallint\)-noise method outperforms all existing warping methods by transporting the noise perfectly while keeping its Gaussian properties.
Bilinear
Bicubic
Nearest Neighbor
Root-bilinear*
\(\smallint\)-noise (Ours)
*The root-bilinear interpolation is a simple modification of bilinear interpolation where we replace the interpolation coefficients by their squareroot. This has the property that when applied to a set of independent Gaussian noise samples, it would preserve the unit variance.
Here are some results of noise warping in latent diffusion models. Please refer to the paper appendix for more details.
No Cross-Frame Attention
Fixed Noise
\(\smallint\)-noise (Ours)
With Cross-Frame Attention
Fixed Noise
\(\smallint\)-noise (Ours)
With Cross-Frame Attention + Feature Injection
Fixed Noise
\(\smallint\)-noise (Ours)
DDIM inversion is a popular inversion method that has also been used to obtain more informative noise priors for video editing tasks. However, it suffers from two main problems. First, it only produces one noise map per image. This may not always be compatible with other diffusion-based methods like SDEdit or I²SB, which uses DDPM. Second, as it remains primarily an inversion method, the spatial and temporal information of the image are entangled inside the noise.
We compare our method with DDIM inversion in the appearance transfer example with SDEdit. We experiment with two settings:
DDIM inversion to intermediate step
DDIM inversion as initial noise
\(\smallint\)-noise (Ours)
Here is a visual comparison between our noise prior and the one obtained from DDIM inversion. While DDIM inversion produces temporally correlated noise, its distribution heavily depends on how far the input video is from the training distribution of the diffusion model. In contrast, our warping method retains the Gaussian properties of the noise.
DDIM inversion noise
\(\smallint\)-noise (Ours)