Video editing and generation methods often rely on pre-trained image-based diffusion models. During the diffusion process, however, the reliance on rudimentary noise sampling techniques that do not preserve correlations present in subsequent frames of a video is detrimental to the quality of the results. This either produces high-frequency flickering, or texture-sticking artifacts that are not amenable to post-processing. With this in mind, we propose a novel method for preserving temporal correlations in a sequence of noise samples. This approach is materialized by a novel noise representation, dubbed \(\smallint\)-noise (integral noise), that reinterprets individual noise samples as a continuously integrated noise field: pixel values do not represent discrete values, but are rather the integral of an underlying infinite-resolution noise over the pixel area. Additionally, we propose a carefully tailored transport method that uses \(\smallint\)-noise to accurately advect noise samples over a sequence of frames, maximizing the correlation between different frames while also preserving the noise properties. Our results demonstrate that the proposed \(\smallint\)-noise can be used for a variety of tasks, such as video restoration and editing, surrogate rendering, and conditional video generation.
Our method allows warping Gaussian noise with extreme deformations while still preserving its Gaussian properties. This is not achievable with standard warping methods, as we show in the comparison below. In the first row, we compare with standard interpolation methods applied on consecutive frames \((F_{n-1}, F_{n})\). These tend to create numerical dissipation, which destroys high-frequency details and produces blurring. In the second row, we compute a flow map between the initial frame and the current one, and apply the interpolation methods on the pair \((F_0, F_{n})\). Due to numerical error in the mapping, flickering and incoherence appear in the result. Our \(\smallint\)-noise method outperforms all existing warping methods by transporting the noise perfectly while keeping its Gaussian properties.
Bilinear
Bicubic
Nearest Neighbor
Root-bilinear*
\(\smallint\)-noise (Ours)
*The root-bilinear interpolation is a simple modification of bilinear interpolation where we replace the interpolation coefficients by their squareroot. This has the property that when applied to a set of independent Gaussian noise samples, it would preserve the unit variance.
Random Noise
Fixed Noise
PYoCo (progressive) [1]
Control-A-Video [2]
Bilinear Warping
Bicubic Warping
Nearest Warping
\(\smallint\)-noise (Ours)
Input (low-res)
Random Noise
Fixed Noise
PYoCo (progressive) [1]
Control-A-Video [2]
Bilinear Warping
Bicubic Warping
Nearest Warping
\(\smallint\)-noise (Ours)
Input (low-res)
Random Noise
Fixed Noise
PYoCo (progressive) [1]
Control-A-Video [2]
Bilinear Warping
Bicubic Warping
Nearest Warping
\(\smallint\)-noise (Ours)
Input (JPEG compressed)
Random Noise
Fixed Noise
PYoCo (progressive) [1]
Control-A-Video [2]
Bilinear Warping
Bicubic Warping
Nearest Warping
\(\smallint\)-noise (Ours)
Input (JPEG compressed)
Random Noise
Fixed Noise
PYoCo (progressive) [1]
Control-A-Video [2]
Bilinear Warping
Bicubic Warping
Nearest Warping
\(\smallint\)-noise (Ours)
Fixed Noise
Random Noise
Bilinear Warping
\(\smallint\)-noise (Ours)
Fixed Noise
Random Noise
Bilinear Warping
\(\smallint\)-noise (Ours)
Condition
→
Fixed Noise
Random Noise
Control-A-Video [2]
\(\smallint\)-noise (Ours)
Condition
→
Fixed Noise
Random Noise
Control-A-Video [2]
\(\smallint\)-noise (Ours)
Condition
→
Fixed Noise
Random Noise
Control-A-Video [2]
\(\smallint\)-noise (Ours)
DeepFloyd IF is a state-of-the-art text-to-image diffusion model by Stability AI. It consists of a frozen text encoder and three cascaded pixel diffusion models, respectively generating 64x64 px, 256x256 px and 1024x1024 px images. We show that our \(\smallint\)-noise prior can be integrated with DeepFloyd IF. We illustrate this on two tasks: video super-resolution and video stylization.
Video super-resolution: we give a 64x64 video sample to the Stage II model and use the following prompts to guide super-resolution: "A blackswan on water, photography, 4k", "A car on a road in the mountains". We additionally apply a simple cross-frame attention mechanism to help improve temporal coherency. The results below comparing different noise priors show that the simple combination of cross-frame attention with our \(\smallint\)-noise prior significantly reduces visual artifacts when lifting DeepFloyd IF to the temporal domain.
Input (low-res)
DeepFloyd IF + Random noise
DeepFloyd IF + Fixed noise
DeepFloyd IF + \(\smallint\)-noise (Ours)
Video stylization: we experiment with DeepFloyd IF's style transfer ability on the mountain car example. In the following we show a style of an oil painting applied to the original video. Our \(\int\)-noise prior is compared to fixed and random noise, and cross-frame attention is applied (anchor frame every 10 frames). As one can see, the choice of the noise prior has a clear impact on the temporal coherence of the final results. Note that better tuning of the style transfer parameters in DeepFloyd IF is likely to further improve the stylization quality.
Input
DeepFloyd IF + Random noise
DeepFloyd IF + Fixed noise
DeepFloyd IF + \(\smallint\)-noise (Ours)
No Cross-Frame Attention
Fixed Noise
\(\smallint\)-noise (Ours)
With Cross-Frame Attention
Fixed Noise
\(\smallint\)-noise (Ours)
With Cross-Frame Attention + Feature Injection
Fixed Noise
\(\smallint\)-noise (Ours)
DDIM inversion is a popular inversion method that has also been used to obtain more informative noise priors for video editing tasks. However, it suffers from two main problems. First, it only produces one noise map per image. This may not always be compatible with other diffusion-based methods like SDEdit or I²SB, which uses DDPM. Second, as it remains primarily an inversion method, the spatial and temporal information of the image are entangled inside the noise.
We compare our method with DDIM inversion in the appearance transfer example with SDEdit. We experiment with two settings:
DDIM inversion to intermediate step
DDIM inversion as initial noise
\(\smallint\)-noise (Ours)
Here is a visual comparison between our noise prior and the one obtained from DDIM inversion. While DDIM inversion produces temporally correlated noise, its distribution heavily depends on how far the input video is from the training distribution of the diffusion model. In contrast, our warping method retains the Gaussian properties of the noise.
DDIM inversion noise
\(\smallint\)-noise (Ours)
[1] Ge, Songwei, et al. "Preserve your own correlation: A noise prior for video diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[2] Chen, Weifeng, et al. "Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models." arXiv preprint arXiv:2305.13840 (2023).