LBM: Latent Bridge Matching for Fast Image-to-Image Translation

Abstract

In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation. We show that the method can reach state-of-the-art results for various image-to-image tasks using only a single inference step. In addition to its efficiency, we also demonstrate the versatility of the method across different image translation tasks such as object removal, normal and depth estimation, and object relighting. We also derive a conditional framework of LBM and demonstrate its effectiveness by tackling the tasks of controllable image relighting and shadow generation. We provide an open-source implementation of the method at https://github.com/gojasper/LBM.

Method

Image-to-image translation is a task that can be considered as a transport problem where the goal is to transport the distribution of the source images (e.g. composite images) to the distribution of the target images (e.g. relighted images). In the proposed Latent Bridge Matching (LBM) method, given paired images, we propose to encode the source and target image into a latent space and then build a stochastic path called a Brownian Bridge between them. In particular, the stochasticity of these paths makes the method differ from flow matching and allows reaching a wider diversity of samples.

The training procedure is as follows and is detailed in the figure above. First, we draw a pair of images. Those samples are first encoded into the latent space using a pre-trained VAE giving the corresponding latents. We create a Brownian Bridge between these two latents. A timestep is drawn from a well chosen distribution to get the latent on the trajectory at this given timestep. This sample is then passed to the denoiser which predicts the drift of the associated Stochastic Differential Equation (SDE).

During training, we also introduce a pixel loss that consists of decoding the estimated target latent and comparing it to the real target image. We found that LPIPS works well in practice and speeds up domain shift. In order to scale with the image size, we put in place a random cropping strategy and only compute the loss on a patch if the image size is larger than a certain threshold. This limits the memory footprint of the model so it does not become a burden to the training efficiency.

Object Relighting

We focus on the task aiming at relighting a foreground object according to a given background, also known as image harmonization. The proposed approach is able to add strong illumination changes to the foreground object while preserving the background. Moreover, it is able to remove existing shadows and reflections so the foreground object appears more realistic.

Object Removal

Additionally, we consider the task consisting of removing objects from an image. For this setting, the model is trained to find a transport map from the masked images to the images without the objects. As illustrated, our model can remove not only the object but also the associated shadow.

Image Restoration

To further stress the method's versatility, we also consider an image restoration task where the model should transport the distribution of the degraded images to the distribution of the clean images.

Controllable Object Relighting and Shadow Generation

We show the effectiveness of our proposed Conditional Latent Bridge Matching model for controllable shadow generation andimage relighting where the model is additionally conditioned on a light map representing the position, color and intensity of the light sources. In these cases, the model must either relight the foreground object according to these sources or generate a shadow on the ground according to the light source.

Original Image

Generated Image (1 NFE)