Stylization and style transfer are fundamental tasks in the realm of image editing, particularly in professional illustration creation. These techniques involve transforming the visual style of an image while preserving its core content. Text-to-image (T2I) generative models have been successful in creating visually stunning images from textual descriptions. However, recent advancements in diffusion models have opened the door for personalized styling in image generation.
This project aims to provide a unified codebase to evaluate training-free stylization models. Various models from the literature were evaluated on Style-Rank, an evaluation dataset of images that we compiled from the most popular stylization papers. On top of evaluating the different models, we also propose, Inversion-InstantStyle, a small improvement with respect to InstantStyle, by computing a starting latent using DDIM Inversion and adding noise to it. See Inversion-InstantStyle demo and the method technical diagram below.
The following project allows benchmarking several training-free stylization methods on the aggregated Style-Rank dataset and compute the corresponding quantitative metrics.
We provide Style-Rank, an evaluation dataset of images that we compiled using reference images from the following stylization papers:
Our codebase can also be used with your own dataset to evaluate the models for more specific use cases, such as enterprise. Notice that corresponding original licenses still apply to each image part of this dataset.
Model | Arxiv | Code | Project Page | Implementation |
---|---|---|---|---|
StyleAligned | Arxiv | Code | Project Page | Official |
VisualStyle | Arxiv | Code | Project Page | Official |
IP-Adapter | Arxiv | Code | Project Page | Diffusers |
InstantStyle | Arxiv | Code | Project Page | Diffusers |
CSGO | Arxiv | Code | Project Page | Official |
Style-Shot | Arxiv | Code | Project Page | Official |
On top of the above mentioned open-source methods, we also provide a new method, Inversion-InstantStyle, that simply combines DDIM Inversion, renoising and InstantStyle.
In more details,
The following metrics are computed to assess the quality of the models:
ClipTextModel
) and the generated image (embedded using ClipVisionModel
) -
Using the implementation from Transformers
ClipVisionModel
) - Using the implementation from Transformers
Dinov2Model
) -
Using the implementation from Dino
Following instructions within our benchmark codebase, we compiled the results of the different listed methods (including Inversion-InstantStyle) on the provided dataset.
Model | ImageReward ↑ | Clip-Text ↑ | Clip-Image ↑ | Dinov2 ↑ |
---|---|---|---|---|
StyleAligned | -1.26 | 19.26 | 68.72 | 36.29 |
VisualStyle | -0.72 | 22.12 | 66.68 | 20.80 |
IP-Adapter | -2.03 | 15.01 | 83.66 | 40.50 |
Style-Shot | -0.38 | 21.34 | 65.04 | 23.04 |
CSGO | -0.29 | 22.16 | 61.73 | 16.85 |
InstantStyle | -0.13 | 22.78 | 66.43 | 18.48 |
Inversion-InstantStyle | -1.30 | 18.90 | 76.60 | 49.42 |
Results are aligned with those obtained in the different styling papers. Notice that the metrics can fluctuate depending on the chosen prompts and seeds for evaluation.
@misc{benaroche2024stylerank, title={Style-Rank: Benchmarking stylization for diffusion models}, author=Eyal Benaroche and Clement Chadebec and Onur Tasar and Benjamin Aubin}, year={2024}, }