Starting from heterogeneous open sources totaling 2.9B raw image-text pairs, we apply successive stages of pre-filtering, safety filtering, deduplication, and domain-based filtering, followed by multi-VLM re-captioning and synthetic-data augmentation, to obtain a final dataset of 104.9M high-quality and safe image-text pairs. The full curation pipeline is illustrated in Fig. 1 and the final distribution of data sources in MONET is shown in Fig. 2.
Fig 1: Curation pipeline of the MONET. Each stage removes images that fail the corresponding quality, safety or source-governance checks, while the surviving pool flows to the next step.
We now describe the details of each stage in the curation pipeline:
Deduplication. Deduplication is crucial to ensure diversity and prevent memorization and overfitting. We use a two-stage strategy combining exact and near-duplicate detection.
URL and perceptual hashing. We start by removing exact URL duplicates, then apply DCT-based perceptual hashing (pHash) [Venkatesan et al., 2000] to detect near-exact copies that differ only in compression or scaling. These steps are applied first to each source individually and then to the merged safe pool.
Near-duplicate detection. We compute 512-d Self-Supervised Copy Detection (SSCD) [Pizzi et al., 2022] embeddings with the public sscd_disc_mixup [Pizzi et al., 2022] model and retrieve the k=64 nearest neighbors per image using a FAISS index [Douze et al., 2024]. Pairs whose cosine similarity exceeds 0.75 (operating point recommended by the SSCD authors at 90% precision on DISC [Pizzi et al., 2022]) are collapsed, keeping the representative with the highest resolution and aesthetic score.
Fig 2: Distribution of data sources in the MONET dataset.
We show some statistics of the resulting dataset: caption length, aesthetic score, aspect ratio, and pixel count in MegaPixels (MP):
Caption length distribution
Aesthetic score distribution
Aspect ratio distribution
Pixel area distribution
Fig 3: Distribution of caption length, aesthetic score, aspect ratio, and pixel area across the MONET dataset.
We also study the content and image style distributions of MONET.
For content distribution, we leverage CLIP embeddings for zero-shot classification.
We define ∼2.7k classes and encode them with
the prompt "a photo of a {class}", where {class} denotes the class name.
Image-class similarities are computed via cosine similarity between image and text
embeddings, and the top-5 classes are retained
For the image style, we use Qwen3-VL-8B-Instruct [Yang et al., 2025][Bai et al., 2023] to classify a subset of the dataset, limited to 1.5M randomly sampled images for cost reasons, into 15 classes according to image style
CLIP top-5 categories
Image style
MONET is shipped with a retrieval interface that lets you query the dataset by text or image. Given a natural-language prompt or a reference image, the system returns the nearest neighbors in embedding space together with their metadata (aesthetic scores, captions, source, resolution, ...). This makes it straightforward to audit the training distribution, curate domain-specific subsets, or identify near-duplicate images before fine-tuning.
To visualise the structure of the MONET dataset, we project the DINOv2 embeddings of a random sample into two dimensions with UMAP, projected to PCA. The interactive lets you pan, zoom and click through samples to inspect how content, style and source distribute across the corpus at a glance. It also reveals a rich cluster structure spanning photographic styles, artistic media, object categories, and scene types — evidence that the dataset covers the full diversity of internet imagery while remaining non-redundant after our deduplication pipeline.
We validate the effectiveness of MONET by training a 4B-parameter text-to-image model completely from scratch on it. We rely on the latent diffusion framework [Rombach et al., 2022] with a denoiser inspired by MMDiT [Esser et al., 2024] and using a deep-compression VAE (DCVAE) [Chen et al., 2025]; text conditioning is injected using Qwen3-4B. Some samples from our model are shown in Fig. 3, and quantitative results on the GenEval [Ghosh et al., 2023] and DPG [Hu et al., 2024] benchmarks are reported in Table 1.
Fig. 3: Samples generated by our 4B model trained exclusively on MONET.
| Model | Num. Params (B) |
GenEval | DPG | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Objects | ||||||||||||||
| Single | Two | Counting | Colors | Position | Color | Overall ↑ | Global | Entity | Attribute | Relation | Other | Overall ↑ | ||
| SD1.5 | 0.9 | 0.97 | 0.38 | 0.35 | 0.76 | 0.04 | 0.06 | 0.43 | 74.63 | 74.23 | 75.39 | 73.49 | 67.81 | 63.18 |
| PixArt-α | 0.6 | 0.98 | 0.50 | 0.44 | 0.80 | 0.08 | 0.07 | 0.48 | 74.97 | 79.32 | 78.60 | 82.57 | 76.96 | 71.11 |
| Emu3-Gen | 8.0 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | 0.54 | 85.21 | 86.68 | 86.84 | 90.22 | 83.15 | 80.60 |
| SDXL | 2.6 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 | 83.27 | 82.43 | 80.91 | 86.76 | 80.41 | 74.65 |
| SD3 Medium | 2.0 | 0.98 | 0.74 | 0.63 | 0.67 | 0.34 | 0.36 | 0.62 | 87.90 | 91.01 | 88.83 | 80.70 | 88.68 | 84.08 |
| FLUX.1 [Dev] | 12.0 | 0.98 | 0.81 | 0.74 | 0.79 | 0.22 | 0.45 | 0.66 | 74.35 | 90.00 | 88.96 | 90.87 | 88.33 | 83.84 |
| DALL-E 3 | -- | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 | 0.67 | 90.97 | 89.61 | 88.39 | 90.58 | 89.83 | 83.50 |
| SANA-1.5 | 4.8 | 0.99 | 0.85 | 0.77 | 0.87 | 0.34 | 0.54 | 0.72 | -- | -- | -- | -- | -- | 85.00 |
| Lumina-Image 2.0 | 2.6 | -- | 0.87 | 0.67 | -- | -- | 0.62 | 0.73 | -- | 91.97 | 90.20 | 94.85 | -- | 87.20 |
| Janus-Pro-7B | 7.0 | 0.99 | 0.89 | 0.59 | 0.90 | 0.79 | 0.66 | 0.80 | 86.90 | 88.90 | 89.40 | 89.32 | 89.48 | 84.19 |
| HiDream-I1-Full | 17.0 | 1.00 | 0.98 | 0.79 | 0.91 | 0.60 | 0.72 | 0.83 | 76.44 | 90.22 | 89.48 | 93.74 | 91.83 | 85.89 |
| Z-Image | 6.0 | 1.00 | 0.94 | 0.78 | 0.93 | 0.62 | 0.77 | 0.84 | 93.39 | 91.22 | 93.16 | 92.22 | 91.52 | 88.14 |
| Qwen-Image | 20.0 | 0.99 | 0.92 | 0.89 | 0.88 | 0.76 | 0.77 | 0.87 | 91.32 | 91.56 | 92.02 | 94.31 | 92.73 | 88.32 |
| Ours | 4.1 | 1.00 | 0.90 | 0.73 | 0.88 | 0.35 | 0.62 | 0.74 | 84.80 | 91.76 | 89.70 | 94.16 | 79.60 | 85.56 |
Table 1: Results on the GenEval and DPG benchmarks. Our 4B model trained on the MONET dataset achieves competitive performance against models of similar size trained on closed-source data.
To lower the barrier to reproducible text-to-image research, we also release NANO-T2I, a minimal and hackable codebase to train a T2I flow-matching model end-to-end on MONET. The reference recipe is a 1.3B DiT-style model with a Qwen3-4B text encoder and a latent VAE backbone, trained from scratch on a single H200 GPU for under $300 in two sequential phases (512×512 then 1024×1024). Every architectural and optimization choice lives in a single YAML config, making it easy to swap components or scale the recipe to your own compute budget. The repository also ships with a Gradio demo so you can interactively sample from your trained checkpoints.
@article{aubin2026monet,
title = {MONET: A Massive, Open, Non-redundant and Enriched Text-to-image Dataset},
author = {Aubin, Benjamin and Quintana, Gonzalo I{\~n}aki and Tasar, Onur and Sreetharan, Sanjeev and Czerwinska, Urszula and Henry, Damien and Chadebec, Cl{\'e}ment},
journal= {arXiv preprint arXiv:2605.21272},
year = {2026},
note = {Jasper Research}
}