Abstract

Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of ∼104.9M image-text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.

Dataset construction

Starting from heterogeneous open sources totaling 2.9B raw image-text pairs, we apply successive stages of pre-filtering, safety filtering, deduplication, and domain-based filtering, followed by multi-VLM re-captioning and synthetic-data augmentation, to obtain a final dataset of 104.9M high-quality and safe image-text pairs. The full curation pipeline is illustrated in Fig. 1 and the final distribution of data sources in MONET is shown in Fig. 2.

Fig 1: Curation pipeline of the MONET. Each stage removes images that fail the corresponding quality, safety or source-governance checks, while the surviving pool flows to the next step.

We now describe the details of each stage in the curation pipeline:

  • Data sourcing. MONET is built from the following existing open-source datasets that have a permissive license: LAION [Schuhmann et al., 2021; 2022], COYO [Byeon et al., 2022], CommonCatalog CC-BY [Gokaslan et al., 2021], Megalith10M [Bohan et al., 2024], CC12M [Changpinyo et al., 2021], and Diffusion-Aesthetic-4K [Zhang et al., 2025].
  • Pre-filtering. For the two largest sources, LAION and COYO, we first filter out images with a resolution below 5122 pixels and an aesthetic score lower than 5.0, concentrating computational resources on images that meet our baseline quality requirements.
  • Safety filtering. Since the data come mainly from the Web, we apply strict safety filters to the merged pool. We start by restricting LAION-2B-en samples to those also present in the vetted Re-LAION-2B-en-safe release [LAION, 2024], removing 1.29M images flagged during the Re-LAION safety revision. Second, we apply an ensemble of open-source Not-Safe-For-Work (NSFW) detectors (Falcon [Falcon AI, 2024] and Bumble [Bumble-Tech, 2024]) together with an internal classifier, under a conservative union rule: an image is removed if any classifier flags it. We conduct a safety audit using DINOv2 [Oquab et al., 2023] embeddings by manually inspecting the 100 nearest neighbors of a small seed set of NSFW images; no additional harmful content is detected.
  • Deduplication. Deduplication is crucial to ensure diversity and prevent memorization and overfitting. We use a two-stage strategy combining exact and near-duplicate detection.

    URL and perceptual hashing. We start by removing exact URL duplicates, then apply DCT-based perceptual hashing (pHash) [Venkatesan et al., 2000] to detect near-exact copies that differ only in compression or scaling. These steps are applied first to each source individually and then to the merged safe pool.

    Near-duplicate detection. We compute 512-d Self-Supervised Copy Detection (SSCD) [Pizzi et al., 2022] embeddings with the public sscd_disc_mixup [Pizzi et al., 2022] model and retrieve the k=64 nearest neighbors per image using a FAISS index [Douze et al., 2024]. Pairs whose cosine similarity exceeds 0.75 (operating point recommended by the SSCD authors at 90% precision on DISC [Pizzi et al., 2022]) are collapsed, keeping the representative with the highest resolution and aesthetic score.

  • Source governance. A final round of exclusion-based filters enforces resolution, source, and watermark standards. We first remove images with resolution below 5122 pixels and images originating from a blocklist of domains including known stock-photo providers such as dreamstime, shutterstock, freepik, getty, unsplash, etc. Finally, we discard images flagged with high watermark probability by an internal detector. These exclusion controls are not a representation of legal clearance; they are source-governance signals that reduce the prevalence of images from known restrictive providers.
  • Multi-VLM re-captioning & embedding computation. Caption quality and diversity are both crucial for T2I models. We re-caption MONET with four Vision Language Models (VLM), that were previously benchmarked: Florence2-Large [Xiao et al., 2016], InternVL3-8B [Zhu et al., 2025], ShareGPT4V-7B [Chen et al., 2024], and Gemini-2.5-flash-lite [Comanici et al., 2025]. Florence2-Large produces short, concept-level captions that closely match typical user prompts, while the three remaining models yield long, fine-grained descriptions. To accelerate downstream use, each MONET image is shipped with pre-computed embeddings (DINOv2-vitg14, CLIP-vit-base-patch32 [Radford et al., 2021], and SSCD), structured annotations (YOLO-v9e object detection [Redmon et al., 2023; Jocher et al., 2023], YOLO-v9e image classification [Jocher et al., 2023], MediaPipe [Lugaresi et al., 2019]) face detection, and SANA VAE [Xie et al., 2025] pre-computed latents, avoiding repeated raw-pixel processing.
  • Synthetic-data augmentation: We complement real data with synthetic images generated by FLUX.1-schnell [Black Forest Labs., 2024], FLUX.2-klein-4B [Black Forest Labs., 2025], and Z-Image [Z-Image Team et al., 2025], chosen as top-performing T2I models released under the permissive Apache 2.0 license, which allows redistribution and use of their outputs for training. Prompts are drawn from recaptioning and an open-source prompt collection [k-mktr, 2026], then upsampled with Qwen3-4B [Yang et al., 2025] under a system prompt that removes unsafe content. The generated images are then filtered with the same NSFW and watermark filters as the real data.

Fig 2: Distribution of data sources in the MONET dataset.

Dataset analysis

We show some statistics of the resulting dataset: caption length, aesthetic score, aspect ratio, and pixel count in MegaPixels (MP):

Caption length distribution

Aesthetic score distribution

Aspect ratio distribution

Pixel area distribution

Fig 3: Distribution of caption length, aesthetic score, aspect ratio, and pixel area across the MONET dataset.

We also study the content and image style distributions of MONET.

For content distribution, we leverage CLIP embeddings for zero-shot classification. We define ∼2.7k classes and encode them with the prompt "a photo of a {class}", where {class} denotes the class name. Image-class similarities are computed via cosine similarity between image and text embeddings, and the top-5 classes are retained

For the image style, we use Qwen3-VL-8B-Instruct [Yang et al., 2025][Bai et al., 2023] to classify a subset of the dataset, limited to 1.5M randomly sampled images for cost reasons, into 15 classes according to image style

CLIP top-5 categories

Image style

Visualising MONET

MONET Retrieval

MONET is shipped with a retrieval interface that lets you query the dataset by text or image. Given a natural-language prompt or a reference image, the system returns the nearest neighbors in embedding space together with their metadata (aesthetic scores, captions, source, resolution, ...). This makes it straightforward to audit the training distribution, curate domain-specific subsets, or identify near-duplicate images before fine-tuning.

MONET UMAP

To visualise the structure of the MONET dataset, we project the DINOv2 embeddings of a random sample into two dimensions with UMAP, projected to PCA. The interactive lets you pan, zoom and click through samples to inspect how content, style and source distribute across the corpus at a glance. It also reveals a rich cluster structure spanning photographic styles, artistic media, object categories, and scene types — evidence that the dataset covers the full diversity of internet imagery while remaining non-redundant after our deduplication pipeline.

Downstream validation

Training a 4B T2I backbone

We validate the effectiveness of MONET by training a 4B-parameter text-to-image model completely from scratch on it. We rely on the latent diffusion framework [Rombach et al., 2022] with a denoiser inspired by MMDiT [Esser et al., 2024] and using a deep-compression VAE (DCVAE) [Chen et al., 2025]; text conditioning is injected using Qwen3-4B. Some samples from our model are shown in Fig. 3, and quantitative results on the GenEval [Ghosh et al., 2023] and DPG [Hu et al., 2024] benchmarks are reported in Table 1.

Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample
Generation sample

Fig. 3: Samples generated by our 4B model trained exclusively on MONET.

Model Num.
Params (B)
GenEval DPG
Objects
Single Two Counting Colors Position Color Overall ↑ Global Entity Attribute Relation Other Overall ↑
SD1.5 0.9 0.970.380.350.760.040.060.43 74.6374.2375.3973.4967.8163.18
PixArt-α 0.6 0.980.500.440.800.080.070.48 74.9779.3278.6082.5776.9671.11
Emu3-Gen 8.0 0.980.710.340.810.170.210.54 85.2186.6886.8490.2283.1580.60
SDXL 2.6 0.980.740.390.850.150.230.55 83.2782.4380.9186.7680.4174.65
SD3 Medium 2.0 0.980.740.630.670.340.360.62 87.9091.0188.8380.7088.6884.08
FLUX.1 [Dev] 12.0 0.980.810.740.790.220.450.66 74.3590.0088.9690.8788.3383.84
DALL-E 3 -- 0.960.870.470.830.430.450.67 90.9789.6188.3990.5889.8383.50
SANA-1.5 4.8 0.990.850.770.870.340.540.72 ----------85.00
Lumina-Image 2.0 2.6 --0.870.67----0.620.73 --91.9790.2094.85--87.20
Janus-Pro-7B 7.0 0.990.890.590.900.790.660.80 86.9088.9089.4089.3289.4884.19
HiDream-I1-Full 17.0 1.000.980.790.910.600.720.83 76.4490.2289.4893.7491.8385.89
Z-Image 6.0 1.000.940.780.930.620.770.84 93.3991.2293.1692.2291.5288.14
Qwen-Image 20.0 0.990.920.890.880.760.770.87 91.3291.5692.0294.3192.7388.32
Ours 4.1 1.000.900.730.880.350.620.74 84.8091.7689.7094.1679.6085.56

Table 1: Results on the GenEval and DPG benchmarks. Our 4B model trained on the MONET dataset achieves competitive performance against models of similar size trained on closed-source data.

Training your own T2I

NANO-T2I

To lower the barrier to reproducible text-to-image research, we also release NANO-T2I, a minimal and hackable codebase to train a T2I flow-matching model end-to-end on MONET. The reference recipe is a 1.3B DiT-style model with a Qwen3-4B text encoder and a latent VAE backbone, trained from scratch on a single H200 GPU for under $300 in two sequential phases (512×512 then 1024×1024). Every architectural and optimization choice lives in a single YAML config, making it easy to swap components or scale the recipe to your own compute budget. The repository also ships with a Gradio demo so you can interactively sample from your trained checkpoints.

Curious how Jasper Research is used in production? Discover our APIs. Curious how Jasper Research is used in production? Discover our APIs

BibTeX

@article{aubin2026monet,
        title   = {MONET: A Massive, Open, Non-redundant and Enriched Text-to-image Dataset},
        author  = {Aubin, Benjamin and Quintana, Gonzalo I{\~n}aki and Tasar, Onur and Sreetharan, Sanjeev and Czerwinska, Urszula and Henry, Damien and Chadebec, Cl{\'e}ment},
        journal=  {arXiv preprint arXiv:2605.21272},
        year    = {2026},
        note    = {Jasper Research}
      }