AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection

Abstract

The rapid advancement of generative AI has revolutionized image creation, enabling high-quality synthesis from text prompts while raising critical challenges for media authenticity. We present AI-GenBench, a novel benchmark designed to address the urgent need for robust detection of AI-generated images in real-world scenarios. Unlike existing solutions that evaluate models on static datasets, AI-GenBench introduces a temporal evaluation framework where detection methods are incrementally trained on synthetic images, historically ordered by their generative models, to test their ability to generalize to new generative models, such as the transition from GANs to diffusion models. Our benchmark focuses on high-quality, diverse visual content and overcomes key limitations of current approaches, including arbitrary dataset splits, unfair comparisons, and excessive computational demands. AI-GenBench provides a comprehensive dataset, a standardized evaluation protocol, and accessible tools for both researchers and non-experts (e.g., journalists, factcheckers), ensuring reproducibility while maintaining practical training requirements. By establishing clear evaluation rules and controlled augmentation strategies, AI-GenBench enables meaningful comparison of detection methods and scalable solutions. Code and data are publicly available to ensure reproducibility and to support the development of robust forensic detectors to keep pace with the rise of new synthetic generators.

Framework

Unlike traditional approaches that evaluate models on static datasets, AI-GenBench introduces a temporal evaluation framework for AI-generated image detection. In this setting, detection models are incrementally trained on synthetic images ordered by the historical release of their generative models. This setup tests how well detectors can generalize to new generation techniques, such as the transition from GANs to diffusion models.
The goal of AI-GenBench is to provide a benchmark for assessing the robustness of detection models across both past and future image generation methods. It includes training and evaluation datasets covering a wide range of image generators released between 2017 and 2024, spanning from early GANs to the latest diffusion-based models. The benchmark also offers a PyTorch Lightning–based framework for training and evaluating detection models, publicly released and maintained on GitHub.

Leaderboard

In this leaderboard we will include the evaluation results on the AI-GenBench benchmark. To submit a candidate algorithm for evaluation please contact us! The only requirement is that both:

the method codebase
a report or paper describing the method must be publicly available

Please note that you may freely use the dataset to train and evaluate your model without following the sliding-windows benchmark protocol. However, only methods that follow the benchmark protocol will be included in the leaderboard. We here report the Area Under the ROC Curve (AUROC) of the methods that have been evaluated on the benchmark so far.

Model Name	Author / Team	Submission Date	# Parameters	AUROC			References
Model Name	Author / Team	Submission Date	# Parameters	Past Period	Next Period	Whole Period	References
ViT-L/14 DINOv2	Baseline from paper authors	Jul/2025	304M	99.1%	94.2%	97.9%
ViT-L/14 CLIP	Baseline from paper authors	Jul/2025	304M	98.1%	92.0%	97.0%
ResNet-50 CLIP	Baseline from paper authors	Jul/2025	38M	89.9%	81.8%	88.9%

BibTeX

If you use this benchmark and/or code in your research, please cite our paper:


@INPROCEEDINGS{11228377,
  author={Pellegrini, Lorenzo and Cozzolino, Davide and Pandolfini, Serafino and Maltoni, Davide and Ferrara, Matteo and Verdoliva, Luisa and Prati, Marco and Ramilli, Marco},
  booktitle={2025 International Joint Conference on Neural Networks (IJCNN)}, 
  title={AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection}, 
  year={2025},
  volume={},
  number={},
  pages={1-9},
  keywords={Training;Visualization;Protocols;Forensics;Computational modeling;Neural networks;Detectors;Benchmark testing;Reproducibility of results;Videos;AI-generated image detection;Generative models;Forensic benchmark},
  doi={10.1109/IJCNN64981.2025.11228377}}