Grand Challenge on
Detecting Cheapfakes

Organized and sponsored by

University of Bergen
Technical University of Munich

Challenge Description

Cheapfake is a general term that encompases many non-AI "cheap" manipulations of multimedia content, including the use of editing software such as Photoshop, but also the alteration of context, etc. Cheapfakes are in fact known to be more prevalent than deepfakes. In this challenge, we aim to benchmark algorithms that can be used to detect out-of-context (image, caption1, caption2) pairs in news items based on a recently compiled dataset. Note that we do not consider digitally altered, or "fake" images, but rather focus on detecting the misuse of real photographs with conflicting image captions. Our current scope is limited to the English language, as the test dataset does not include captions from other languages.


  1. Binary detection
    The main task of the challenge is to detect whether an (image, caption1, caption2) pair is out-of-context or not-out-of-context, using the dataset provided by the organizers. The performance of participant algorithms are evaluated based on the detection accuracy.
  2. Accuracy vs. latency
    In certain scenarios, having an idea about potential fakes in real time and with a low-complexity algorithm can be more important than the detection accuracy itself. This speaks to the tradeoff between effectiveness and efficiency. As described below, we take this aspect into consideration by introducing additional efficiency metrics to complement the more traditional accuracy-related metrics such as F1-score and MCC.

Task, Dataset and Test Environment

This challenge invites participants to design and implement a detector for out-of-context image captions that might be accompanying news images. In this task, the participants are asked to detect whether an (image, caption1, caption2) pair is out-of-context (misleading) or not-out-of-context. Given the test dataset, their algorithm must output a class label corresponding to 1/Out-of-context or 0/Not-out-of-context.

Authors of this paper have created a large dataset of 200K images that they have matched with 450K textual captions from a variety of news websites, blogs, and social media posts. The images are gathered from a wide-variety of articles, with special focus on topics where misinformation spread is prominent, as shown in the figure below.

The dataset was gathered from two primary sources: news websites and fact-checking websites. It was collected in two steps: (1) Using publicly available news channel APIs, the authors scraped images along with the corresponding captions. (2) The authors then reverse-searched these images using Google’s Cloud Vision API to find different contexts across the web in which the image was shared. Thus, they have obtained several captions per image with varying context (2-4 captions per image).

From the paper:

The core idea of the method is a self-supervised training strategy where we only need captioned images; we do not require any explicit out-of-context annotations which would be potentially difficult to annotate in large numbers. We can then establish the image captions from the data as matches, and random captions from other images as non-matches. Using these matches vs non-matches as loss function, we are able to learn co-occurrence patterns of images with textual descriptions to determine whether the image appears to be out-of-context with respect to textual claims. During training, the method only learns to selectively align individual objects in an image with textual claims, without explicit out-of-context supervision. At test time, we are able to correlate these alignment predictions between the two captions for the input image. If both texts correspond to same object but their meaning is semantically different, we infer that the image is used out-of-context.

For this challenge, a part of the above dataset will be made public as the public test dataset, on which the participants can train and test their algorithms. Another part of the dataset will be augmented with new samples and modified to create the hidden test dataset, which will not be made publicly available, and will be used by the challenge organizers to evaluate the submissions.

Participants are free to develop their algorithms in any language or platform they prefer. For the evaluation, a submission in the form of a Docker image is required. This image should include all the required dependencies and should be possible to run using the latest version of Docker (releases for Linux/Mac/Windows are available here). Note that the data should not be included within the Docker image itself, as it will be injected by us. Assume that the test dataset will be located at /mmsys21cheapfakes. A sample Docker file (along with instructions) is provided here.

Prospective participants can get access to the public test dataset by filling out this form.

Evaluation Criteria

The evaluation will be based on a custom metric derived by the organizers, comprising the E1 and E2 scores described below.

Effectiveness Score (E1)

Given the following definitions:
  • True Positives (TP): Number of pairs correctly identified as fake/out-of-context
  • True Negatives (TN): Number of pairs correctly identified as real
  • False Positives (FP): Number of pairs incorrectly identified as fake/out-of-context
  • False Negatives (FN): Number of pairs incorrectly identified as real

Authors are asked to evaluate the effectiveness of their algorithm according to the following criteria, and include the results in their submission.

Accuracy (TP + TN) / (TP + FP + FN + TN)
Precision TP / (TP + FP)
Recall TP / (TP + FN)
F1-score 2 * (Recall * Precision) / (Recall + Precision)
MCC (TP * TN - FP * FN) / sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))

E1 will be a function of Accuracy, F1-score and MCC, to be calculated by the organizers.

Efficiency Score (E2)

Authors are asked to evaluate the efficiency of their algorithm according to the following criteria and include the results in their submission.
  • Latency: Runtime in ms for the complete public test dataset
  • Complexity-1: Number of parameters in the model
  • Complexity-2: Model size in MB

E2 will be a function of latency, complexity-1 and complexity-2, to be calculated by the organizers.

Important Dates

Submission Guidelines

Submissions should provide enough details for the implemented algorithm in a short technical paper, prepared according to the guidelines provided for the Open Dataset and Software Track. A link to the code (preferably a GitHub repository) must also be included in the paper. Complete your submission here.

The authors will be notified after a review process and the authors of the accepted papers need to prepare a camera-ready version, so that their papers can be published by ACM DL.

Winners are to be chosen by a committee appointed by the challenge organizers and results will be final. If contributions of sufficient quality are not received, then some or all of the awards may not be granted. The challenge is open to any individual, commercial or academic institution.


The winner will be awarded 5,000 USD and the runner-up will be awarded 2,500 USD.


If you have any questions, please post them in the Google Group of the challenge or send an email to You can also join the #2021-gc-cheapfakes channel in the ACM MMSys Slack workspace for discussions.