Grand Challenge on
Detecting Cheapfakes

Organized and sponsored by

University of Bergen
Technical University of Munich

Challenge Description

Cheapfake is a general term that encompases many non-AI ("cheap") manipulations of multimedia content, including the use of editing software for image manipulations, speeding/slowing of videos, and the deliberate alteration of context in news captions. Cheapfakes are in fact known to be more prevalent than deepfakes. In this challenge, we aim to benchmark algorithms that can be used to detect out-of-context {image, caption1, caption2} triplets in news items based on a recently compiled dataset. Note that we do not consider digitally altered, or "fake" images, but rather focus on detecting the misuse of real photographs with conflicting image captions. Our current scope is limited to the English language, as the test dataset does not include captions from other languages.

An overview of this challenge, including the information provided on this page and further details, can be found in the challenge overview paper.


  1. Binary detection performance:
    The first goal of the task is to achieve high detection performance, i.e., to be able to detect whether {image, caption1, caption2} triplets are out-of-context or not-out-of-context, successfully. This speaks to effectiveness. Participant models are evaluated based on the Effectiveness Score (E1) described below.
  2. Latency and complexity:
    In certain scenarios, having an idea about potential misuse of images in real-time and with minimal resources can be more important than the detection performance itself. This speaks to efficiency. We take this aspect into consideration by introducing an additional goal: having low latency and low complexity. Participant models are evaluated based on the Efficiency Score (E2) described below.

Task, Dataset and Test Environment


This challenge invites participants to develop a model using the dataset provided by the organizers, for the detection of out-of-context image captions that might be accompanying news images. For each sample in a given test split, their model must detect whether the {image, caption1, caption2} triplet is out-of-context or not-out-of-context, and output the corresponding class label: 1=out-of-context or 0=not-out-of-context.


Authors of this paper have created a large dataset of 200K images that they have matched with 450K textual captions from a variety of news websites, blogs, and social media posts. The images are gathered from a wide-variety of articles, with special focus on topics where misinformation spread is prominent, as shown in the figure below.

For this challenge, a part of the above dataset is sampled and assigned as the public dataset. The public dataset, consisting of the training, validation and public test splits, is provided openly to participants for training and testing their algorithms. Prospective participants can get access to the public test dataset by filling out this form.

The remaining part of the dataset is augmented with new samples and modified to create the hidden test split, which is not made publicly available, and will be used by the challenge organizers to evaluate the submissions.

Test Environment

Participants are free to develop their algorithms in any language or platform they prefer. However, a submission in the form of a Docker image is required for evaluation. This image should include all the required dependencies and should be possible to run using the latest version of Docker (releases for Linux/Mac/Windows are available here). Note that data should not be included within the Docker image itself, as it will be injected by us. Assume that the test dataset will be located at /mmsys21cheapfakes. Sample Docker file instructions can be found in the official GitHub repository for the challenge.

Baseline Model

Prospective participants can consult the original paper for details about the authors’ proposed model. Their core idea is a self-supervised training strategy where only captioned images are needed: no explicit out-of-context annotations are required during training, which could be potentially difficult to acquire in large numbers. The image captions from the dataset are established as matches, and random captions from other images as non-matches. Using these matches vs. non-matches as a loss function, the authors are able to learn the co-occurrence patterns of images with textual descriptions in order to determine whether an image appears to be out-of-context with respect to textual claims. During training, their method only learns to selectively align individual objects in an image with textual claims, without explicit out-of-context supervision. At test time, they correlate these alignment predictions between the two captions for the input image. If both texts correspond to the same object but their meaning is semantically different, they infer that the image is used out-of-context.

Interested participants can use this model as a baseline for developing their own models for the challenge.

Evaluation Criteria

In order to rank participant models, two aggregate scores will be used: E1 and E2.

Effectiveness Score (E1)

Given the following definitions:
  • True Positives (TP): Number of samples correctly identified as out-of-context
  • True Negatives (TN): Number of samples correctly identified as not-out-of-context
  • False Positives (FP): Number of samples incorrectly identified as out-of-context
  • False Negatives (FN): Number of samples incorrectly identified as not-out-of-context

The effectiveness of participant models will be evaluated according to the following 5 metrics: accuracy, precision, recall, F1-score, and Matthews Correlation Coefficient (MCC). Authors are asked to calculate the 5 metrics for their model and include these values in their manuscript. E1 will be a function of Accuracy, F1-score and MCC, to be calculated by the organizers.

Accuracy (TP + TN) / (TP + FP + FN + TN)
Precision TP / (TP + FP)
Recall TP / (TP + FN)
F1-score 2 * (Recall * Precision) / (Recall + Precision)
MCC (TP * TN - FP * FN) / sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))

Efficiency Score (E2)

The efficiency of participant models will be evaluated according to the following 3 metrics: latency, number of parameters, and model size. Participants are asked to calculate the 3 metrics for their model and include these values in their manuscript. E2 will be a function of latency, complexity-1 and complexity-2, to be calculated by the organizers.

  • Latency: Average runtime per sample* (ms)
  • Complexity-1: Number of trainable parameters in the model (million)
  • Complexity-2: Model size (MB)

(*) Arithmetic mean of the runtime per sample, calculated over all samples in the public test split.

Important Dates

Submission Guidelines

Submissions should be prepared according to the guidelines provided below:

  • The manuscript should provide enough details for the implemented algorithm in a short technical paper.
  • The manuscript should include references to the public repository for the source code and the Docker image, as well as the challenge overview paper.
  • Page count: Up to 6 pages plus an optional page for references only.
  • Style: Single blind using the ACM proceedings template. LaTeX users can prefer the “authordraft” template alternative for the first submission.
  • ACM header: “MMSys’21, Sept. 28-Oct. 1, 2021, Istanbul, Turkey" on the left, and author names (on even pages) / title (on odd pages, except the first page) on the right.
  • Format: Portable Document Format (PDF)
  • Online submission: Complete your submission here.

The authors will be notified after a review process and the authors of the accepted papers need to prepare a camera-ready version, so that their papers can be published by ACM Digital Library.

The challenge is open to any individual, commercial or academic institution. Winners will be chosen by a committee appointed by the challenge organizers and the decision will be final. The results will be announced during the ACM Multimedia Systems Conference (MMSys'21). If contributions of sufficient quality are not received, then some or all of the awards may not be granted.


The winner will be awarded 5,000 USD and the runner-up will be awarded 2,500 USD.


If you have any questions, please post them in the Google Group of the challenge or send an email to You can also join the #2021-gc-cheapfakes channel in the ACM MMSys Slack workspace for discussions.