Cheapfake is a general term that encompases many non-AI "cheap" manipulations of multimedia content, including the use of editing software such as Photoshop, but also the alteration of context, etc. Cheapfakes are in fact known to be more prevalent than deepfakes. In this challenge, we aim to benchmark algorithms that can be used to detect out-of-context (image, caption1, caption2) pairs in news items based on a recently compiled dataset. Note that we do not consider digitally altered, or "fake" images, but rather focus on detecting the misuse of real photographs with conflicting image captions. Our current scope is limited to the English language, as the test dataset does not include captions from other languages.
- Binary detection
The main task of the challenge is to detect whether an (image, caption1, caption2) pair is out-of-context or not-out-of-context, using the dataset provided by the organizers. The performance of participant algorithms are evaluated based on the detection accuracy.
- Accuracy vs. latency
In certain scenarios, having an idea about potential fakes in real time and with a low-complexity algorithm can be more important than the detection accuracy itself. This speaks to the tradeoff between effectiveness and efficiency. As described below, we take this aspect into consideration by introducing additional efficiency metrics to complement the more traditional accuracy-related metrics such as F1-score and MCC.
Task, Dataset and Test Environment
This challenge invites participants to design and implement a detector for out-of-context image captions that might be accompanying news images. In this task, the participants are asked to detect whether an (image, caption1, caption2) pair is out-of-context (misleading) or not-out-of-context. Given the test dataset, their algorithm must output a class label corresponding to 1/Out-of-context or 0/Not-out-of-context.
Authors of the dataset have created a large dataset of 200K images that they have matched with 450K textual captions from a variety of news websites, blogs, and social media posts. The images are gathered from a wide-variety of articles, with special focus on topics where misinformation spread is prominent, as shown in the figure below.
The dataset was gathered from two primary sources: news websites and fact-checking websites. It was collected in two steps: (1) Using publicly available news channel APIs, the authors scraped images along with the corresponding captions. (2) The authors then reverse-searched these images using Google’s Cloud Vision API to find different contexts across the web in which the image was shared. Thus, they have obtained several captions per image with varying context (2-4 captions per image).
For this challenge, a part of the above dataset will be made public as the public test dataset, on which the participants can train and test their algorithms. Another part of the dataset, called the hidden test dataset, will not be made publicly available, and it will be used by the challenge organizers to evaluate the submissions.
Participants are free to develop their algorithms in any language or platform they prefer. For the evaluation, a submission in the form of a Docker image is required. This image should include all the required dependencies and should be possible to run using the latest version of Docker (releases for Linux/Mac/Windows are available here). Note that the data should not be included within the Docker image itself, as it will be injected by us. Assume that the test dataset will be located at /mmsys21cheapfakes. A sample Docker file is to be provided here.
The evaluation will be based on a custom metric derived by the organizers, comprising the E1 and E2 scores described below.
Effectiveness Score (E1)Given the following definitions:
- True Positives (TP): Number of pairs correctly identified as fake/out-of-context
- True Negatives (TN): Number of pairs correctly identified as real
- False Positives (FP): Number of pairs incorrectly identified as fake/out-of-context
- False Negatives (FN): Number of pairs incorrectly identified as real
Authors are asked to evaluate the effectiveness of their algorithm according to the following criteria, and include the results in their submission.
|Accuracy||(TP + TN) / (TP + FP + FN + TN)|
|Precision||TP / (TP + FP)|
|Recall||TP / (TP + FN)|
|F1-score||2 * (Recall * Precision) / (Recall + Precision)|
|MCC||(TP * TN - FP * FN) / sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))|
E1 will be a function of Accuracy, F1-score and MCC, to be calculated by the organizers.
Efficiency Score (E2)Authors are asked to evaluate the efficiency of their algorithm according to the following criteria and include the results in their submission.
- Latency: Runtime in ms for the complete test dataset
- Complexity-1: Number of parameters in the model
- Complexity-2: Model size in MB
E2 will be a function of latency, complexity-1 and complexity-2 to be calculated by the organizers.
Submissions should provide enough details for the implemented algorithm in a short technical paper, prepared according to the guidelines provided for the Open Dataset and Software Track. A link to the code (preferably a GitHub project) must also be included in the paper. Complete your submission here.
The authors will be notified after a review process and the authors of the accepted papers need to prepare a camera-ready version, so that their papers can be published by ACM DL.
Winners are to be chosen by a committee appointed by the challenge organizers and results will be final. If contributions of sufficient quality are not received, then some or all of the awards may not be granted. The challenge is open to any individual, commercial or academic institution.
The winner will be awarded 5,000 USD and the runner-up will be awarded 2,500 USD.
If you have any questions, please send an email to firstname.lastname@example.org.