Places2: A Large-Scale Database for Scene Understanding

Introduction

The goal of this challenge is to identify the scene category depicted in a photograph. The data for this task comes from the dataset which contains 10+ million images belonging to 400+ unique scene categories. Specifically, the challenge data will be divided into 8 Million images for training, 36K images for validation and 328K images for testing coming from 365 scene categories. Note that there is a non-uniform distribution of images per category for training, ranging from 4,000 to 40,000, mimicking a more natural frequency of occurrence of the scene.

For each image, algorithms will produce a list of at most 5 scene categories in descending order of confidence. The quality of a labeling will be evaluated based on the label that best matches the ground truth label for the image. The idea is to allow an algorithm to identify multiple scene categories in an image given that many environments have multi-labels (e.g. a bar can also be a restaurant) and that humans often describe a place using different words (e.g. forest path, forest, woods).

Dates

May 20, 2016: Development kit, data, and evaluation software made available
Sep. 9, 2016, 5pm PDT: Submission deadline
Sep. 26, 2016: Challenge results released
Oct. 8, 2016: Winner(s) presents at ECCV 2016 Imagenet and COCO Joint Visual Recognition Workshop

Organizers

Download

You could download data at Download. Note that the Places Challenge 2016 data is different to previous challenge data, you have to download current new data and train your networks. You could use the baseline Places-CNNs we release (either Places205-CNNs or Places365-CNNs) as the starting point to train your networks.

Please register to obtain the permission to submit result and the application form for accessing GPU resources (provided by NVIDIA and IBM Cloud).

Evaluation

For each image, an algorithm will produce 5 labels \( l_j, j=1,...,5 \). The ground truth labels for the image are \( g_k, k=1,...,n \) with n classes of scenes labeled. The error of the algorithm for that image would be

\[ e= \frac{1}{n} \cdot \sum_k \min_j d(l_j,g_k). \]

\( d(x,y)=0 \) if \( x=y \) and 1 otherwise. The overall error score for an algorithm is the average error over all test images. Note that for this version of the competition, n=1, that is, one ground truth label per image.

Past Results

Challenge 2015 results

Citation

If you are reporting results of the challenge or using the dataset, please cite:

arXiv:1610.02055,

PDF

Contact

Email Bolei Zhou or Aditya Khosla if you have any questions or comments.

Accessibility