- Summary: There are totally 92 valid submissions from 27 teams. Hikvision won the 1st place with 0.0901 top-5 error, MW won the 2nd place with 0.1019 top-5 error, and Trimps-Soushen won the 3rd place with 0.1030 top-5 error. Congratulations to all the teams. See below for the leaderboard and the team information.
- Rule: Each teams can only use the provided data in Places2 Challenge 2016 to train their networks. Standard pre-trained CNN models trained on Imagenet-1.2million and previous Places are allowed to use. Each teams can submit at most 5 prediction results. Ranks are based on the top-5 classification error of each submission.
- Scene classification with provided training data
- Team information
- Yellow background- winner in this task according to this metric & authors are willing to reveal the method
- White background - authors are willing to reveal the method
- Grey background- authors chose not to reveal the method
|Team name||Entry description||Top-5 classification error|
|MW||Model ensemble 2||0.1019|
|MW||Model ensemble 3||0.1019|
|MW||Model ensemble 1||0.1023|
|Trimps-Soushen||With extra data.||0.103|
|SIAT_MMLAB||10 models fusion||0.1043|
|SIAT_MMLAB||7 models fusion||0.1044|
|SIAT_MMLAB||fusion with softmax||0.1044|
|SIAT_MMLAB||learning weights with cnn||0.1044|
|SIAT_MMLAB||6 models fusion||0.1049|
|MW||Single model B||0.1073|
|MW||Single model A||0.1076|
|NTU-SC||Product of 5 ensembles (top-5)||0.1085|
|NTU-SC||Product of 3 ensembles (top-5)||0.1086|
|NTU-SC||Sum of 3 ensembles (top-5)||0.1086|
|NTU-SC||Sum of 5 ensembles (top-3)||0.1086|
|NTU-SC||Single ensemble of 5 models (top-5)||0.1088|
|Samsung Research America: General Purpose Acceleration Group||Simple Ensemble, 3 Inception v3 models w/various hyper param changes, 32 multi-crop (60.11 top-1, 88.98 top-5 on val)||0.1113|
|fusionf||Fusion with average strategy (12 models)||0.1115|
|fusionf||Fusion with scoring strategy (14 models)||0.1117|
|fusionf||Fusion with average strategy (13 models)||0.1118|
|YoutuLab||weighted average1 at scale level using greedy search||0.1125|
|YoutuLab||weighted average at model level using greedy search||0.1127|
|YoutuLab||weighted average2 at scale level using greedy search||0.1129|
|fusionf||Fusion with scoring strategy (13 models)||0.113|
|fusionf||Fusion with scoring strategy (12 models)||0.1132|
|YoutuLab||simple average using models in entry 3||0.1139|
|Samsung Research America: General Purpose Acceleration Group||Model A0, weakly scaled, multi-crop. (59.61 top-1, 88.64 top-5 on val)||0.1142|
|Samsung Research America: General Purpose Acceleration Group||Ensemble B, 3 Inception v3 models w/various hyper param changes + Inception v4 res2, 128 multi-crop||0.1152|
|YoutuLab||average on base models||0.1162|
|Samsung Research America: General Purpose Acceleration Group||Model A2, weakly scaled, single-crop & mirror. (58.84 top-1, 88.09 top-5 on val)||0.1188|
|Samsung Research America: General Purpose Acceleration Group||Model A1, weakly scaled, single-crop. (58.65 top-1, 88.07 top-5 on val)||0.1193|
|Rangers||ensemble model 1||0.1208|
|Rangers||ensemble model 2||0.1212|
|Everphoto||ensemble by learned weights - 1||0.1213|
|Everphoto||ensemble by product strategy||0.1218|
|Everphoto||ensemble by learned weights - 2||0.1218|
|Everphoto||ensemble by average strategy||0.1223|
|MIPAL_SNU||Ensemble of two ResNet-50 with balanced sampling||0.1232|
|KPST_VB||Ensemble of Model I and II||0.1235|
|Rangers||single model result of 69||0.124|
|Everphoto||ensemble by product strategy (without specialist models)||0.1242|
|KPST_VB||Model II with adjustment||0.125|
|Rangers||single model result of 66||0.1253|
|KPST_VB||Ensemble of Model I and II with adjustment||0.1253|
|SJTU-ReadSense||Ensemble 5 models with learnt weights||0.1272|
|SJTU-ReadSense||Ensemble 5 models with weighted validation accuracies||0.1273|
|iMCB||A combination of CNN models based on researched influential factors||0.1277|
|SJTU-ReadSense||Ensemble 6 models with learnt weights||0.1278|
|SJTU-ReadSense||Ensemble 4 models with learnt weights||0.1287|
|iMCB||A combination of CNN models with a strategy w.r.t.validation accuracy||0.1299|
|Choong||Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate.||0.131|
|SIIT_KAIST||101-depth single model (val.error 12.90%)||0.131|
|DPAI Vison||An ensemble model||0.1355|
|isia_ICT||spectral clustering on confusion matrix||0.1355|
|isia_ICT||fusion of 4 models with average strategy||0.1357|
|Viz Insight||Multiple Deep Metaclassifiers||0.1386|
|DPAI Vison||Single Model||0.1425|
|isia_ICT||2 models with size of 288||0.1433|
|Faceall-BUPT||A single model with 150crops||0.1471|
|iMCB||A Single Model||0.1506|
|SJTU-ReadSense||A single model (based on Inception-BN) trained on the Places365-Challenge dataset||0.1511|
|OceanVision||A result obtained by VGG-16||0.1635|
|OceanVision||A result obtained by alexnet||0.1867|
|OceanVision||A result obtained by googlenet||0.1867|
|ABTEST||GoogLeNet Model trained on LSUN dataset and fined tuned on Places2||0.3245|
|Vladimir Iglovikov||VGG16 trained on 128x128||0.3552|
|Vladimir Iglovikov||VGG19 trained on 128x128||0.3593|
|Vladimir Iglovikov||average of VGG16 and VGG19 trained on 128x128||0.3712|
|Vladimir Iglovikov||Resnet 50 trained on 128x128||0.4577|
|Team name||Team members||Abstract|
|Hikvision||Qiaoyong Zhong*, Chao Li, Yingying Zhang(#), Haiming Sun*, Shicai Yang*, Di Xie, Shiliang Pu (* indicates equal contribution)
Hikvision Research Institute
(#)ShanghaiTech University, work is done at HRI
Our work on object detection is based on Faster R-CNN. We design and validate the following improvements:
* Better network. We find that the identity-mapping variant of ResNet-101 is superior for object detection over the original version.
* Better RPN proposals. A novel cascade RPN is proposed to refine proposals' scores and location. A constrained neg/pos anchor ratio further increases proposal recall dramatically.
* Pretraining matters. We find that a pretrained global context branch increases mAP by over 3 points. Pretraining on the 1000-class LOC dataset further increases mAP by ~0.5 point.
* Training strategies. To attack the imbalance problem, we design a balanced sampling strategy over different classes. With balanced sampling, the provided negative training data can be safely added for training. Other training strategies, like multi-scale training and online hard example mining are also applied.
* Testing strategies. During inference, multi-scale testing, horizontal flipping and weighted box voting are applied.
The final mAP is 65.1 (single model) and 67 (ensemble of 6 models) on val2.
A combination of 3 Inception networks and 3 residual networks is used to make the class prediction. For localization, the same Faster R-CNN configuration described above for DET is applied. The top5 classification error rate is 3.46%, and localization error is 8.8% on the validation set.
For the scene classification task, by drawing support from our newly-built M40-equipped GPU clusters, we have trained more than 20 models with various architectures, such as VGG, Inception, ResNet and different variants of them in the past two months. Fine-tuning very deep residual networks from pre-trained ImageNet models, like ResNet 101/152/200, seemed not to be as good enough as what we expected. Inception-style networks could get better performance in considerably less training time according to our experiments. Based on this observation, deep Inception-style networks, and not-so-deep residuals networks have been used. Besides, we have made several improvements for training and testing. First, a new data augmentation technique is proposed to better utilize the information of original images. Second, a new learning rate setting is adopted. Third, label shuffling and label smoothing is used to tackle the class imbalance problem. Fourth, some small tricks are used to improve the performance in test phase. Finally we achieved a very good top 5 error rate, which is below 9% on the validation set.
We utilize a fully convolutional network transferred from VGG-16 net, with a module, called mixed context network, and a refinement module appended to the end of the net. The mixed context network is constructed by a stack of dilated convolutions and skip connections. The refinement module generates predictions by making use of output of the mixed context network and feature maps from early layers of FCN. The predictions are then fed into a sub-network, which is designed to simulate message-passing process. Compared with baseline, our first major improvement is that, we construct the mixed context network, and find that it provides better features for dealing with stuff, big objects and small objects all at once. The second improvement is that, we propose a memory-efficient sub-network to simulate message-passing process. The proposed system can be trained end-to-end. On validation set, the mean iou of our system is 0.4099 (single model) and 0.4156 (ensemble of 3 models), and the pixel accuracy is 79.80% (single model) and 80.01% (ensemble of 3 models).
 Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
 Shrivastava, Abhinav, Abhinav Gupta, and Ross Girshick. "Training region-based object detectors with online hard example mining." arXiv preprint arXiv:1604.03540 (2016).
 He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
 He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).
 Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).
 Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015).
 Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016).
 F. Yu and V. Koltun, "Multi-scale context aggregation by dilated convolutions," in ICLR, 2016.
 J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in CVPR, 2015.
 S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr, "Conditional random fields as recurrent neural networks," in ICCV, 2015.
 L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", arXiv:1606.00915, 2016.
 P. O. Pinheiro, T. Lin, R. Collobert, P. Dollar, "Learning to Refine Object Segments", arXiv:1603.08695, 2016.
|MW||Gang Sun (Institute of Software, Chinese Academy of Sciences)
Jie Hu (Peking University)
|We leverage the theory named CNA  (capacity and necessity analysis) to guide the design of CNNs. We add more layers on the larger feature map (e.g., 56x56) to increase the capacity, and remove some layers on the smaller feature map (e.g., 14x14) to avoid ineffective architectures. We have verified the effectiveness on the models in , ResNet-like models , and Inception-ResNet-like models . In addition, we also apply cropped patches from original images as training samples by selecting random area and aspect ratio. To increase the ability of generalization, we prune the model weights periodically. Moreover, we utilize balanced sampling strategy  and label smooth regularization  during training, to alleviate the bias from the non-uniform sample distribution among categories and partial incorrect training labels. We use the provided data (Places365) for training models, do not use any additional data, and train all models from scratch. The algorithm and architecture details will be described in our arXiv paper (available online shortly).
 Xudong Cao. A practical theory for designing very deep convolutional neural networks, 2014. (unpublished)
 Li Shen, Zhouchen Lin, Qingming Huang. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. In ECCV 2016.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. In CVPR 2016.
 Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. ArXiv:1602.07261,2016.
 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. ArXiv:1512.00567,2016.
|Trimps-Soushen||Jie Shao, Xiaoteng Zhang, Zhengyan Ding, Yixin Zhao, Yanjun Chen, Jianying Zhou, Wenfei Wang, Lin Mei, Chuanping Hu
The Third Research Institute of the Ministry of Public Security, P.R. China.
|Object detection (DET)
We use several pre-trained models, including ResNet, Inception, Inception-Resnet etc. By taking the predict boxes from our best model as region proposals, we average the softmax scores and the box regression outputs across all models. Other improvements include annotations refine, boxes voting and features maxout.
Object classification/localization (CLS-LOC)
Based on image classification models like Inception, Inception-Resnet, ResNet and Wide Residual Network (WRN), we predict the class labels of the image. Then we refer to the framework of "Faster R-CNN" to predict bounding boxes based on the labels. Results from multiple models are fused in different ways, using the model accuracy as weights.
Scene classification (Scene)
We adopt different kinds of CNN models such as ResNet, Inception and WRN. To improve the performance of features from multiple scales and models, we implement a cascade softmax classifier after the extraction stage.
Object detection from video (VID)
Same methods as DET task were applied to each frame. Optical flow guided motion prediction helped to reduce the false negative detections.
 Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. NIPS 2015
 Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alem.
 Zagoruyko S, Komodakis N. Wide Residual Networks[J]. arXiv preprint arXiv:1605.07146, 2016.
|SIAT_MMLAB||Sheng Guo, Linjie Xing,
Shenzhen Institutes of Advanced Technology, CAS.
Computer Vision Lab, ETH Zurich.
Chinese University of Hong Kong.
Jiaming Liu and Yu Qiao,
Shenzhen Institutes of Advanced Technology, CAS.
|We propose a modular framework for large-scale scene recognition, called as multi-resolution CNN (MR-CNN) . This framework addresses the characterization difficulty of scene concepts, which may be based on multi-level visual information, including local objects, spatial layout, and global context. Specifically, in this challenge submission, we utilizes four resolutions (224, 299, 336, 448) as the input sizes of MR-CNN architectures. For coarse resolution (224, 299), we exploit the existing powerful Inception architectures (Inception v2 , Inception v4 , and Inception-ResNet ), while for fine resolution (336, 448), we propose our new inception architectures by making original inception network deeper and wider. Our final submission is the prediction result of MR-CNNs by fusing the outputs of CNNs of different resolutions.
In addition, we propose several principled techniques to reduce the over-fitting risk of MR-CNNs, including class balancing and hard sample mining. These simple yet effective training techniques enable us to further improve the generalization performance of MR-CNNs on the validation dataset. Meanwhile, we use an efficient parallel version of Caffe toolbox  to allow for the fast training of our proposed deeper and wider Inception networks.
 L. Wang, S. Guo, W. Huang, Y. Xiong, and Y. Qiao, Knowledge guided disambiguation for large-scale scene classification with Multi-Resolution CNNs, in arXiv, 2016.
 S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in ICML, 2015.
 C. Szegedy, S. Ioffe, and V. Vanhouche, Inception-v4, Inception-ResNet and the impact of residual connections on learning, in arXiv, 2016.
 L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal segment networks: towards good practices for deep action recognition, in ECCV, 2016.
|NTU-SC||Jason Kuen, Xingxing Wang, Bing Shuai, Xiangfei Kong, Jianxiong Yin, Gang Wang*, Alex C Kot
Rapid-Rich Object Search Lab, Nanyang Technological University, Singapore.
|All of our scene classification models are built upon pre-activation ResNets . For scene classification using the provided RGB images, we train from scratch a ResNet-200, as well as a relatively shallow Wide-ResNet . In addition to RGB images, we make use of class activation maps  and (scene) semantic segmentation masks  as complementary cues, obtained from models pre-trained for ILSVRC image classification  and scene parsing  tasks respectively. Our final submissions consist of ensembles of multiple models.
 He, K., Zhang, X., Ren, S., & Sun, J. “Identity Mappings in Deep Residual Networks”. ECCV 2016.
 Zagoruyko, S., & Komodakis, N. “Wide Residual Networks”. BMVC 2016.
 Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. “Learning Deep Features for Discriminative Localization”. CVPR 2016.
 Shuai, B., Zuo, Z., Wang, G., & Wang, B. "Dag-Recurrent Neural Networks for Scene Labeling". CVPR 2016.
 Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Berg, A. C. “Imagenet large scale visual recognition challenge”. International Journal of Computer Vision, 115(3), 211-252.
 Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. “Semantic Understanding of Scenes through the ADE20K Dataset”. arXiv preprint arXiv:1608.05442.
|NQSCENE||Chen Yunpeng ( NUS )
Jin Xiaojie ( NUS )
Zhang Rui ( CAS )
Li Yu ( CAS )
Yan Shuicheng ( Qihoo/NUS )
|Technique Details for the Scene Classification:
For the scene classification task, we propose the following methods to address the data imbalance issues (aka the long tail distribution issue) which benefit and boost the final performance:
1) Category-wise Data Augmentation:
We implied a category wise data augmentation strategy, which associates each category with adaptive augmentation level. The augmentation level is updated iteratively during the training.
2) Multi-task Learning:
We proposed a multipath learning architecture to jointly learn feature representations from the Imagnet-1000 dataset and Places-365 dataset.
Vanilla ResNet-200  is adopted with following elementary tricks: scale and aspect ratio augmentation, over-sampling, multi-scale (x224,x256,x288,x320) dense testing. In total, we have trained four models and fused them by averaging their scores. It costs about three days for training each model using MXNet  on a cluster with forty NVIDIA M40 (12GB).
 He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).
 Chen, Tianqi, et al. "Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems." arXiv preprint arXiv:1512.01274(2015).
|Samsung Research America: General Purpose Acceleration Group||Dr. S. Eliuk (Samsung), C. Upright (Samsung), Dr. H. Vardhan (Samsung),
T. Gale (Intern, North Eastern),
S. Walsh (Intern University of Alberta).
|The General Purpose Acceleration Group is focused on accelerating training via HPC & distributed computing. We present Distributed Training Done Right (DTDR) where standard open-source models are trained in an effective manner via a multitude of techniques involving strong / weak scaling and strict distributed training modes. Several different models are used from standard Inception v3, to Inception v4 res2, and ensembles of such techniques. The training environment is unique as we can explore extremely deep models given the model-parallel nature of our partitioning of data.
|fusionf||Nina Narodytska (Samsung Research America)
Shiva Kasiviswanathan (Samsung Research America)
Hamid Maei (Samsung Research America)
|We used several modifications of modern CNNs, including VGG, GoogleNet[2,4], and ResNet. We used several fusion strategies,
including a standard averaging and scoring scheme. We also used different subsets of models in different submissions. Training was
performed on low-resolution dataset. We used balanced loading to take into account different numbers of images in each class.
 K. Simonyan and A. Zisserman. Very deep convolutional networks for
large-scale image recognition.
 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D.
Erhan, V. Vanhoucke, A. Rabinovich. Going Deeper with Convolutions.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Deep Residual Learning for Image Recognition
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi
Inception-v4, Inception-ResNet and the Impact of Residual Connections
|YoutuLab||Xiaowei Guo, YoutuLab
Ruixin Zhang, YoutuLab
Yushi Yao, YoutuLab
Pai Peng, YoutuLab
Ke Li, YoutuLab
|We build a scene recognition system using deep CNN models. These CNN models are inspired by original resnet and inception network architectures. We train these models on challenge dataset and apply balanced sampling strategy to adapt unbalanced challenge dataset. Moreover, DSD process is applied to further improve model performance.
In this competition, we submit five entries. The first and second are combinations of single scale results using weighted arithmetic average which weights is searched by greedy strategy. The third is a combination of single model results using same strategy with the first entry. The fourth and fifth are combinations using simple average strategy of single model results.
 K. He, X. Zhang, S. Ren, J. Sun. Identity Mappings in Deep Residual Networks. In ECCV 2016. abs/1603.05027
 C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In ICLR 2016. abs/1602.07261
 L. Shen, Z. Lin, Q. Huang. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. abs/1512.05830
 S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, W. J. Dally. DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow. abs/1607.04381
|SamExynos||Qian Zhang(Beijing Samsung Telecom R&D Center)
Peng Liu(Beijing Samsung Telecom R&D Center)
Jinbin Lin(Beijing Samsung Telecom R&D Center)
Junjun Xiong(Beijing Samsung Telecom R&D Center)
The submission is based on  and , but we modified the model, and the newtwork is 205 layers. Due to the limit of time and GPUs, we have just trained three CNN model for classification. The top-5 accuracy on the validation set with dense crops(scale:224,256,288,320,352,384,448,480) is 96.44% for the best single model. And the top-5 accuracy on the validation set with dense crops is 96.88% for three model ensemble.
The submission is based on  and , we add 5 layers to resnet 50, and modified the network. Due to the limit of time and GPUs, we have just trained three CNN model for the scene classification task. The top-5 accuracy on the validation set with 72 crops is 87.79% for the best single model. And the top-5 accuracy on the validation set with multiple crops is 88.70% for three model ensemble.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun,Identity Mappings in Deep Residual Networks. ECCV 2016.
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning". arXiv preprint arXiv:1602.07261 (2016)
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. "Rethinking the Inception Architecture for Computer Vision". arXiv preprint arXiv:1512.00567 (2015)
|Rangers||Y. Q. Gao,
W. H. Luo,
X. J. Deng,
W. D. Chen,
|Everphoto||Yitong Wang, Zhonggan Ding, Zhengping Wei, Linfu Wen
|Our method is based on DCNN approaches.
We use 5 models with different input scales and different network structures as basic models. They are derived from GoogleNet, VGGNet and ResNet.
We also utilize the idea of dark knowledge  to train several specialist models, and use these specialist models to reassign probability scores and refine the basic outputs.
Our final results are based on the ensemble of refined outputs.
 Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.
|MIPAL_SNU||Sungheon Park and Nojun Kwak (Graduate School of Convergence Science and Technology, Seoul National University)||We trained two ResNet-50  networks. One network used 7x7 mean pooling, and the other used multiple mean poolings with various sizes and positions. We also used balanced sampling strategy which is similar to  to deal with the imbalanced training set.
 He, Kaiming, et al. "Deep residual learning for image recognition." CVPR, 2016.
 Shen, Li, et al. "Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks." arXiv, 2015.
|KPST_VB||Nguyen Hong Hanh
|In this work, we used pre-trained ResNet200(ImageNet) and retrained the network on Place 365 Challenge data (256 by 256). We also estimated scene probability using the output of pretrained ResNet200 and scene vs. object (ImageNet 1000 class) distribution on training data. For classification, we used ensemble of two networks with multiple crops and adjusted on scene probability.
 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
*Our work is performed by deep learning analysis tool(Deep SDK by KPST).
|SJTU-ReadSense||Qinchuan Zhang, Shanghai Jiao Tong University
Junxuan Chen, Shanghai Jiao Tong University
Thomas Tong, ReadSense
Leon Ding, ReadSense
Hongtao Lu, Shanghai Jiao Tong University
|We train two CNN models from the scratch. Model A based on Inception-BN  with one auxiliary classifier is trained on the Places365-Challenge dataset , which achieved 15.03% top-5 error on validation dataset. Model B based on ResNet  with depth of 50 layers is trained on the Places365-Standard dataset and finetuned for 2 epochs on the Places365-Challenge dataset due to the limit of time, which achieved 16.3% top-5 error on validation dataset. We also fuse features extracted from 3 baseline models  on the Places365-Challenge dataset and trained two fully connected layers with a softmax classifier. Moreover, we adopt the "class-aware" sampling strategy proposed by  for models trained on Places365-Challenge dataset to tackle the non-uniform distribution of images over 365 categories. We implement model A using Caffe  and conduct all other experiments using MXNet  to deploy larger batch size on a GPU.
We train all models with a 224x224 crop randomly sampled from an 256x256 image or its horizontal flip, with the per-pixel mean subtracted. We apply 12-crops  for evaluation on validation and test datasets.
We ensemble multiple models with weights (learnt on validation dataset or top-5 validation accuracies), and achieve 12.79% (4 models), 12.69% (5 models), 12.57% (6 models) top-5 error on validation dataset.
 S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 Places: An Image Database for Deep Scene Understanding. B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. Arxiv, 2016.
 K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
 L. Shen, Z. Lin , Q. Huang. Relay backpropagation for effective learning of deep convolutional neural networks. arXiv:1512.05830, 2015.
 Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
 T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C.n Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS, 2015.
 C. Szegedy,W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
@Shanghai Jiao Tong University (SJTU)
(* indicates equal contribution)
|In this competition, we submit five entries.
The first model is a single model, which achieved 15.24% top-5 error on validation dataset. It is a Inception-V3 model that is modified and trained based on both the challenge and standard datasets. When being tested, images are resized to 337*337 and then a 12-crops skill is used to get the 299*299 inputs to the model, which contributes to the improvement of performance.
The second model is a fusion-feature model(FeatureFusion_2L), which achieved 13.74% top-5 error on validation dataset. It is a two layers fusion-feature network, whose input is the combination of fully-connected layer's features extracted from several well performed CNNs(i.e. pretrained models, such as Resnet, VGG, Googlenet).As a result, it turns out to be efficient in reducing the error rate.
The third model is also a fusion-feature network(FeatureFusion_3L),which achieved 13.95% top-5 error on validation dataset. Comparing with the second model, it is a three layers fusion-feature network which contains two fully-connected layers.
The fourth is the combination of CNN models with a strategy w.r.t.validation accuracy, which achieved 13% top-5 error on validation dataset. It combines the probabilities provided by the softmax layer from three CNNs, in which the influential factor of each CNN is determined by the validation accuracy.
The fifth is the combination of CNN models based on researched influential factors, which achieved 12.65% top-5 error on validation dataset. There are six CNNs taken into consideration, while four models(Inception-V2, Inception-V3, FeatureFusion_2L and FeatureFusion_3L) of them are trained by us and the other two are pretrained. The influential factors of these models are optimized according to plenty of researches.
 Szegedy, Christian, et al. "Rethinking the Inception Architecture for Computer Vision." arXiv preprint arXiv:1512.00567 (2015).
B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. "Places: An Image Database for Deep Scene Understanding." Arxiv, 2016.
 B. Zhou, A. Lapedriza, J. Xiao, A. Torralba and A. Oliva. "Learning Deep Features for Scene Recognition using Places Database."Advances in Neural Information Processing Systems 27 (NIPS), 2014.
|Choong||Choong Hwan Choi (KAIST)||Abstract
Ensemble of Deep learning model based on VGG16 & ResNet
Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate.
 Liu, Wei, et. al. "SSD: Single Shot Multibox Detector"
 K. Simonyan, A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition"
 Kaiming He, et. al., "Deep Residual Learning for Image Recognition"
|SIIT_KAIST||Sihyeon Seong (KAIST)
Byungju Kim (KAIST)
Junmo Kim (KAIST)
|We used ResNet (101 layers / 4GPUs) as our baseline model. From the model pre-trained with ImageNet classification dataset(provided by ), We re-tuned the model with Places365 dataset (256-resized small dataset). Then, we further fine-tuned the model based on the following ideas:
i) Analyzing correlations between labels : We calculated correlations between each pair of predictions p(i), p(j) where i, j are classes. Then, highly correlated label pairs are extracted by thresholding the correlation coefficients.
ii) Additional semantic label generation : Using the correlation table from i), we further generated super/subclass labels by clustering them. Additionally, we generated 170 binary labels for separations of confusing classes, which maximize margins between highly correlated label pairs.
iii) Boosting-like multi-loss terms :
A large number of loss terms are combined for classifying the labels generated in ii).
 He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
|DPAI Vison||Object detection: Chris Li, Savion Zhao, Bin Liu, Yuhang He, Lu Yang, Cena Liu
Scene classification: Lu Yang, Yuhang He, Cena Liu, Bin Liu, Bo Yu
Scene parsing: Bin Liu, Lu Yang, Yuhang He, Cena Liu, Bo Yu, Chris Li, Xiongwei Xia
Object detection from video: Bin Liu, Cena Liu, Savion Zhao, Yuhang He, Chris Li
|Object detection:Our methods is based on faster-rcnn and extra classifier. (1) data processing: data equalization by deleting lots of examples in threee dominating classes (person, dog, and bird); adding extra data for classes with training data less than 1000; (2) COCO pre-train; (3) Iterative bounding box regression + multi-scale (trian/test) + random flip images (train / test) (4) Multimodel ensemble: resnet-101 and inception-v3 (5) Extra classifier with 200 classes which helps to promote recall and refine the detection scores of ultimate boxes.
 He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015.
 Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99.
Scene classification: We trained the model on Caffe. An ensemble of Inception-V3 and Inception-V4. We totally integrated four models. Top1 error on validation is 0.431 and top5 error is 0.129. The single model is modified on Inception-V3, the top1 error on validation is 0.434, top5 error is 0.133.
 Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093. 2014.
C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,andZ.Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
 C.Szegedy,S.Ioffe,V.Vanhoucke. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv preprint arXiv:1602.07261, 2016.
Scene parsing: We trained 3 models on modified deeplab (inception-v3, resnet-101, resnet-152) and only used the ADEChallengeData2016 data. Multi-scale \ image crop \ image fliping \ contrast transformation are used for data augmentation and decseCRF is used as post-processing to refine object boundaries. On validation with combining 3 models, witch achieved 0.3966 mIoU and 0.7924 pixel-accuracy.
 L. Chen, G. Papandreou, I. K.; Murphy, K.; and Yuille, A. L. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In arXiv preprint arXiv:1606.00915.
 B. Zhou, H. Zhao, X. P. S. F. A. B., and Torralba, A. 2016. Semantic understanding of scenes through the ade20k dataset. In arXiv preprint arXiv:1608.05442.
Object detection from video: Our methods is based on faster-rcnn and extra classifier. We train Faster-RCNN based on RES-101 with the provided training data. We also train extra classifier with 30 classes which helps to promote recall and refine the detection scores of ultimate boxes.
|isia_ICT||Xinhan Song, Institute of Computing Technology
Chengpeng Chen, Institute of Computing Technology
Shuqiang jiang, Institute of Computing Technology
|For convenience, we use the 4 provided models as our basic models, which are used for the following fine-tuning or networks adaptation. Besides, considering the non-uniform and the tremendous image number of the Challenge Dataset, we only use the Standard Dataset for all the following steps.
First, we fuse these models with average strategy as the baseline. And then, we add a SPP layer to VGG16 and ResNet152 perspectively to enable the models to be feed with images with larger scale. After fine-tuning the models, we also fuse them with average strategy, and we only submit the result of the size 288.
we also perform spectral clustering on the confusion matrix extracted from validation data to get 20 clusters, which means that 365 classes are separated into 20 clusters mainly dependent on their co-relationship. To classify the classes in the same cluster more precisely, we train an extra classifier within each cluster, which is implemented by fine-tuning the networks with all the layers fixed except for fc8 layer and combining them into a network at last.
D. Yoo, S. Park, J. Lee and I. Kweon. “Multi-scale pyramid pooling for deep convolutional representation”. In CVPR Workshop 2015
|NUIST||Jing Yang, Hui Shuai, Zhengbo Yu, Rongrong Fan, Qiang Ma, Qingshan Liu, Jiankang Deng||1.inception v2  is used in the VID task, which is almost real time with GPU.
2.cascaded region regression is used to detect and track different instances.
3.context inference between instances within each video
4.online detector and tracker update to improve recall
Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
Dai, Jifeng, et al. "R-FCN: Object Detection via Region-based Fully Convolutional Networks." arXiv preprint arXiv:1605.06409 (2016).
|Viz Insight||Biplab Ch Das, Samsung R&D Institute Bangalore
Shreyash Pandey, Samsung R&D Institute Bangalore
|Ensembling approaches have been known to outperform individual classifiers on standard classification tasks [No free Lunch Theorem :)]
In our approach we trained state of the art classifiers including variations of:
Each of these classifiers were trained on different views of the provided places 2 challenge data.
Multiple Deep Metaclassifiers were trained on the confidence of the labels predicted by above classifiers successfully accomplishing a non linear ensemble,
where the weights of the neural network are in a way to maximize the accuracy of scene recognition.
To impose further consistency between objects and scenes, a state of art classifier trained on imagenet was adapted to places via a zero shot learning approach.
We did not use any external data for training the classifiers. However we balanced the data to make the classifiers get unbiased results. So some of the data remained unused.
|Faceall-BUPT||Xuankun HUANG, BUPT, CHINA
Jiangqi ZHANG, BUPT, CHINA
Zhiqun HE, BUPT, CHINA
Junfei ZHUANG, BUPT, CHINA
Zesang HUANG, BUPT, CHINA
Yongqiang Yao, BUPT, CHINA
Kun HU, BUPT, CHINA
Fengye XIONG, BUPT, CHINA
Hongliang BAI, Beijing Faceall co., LTD
Wenjian FENG, Beijing Faceall co., LTD
Yuan DONG, BUPT, CHINA
We trained the ResNet-101, ResNet-152 and Inception-v3 for object classification. Multi-view testing and models ensemble is utilized to generate the final classification results.
For localization task, we trained a Region Proposal Network to generate proposals of each image, and we fine-tuned two models with object-level annotations of 1,000 classes. Moreover, a background class is added into the network. Then test images are segmented into 300 regions by RPN and these regions are classified by the fine-tuned model into one of 1,001 classes. And the final bounding box is generated by merging the bounding rectangle of three regions.
# Object detection
We utilize faster-rcnn with the publicly available resnet-101. Other than the baseline, we adopt multi-scale roi to obtain features containing richer context information. For testing, we use 3 scales and merge these results using the simple strategy introduced last year.
No validation data is used for training, and flipped images are used in only a third of the training epochs.
# Object detection from video
We use Faster R-CNN with Resnet-101 to do this as in the object detection task. One fifth of the images are tested with 2 scales. No tracking techniques are used because of some mishaps.
# Scene classification
We trained a single Inception-v3 network with multi-scale and tested with multi-view of 150 crops.
On validation the top-5 error is about 14.56%.
# Scene parsing
We trained 6 models with net structure inspired by fcn8s and dilatedNet with 3 scales(256,384,512). Then we test with flipped images using pre-trained fcn8s and dilatedNet. The pixel-wise accuracy is 76.94％ and mean of the class-wise IoU is 0.3552.
|OceanVision||Zhibin Yu Ocean University of China
Chao Wang Ocean University of China
ZiQiang Zheng Ocean University of China
Haiyong Zheng Ocean University of China
|Our homepage: http://vision.ouc.edu.cn/~zhenghaiyong/
We are interesting in scene classification and we aim to build a net for this problem.
|ABTEST||Ankan Bansal||We have used a 22 layer GoogleNet  model to classify scenes. The model was trained on the LSUN  dataset and then finetuned on the Places dataset fot 365 categories. We did not use any intelligent data selection techniques. The network is simply trained using all the available data without considering the data distribution for different classes.
Before training on LSUN, this network was trained using the Places205 dataset. The model was trained till it saturated at around 85% (Top-1) accuracy on the validation dataset of the LSUN challenge. Then the model was fine-tuned on the 365 categories in the Places2 challenge.
We did not use the trained models provided by the organisers to initialise our network.
 Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
 Yu, Fisher, et al. "Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop." arXiv preprint arXiv:1506.03365 (2015).
|Vladimir Iglovikov||Vladimir Iglovikov
||Hardware: Nvidia Titan X
Software: Keras with Theano backend
Time spent: 5 days
All models trained on 128x128 resized from "small 256x256" dataset.
 Modiffied VGG16 => validation top5 error => 0.36
 Modified VGG19 => validation top5 error => 0.36
 Modified Resnet 50 => validation top5 error => 0.46
 Average of  and  => validation top5 error 0.35
Relu => Elu
Optimizer => Adam
Batch Normalization added to VGG16 and VGG19
|scnu407||Li Shiqi South China Normal University
Zheng Weiping South China Normal University
Wu Jinhui South China Normal University
|We believe that the spatial relationships between objects in the image is a kind of time-series data. Therefore, we first use VGG16 to extract the features of the image, then add 4 LSTM layer in the back, four LSTM layer representing the four directions of the scanning feature map.|