Challenge 2016

Held in conjunction with ILSVRC at ECCV 2016

Results

Contents:

Legend:

Scene classification with provided training data

Team name Entry description Top-5 classification error
Hikvision Model D 0.0901
Hikvision Model E 0.0908
Hikvision Model C 0.0939
Hikvision Model B 0.0948
MW Model ensemble 2 0.1019
MW Model ensemble 3 0.1019
MW Model ensemble 1 0.1023
Hikvision Model A 0.1026
Trimps-Soushen With extra data. 0.103
Trimps-Soushen Ensemble 2 0.1042
SIAT_MMLAB 10 models fusion 0.1043
SIAT_MMLAB 7 models fusion 0.1044
SIAT_MMLAB fusion with softmax 0.1044
SIAT_MMLAB learning weights with cnn 0.1044
SIAT_MMLAB 6 models fusion 0.1049
Trimps-Soushen Ensemble 4 0.1049
Trimps-Soushen Ensemble 3 0.105
MW Single model B 0.1073
MW Single model A 0.1076
NTU-SC Product of 5 ensembles (top-5) 0.1085
NTU-SC Product of 3 ensembles (top-5) 0.1086
NTU-SC Sum of 3 ensembles (top-5) 0.1086
NTU-SC Sum of 5 ensembles (top-3) 0.1086
NTU-SC Single ensemble of 5 models (top-5) 0.1088
NQSCENE Four models 0.1093
NQSCENE Three models 0.1101
Samsung Research America: General Purpose Acceleration Group Simple Ensemble, 3 Inception v3 models w/various hyper param changes, 32 multi-crop (60.11 top-1, 88.98 top-5 on val) 0.1113
fusionf Fusion with average strategy (12 models) 0.1115
fusionf Fusion with scoring strategy (14 models) 0.1117
fusionf Fusion with average strategy (13 models) 0.1118
YoutuLab weighted average1 at scale level using greedy search 0.1125
YoutuLab weighted average at model level using greedy search 0.1127
YoutuLab weighted average2 at scale level using greedy search 0.1129
fusionf Fusion with scoring strategy (13 models) 0.113
fusionf Fusion with scoring strategy (12 models) 0.1132
YoutuLab simple average using models in entry 3 0.1139
Samsung Research America: General Purpose Acceleration Group Model A0, weakly scaled, multi-crop. (59.61 top-1, 88.64 top-5 on val) 0.1142
SamExynos 3 model 0.1143
Samsung Research America: General Purpose Acceleration Group Ensemble B, 3 Inception v3 models w/various hyper param changes + Inception v4 res2, 128 multi-crop 0.1152
YoutuLab average on base models 0.1162
NQSCENE Model B 0.117
Samsung Research America: General Purpose Acceleration Group Model A2, weakly scaled, single-crop & mirror. (58.84 top-1, 88.09 top-5 on val) 0.1188
NQSCENE Model A 0.1192
Samsung Research America: General Purpose Acceleration Group Model A1, weakly scaled, single-crop. (58.65 top-1, 88.07 top-5 on val) 0.1193
Trimps-Soushen Ensemble 1 0.1196
Rangers ensemble model 1 0.1208
SamExynos single model 0.121
Rangers ensemble model 2 0.1212
Everphoto ensemble by learned weights - 1 0.1213
Everphoto ensemble by product strategy 0.1218
Everphoto ensemble by learned weights - 2 0.1218
Everphoto ensemble by average strategy 0.1223
MIPAL_SNU Ensemble of two ResNet-50 with balanced sampling 0.1232
KPST_VB Model II 0.1233
KPST_VB Ensemble of Model I and II 0.1235
Rangers single model result of 69 0.124
Everphoto ensemble by product strategy (without specialist models) 0.1242
KPST_VB Model II with adjustment 0.125
KPST_VB Model I 0.1251
Rangers single model result of 66 0.1253
KPST_VB Ensemble of Model I and II with adjustment 0.1253
SJTU-ReadSense Ensemble 5 models with learnt weights 0.1272
SJTU-ReadSense Ensemble 5 models with weighted validation accuracies 0.1273
iMCB A combination of CNN models based on researched influential factors 0.1277
SJTU-ReadSense Ensemble 6 models with learnt weights 0.1278
SJTU-ReadSense Ensemble 4 models with learnt weights 0.1287
iMCB A combination of CNN models with a strategy w.r.t.validation accuracy 0.1299
Choong Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate. 0.131
SIIT_KAIST 101-depth single model (val.error 12.90%) 0.131
DPAI Vison An ensemble model 0.1355
isia_ICT spectral clustering on confusion matrix 0.1355
isia_ICT fusion of 4 models with average strategy 0.1357
NUIST inception+shortcut CNN 0.137
isia_ICT MP_multiCNN_multiscale 0.1372
NUIST inception+shortcut CNN 0.1381
Viz Insight Multiple Deep Metaclassifiers 0.1386
iMCB FeatureFusion_2L 0.1396
iMCB FeatureFusion_3L 0.1404
DPAI Vison Single Model 0.1425
isia_ICT 2 models with size of 288 0.1433
Faceall-BUPT A single model with 150crops 0.1471
Baseline: ResNet152 0.1493
Baseline: VGG16 0.1499
iMCB A Single Model 0.1506
SJTU-ReadSense A single model (based on Inception-BN) trained on the Places365-Challenge dataset 0.1511
Baseline: GoogLeNet 0.1599
OceanVision A result obtained by VGG-16 0.1635
Baseline:AlexNet 0.1725
OceanVision A result obtained by alexnet 0.1867
OceanVision A result obtained by googlenet 0.1867
ABTEST GoogLeNet Model trained on LSUN dataset and fined tuned on Places2 0.3245
Vladimir Iglovikov VGG16 trained on 128x128 0.3552
Vladimir Iglovikov VGG19 trained on 128x128 0.3593
Vladimir Iglovikov average of VGG16 and VGG19 trained on 128x128 0.3712
Vladimir Iglovikov Resnet 50 trained on 128x128 0.4577
scnu407 VGG16+4D lstm 0.8831

Team information

Team name Team members Abstract
Hikvision Qiaoyong Zhong*, Chao Li, Yingying Zhang(#), Haiming Sun*, Shicai Yang*, Di Xie, Shiliang Pu (* indicates equal contribution)

Hikvision Research Institute
(#)ShanghaiTech University, work is done at HRI
[DET]
Our work on object detection is based on Faster R-CNN. We design and validate the following improvements:
* Better network. We find that the identity-mapping variant of ResNet-101 is superior for object detection over the original version.
* Better RPN proposals. A novel cascade RPN is proposed to refine proposals' scores and location. A constrained neg/pos anchor ratio further increases proposal recall dramatically.
* Pretraining matters. We find that a pretrained global context branch increases mAP by over 3 points. Pretraining on the 1000-class LOC dataset further increases mAP by ~0.5 point.
* Training strategies. To attack the imbalance problem, we design a balanced sampling strategy over different classes. With balanced sampling, the provided negative training data can be safely added for training. Other training strategies, like multi-scale training and online hard example mining are also applied.
* Testing strategies. During inference, multi-scale testing, horizontal flipping and weighted box voting are applied.
The final mAP is 65.1 (single model) and 67 (ensemble of 6 models) on val2.

[CLS-LOC]
A combination of 3 Inception networks and 3 residual networks is used to make the class prediction. For localization, the same Faster R-CNN configuration described above for DET is applied. The top5 classification error rate is 3.46%, and localization error is 8.8% on the validation set.

[Scene]
For the scene classification task, by drawing support from our newly-built M40-equipped GPU clusters, we have trained more than 20 models with various architectures, such as VGG, Inception, ResNet and different variants of them in the past two months. Fine-tuning very deep residual networks from pre-trained ImageNet models, like ResNet 101/152/200, seemed not to be as good enough as what we expected. Inception-style networks could get better performance in considerably less training time according to our experiments. Based on this observation, deep Inception-style networks, and not-so-deep residuals networks have been used. Besides, we have made several improvements for training and testing. First, a new data augmentation technique is proposed to better utilize the information of original images. Second, a new learning rate setting is adopted. Third, label shuffling and label smoothing is used to tackle the class imbalance problem. Fourth, some small tricks are used to improve the performance in test phase. Finally we achieved a very good top 5 error rate, which is below 9% on the validation set.

[Scene Parsing]
We utilize a fully convolutional network transferred from VGG-16 net, with a module, called mixed context network, and a refinement module appended to the end of the net. The mixed context network is constructed by a stack of dilated convolutions and skip connections. The refinement module generates predictions by making use of output of the mixed context network and feature maps from early layers of FCN. The predictions are then fed into a sub-network, which is designed to simulate message-passing process. Compared with baseline, our first major improvement is that, we construct the mixed context network, and find that it provides better features for dealing with stuff, big objects and small objects all at once. The second improvement is that, we propose a memory-efficient sub-network to simulate message-passing process. The proposed system can be trained end-to-end. On validation set, the mean iou of our system is 0.4099 (single model) and 0.4156 (ensemble of 3 models), and the pixel accuracy is 79.80% (single model) and 80.01% (ensemble of 3 models).

References
[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
[2] Shrivastava, Abhinav, Abhinav Gupta, and Ross Girshick. "Training region-based object detectors with online hard example mining." arXiv preprint arXiv:1604.03540 (2016).
[3] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
[4] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).
[5] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).
[6] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015).
[7] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016).
[8] F. Yu and V. Koltun, "Multi-scale context aggregation by dilated convolutions," in ICLR, 2016.
[9] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in CVPR, 2015.
[10] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr, "Conditional random fields as recurrent neural networks," in ICCV, 2015.
[11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", arXiv:1606.00915, 2016.
[12] P. O. Pinheiro, T. Lin, R. Collobert, P. Dollar, "Learning to Refine Object Segments", arXiv:1603.08695, 2016.
MW Gang Sun (Institute of Software, Chinese Academy of Sciences)
Jie Hu (Peking University)
We leverage the theory named CNA [1] (capacity and necessity analysis) to guide the design of CNNs. We add more layers on the larger feature map (e.g., 56x56) to increase the capacity, and remove some layers on the smaller feature map (e.g., 14x14) to avoid ineffective architectures. We have verified the effectiveness on the models in [2], ResNet-like models [3], and Inception-ResNet-like models [4]. In addition, we also apply cropped patches from original images as training samples by selecting random area and aspect ratio. To increase the ability of generalization, we prune the model weights periodically. Moreover, we utilize balanced sampling strategy [2] and label smooth regularization [5] during training, to alleviate the bias from the non-uniform sample distribution among categories and partial incorrect training labels. We use the provided data (Places365) for training models, do not use any additional data, and train all models from scratch. The algorithm and architecture details will be described in our arXiv paper (available online shortly).

[1] Xudong Cao. A practical theory for designing very deep convolutional neural networks, 2014. (unpublished)
[2] Li Shen, Zhouchen Lin, Qingming Huang. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. In ECCV 2016.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. In CVPR 2016.
[4] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. ArXiv:1602.07261,2016.
[5] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. ArXiv:1512.00567,2016.
Trimps-Soushen Jie Shao, Xiaoteng Zhang, Zhengyan Ding, Yixin Zhao, Yanjun Chen, Jianying Zhou, Wenfei Wang, Lin Mei, Chuanping Hu

The Third Research Institute of the Ministry of Public Security, P.R. China.
Object detection (DET)
We use several pre-trained models, including ResNet, Inception, Inception-Resnet etc. By taking the predict boxes from our best model as region proposals, we average the softmax scores and the box regression outputs across all models. Other improvements include annotations refine, boxes voting and features maxout.

Object classification/localization (CLS-LOC)
Based on image classification models like Inception, Inception-Resnet, ResNet and Wide Residual Network (WRN), we predict the class labels of the image. Then we refer to the framework of "Faster R-CNN" to predict bounding boxes based on the labels. Results from multiple models are fused in different ways, using the model accuracy as weights.

Scene classification (Scene)
We adopt different kinds of CNN models such as ResNet, Inception and WRN. To improve the performance of features from multiple scales and models, we implement a cascade softmax classifier after the extraction stage.

Object detection from video (VID)
Same methods as DET task were applied to each frame. Optical flow guided motion prediction helped to reduce the false negative detections.


[1] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. NIPS 2015

[2] Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alem.

[3] Zagoruyko S, Komodakis N. Wide Residual Networks[J]. arXiv preprint arXiv:1605.07146, 2016.
SIAT_MMLAB Sheng Guo, Linjie Xing,
Shenzhen Institutes of Advanced Technology, CAS.
Limin Wang,
Computer Vision Lab, ETH Zurich.
Yuanjun Xiong,
Chinese University of Hong Kong.
Jiaming Liu and Yu Qiao,
Shenzhen Institutes of Advanced Technology, CAS.
We propose a modular framework for large-scale scene recognition, called as multi-resolution CNN (MR-CNN) [1]. This framework addresses the characterization difficulty of scene concepts, which may be based on multi-level visual information, including local objects, spatial layout, and global context. Specifically, in this challenge submission, we utilizes four resolutions (224, 299, 336, 448) as the input sizes of MR-CNN architectures. For coarse resolution (224, 299), we exploit the existing powerful Inception architectures (Inception v2 [2], Inception v4 [3], and Inception-ResNet [3]), while for fine resolution (336, 448), we propose our new inception architectures by making original inception network deeper and wider. Our final submission is the prediction result of MR-CNNs by fusing the outputs of CNNs of different resolutions.

In addition, we propose several principled techniques to reduce the over-fitting risk of MR-CNNs, including class balancing and hard sample mining. These simple yet effective training techniques enable us to further improve the generalization performance of MR-CNNs on the validation dataset. Meanwhile, we use an efficient parallel version of Caffe toolbox [4] to allow for the fast training of our proposed deeper and wider Inception networks.


[1] L. Wang, S. Guo, W. Huang, Y. Xiong, and Y. Qiao, Knowledge guided disambiguation for large-scale scene classification with Multi-Resolution CNNs, in arXiv, 2016.

[2] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in ICML, 2015.

[3] C. Szegedy, S. Ioffe, and V. Vanhouche, Inception-v4, Inception-ResNet and the impact of residual connections on learning, in arXiv, 2016.

[4] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal segment networks: towards good practices for deep action recognition, in ECCV, 2016.
NTU-SC Jason Kuen, Xingxing Wang, Bing Shuai, Xiangfei Kong, Jianxiong Yin, Gang Wang*, Alex C Kot


Rapid-Rich Object Search Lab, Nanyang Technological University, Singapore.
All of our scene classification models are built upon pre-activation ResNets [1]. For scene classification using the provided RGB images, we train from scratch a ResNet-200, as well as a relatively shallow Wide-ResNet [2]. In addition to RGB images, we make use of class activation maps [3] and (scene) semantic segmentation masks [4] as complementary cues, obtained from models pre-trained for ILSVRC image classification [5] and scene parsing [6] tasks respectively. Our final submissions consist of ensembles of multiple models.

References
[1] He, K., Zhang, X., Ren, S., & Sun, J. “Identity Mappings in Deep Residual Networks”. ECCV 2016.
[2] Zagoruyko, S., & Komodakis, N. “Wide Residual Networks”. BMVC 2016.
[3] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. “Learning Deep Features for Discriminative Localization”. CVPR 2016.
[4] Shuai, B., Zuo, Z., Wang, G., & Wang, B. "Dag-Recurrent Neural Networks for Scene Labeling". CVPR 2016.
[5] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Berg, A. C. “Imagenet large scale visual recognition challenge”. International Journal of Computer Vision, 115(3), 211-252.
[6] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. “Semantic Understanding of Scenes through the ADE20K Dataset”. arXiv preprint arXiv:1608.05442.
NQSCENE Chen Yunpeng ( NUS )
Jin Xiaojie ( NUS )
Zhang Rui ( CAS )
Li Yu ( CAS )
Yan Shuicheng ( Qihoo/NUS )
Technique Details for the Scene Classification:

For the scene classification task, we propose the following methods to address the data imbalance issues (aka the long tail distribution issue) which benefit and boost the final performance:

1) Category-wise Data Augmentation:
We implied a category wise data augmentation strategy, which associates each category with adaptive augmentation level. The augmentation level is updated iteratively during the training.

2) Multi-task Learning:
We proposed a multipath learning architecture to jointly learn feature representations from the Imagnet-1000 dataset and Places-365 dataset.

Vanilla ResNet-200 [1] is adopted with following elementary tricks: scale and aspect ratio augmentation, over-sampling, multi-scale (x224,x256,x288,x320) dense testing. In total, we have trained four models and fused them by averaging their scores. It costs about three days for training each model using MXNet [2] on a cluster with forty NVIDIA M40 (12GB).

------------------------------
[1] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).
[2] Chen, Tianqi, et al. "Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems." arXiv preprint arXiv:1512.01274(2015).
Samsung Research America: General Purpose Acceleration Group Dr. S. Eliuk (Samsung), C. Upright (Samsung), Dr. H. Vardhan (Samsung),
T. Gale (Intern, North Eastern),
S. Walsh (Intern University of Alberta).
The General Purpose Acceleration Group is focused on accelerating training via HPC & distributed computing. We present Distributed Training Done Right (DTDR) where standard open-source models are trained in an effective manner via a multitude of techniques involving strong / weak scaling and strict distributed training modes. Several different models are used from standard Inception v3, to Inception v4 res2, and ensembles of such techniques. The training environment is unique as we can explore extremely deep models given the model-parallel nature of our partitioning of data.
fusionf Nina Narodytska (Samsung Research America)
Shiva Kasiviswanathan (Samsung Research America)
Hamid Maei (Samsung Research America)
We used several modifications of modern CNNs, including VGG[1], GoogleNet[2,4], and ResNet[3]. We used several fusion strategies,
including a standard averaging and scoring scheme. We also used different subsets of models in different submissions. Training was
performed on low-resolution dataset. We used balanced loading to take into account different numbers of images in each class.

[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for
large-scale image recognition.

[2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D.
Erhan, V. Vanhoucke, A. Rabinovich. Going Deeper with Convolutions.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Deep Residual Learning for Image Recognition

[4]Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi
Inception-v4, Inception-ResNet and the Impact of Residual Connections
on Learning

YoutuLab Xiaowei Guo, YoutuLab
Ruixin Zhang, YoutuLab
Yushi Yao, YoutuLab
Pai Peng, YoutuLab
Ke Li, YoutuLab
We build a scene recognition system using deep CNN models. These CNN models are inspired by original resnet[1] and inception[2] network architectures. We train these models on challenge dataset and apply balanced sampling strategy[3] to adapt unbalanced challenge dataset. Moreover, DSD[4] process is applied to further improve model performance.
In this competition, we submit five entries. The first and second are combinations of single scale results using weighted arithmetic average which weights is searched by greedy strategy. The third is a combination of single model results using same strategy with the first entry. The fourth and fifth are combinations using simple average strategy of single model results.
[1] K. He, X. Zhang, S. Ren, J. Sun. Identity Mappings in Deep Residual Networks. In ECCV 2016. abs/1603.05027
[2] C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In ICLR 2016. abs/1602.07261
[3] L. Shen, Z. Lin, Q. Huang. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. abs/1512.05830
[4] S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, W. J. Dally. DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow. abs/1607.04381

SamExynos Qian Zhang(Beijing Samsung Telecom R&D Center)
Peng Liu(Beijing Samsung Telecom R&D Center)
Jinbin Lin(Beijing Samsung Telecom R&D Center)
Junjun Xiong(Beijing Samsung Telecom R&D Center)
Object localization:

The submission is based on [1] and [2], but we modified the model, and the newtwork is 205 layers. Due to the limit of time and GPUs, we have just trained three CNN model for classification. The top-5 accuracy on the validation set with dense crops(scale:224,256,288,320,352,384,448,480) is 96.44% for the best single model. And the top-5 accuracy on the validation set with dense crops is 96.88% for three model ensemble.

places365 classification:

The submission is based on [3] and [4], we add 5 layers to resnet 50, and modified the network. Due to the limit of time and GPUs, we have just trained three CNN model for the scene classification task. The top-5 accuracy on the validation set with 72 crops is 87.79% for the best single model. And the top-5 accuracy on the validation set with multiple crops is 88.70% for three model ensemble.

[1]Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun,Identity Mappings in Deep Residual Networks. ECCV 2016.
[2]Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning". arXiv preprint arXiv:1602.07261 (2016)
[3]Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. "Rethinking the Inception Architecture for Computer Vision". arXiv preprint arXiv:1512.00567 (2015)
Rangers Y. Q. Gao,
W. H. Luo,
X. J. Deng,
H. Wang,
W. D. Chen,
---
Everphoto Yitong Wang, Zhonggan Ding, Zhengping Wei, Linfu Wen

Everphoto
Our method is based on DCNN approaches.

We use 5 models with different input scales and different network structures as basic models. They are derived from GoogleNet, VGGNet and ResNet.

We also utilize the idea of dark knowledge [1] to train several specialist models, and use these specialist models to reassign probability scores and refine the basic outputs.

Our final results are based on the ensemble of refined outputs.

[1] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.
MIPAL_SNU Sungheon Park and Nojun Kwak (Graduate School of Convergence Science and Technology, Seoul National University) We trained two ResNet-50 [1] networks. One network used 7x7 mean pooling, and the other used multiple mean poolings with various sizes and positions. We also used balanced sampling strategy which is similar to [2] to deal with the imbalanced training set.

[1] He, Kaiming, et al. "Deep residual learning for image recognition." CVPR, 2016.

[2] Shen, Li, et al. "Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks." arXiv, 2015.
KPST_VB Nguyen Hong Hanh
Seungjae Lee
Junhyeok Lee
In this work, we used pre-trained ResNet200(ImageNet)[1] and retrained the network on Place 365 Challenge data (256 by 256). We also estimated scene probability using the output of pretrained ResNet200 and scene vs. object (ImageNet 1000 class) distribution on training data. For classification, we used ensemble of two networks with multiple crops and adjusted on scene probability.

[1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.

*Our work is performed by deep learning analysis tool(Deep SDK by KPST).
SJTU-ReadSense Qinchuan Zhang, Shanghai Jiao Tong University
Junxuan Chen, Shanghai Jiao Tong University
Thomas Tong, ReadSense
Leon Ding, ReadSense
Hongtao Lu, Shanghai Jiao Tong University
We train two CNN models from the scratch. Model A based on Inception-BN [1] with one auxiliary classifier is trained on the Places365-Challenge dataset [2], which achieved 15.03% top-5 error on validation dataset. Model B based on ResNet [3] with depth of 50 layers is trained on the Places365-Standard dataset and finetuned for 2 epochs on the Places365-Challenge dataset due to the limit of time, which achieved 16.3% top-5 error on validation dataset. We also fuse features extracted from 3 baseline models [2] on the Places365-Challenge dataset and trained two fully connected layers with a softmax classifier. Moreover, we adopt the "class-aware" sampling strategy proposed by [4] for models trained on Places365-Challenge dataset to tackle the non-uniform distribution of images over 365 categories. We implement model A using Caffe [5] and conduct all other experiments using MXNet [6] to deploy larger batch size on a GPU.

We train all models with a 224x224 crop randomly sampled from an 256x256 image or its horizontal flip, with the per-pixel mean subtracted. We apply 12-crops [7] for evaluation on validation and test datasets.

We ensemble multiple models with weights (learnt on validation dataset or top-5 validation accuracies), and achieve 12.79% (4 models), 12.69% (5 models), 12.57% (6 models) top-5 error on validation dataset.

[1] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
[2] Places: An Image Database for Deep Scene Understanding. B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. Arxiv, 2016.
[3] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
[4] L. Shen, Z. Lin , Q. Huang. Relay backpropagation for effective learning of deep convolutional neural networks. arXiv:1512.05830, 2015.
[5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
[6] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C.n Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS, 2015.
[7] C. Szegedy,W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
iMCB *Yucheng Xing,
*Yufeng Zhang,
Zhiqin Chen,
Weichen Xue,
Haohua Zhao,
Liqing Zhang
@Shanghai Jiao Tong University (SJTU)

(* indicates equal contribution)
In this competition, we submit five entries.

The first model is a single model, which achieved 15.24% top-5 error on validation dataset. It is a Inception-V3[1] model that is modified and trained based on both the challenge and standard datasets[2]. When being tested, images are resized to 337*337 and then a 12-crops skill is used to get the 299*299 inputs to the model, which contributes to the improvement of performance.

The second model is a fusion-feature model(FeatureFusion_2L), which achieved 13.74% top-5 error on validation dataset. It is a two layers fusion-feature network, whose input is the combination of fully-connected layer's features extracted from several well performed CNNs(i.e. pretrained models[3], such as Resnet, VGG, Googlenet).As a result, it turns out to be efficient in reducing the error rate.

The third model is also a fusion-feature network(FeatureFusion_3L),which achieved 13.95% top-5 error on validation dataset. Comparing with the second model, it is a three layers fusion-feature network which contains two fully-connected layers.

The fourth is the combination of CNN models with a strategy w.r.t.validation accuracy, which achieved 13% top-5 error on validation dataset. It combines the probabilities provided by the softmax layer from three CNNs, in which the influential factor of each CNN is determined by the validation accuracy.

The fifth is the combination of CNN models based on researched influential factors, which achieved 12.65% top-5 error on validation dataset. There are six CNNs taken into consideration, while four models(Inception-V2, Inception-V3, FeatureFusion_2L and FeatureFusion_3L) of them are trained by us and the other two are pretrained. The influential factors of these models are optimized according to plenty of researches.


[1] Szegedy, Christian, et al. "Rethinking the Inception Architecture for Computer Vision." arXiv preprint arXiv:1512.00567 (2015).

[2]B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. "Places: An Image Database for Deep Scene Understanding." Arxiv, 2016.

[3] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba and A. Oliva. "Learning Deep Features for Scene Recognition using Places Database."Advances in Neural Information Processing Systems 27 (NIPS), 2014.
Choong Choong Hwan Choi (KAIST) Abstract
Ensemble of Deep learning model based on VGG16 & ResNet
Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate.
Reference :
[1] Liu, Wei, et. al. "SSD: Single Shot Multibox Detector"
[2] K. Simonyan, A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition"
[3] Kaiming He, et. al., "Deep Residual Learning for Image Recognition"

SIIT_KAIST Sihyeon Seong (KAIST)
Byungju Kim (KAIST)
Junmo Kim (KAIST)
We used ResNet[1] (101 layers / 4GPUs) as our baseline model. From the model pre-trained with ImageNet classification dataset(provided by [2]), We re-tuned the model with Places365 dataset (256-resized small dataset). Then, we further fine-tuned the model based on the following ideas:

i) Analyzing correlations between labels : We calculated correlations between each pair of predictions p(i), p(j) where i, j are classes. Then, highly correlated label pairs are extracted by thresholding the correlation coefficients.

ii) Additional semantic label generation : Using the correlation table from i), we further generated super/subclass labels by clustering them. Additionally, we generated 170 binary labels for separations of confusing classes, which maximize margins between highly correlated label pairs.

iii) Boosting-like multi-loss terms :
A large number of loss terms are combined for classifying the labels generated in ii).

[1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
[2] https://github.com/facebook/fb.resnet.torch
DPAI Vison Object detection: Chris Li, Savion Zhao, Bin Liu, Yuhang He, Lu Yang, Cena Liu
Scene classification: Lu Yang, Yuhang He, Cena Liu, Bin Liu, Bo Yu
Scene parsing: Bin Liu, Lu Yang, Yuhang He, Cena Liu, Bo Yu, Chris Li, Xiongwei Xia
Object detection from video: Bin Liu, Cena Liu, Savion Zhao, Yuhang He, Chris Li
Object detection:Our methods is based on faster-rcnn and extra classifier. (1) data processing: data equalization by deleting lots of examples in threee dominating classes (person, dog, and bird); adding extra data for classes with training data less than 1000; (2) COCO pre-train; (3) Iterative bounding box regression + multi-scale (trian/test) + random flip images (train / test) (4) Multimodel ensemble: resnet-101 and inception-v3 (5) Extra classifier with 200 classes which helps to promote recall and refine the detection scores of ultimate boxes.
[1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015.
[2] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99.

Scene classification: We trained the model on Caffe[1]. An ensemble of Inception-V3[2] and Inception-V4[3]. We totally integrated four models. Top1 error on validation is 0.431 and top5 error is 0.129. The single model is modified on Inception-V3[2], the top1 error on validation is 0.434, top5 error is 0.133.
[1] Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093. 2014.
[2]C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,andZ.Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
[3] C.Szegedy,S.Ioffe,V.Vanhoucke. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv preprint arXiv:1602.07261, 2016.

Scene parsing: We trained 3 models on modified deeplab[1] (inception-v3, resnet-101, resnet-152) and only used the ADEChallengeData2016[2] data. Multi-scale \ image crop \ image fliping \ contrast transformation are used for data augmentation and decseCRF is used as post-processing to refine object boundaries. On validation with combining 3 models, witch achieved 0.3966 mIoU and 0.7924 pixel-accuracy.
[1] L. Chen, G. Papandreou, I. K.; Murphy, K.; and Yuille, A. L. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In arXiv preprint arXiv:1606.00915.
[2] B. Zhou, H. Zhao, X. P. S. F. A. B., and Torralba, A. 2016. Semantic understanding of scenes through the ade20k dataset. In arXiv preprint arXiv:1608.05442.

Object detection from video: Our methods is based on faster-rcnn and extra classifier. We train Faster-RCNN based on RES-101 with the provided training data. We also train extra classifier with 30 classes which helps to promote recall and refine the detection scores of ultimate boxes.
isia_ICT Xinhan Song, Institute of Computing Technology
Chengpeng Chen, Institute of Computing Technology
Shuqiang jiang, Institute of Computing Technology
For convenience, we use the 4 provided models as our basic models, which are used for the following fine-tuning or networks adaptation. Besides, considering the non-uniform and the tremendous image number of the Challenge Dataset, we only use the Standard Dataset for all the following steps.
First, we fuse these models with average strategy as the baseline. And then, we add a SPP layer to VGG16 and ResNet152 perspectively to enable the models to be feed with images with larger scale. After fine-tuning the models, we also fuse them with average strategy, and we only submit the result of the size 288.
we also perform spectral clustering on the confusion matrix extracted from validation data to get 20 clusters, which means that 365 classes are separated into 20 clusters mainly dependent on their co-relationship. To classify the classes in the same cluster more precisely, we train an extra classifier within each cluster, which is implemented by fine-tuning the networks with all the layers fixed except for fc8 layer and combining them into a network at last.

D. Yoo, S. Park, J. Lee and I. Kweon. “Multi-scale pyramid pooling for deep convolutional representation”. In CVPR Workshop 2015
NUIST Jing Yang, Hui Shuai, Zhengbo Yu, Rongrong Fan, Qiang Ma, Qingshan Liu, Jiankang Deng 1.inception v2 [1] is used in the VID task, which is almost real time with GPU.
2.cascaded region regression is used to detect and track different instances.
3.context inference between instances within each video
4.online detector and tracker update to improve recall

[1]Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
[2]Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
[3]Dai, Jifeng, et al. "R-FCN: Object Detection via Region-based Fully Convolutional Networks." arXiv preprint arXiv:1605.06409 (2016).
Viz Insight Biplab Ch Das, Samsung R&D Institute Bangalore
Shreyash Pandey, Samsung R&D Institute Bangalore
Ensembling approaches have been known to outperform individual classifiers on standard classification tasks [No free Lunch Theorem :)]

In our approach we trained state of the art classifiers including variations of:
1.ResNet
2.VGGNet
3.AlexNet
4.SqueezeNet
5.GoogleNet

Each of these classifiers were trained on different views of the provided places 2 challenge data.

Multiple Deep Metaclassifiers were trained on the confidence of the labels predicted by above classifiers successfully accomplishing a non linear ensemble,
where the weights of the neural network are in a way to maximize the accuracy of scene recognition.

To impose further consistency between objects and scenes, a state of art classifier trained on imagenet was adapted to places via a zero shot learning approach.

We did not use any external data for training the classifiers. However we balanced the data to make the classifiers get unbiased results. So some of the data remained unused.
Faceall-BUPT Xuankun HUANG, BUPT, CHINA
Jiangqi ZHANG, BUPT, CHINA
Zhiqun HE, BUPT, CHINA
Junfei ZHUANG, BUPT, CHINA
Zesang HUANG, BUPT, CHINA
Yongqiang Yao, BUPT, CHINA
Kun HU, BUPT, CHINA
Fengye XIONG, BUPT, CHINA
Hongliang BAI, Beijing Faceall co., LTD
Wenjian FENG, Beijing Faceall co., LTD
Yuan DONG, BUPT, CHINA
# Classification/Localization
We trained the ResNet-101, ResNet-152 and Inception-v3 for object classification. Multi-view testing and models ensemble is utilized to generate the final classification results.
For localization task, we trained a Region Proposal Network to generate proposals of each image, and we fine-tuned two models with object-level annotations of 1,000 classes. Moreover, a background class is added into the network. Then test images are segmented into 300 regions by RPN and these regions are classified by the fine-tuned model into one of 1,001 classes. And the final bounding box is generated by merging the bounding rectangle of three regions.

# Object detection
We utilize faster-rcnn with the publicly available resnet-101. Other than the baseline, we adopt multi-scale roi to obtain features containing richer context information. For testing, we use 3 scales and merge these results using the simple strategy introduced last year.

No validation data is used for training, and flipped images are used in only a third of the training epochs.

# Object detection from video
We use Faster R-CNN with Resnet-101 to do this as in the object detection task. One fifth of the images are tested with 2 scales. No tracking techniques are used because of some mishaps.

# Scene classification
We trained a single Inception-v3 network with multi-scale and tested with multi-view of 150 crops.
On validation the top-5 error is about 14.56%.

# Scene parsing
We trained 6 models with net structure inspired by fcn8s and dilatedNet with 3 scales(256,384,512). Then we test with flipped images using pre-trained fcn8s and dilatedNet. The pixel-wise accuracy is 76.94% and mean of the class-wise IoU is 0.3552.
OceanVision Zhibin Yu Ocean University of China
Chao Wang Ocean University of China
ZiQiang Zheng Ocean University of China
Haiyong Zheng Ocean University of China
Our homepage: http://vision.ouc.edu.cn/~zhenghaiyong/

We are interesting in scene classification and we aim to build a net for this problem.
ABTEST Ankan Bansal We have used a 22 layer GoogleNet [1] model to classify scenes. The model was trained on the LSUN [2] dataset and then finetuned on the Places dataset fot 365 categories. We did not use any intelligent data selection techniques. The network is simply trained using all the available data without considering the data distribution for different classes.

Before training on LSUN, this network was trained using the Places205 dataset. The model was trained till it saturated at around 85% (Top-1) accuracy on the validation dataset of the LSUN challenge. Then the model was fine-tuned on the 365 categories in the Places2 challenge.

We did not use the trained models provided by the organisers to initialise our network.

References:
[1] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
[2] Yu, Fisher, et al. "Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop." arXiv preprint arXiv:1506.03365 (2015).
Vladimir Iglovikov Vladimir Iglovikov

Hardware: Nvidia Titan X
Software: Keras with Theano backend
Time spent: 5 days

All models trained on 128x128 resized from "small 256x256" dataset.

[1] Modiffied VGG16 => validation top5 error => 0.36
[2] Modified VGG19 => validation top5 error => 0.36
[3] Modified Resnet 50 => validation top5 error => 0.46
[4] Average of [1] and [2] => validation top5 error 0.35


Main changes:
Relu => Elu
Optimizer => Adam
Batch Normalization added to VGG16 and VGG19
scnu407 Li Shiqi South China Normal University
Zheng Weiping South China Normal University
Wu Jinhui South China Normal University
We believe that the spatial relationships between objects in the image is a kind of time-series data. Therefore, we first use VGG16 to extract the features of the image, then add 4 LSTM layer in the back, four LSTM layer representing the four directions of the scanning feature map.
[top]