Challenge 2015

Held in conjunction with ILSVRC at ICCV 2015




Task A: Scene classification with provided training data

Team name Entry description Classification error
WM Fusion with product strategy 0.168715
WM Fusion with learnt weights 0.168747
WM Fusion with average strategy 0.168909
WM A single model (model B) 0.172876
WM A single model (model A) 0.173527
SIAT_MMLAB 9 models 0.173605
SIAT_MMLAB 13 models 0.174645
SIAT_MMLAB more models 0.174795
SIAT_MMLAB 13 models 0.175417
SIAT_MMLAB 2 models 0.175868
Qualcomm Research Weighted fusion of two models. Top 5 validation error is 16.45%. 0.175978
Qualcomm Research Ensemble of two models. Top 5 validation error is 16.53%. 0.176559
Qualcomm Research Ensemble of seven models. Top 5 validation error is 16.68% 0.176766
Trimps-Soushen score combine with 5 models 0.179824
Trimps-Soushen score combine with 8 models 0.179997
Trimps-Soushen top10 to top5, label combine with 9 models 0.180714
Trimps-Soushen top10 to top5, label combine with 7 models 0.180984
Trimps-Soushen single model, bn07 0.182357
ntu_rose test_4 0.193367
ntu_rose test_2 0.193645
ntu_rose test_5 0.19397
ntu_rose test_3 0.194262
Mitsubishi Electric Research Laboratories average of VGG16 trained with the standard cross entropy loss and VGG16 trained with weighted cross entropy loss. 0.194346
Mitsubishi Electric Research Laboratories VGG16 trained with weighted cross entropy loss. 0.199268
HiVision Single model with 5 scales 0.199777
DeepSEU Just one CNN model 0.200572
Qualcomm Research Ensemble of two models, trained with dense augmentation. Top 5 validation error is 19.20% 0.20111
HiVision Single model with 3 scales 0.201796
GatorVision modified VGG16 network 0.20268
SamExynos A Combination of Multiple ConvNets (7 Nets) 0.204197
SamExynos A Combination of Multiple ConvNets ( 6 Nets) 0.205457
UIUCMSR VGG-16 model trained using the entire training data 0.206851
SamExynos A Single ConvNet 0.207594
UIUCMSR Using filter panorama in the very bottom convolutional layer in CNNs 0.207925
UIUCMSR Using filter panorama in the top convolutional layer in CNNs 0.208972
ntu_rose test_1 0.211503
DeeperScene A single deep CNN model tuned on the validation set 0.241738
THU-UTSA-MSRA run4 0.253109
THU-UTSA-MSRA run5 0.254369
THU-UTSA-MSRA run1 0.256104
SIIT_KAIST-TECHWIN averaging three models 0.261284
SIIT_KAIST-TECHWIN averaging two models 0.266788
SIIT_KAIST-ETRI Modified GoogLeNet and test augmentation. 0.269862
THU-UTSA-MSRA run2 0.271185
NECTEC-MOONGDO Alexnet with retrain 2 0.275558
NECTEC-MOONGDO Alexnet with retrain 1 0.27564
SIIT_KAIST-TECHWIN single model 0.280223
HanGil deep ISA network for Places2 recognition 0.282688
FACEALL-BUPT Fine-tune Model 1 for another 1 epoch and correct the output vertor size from 400 to 401; 10 crops, top-5 error 31.42% on validation 0.32725
FACEALL-BUPT GoogLeNet, with input resize to 128*128, removed Incecption_5, 10 crops, top-5 error 37.19% on validation 0.38872
FACEALL-BUPT GoogLeNet, with input resize to 128*128 and reduced kernel numbers and sizes,10 crops, top-5 error 38.99% on validation 0.407011
Henry Machine No deep learning. Traditional practice: feature engineering and classifier design. 0.417073
THU-UTSA-MSRA run3 0.987563

Task B: Scene classification with additional training data

Team name Entry description Description of outside data used Classification error
NEIOP We pretrained a VGG16 model on Places205 database and then finetuned the model on Places2 database. Places205 database 0.203539
Isia_ICT Combination of different models Places1 0.220239
Isia_ICT Combination of different models Places1 0.22074
Isia_ICT Combination of different models Places1 0.22074
ZeroHero Zero-shot scene recognition with 15K object categories 15K object categories from ImageNet, textual data from YFCC100M. 0.572784

Team information[top]

Team name (with project link where available) Team members Abstract
AK47 Na Li
Shenzhen Institue of Advanced Technology,Chinese Academy of Sciences
Hongxiang Hu
Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences
Our algorithm is based on fast-rcnn.
We fintuned the fast-rcnn network using the date picked from ILSVRC2015's training set.After that,each testing frames were inputted to the network,then we get the predict result.
We have also tried several kinds of method to using the similiarity of neighboring frames.At begining,we compared object proposals created by different methods(selective search,rigor,edgeboxes,mop,gop)and we choosed edgeboxes finally.
We tried to add the "behind" frame's op and the "before" frame's to the middle one to use their relativity.Our experiments proved it work.
We have also tried kinds of algorithms to tracking object,like optical flow and streamline,it's a pity that we havn't apply any of these algorithms to our model.
Whatever,we have learned a lot from this competition and thanks for your organization!
We will come back!
ART Vision Rami Hagege
Ilya Bogomolny
Erez Farhan
Arkadi Musheyev
Adam Kelder
Ziv Yavo
Elad Meir
Roee Francos
The problem of classification and segmentation of objects in videos is one of the biggest challenges in computer vision, demanding simultaneous solutions of several fundamental problems. Most of these fundamental problems are yet to be solved separately. Perhaps the most challenging task in this context, is the task of object detection and classification. In this work, we utilized the feature-extraction capabilities of Deep Neural Network in order to construct robust object classifiers, and accurately localize them in the scene. On top of that, we use time and space analysis in order to capture the tracklets of each detected object in time. The results show that our system is able to localize multiple objects in different scenes, while maintaining track stability over time.
DeeperScene Qinchuan Zhang, Xinyun Chen, Junwen Bai, Junru Shao, Shanghai Jiao Tong University, China For the scene classification task, our model is based on a convolutional neural network framework implemented on Caffe. We use parameters of vgg_19 model training on ILSVRC classification task as initialization of our model [1]. Since current deep features learnt by those convolutional neural networks, which are trained from ImageNet, are not competitive enough for scene classification task, due to the fact that ImageNet is an object-centric dataset [3], we further train our model on Places2 [4]. Moreover, according to our experiments, “msra” initialization of filter weights for rectifiers is a more robust method of training extremely deep rectifier networks [2], we use this method for initialization of some fully-connected layers.

[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.

[2] He K, Zhang X, Ren S, et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv:1502.01852, 2015.

[3] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning Deep Features for Scene Recognition using Places Database. Advances in Neural Information Processing Systems 27 (NIPS), 2014.

[4] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. Places2: A Large-Scale Database for Scene Understanding. Arxiv, 2015.
DeepSEU Johnny S. Yang, Southeast University
Yongjie Chu, Southeast University
A VGG-like model has been trained for this scene classification task. We only use the resized 256x256 image data to train this model. In training phase, random crops of multi scales are used to do data augmentation. The procedure generally follows the VGG paper, for example, the batch size was set to 128, and the learning rate was initially set to 0.01. The only difference is that we don't use Gaussian method for weight initialization. We proposed a new weight initializing method, which can get a bit faster convergence performance than MSRA weight filler. In test phase, we convert the full connected layers into convolutional layers, and then this fully convolutional network is applied over the whole image. Multi scales images are used to evaluate dense predictions. Finally, the top 5 classification score we got on the validation set is 80.0%.
Hongliang BAI, Beijing Faceall co., LTD
Wenjian FENG, Beijing Faceall co., LTD
Tao FU, Beijing Faceall co., LTD
It is the third time that we participate in ILSVRC. In this year, we start with the GoogLeNet [1] model and apply it to all four tasks. Details are shown below.

Task 1 Object Classification/Localization
We utilize the GoogLeNet with batch normalization and prelu for object classification. Three models are trained. The first one uses the original GoogLeNet architecture but all relu layers are replaced with prelu layers. The second model is the one mentioned in Ref. [2]. The third model is fine-tuned from the second model with multi-scale training. Multi-scale testing and models ensemble is utilized to generate the final classification result. 144 crops of an image [1] are used to evaluate one network. Actually, we also tried the method of 150 crops which is described in Ref. [3]. The performance is almost the same. And merging the results of 144 crops and 150 crops does not bring to much increased performance. The top-5 error of the three models on validation is 8.89%, 8.00% and 7.79%, respectively. And using an ensemble of the three models, we decreased the top-5 error rate to 7.03%. As we all know, ensemble with more models can improve the performance. But we do not have enough time and GPUs to do that.
To generate a bounding box for each label of an image, we firstly fine-tune the second classification model with object-level annotations of 1,000 classes from ImageNet CLS-LOC train data. Moreover, a background class is added into the network. Then test images are segmented into ~300 regions by selective search and these regions are classified by the fine-tuned model into one of 1,001 classes. We select the top-3 regions with the highest possibility classes generated by the classification model. A new bounding box is generated by finding a minimal bounding rectangle of three regions. The localization error is about 32.9% on validation. We also try the third classification model. And the localization error on validation is 32.76%. After merging the two aforementioned results, the localization error decrease to 31.53%.

Task 2 Object Detection
We employ the well-known Fast-RCNN framework [4]. Firstly, we tried the AlexNet model which is pre-trained on the CLS-LOC dataset with image –level annotation. When training on the object detection dataset, we run SDG for 12 epochs, and then lower the learning rate from 0.001 to 0.0001 and train for another 4 epochs. The other setting is the same with the original Fast-RCNN method. This approach achieves 34.9% MAP on validation. Then we apply GoogLeNet with Fast-RCNN framework. The pool layer after inception 4 layers is replaced by a ROI pooling layer. This trial achieves about 34% map on the validation. In another trial, we move the ROI pooling layer from the pool4 to pool5 and enlarge the input size from 600(max 1000) to 786(max 1280). The pooled width and height is set to 4x4 instead of 1x1. The MAP is about 37.8% on validation. It is worth noting that the last model needs about 6g GPU memory to train and 1.5g GPU memory to test. And it has near the same test speed with AlexNet but gains better performance. We employ a simple strategy to merge the three results and gain 38.7% MAP.

Task 3 Scene Classification
The Places2 training set has about 8 million images. We reduce the input size of images from 256x256 to 128x128 and try small network architectures to accelerate the training procedure. However, due to the large amount images, we still cannot finish the training before the submission deadline. We trained two models which architectures are modified from the original GOOGLENET. The first one only removes the inception 5 layers. We only trained the model for about 10 epochs. The top-5 error on validation is about 37.19%. The second model enlarges the stride of the conv1 layer from 2 to 4 and reduces the kernel number of the conv2 layer from 192 to 96. For the remaining inception layers, every kernel number is set as the half of its original number. This model is trained about 12 epochs and achieves 38.99% top-5 error on the validation. Unfortunately, about one week ago, we found that the final output vector was set to 400 instead of 401 due to an oversight. We correct the error and fine-tune the first model for about 1 epoch. The top-5 error on validation of this model is about 31.42%.

Task 4 Object Detection from Video
A simple method for this task is to perform object detection in all frames. But it does not utilize the spatial temporal constraint or context information between continual frames in a video. Thus, we employ object detection and object tracking for this track. First, key frames from a video are selected to detect objects in them using Fast RCNN. There are about 1.3 million frames in training set. Due to temporal continuity, we select one frame every 25 frames to train an object detection model. 52,922 frames are utilized to train the model. Similar to the approach in the Object Detection track, we run SGD for 12 epochs and then lower the learning rate from 0.001 to 0.0001 and train for another 8 epochs. The training procedure takes 1 day and 3 days on a single K40 for AlexNet and GoogLeNet, respectively. More details about the object detection can be found in the instruction of the task 2. During test, if a video has less than 50 frames, we choose two frames in the middle of the video. If a video has more than 50 frames, we choose a frame every 25 frames. This results in 6,861 frames of 176,126 on validation and 12,329 frames of 315,176 on test. We do object detection on these frames and filter out the objects which confident scores are larger than 0.2(AlexNet) and 0.4(GoogLeNet). Then, detected objects are tracked to generate the results of the other frames. The tracking method we used is TLD [5]. After tracking, we generate the final results for evaluation. It is worth noting that we set most of parameters empirically because we have no time to validate them. Three entries are provided. The first one utilizes AlexNet and achieves 19.1% map on validation. The second one uses GoogLeNet and achieves 24.6% map on validation. A simple strategy to merge the two results is employed and results in the third entry. The map of it is 25.3% on validation.

[1] Szegedy C, Liu W, Jia Y, et al. Going Deeper With Convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 1-9.
[2] Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[J]. arXiv preprint arXiv:1502.03167, 2015.
[3] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[4] Girshick R. Fast R-CNN[J]. arXiv preprint arXiv:1504.08083, 2015.
[5] Kalal, Z.; Mikolajczyk, K.; Matas, J., "Tracking-Learning-Detection," in Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.34, no.7, pp.1409-1422, July 2012
GatorVision Dave Ojika, University of Florida
Liu Chujia, University of Florida
Rishab Goel, University of Florida
Vivek Viswanath, University of Florida
Arpita Tugave, University of Florida
Shruti Sivakumar, University of Florida
Dapeng Wu, University of Florida
We implement a Caffe-based convolutional neural network using the Places2 dataset for a large-scale visual recognition environment. We trained a network based on the VGG ConvNet with 13 weight layers and 3 by 3 kernels, with 3 fully connected layers. All convolutional layers are followed with a ReLU layer. Due to the very large amount of time required to train the model with deeper layers, we deployed Caffe on a multiple GPU cluster environment and leveraged cuDNN libraries to improve training time.

[1] Chen Z, Lam O, Jacobson A, et al. Convolutional neural network-based place recognition[J]. arXiv preprint arXiv:1411.1509, 2014.
[2] Zhou B, Khosla A, Lapedriza A, et al. Object detectors emerge in deep scene cnns[J]. arXiv preprint arXiv:1412.6856, 2014.
[3] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
HanGil Gil-Jin Jang, School of Electronics Engineering, Kyungpook National University, Daegu, Republic of Korea
Han-Gyu Kim, School of Computing, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea
The novel deep network architecture is proposed based on independent subspace analysis (ISA). We extract 4096 dimensional features by the baseline Alexnet trained by the Places2 Database, and the proposed architecture is applied on top of the feature extraction network. Every other 4 nodes of the 4096 feature nodes are grouped as a single subspace, resulting in 1024 individual subspaces. The output of the each subspace is generated by the square root of the sum of the squares of the components, and the architecture is repeated 3 times to generate 256 nodes before connecting to the final network output of 401 categories.
Henry Machine Henry Shu (Home)
Jerry Shu (Home)
Fundamentally different from deep learning/ConvNet/ANN/representation learning, Henry machine was trained using the traditional methodology: feature engineering --> classifier design --> prediction paradigm. The intent is to encourage continued research interest in many traditional methods in spite of the current popularity of deep learning.

The most recent (as of Nov 14, 2015) top-5 and top-1 accuracies of Henry machine for the Scene401 validation dataset are 68.53% and 36.15%, respectively.

Here are some characteristics of Henry machine.

- The features used are our own modified version of a selection of features in the literature. These are engineered features, not learned features. The feature extraction was done for all the 8.1M training, 380K test, and 20K validation images on their original, high-resolution version with the original aspect ratio left unchanged. I.e., no image resizing or rescaling were applied. The entire feature extraction step using our own implementation took 7 days to complete on a cluster of home-brewed CPU cluster (listed below), which consists of 12 low-end to mid-end computers borrowed from family and friends. Five of the twelve computers are laptops.

- We did not have time to study and implement many strong features in the literature. The accuracy of Henry machine is expected to increase once we include more such features. We believe that the craftsmanship of feature engineering encodes the expertise and ingenuity of the human designer behind it, and cannot be dispensed away (at least not yet) in spite of the current popularity of deep learning and representation learning-based approaches. Our humble goal with Henry machine is to encourage continued research interest in the craftsmanship of feature engineering.

- The training of Henry machine for Scene401 was also done using the home-brewed CPU cluster, and took 21 days to complete (not counting algorithm design/development/debugging time).

- While Henry machine was trained using traditional classification methodology, the classifier itself was devised by us from scratch instead of using conventionally available methods such as SVM. We did not use SVM because it suffers from several drawbacks. For example, SVM is fundamentally a binary classifier, and applying its maximum-margin formulation in a multiclass setting, especially with such a large number of classes (401 and 1000), could be ad hoc. Also, the output of SVM is inherently non-probabilistic, which makes it less convenient to use in a probabilistic setting. The training algorithm behind Henry machine is our attempt to address these and several other issues (not mentioned here), while at the same time to make it efficient to train, with small memory footprint, on mom-and-pop computers at home, using only CPU's. However, Henry machine is still very far from perfect, and our training algorithm still needs a lot of improvement. More details of the training and prediction algorithm will be available in our publication.

- As Nov 13 was fast approaching, we were pressed by time. The delay was mainly due to hardware heat stress from many triple-digit-temperature days in Sep and Oct here in California. In the end of Oct, we bought two GTX 970 graphics card and implemented high-performance CUDA code (driver version 7.5) to help us finish the final prediction phase in time.

- We will also release the performance report of Henry machine on the ImageNet1000 CLS-LOC validation dataset.

- The source code for building Henry machine, including the feature extraction part, was primarily written in Octave (version 4.0.0) and C++ (gcc 4.8.4) on linux machines (Ubuntu x86_64 14.04).

Here is the list of the CPU's of the home-brewed cluster:
* Pentium D 2.8GHz, 3G DDR (2005 Desktop)
* Pentium D 2.8GHz, 3.26G DDR (2005 Desktop)
* Pentium 4 3.0GHz, 3G DDR (2005 Desktop)
* Core 2 Duo T5500 1.66GHz, 3G DDR2 (2006 Laptop)
* Celeron E3200 2.4GHz, 3G DDR2 (2009 Desktop)
* Core i7-720QM 1.6GHz, 20G DDR3 (2009 Laptop)
* Xeon E5620 2.4GHz (x2), 16G DDR3 (2010 Server)
* Core i5-2300 2.8GHz, 6G DDR3 (2011 Desktop)
* Core i7-3610QM 2.3GHz, 16G DDR3 (2012 Laptop)
* Pending info (2012 Laptop)
* Core i7-4770K 3.5GHz, 32G DDR3 (2013 Desktop)
* Core i7-4500U 1.8GHz, 8G DDR3 (2013 Laptop)
HiVision Q.Y. Zhong, S.C. Yang, H.M. Sun, G. Zheng, Y. Zhang, D. Xie and S.L. Pu [DET] We follow the Fast R-CNN [1] framework for detection. EdgeBoxes is used for generating object proposals. A detection model is fine-tuned based on a pre-trained VGG16 [3] model on ILSVRC2012 CLS dataset. During testing, predictions on test images and their flipped version are combined by non-maximum suppression. Validation mAP is 42.3%.

[CLS-LOC] We train different models for classification and localization separately, i.e. GoogLeNet [4,5] for classification and VGG16 for bounding box regression. The final models achieve 28.9% top-5 cls-loc error and 6.62% cls error on the validation set.

[Scene] Due to the limit of time and GPUs, we have just trained one CNN model for the scene classification task, namely VGG19, based on the resized 256x256 image datasets. The top-5 accuracy on the validation set with single center crop is 79.9%. In test phase, a multi-scale dense evaluation is adopted for the prediction, whose accuracy on the validation set is 80.7%.

[VID] First we apply Fast R-CNN with RPN proposals [2] to detect objects frame by frame. Then a Multi-Object Tracking (MOT) [6] method is utilized to associate the detections for each snippet. Validation mAP is 43.1%.

[1] R. Girshick. Fast R-CNN. arXiv 1504.08083, 2015.
[2] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 1506.01497, 2015.
[3] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 1409.1556, 2014.
[4] C. Szegedy, W. Liu, Y. Jia, et al. Going Deeper with Convolutions. arXiv 1409.4842, 2014.
[5] S. Ioffe, C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 1502.03167, 2015.
[6] Bae S H, Yoon K J. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2014: 1218-1225.
Isia_ICT Xiangyang Li
Xinhang Song
Luis Herranz
Shuqiang Jiang
As the number of images per category for training is non-uniform, we sample 4020 images for each class from the training dataset. We use this uniform distributed subdataset to train our convolutional neural networks. In order to reuse the semantic information in the 205-catogery Places dataset [1], we also use the models trained on this dataset to extract visual features for the classification task. Even though the mid-level representations in convolutional neural networks are rich, but the geometric invariance properties are poor [2]. So we use multi-scale features. Precisely, we convert all the layers in the convolutional neural network to convolution layers and use the full convolution network to extract features with different input sizes. We use max pooling to pool the features to the same size with the fixed size of 227 which is used to train the network. At last, we combine features extracted from models which not only have different architectures, but also are pre-trained on different datasets. We use the concatenated features to classify the scene images. Considering the efficiency, we use a logistic regression classifier composed with two fully-connected layers of 4096 units and 401 units respectively and a softmax layer with the sampled training examples exposed to the model.

[1] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, And A. Oliva. “Learning deep features for scene recognition using places database”. In NIPS 2014.
[2] D. Yoo, S. Park, J. Lee and I. Kweon. “Multi-scale pyramid pooling for deep convolutional representation”. In CVPR Workshop 2015.
Mitsubishi Electric Research Laboratories Ming-Yu Liu, Mitsubishi Electric Research Laboratories
Teng-Yok Lee, Mitsubishi Electric Research Laboratories
The submitted result is computed using the VGG16 network, which contains 13 convolutional layers and 3 fully connected layers. After the network is trained for several epochs using the training procedure described in the original paper, we fine-tune the network by using a weighted cross entropy loss, where the weights are determined per class and are based on their fitting errors. During testing time, we conduct a multi-resolution testing. The images are resized to three different resolutions. 10 crops are extracted from each resolution and the final score is the average scores of the 30 crops.

NECTEC-MOONGDO Sanparith Marukatat, Ithipan Methasate, Nattachai Watcharapinchai, Sitapa Rujikietgumjorn

IMG Lab, National Electronics and Computer Technology Center, Thailand
We first built AlexNet[NIPS2012_4824] model using Caffe[jia2014caffe] on Imagenet’s object classification dataset.
The performance of the our AlexNet on object classification is 55.76 percent of accuracy.
Then we replaced the classification layers (fc6 and fc7) with larger ones.
By comparing the size of training data between object classification dataset and the Place2 dataset, we doubled the number of hidden nodes on these two layers.
We connected this structure with new output layer with 401 nodes for 401 classes in Place2 dataset.
The Place2 training dataset was split into 2 parts.
The first part was used to adjust weights on the new layers.
To train with the first part, the model is trained with 1,000,000 iterations in total.
The learning rate is initialed with 0.001, then decreased 10 times every 200,000 iterations.
This yielded 43.18% accuracy on the validation set.
Then we retrained the whole convolution network using the second part.
We set learning rate of the new layers to be 100 times higher than the lower layers.
This raised the validation accuracy to 43.31%.
We also have trained from Place2 training dataset, since the beginning but the method described above achieved a better result.

Author = {Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor},
Journal = {arXiv preprint arXiv:1408.5093},
Title = {Caffe: Convolutional Architecture for Fast Feature Embedding},
Year = {2014}

title = {ImageNet Classification with Deep Convolutional Neural Networks},
author = {Alex Krizhevsky and Sutskever, Ilya and Geoffrey E. Hinton},
booktitle = {Advances in Neural Information Processing Systems 25},
editor = {F. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger},
pages = {1097--1105},
year = {2012},
publisher = {Curran Associates, Inc.},
url = {}
NEIOP NEIOPs We pretrained a VGG16 model on Places205 database and then finetuned the model on Places2 database. All images are resized to 224 by N. Multi-scale & multi-crop are used at the testing stage.
ntu_rose wang xingxing(NTU ROSE),wang zhenhua(NTU ROSE), yin jianxiong(NTU ROSE), gu jiuxiang(NTU ROSE), wang gang(NTU ROSE), Alex Kot(NTU ROSE), Jenny Chen(Tencent) For the scene task, we first train VGG-16[5], VGG-19[5] and Inception-BN[1] model, after this step, we use CNN Tree to learning Fine-grained Features[6]. After all, we combine CNN Tree model and VGG-16,VGG-19 , Inception-BN model as the final prediction result.

[1]Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167
[2]Ren Wu 1 , Shengen Yan, Yi Shan, Qingqing Dang, Gang Sun.Deep Image: Scaling up Image Recognition. arXiv preprint arXiv:1501.02876
[3]Andrew G. Howard. Some Improvements on Deep Convolutional Neural Network Based Image Classification. arXiv preprint arXiv:1312.5402
[4]Karen Simonyan, Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556
[5]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159
[6]Zhenhua Wang, Xingxing Wang, Gang Wang. Learning Fine-grained Features via a CNN Tree for Large-scale Classification. arXiv preprint
Qualcomm Research Daniel Fontijne
Koen van de Sande
Eren Gögle
Blythe Towal
Anthony Sarah
Cees Snoek
We present NeoNet, an inception-style [1] deep convolutional neural network ensemble that forms the basis for our work on object detection, object localization and scene classification. Where traditional deep nets in the ImageNet challenge are image-centric, NeoNet is object-centric. We emphasize the notion of objects during pseudo-positive mining, in the improved box proposals [2], in the augmentations, during batch-normalized pre-training of features, and via bounding box regression at run time [3].

[1] S. Ioffe & C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML 2015.
[2] K.E.A. van de Sande et al. Segmentation as Selective Search for Object Recognition. In ICCV 2011
[3] R. Girshick. Fast R-CNN. In ICCV 2015.
SamExynos Qian Zhang(Beijing Samsung Telecom R&D Center)
Peng Liu(Beijing Samsung Telecom R&D Center)
Wei Zheng(Beijing Samsung Telecom R&D Center)
Zhixuan Li(Beijing Samsung Telecom R&D Center)
Junjun Xiong(Beijing Samsung Telecom R&D Center)
Our submissions are trained by modified version of [1] and [2]. We use the structure of [1], but remove the batch normalization layers. And Relu is replaced by Prelu[3]. Meanwhile, the modified version of latent Semantic representation learning is integrated into the structure of [1].

[1]Sergey Ioffe, Christian Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015.
[2]Xin Li,Yuhong Guo, Latent Semantic Representation Learning for Scene Classification.ICML 2014.
[3]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015
[4] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, Going deeper with convolutions. CVPR 2015.
SIAT_MMLAB Limin Wang, Sheng Guo, Weilin Huang, Yu Qiao

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
We propose a new scene recognition system with deep convolutional models. Specifically, we address this problem from four aspects:

(i) Multi-scale CNNs: we utilize Inception2 architecture as our main exploration network structure due to its performance and efficiency. We propose a multi-scale CNN framework, where we train CNNs from image patches of two resolutions (224 cropped from 256, and 336 cropped from 384). For CNN at low resolution, we use the same network as Inception2. For CNN at high resolution, we design a deeper network based on Inception2.

(ii) Handing Label Ambiguity: As the scene labels are not mutually exclusive with each other and some categories are easily confused, we propose two methods to handle this problem. First, according to the confusion matrix on the validation dataset, we merge some scene categories into one single super-category. Second, we utilize the Places205 scene model to test images and use the soft output as another target to guide the training of CNN.

(iii) Better Optimize CNN: We use a large batch size to train CNN (1024). Meanwhile, we try to set the decrease of learning rate in the exponential form. Moreover, we design a locally-supervised learning method to learn the weight of CNNs.

(iv) Combing CNNs of different architectures: considering the complementarity of networks with different architectures, we also fuse the prediction results of networks: VGGNet13, VGGNet16, VGGNet19 and MSRANet-B.

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.

[2] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.

[3] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.

[4] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015.

[5] Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[6] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. Towards good practices for very deep two-stream convnets. CoRR, abs/1507.02159, 2015.

SIIT_KAIST-ETRI Youngsoo Kim(KAIST), Heechul Jung(KAIST), Jeongwoo Ju(KAIST), Byungju Kim(KAIST), Yeakang Lee(KAIST), Junmo Kim(KAIST), Joongwon Hwang(ETRI), Young-Suk Yoon(ETRI), Yuseok Bae(ETRI) For this work, we use modified GoogLeNet and test augmentation such as crop, scaling, rotation and projective transformation.
Networks model was pre-trained with localization dataset. This work was supported by Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis.
SIIT_KAIST-TECHWIN Youngsoo Kim(KAIST), Heechul Jung(KAIST), Jeongwoo Ju(KAIST), Byungju Kim(KAIST), Sihyeon Seong(KAIST), Junho Yim(KAIST), Gayoung Lee(KAIST), Yeakang Lee(KAIST), Minju Jung(KAIST), Junmo Kim(KAIST), Soonmin Bae(Hanwha Techwin), Jayeong Ku(Hanwha Techwin), Seokmin Yoon(Hanwha Techwin), Hwalsuk Lee(Hanwha Techwin), Jaeho Jang(Hanwha Techwin) Our method for scene classification is based on deep convolutional neural networks.
We used pre-trained networks on ILSVRC2015 localization dataset and retrained the networks with 256x256 size Places2 dataset.
For test, we used ten crop data augmentation and model combination with four slightly different models.
This work was supported by Hanwha Techwin.
THU-UTSA-MSRA Liang Zheng, University of Texas at San Antonio
Shengjin Wang, Tsinghua University
Qi Tian, University of Texas at San Antonio
Jingdong Wang, Microsoft Research
Our team submits results on the scene classification task using the Places2 dataset [4].

We trained three CNNs on the Places2 datasets using the GoogleNet [1]. Specifically, Caffe [2] is used for training. Among the three models, the first model is trained on by fine-tuning the GoogleNet model trained on Places dataset [3]. The classification accuracy of this model is: top-1 accuracy = 42.96%, top-5 accuracy = 75.35%. This model is obtained after 3,320,000 mini-batches, and the batch size is 32. Learning rate is set to 0.001, and gamma is set to 0.1, with a step size of 750,000.

The second and the third models are trained using the "quick" and the default solvers provided by GoogleNet, respectively, and both models are trained from scratch. Specifically, for the second model, we use a base learning rate of 0.01, gamma = 0.7, and step size = 320,000. We run this model for 4,500,000 mini-batches, and each mini-batch is of size 32. For this model, our result on the validation set is: top-1 accuracy = 43.41%, top-5 accuracy = 75.37%. In more detail, we only change the architecture of GoogleNet to have 401 blobs in the last fully connected layer. The average operation is done after the softmax calculation.

For submission, we submit results of each model as the first three runs (run 1, run 2, and run 3). Then, run 4 is the averaged result of fine-tuned GoogleNet + quick GoogleNet. Run 5 is the averaged result of all three models.

[1] Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).

[2] Jia, Yangqing, et al. "Caffe: Convolutional architecture for fast feature embedding." Proceedings of the ACM International Conference on Multimedia. ACM, 2014.

[3] Zhou, Bolei, et al. "Learning deep features for scene recognition using places database." Advances in Neural Information Processing Systems. 2014.

[4] Places2: A Large-Scale Database for Scene Understanding. B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva, Arxiv, 2015
Trimps-Soushen Jie Shao*, Xiaoteng Zhang*, Jianying Zhou*, Zhengyan Ding*, Wenfei Wang, Lin Mei, Chuanping Hu (* indicates equal contribution)

(The Third Research Institute of the Ministry of Public Security, P.R. China.)
Object detection:
Our models were trained based on Fast R-CNN and Faster R-CNN. 1) More training signal were added, including negative classes and objectness. Some models were trained on 489 subcategories first, then fine-tuning using 200 categories. 2) Replace pooling layers with stride convolutional layers for more accurate localization. 3) Extra data from Microsoft COCO, more anchors. 4) Iterative scheme, which alternates between scoring the proposals and refining their localizations with bounding box regression. 5) Various models were combined with weighted nms.

Object localization:
Different data augmentation methods were used, including random crops, multiple scales, contrast and color jittering. Some models were trained by maintaining the aspect ratio of input images, while others were not. In the test phase, whole uncropped images are densely processed for various scales. Further generate the fusion classification result according to the scores and labels jointly. On the localization side, we refer to the framework of Fast R-CNN. An iterative scheme like detection task is used. Then select top-k regions and averaging their coordinates as output. Results from multiple models are fused in different ways, using the model accuracy as weights.

Object detection from video:
We use same models as object detection task. Part of these models were fine-tuned using VID data. We also try main object constraint by considering whole snippet rather than single frame.

Scene classification:
Based on both MSRA-net and BN-GoogLeNet, and plus several improvements: 1) Choose subset data in a stochastic way at each epoch, which ensures each class has roughly equal images. This can both accelerate training and increase model diversity. 2) To utilize the whole image and part object information simultaneously, three different size patches of image (whole image with 224x224, crop of 160x160, and crop of 112x112) were feed into network and concatenate at the last convolution layer. 3) Enlarge the MSRA-net to 25 layers, and change some BN-net input from 224x224 to 270x270. 4) Use dense sample and multi-crop (50x3) for testing.
UIUCMSR Yingzhen Yang (UIUC), Wei Han (UIUC), Nebojsa Jojic (Microsoft Research), Jianchao Yang, Honghui Shi (UIUC), Shiyu Chang (UIUC), Thomas S. Huang (UIUC) Abstract:
We develop a new architecture for deep Convolutional Neutral Networks (CNNs), named Filter Panorama Convolutional Neutral Network (FPCNNs) for this scene classification competition. Convolutional layers are essential parts of CNNs and each layer is comprised of a set of trainable filters. To enhance the representation capability of the convolutional layers with more filters while maintaining almost the same parameter size, the filters of one convolutional layer (or possibly several convolutional layers) of FPCNNs are replaced by a filter panorama, wherein each window of the filter panorama serves as a filter. With the densely extracted overlapping windows from the filter panorama, a significantly larger filter set is obtained without the risk of overfitting since the parameter size of the filter panorama is the same as that of the original filters in CNNs.

The idea of filter map is inspired by epitome [1], which is developed in the computer vision and machine learning literature for learning a condensed version of Gaussian Mixture Models (GMMs). In epitome, the Gaussian means are represented by a two dimensional matrix wherein each window in this matrix contains parameters of the Gaussian means for a Gaussian component. The same structure is adopted for representing the Gaussian covariances. With almost the same parameter space as GMMs, the epitome possesses significantly more number of Gaussian components than its GMMs counterpart since much more Gaussian means and covariances can be extracted densely from the mean and covariances matrices of the epitome. Therefore, the generalization and representation capability of epitome outshines GMMs with almost the same parameter space, while circumventing the potential overfitting.

The above characteristics of epitome encourage us to arrange filters in a way similar to epitome in the FPCNNs. More precisely, we construct a three dimensional matrix named filter panorama for the convolutional layer of FPCNNs, wherein each window of the filter map plays the same role as the filter in the convolutional layer of CNNs. The filter panorama is designed such that the number of non-overlapping windows in the filter panorama is almost equal to the number of filters in the corresponding convolutional layer of CNNs. By densely extracting overlapping windows from the filter panorama, there are many more filters in the filter panorama that vary smoothly in the spatial domain, and neighboring filters share weights in their overlapping region. These smoothly varying filters tend to activate on more types of features that also exhibit small variations in the input volume, increasing the chance of extracting more robust deformation invariant features by the subsequent max-pooling layer [2].

In addition to the superior representation capability, filter panorama inherently enables a better visualization of the filters by forming a “panorama” of the filters such that adjacent filters changes smoothly across the spatial domain, and similar filters group together. This feature would benefit the design of the networks and make it easier to observe the characteristics of the filters learnt in the convolutional layer.

[1] N. Jojic, B. J. Frey, and A. Kannan. Epitomic Analysis of Appearance and Shape. In Pro-ceedings of IEEE International Conference on Computer Vision (ICCV), pp. 34-43, 2003.
[2] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage archi-tecture for object recognition? In Proc. ICCV, pages 2146–2153, 2009. 1
Junlong Liu, USTC
We are graduate students from USTC, trying to do some research in computer vision and deep learning.
WM Li Shen (University of Chinese Academy of Sciences)
Zhouchen Lin (Peking University)
We exploit partially overlapping optimization strategy to improve the convolutional neural networks, alleviating the optimization difficulty at lower layers and favoring better discrimination at higher layers. We have verified its effectiveness on VGG-like architectures [1]. We also apply two modifications of network architectures. Model A has 22 weight layers in total, adding three 3x3 convolutional layers in VGG-19 [1] and replacing the last max-pooling layer with SPP layer [2]. Model B integrates multi-scale information combination. Moreover, we apply balanced sampling strategy during training to tackle the non-uniform distribution of class samples. The algorithm and architecture details will be described in our arXiv paper (available online shortly).

In this competition, we submit five entries. The first is a single model (model A), which achieved 16.33% top-5 error on validation dataset. The second is a single model (model B), which achieved 16.36% top-5 error on validation dataset. The third is a combination of multiple CNN models with the averaging strategy. The fourth is the combination of these CNN models with a product strategy. The fifth is the combination of multiple CNN models with learnt weights.

[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR 2015
[2] K. He, X. Zhang, S. Ren and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV 2014.
ZeroHero Svetlana Kordumova, UvA; Thomas Mensink, UvA; Cees Snoek, UvA; ZeroHero recognizes scenes without using any scene images as training data. Instead of using attributes for the zero-shot recognition, we recognize a scene using a semantic word embedding that is spanned by a skip-gram model of thousands of object categories [1]. We subsample 15K object categories from the 22K ImageNet dataset, for which more than 200 training examples are available. Using those, we train an inception-style convolutional neural network [2]. An unseen test image, is represented as the sparisified set of prediction scores of the last network layer with softmax normalization. For the embedding space, we learn a 500-dimensional word2vec model [3], which is trained on the title description and tag text from the 100M Flickr photos in the YFCC100M dataset [4]. The similarity between object and scene affinities in the semantic space is computed with cosine similarity over their word2vec representations, pooled with Fisher word vectors [1]. For each test image we predict the five highest scoring scenes.


[1] M. Jain, J.C. van Gemert, T. Mensink, and C.G.M. Snoek. Objects2action: Classifying and localizing actions without any video example. In ICCV, 2015.

[2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going Deeper with Convolutions. CVPR, 2015.

[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.

[4] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, L.-J. Li. The New Data and New Challenges in Multimedia Research. arXiv:1503.01817, 2015.