ICCV19-Paper-Review

Summaries of ICCV 2019 papers.

Hierarchical Self-Attention Network for Action Localization in Videos

Action localization is the process of recognizing an action of one or more agents from a series of observations on the agent from a video. We all know an action is made of multiple semantic sub-actions in a consistent order(while sub-actions may vary in appearance and duration). So, our main aim is to recognize the sub-actions of the main action in every frame. From the perspective of applications, action localization has a wide range of real-life applications, like- automatic human action monitoring, video surveillance and video captioning.

This paper gives us a state-of-the-art action recognition model, different features of the proposed model go something like that, AL_pic1.jpg

Fig:1 The pipeline of the proposed architecture

you can find the slideshare slides here

Experimental Results:

Datasets:

Two dataset are used for experiment which are UCF101-24 and J-HMDB. Both of the action localization datasets are consist of varity of charcteristics, which is suitable for the experiments.

UCF101-24 :

This dataset contains 3194 annotated videos and 24 action classes. It encompasses several viewpoints of actors, illumination conditions, and camera movements. Most of the videos in this dataset are untrimmed.You can find the dataset overhere

J-HMDB dataset:

This dataset is composed of 928 trimmed videos and 21 action classes. Several challenges encountered in this dataset include occlusion, background clutter, and high inter-class similarity.You can find the dataset overhere

Table 1:

Action localization results on UCF101-24 with various combinations of strategies. table1_AL.jpg

Table 2:

Action localization results on J-HMDB with various combinations of strategies. table2_AL.jpg

Comparison with State of the Art Works:

The comparison of the proposed architecture with ten baselines, including Zolfaghari et al. [5], Alwando et al.[5], Singh et al. [8], CPLA [9], T-CNN [10], ACT [11], TPN [12], RTP + RTN [13], Gu et al. [15], and Duarte et al. [17], in terms of video mAP for different IoU’s on UCF101-24 and J-HMDB are provided below in the form of a table.

please note ‘[x]’ is the reference number for the baselines.

Table 3:

Comparison of the action localization performance on UCF101-24. The best results are bold-faced. table3_AL.jpg #### Table 4: Comparison of the action localization performance on J-HMDB. The best results are bold-faced. table4_AL.jpg

Except [15,17] proposed model outperforms all of the prementioned methods because of the bidirectional self-attention and the fusion strategy which is really boosted the performance of the model.