Spatiotemporal Knowledge Distillation for Efficient

Estimation of Aerial Video Saliency

Jia Li¹

Kui Fu^1,3

Shengwei Zhao^2,4

Shiming Ge²

CVTEAM

¹State Key Laboratory of Virtual Reality Technology and Systems, SCSE, Beihang University, Beijing, China

²Institute of Information Engineering, Chinese Academy of Sciences

³Peng Cheng Laboratory

⁴ School of Cyber Security, University of Chinese Academy of Sciences

TIP 2020

Paper

Code

Abstract

The performance of video saliency estimation techniques has achieved significant advances along with the rapid development of Convolutional Neural Networks (CNNs). However, devices like cameras and drones may have limited computational capability and storage space so that the direct deployment of complex deep saliency models becomes infeasible. To address this problem, this paper proposes a dynamic saliency estimation approach for aerial videos via spatiotemporal knowledge distillation. In this approach, five components are involved, including two teachers, two students and the desired spatiotemporal model. The knowledge of spatial and temporal saliency is first separately transferred from the two complex and redundant teachers to their simple and compact students, while the input scenes are also degraded from high-resolution to low-resolution to remove the probable data redundancy so as to greatly speed up the feature extraction process. After that, the desired spatiotemporal model is further trained by distilling and encoding the spatial and temporal saliency knowledge of two students into a unified network. In this manner, the inter-model redundancy can be removed for the effective estimation of dynamic saliency on aerial videos. Experimental results show that the proposed approach is comparable to 11 state-of-the-art models in estimating visual saliency on aerial videos, while its speed reaches up to 28,738 FPS and 1,490.5 FPS on the GPU and CPU platforms,respectively.

Qualitative comparisons

Representative frames of the models on AVS1K. (a) Video frame, (b) Ground truth, (c) HFT, (d) SP, (e) PNSP, (f) SSD, (g) LDS, (h) eDN, (i) iSEEL,(j) DVA, (k) SalNet, (l) STS, (m) SKD.

BibTex Citation

@article{li2019spatiotemporal,
  title={Spatiotemporal knowledge distillation for efficient estimation of aerial video saliency},
  author={Li, Jia and Fu, Kui and Zhao, Shengwei and Ge, Shiming},
  journal={IEEE Transactions on Image Processing},
  volume={29},
  pages={1902--1914},
  year={2019},
  publisher={IEEE}
}