A Transfer Learning-Based Multi-cues Multi-scale Spatial–Temporal Modeling for Effective Video-Based Crowd Counting and Density Estimation Using a Single-Column 2D-Atrous Net
Abstract
Crowd count and density estimation (CCDE) is an emerging research area and is a useful tool for crowd analysis and behavior modeling. The existing video-based CCDE approaches utilize spatial–temporal modeling for the CCDE. However, these approaches fail to address some of the major issues, such as scale variation because of perspective distortion in the frame and volume of frames and minimization of background influence during spatial–temporal modeling. To attain these issues, we are motivated to design a transfer learning-based multi-cues multi-scale spatial–temporal modeling for video-based CCDE. The proposed model utilizes a pre-trained Inception-V3 to extract multi-scale features for four different video frames cues, such as color frame, the foreground map of the frame, volume of the frame, and volume of foreground maps. The foreground maps are obtained by the Gaussian mixture model. The extracted multi-cue multi-scale features are then concatenated and fed into a single-column 2D-Atrous-Net. The 2D-Atrous-Net estimates the crowd density by regression on the ground-truth density maps. The experiments are conducted on two datasets, namely the Mall and Venice. The model outperforms the state-of-the-art techniques and yields an effective CCDE model by achieving better MAE and RMSE. © 2021, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.