Context-Aware CNNs for Person Head Detection
Person detection is a key problem for many computer vision tasks. While face detection has reached maturity, detecting people under full variation of camera view-points, human poses, lighting conditions and occlusions is still a difficult challenge. In this work we focus on detecting human heads in natural scenes. Starting from the recent R-CNN object detector, we extend it in two ways. First, we leverage person-scene relations and propose a global CNN model trained to predict positions and scales of heads directly from the full image. Second, we explicitly model pairwise relations among the objects via energy-based model where the potentials are computed with a CNN framework. Our full combined model complements R-CNN with contextual cues derived from the scene. To train and test our model, we introduce a large dataset with 369,846 human heads annotated in 224,740 movie frames. We evaluate our method and demonstrate improvements of person head detection compared to several recent baselines on three datasets. We also show improvements of the detection speed provided by our model.