Weakly supervised spatiotemporal violence detection in surveillance video

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Universidad Católica San pablo
Violence Detection in surveillance video is an important task to prevent social and personal security issues. Usually, traditional surveillance systems need a human operator to monitor a large number of cameras, leading to problems such as miss detections and false positive detections. To address this problem, in last years, researchers have been proposing computer vision-based methods to detect violent actions. The violence detection task could be considered a sub-task of the action recognition task but violence detection has been less investigated. Although a lot of action recognition works were proposed for human behavior analysis, there are just a few CCTV-based surveillance methods for analyzing violent actions. In the literature of violence detection, most of the methods tackle the problem as a classication task, where a short video is labeled as violent or non-violent. Just a few methods tackle the problem as a spatiotemporal detection task, where the method should detect spatially and temporally violent actions. We assume that the lack of such methods is due the exorbitant cost of annotating, at frame-level, current violence datasets. In this work, we propose a spatiotemporal violence detection method using a weakly supervised approach to train the model using only video-level labels. Our proposal uses a Deep Learning model following a Fast-RCNN (Girshick, 2015) style architecture extended temporally. Our method starts by generating spatiotemporal proposals leveraging a pre-trained person detector and motion appearance to build such proposals called action tubes. An action tube is dened as a set of temporally related bounding boxes that enclose and track a person doing an action. Then, a video with the action tubes is fed to the model to extract spatiotemporal features, and nally, we train a tube classier based on Multiple-instance learning (Liu et al., 2012). The spatial localization relies on the pre-trained person detector and motion regions extracted from dynamic images (Bilen et al., 2017). A dynamic image summarizes the movement of a set of frames to an image. Meanwhile, temporal localization is done by the action tubes by grouping spatial regions over time. We evaluate the proposed method on four publicly available datasets such as Hockey Fight, RWF-2000, RLVSD and UCFCrime2Local. Our proposal achieves an accuracy score of 97:3%, 88:71%, and 92:88% for violence detection in the Hockey Fight, RWF-2000, and RLVSD datasets, respectively; which are very close to the state-of-the-art methods. Besides, our method is able to detect spatial locations in video frames. To validate our spatiotemporal violence detection results, we use the UCFCrime2Local dataset. The proposed approach reduces the spatiotemporal localization error to 31:92%, which demonstrates the feasibility of the approach to detect and track violent actions.