Read Aloud the Text Content
This audio was created by Woord's Text to Speech service by content creators from all around the world.
Text Content or SSML code:
dataset was obtained in 13 separate video sequences, each with a different number of images. It consists of 15224 single- channel images with dimensions of 164 x 129 pixels. The training set and test set consist of 6159 and 9065 images, respectively. In evaluations, only unoccluded pedestrians are considered. Portmann et al. [12] released a new dataset, which contained 4381 images, including humans and animals, such as dogs, horses. The dataset consists of 9 outdoor sequences recorded from different perspectives and at different temper- atures. [13] presented the TIV dataset consisting of seven separate scenes, two of which are indoor scenes. The full resolution is 1024 x 1024. So far, the TIV dataset contains 63,782 images and records thousands of objects. The dataset is still constantly being updated. In [14], the authors captured nine video sequences (LITIV dataset). The LITIV dataset consists of videos of various tracking situations acquired at 30 frames per second by a visible and thermal camera, with different zoom settings and at different locations. The KAIST multispectral pedestrian dataset [15] includes images taken with varying lighting conditions under various traffic scenes (i.e., data collected both during daytime and at night). The dataset is made up of over 95000 compatible RGB-thermal image pairs, 50200 images are used for training, and the rest is used for testing. There are 103128 annotations correspond- ing to 1,182 pedestrians. Detailed information about thermal image datasets is illustrated in Table I. The pedestrians in these images dataset show up limited postures, such as walking, standing, and cycling. For post- disaster SAR missions, the victims have various forms. More often, they are lying on the ground, squatting, leaning on collapsed buildings, or being buried in ruins. So, to address this issue, we collected a new thermal image dataset captured by UAV, and the person in these thermal images have different postures. Dataset development is presented in Section III-B. Pedestrian detection Pedestrian detection is a longstanding application in the computer vision field, and a lot of algorithms have been proposed over time. In the last few years, with the advancement of technology, especially the development of graphics processing units (GPUs), pedestrian detection algorithms based on the convolutional neural network became more and more popular. These algorithms have been proven to be more effective than any traditional geometric or statistical method. These object detectors based on the deep neural network can be simply divided into two groups: single-stage and two-stage detectors, the main difference is whether extra region proposal modules are required. 1) Two-stage detectors: Two-stage detectors are all region- based, the R-CNN (Region-based Convolutional Neural Net- works) family is a typical representative of such detectors [15], [16]. The detection happens in two steps: (1) First, the model uses a regional proposal network to propose a set of regions of interest. Because of the potential bounding box candidates can be infinite, the proposed regions are sparse. (2) Then the region proposals are sent to classifier for object classification. Two-stage detectors apply region proposal modules to obtain high-quality region proposals, thus achieving a good detection accuracy. However, two-stage detectors require huge compu- tation and run-time memory footprint, thus making detection relatively slow. 2) Single-stage detectors: On the other hand, single-stage detectors skip the region proposal stage and run detection directly across an image via a dense sampling of potential locations. The typical representatives of such detectors are YOLO (You Only Look Once) series models [17]—[19], SSD (Single Shot MultiBox Detector) [20] and RetinaNet [21]. Single-stage detectors encapsulate all computations into a single network, making it more likely to run much faster than two-stage detectors, although it maybe reaches lower accuracy rates. However, in UAV-aided post-disaster rescue scenarios, the onboard microcomputer on UAV has a limited computing capacity. For addressing this issue and getting real-time perfor- mance, different techniques are explored in literature. Tijtgat et al. design a UAV warning system and compare both the inference time and accuracy on the Jetson TX2 platform between YOLOV2 and tiny YOLOV2 neural network [22]. He et al. proposed an Asymptotic Soft Filter Pruning method to accelerate the inference procedure of deep neural networks [23]. Zhou et al. proposed a method to allocate the inference computation of each network layer to different devices in the embedded system [24]. In the paper, we apply YOLOV3 [18] as base architecture and combine other backbone networks such as MobileNetV1 [25] and MobileNetV3 [26] to test inference time and accuracy on our thermal images dataset. We employ optimization steps to prune the unnecessary filters of these models to minimize the model size. After that, in order to ensure that the detection accuracy is not lost, we use the knowledge distillation to fine-