Read Aloud the Text Content
This audio was created by Woord's Text to Speech service by content creators from all around the world.
Text Content or SSML code:
tune the pruned model. The details are described in Section III. III. M e t h o d o l o g y In this part, we first present the designed UAV-aided rescue system and the dataset collection process. We then describe the deep learning model used in our rescue system and our proposed method for pruning and fine-tune the model. A. System overview The drone used for the SAR mission is DJI Matrice210 RTK V23 (Figure 2). This kind of drone is designed for industrial use, with a flight time up to 33 minutes and a range up to 5 miles. The drone has an advanced power management system and equips two batteries during flight, so it can provide sufficient power and enhance flight safety. It also has a battery heating system, which can significantly extend the flight time in cold temperatures. Thus the drone can be applied in many scenarios. DJI Matrice210 is compatible with Zenmuse gimbal cameras (such as Z30, X4S, XT2, etc.) and can provide high- resolution visible-light as well as thermal images and videos. For image acquisition, a Zenmuse XT24 gimbal and camera are mounted to the drone, which features a visual camera and a FLIR longwave infrared thermal camera, simultaneously delivering both thermal and visible-light imaging. The video stream captured by Zenmuse XT2 is sent to onboard microcomputer Manifold 2-G. The Manifold 2-G is DJI's second-generation microcomputer for DJI SDK devel- opers. Its processor is NVIDIA Jetson TX2, which has an onboard Pascal GPU with 256 CUDA cores. The UAV-aided SAR system designed in the paper is as follows: In the event of a disaster, the drone will fly over the disaster area closely at low altitudes, and search the affected areas using the thermal camera Zenmuse XT2. The thermal imaging camera transmits the captured video stream to an on- board microcomputer(Manifold 2-G) for survivors detection. When a victim is identified, the location information will be B. Dataset development In this subsection, we first describe the dataset collection, and then we explain the data pre-processing process, and finally, we outline the properties of our dataset. 1) Dataset collection: To simulate the postures of people after the disaster, our data were collected in various loca- tions(on our campus, the grass, or beach). Moreover, partic- ipants were asked to exhibit various poses, such as walking, squatting, lying down or leaning on a building, and so on. All the actors are a group of researchers in our laboratory. our data were collected at various periods during day and night to simulate the disaster scene under different lighting. During the data collection, in order to find the appropriate settings for our UAVs, we experimented with various thermal camera angles and altitudes to capture videos of these poses, ensuring the poses are clearly recognizable and distinguishable. According to our experiment, we determine the camera angle of 30 or 90 degrees and the altitude range of 15 to 40 meters, respectively. 2) Pre-processing o f dataset: After capturing videos, there are two main steps for preprocessing the dataset: frame selec- tion and pedestrian annotation. * Frame selection: Dataset for pedestrian detection based on the thermal image is collected in video form. Because there are 30 frames in the video per second, to avoid duplication, we skipped every 12 frames in the captured dataset to get one frame for training and testing convo- lutional neural networks. * Pedestrian annotation: In the preparation of the dataset, one of the most time-consuming processes is the annotation of objects in images. There are some kinds of free image annotation tools, like Labelbox5 and LabelImg6. The choice of image annotation tools depends on the training method used. In our dataset, there is only one object to detect, and we chose the tool Labellmg to label pedestrians in thermal images. The annotations are stored as XML files. 3) Dataset summary and statistics: The new UAV thermal images dataset for pedestrian detection contains a total of 3 video sequences and 77365 frames in 640x512 resolution. We extract one image every 12 frames to avoid repetition and get a total of 6447 thermal images. Such sequences were captured using DJI Matrice210 flying at altitudes ranging from 15 to 40 meters and with a camera angle of 30 to 90 degrees. Table I describes the details of our dataset compared with other datasets. Fig. 4: Sensitives of layers on thermal images dataset. C. Pedestrian detection methods In this paper, we apply DJI Matrice 210 with onboard micro- computer Manifold 2-G to detect survivor in post-disaster SAR missions. The target platform has limited hardware resources and still needs to get real-time performance. We selected the single-stage detector YOLOV3 series due to the excellent processing speed. YOLO is one of the most advanced real-time object de- tection systems. YOLO-V3 is the third version of the YOLO algorithm [18]. YOLO-V3 uses binary cross-entropy loss func- tion to calculate the classification loss. Moreover, this reduces the computation complexity by avoiding the softmax function and replacing the mean square error function. Therefore, it is easy for YOLO-V3 to achieve real-time performance on a computer with a GPU. However, in embedded devices such as DJI Manifold, the YOLO-V3 model runs slowly. YOLOV3- MobileNet series are used MobileNet as the backbone network of YOLOV3 instead of Darknet. MobileNet uses depthwise separable convolutions to build lightweight deep neural net- works. The platform Manifold 2-G has limited computation capac- ity, and we need to compress the network to reduce the model size and accelerate the inference time. Recent researches toward pruning the weights of various layers. However, the method reduces a significant number of parameters from the fully connected layers and causes a large loss of accuracy. In this paper, we analyze the sensitivity of each layer based on the method proposed in [27] and then determine how to prune the network. Within the parameters of a convolutional layer, the filters are sorted from high to low according to 11-norm, and the later filters are less important, these filters are preferentially pruned. When two convolutional layers are pruned filters by the same ratio, we say that the sensitivity of the layer is rel- atively high when then accuracy is greater impact. Therefore, according to the sensitivity of each convolution layer, different proportions of filters are pruned. As shown in Figure 4, the X-axis is the ratio of the filter prune, and the Y-axis is the loss of accuracy. Each colored line represents a convolutional layer in the network. Each time an mAP (mean Average Precision) loss value is selected on the Y-axis, there is a set of prune ratios on the X-axis, as shown by the solid black line. We find a set of reasonable prune ratios that meet the conditions by moving the solid black line. We prune a convolutional layer separately with different prune ratios and observe the accuracy loss on the verification dataset. The curve line rises slowly, and the corresponding convolutional layer is relatively insensitive. We give priority to prune the filters of the insensitive convolutional layer. After pruning the network, the model size of the network is significantly reduced, but the accuracy of the network is also lost. In order to repair the recognition rate of the network, we need to fine-tune the network. In this paper, we use knowledge distillation [28]. The main idea of knowledge distillation is to use a complex network with high recognition rates as a teacher model and a small network as the student. Consequently, using the teacher network to retrain the student network. With the help of the teacher network's knowledge, the accuracy of the student network can be improved. In this paper, we use the most straightforward knowledge distillation technique that replaces the labels of the student network with the prediction of the teacher network. The replacement allows the student network to learn from a network that already has its activation regions defined and makes learning easier. The process diagram is shown in Fig- ure 5. The trained YOLOV3-Darknet network is chosen as the teacher network, we fine-tuned the YOLOV3-MobileNetV1 and YOLOV3-MobileNetV3 networks with it.