Read Aloud the Text Content
This audio was created by Woord's Text to Speech service by content creators from all around the world.
Text Content or SSML code:
Good afternoon, everyone! I am glad to present our work here. Then, my presentation consists of four sections as follows. We quantitatively analyze the contributions of each component in our method to the captioning performance in table 2. To further explore the effectiveness of ASFA module, we show the performance comparison between single ORT-based caption model and ORT & ASFA based caption model in table 2. Clearly, the ORT & ASFA based caption model outperforms the single ORT-based model over all metrics. Specifically, ORT & ASFA based caption model obtains a large improvement of nearly 4%, compared with the single ORT-based model. The results prove the effectiveness of our ASFA to highlight the salient fine-grained objects by calculate the importance weights of them within image. To further explore the effectiveness of AORA module, we compare AORA based on one-sided relation and two-sided relation with the original ORT network, as described in the third and fourth rows of table 2. We see that both one-sided relation AORA and two-sided relation AORA obtains improvement over the ORT network, which may be contributed to the fact that our modules assign similar weights on the relations to activate the important object relations. Besides, two-sided relation AORA indeed achieves some improvement over the one-sided relation AORA, which can also demonstrate that the relation weights between objects calculated one by one can further improves the effectiveness of AORA. Moreover, the AORA with two-sided relation can calculate the importance weights for each object relation pair more flexibly. To investigate this issue, we evaluate the performance of the ORT network-based model equipped with both ASFA and AORA. Moreover, both one-sided relation AORA and two-sided relation AORA will be considered. As shown in the 5-th and 6-th rows of table 2, we can see that the performance of ORT network-based caption model can be further improved. That is, the combination of ASFA and AORA are effective for transformer-based model. Generating the object relations by encoding can be regarded as a process of extracting the position information of the objects within images. That is, the visual objects in images will have positional orders, such as object size, or left-to-right or top-to-bottom based on bounding box coordinates. Note that our proposed AORA module can be considered as a flexible positional encoding block for the Transformer based image caption models. As such, we would like to evaluate the performance of our position encoding performance of our AORA module. We provide a comparison between our AORA module and other related object orderings in table 3. The positional encoding ordered by box size denotes that ranking the area of each bounding box from largest to smallest. The order manner of left-to-right and top-to-bottom means that ordering the bounding boxes according to the x-coordinate and y-coordinate of their centroids, respectively. Geometry, attention calculates the object relation by adding the spatial geometry relation with the semantic information. As shown in table 3, our proposed AORA performs better on CIDEr-D scores than other related position encoding methods, which demonstrates that adaptively calculating the important object relations will effectively benefit the image captioning task. In addition to the above numerical results, we also show several qualitative results of the image captions generated by different models. In this study, the generated captions by our model are compared with the human-annotated ground truth captions and the captions generated by the ORT network. As shown in figure 4, our proposed model can effectively focus on the important objects and salient object relations, compared with the ORT-based model. For example, our model can accurately focus on the key objects in figure, i.e., “crowed of people” and “clock tower”, while the ORT network-based model ignores “clock tower”. Furthermore, ORT network-based model also generates poor descriptions about “two men” in figure, since there is only one man riding on the back of the elephant in the image instead of a group of men. By adaptively taking the spatial geometry relation between “elephant” and “two men” into account, our caption model can generate more accurate contents, which also distinguishes the different object relations between them. That’s all. Thanks!