Read Aloud the Text Content
This audio was created by Woord's Text to Speech service by content creators from all around the world.
Text Content or SSML code:
Good afternoon, everyone! I am glad to present our work here. Good afternoon, everyone! I am glad to present our work here. Good afternoon, everyone! I am glad to present our work here. Good afternoon, everyone! I am glad to present our work here. Then, my presentation consists of four sections as follows. Good afternoon, everyone! I am glad to present our work here. Then, my presentation consists of four sections as follows. Good afternoon, everyone! I am glad to present our work here. Then, my presentation consists of four sections as follows. Good afternoon, everyone! I am glad to present our work here. Then, my presentation consists of four sections as follows. Image captioning task aims at describing the main contents of image by using natural language sentences, which combines the techniques of computer vision and natural language processing, and is one of the important tasks in multi-modality processing. Image caption methods usually apply the encoder-decoder framework, which firstly encoding the input image to obtain an intermediate representation by a vision network, and then decoding into the textual descriptions by a language model. The vision encoder is used to recognize the main descriptive objects and embed them into feature vectors as intermediate representation. Then the language model translates the embedded feature vectors into words sequence to present the semantic contents of input image. Recently, those methods employ the visual region features by object detector, and then generate the descriptive text sequence by decoding the features by a language model, e.g., Long Short-Term Memory and Transformer networks. Transformer-based image caption models have delivered remarkable performance by exploiting the region features, but without properly considering the fine-grained difference of objects will influence the generation of specific and accurate text descriptions directly. The improvement of transformer-based image caption models benefits from the multi-head self-attention to a large extent, which enables it to model the correlation and interaction of features. However, this attention mechanism ignored the fine-grained difference of objects, which would make current transformer-based image caption models inept in modeling the importance of objects and object relations between objects. Next, we introduce the fine-grained difference from two respects. First, transformer-based deep image caption models cannot distinguish the importance of objects in fine-grained foregrounds, which we call soft-foreground, i.e., objects in the same foreground have different contributions to the caption task. For example, there are crowds of people in the images. However, only two “men” playing tennis ball and “woman” riding a horse are important foreground objects rather than the people around them. Secondly, different object relations within the image play different roles in describing the content of image from different perspectives. Different from considering the object relation by simply adding semantic relation and spatial geometry relation, the object relation should be adaptively generated according to the main contents in image. That is, object relation has different dominant sources in different images. For example, semantic relation has greater impact than the spatial geometry relation. The “woman”- “horse” pair is strongly supported by the spatial geometry relation, compared with the semantic relation in the 2-th row of image. Besides, the semantic relation between “woman” and “people” in background is usually not emphatically considered into descriptive content. In this work, we clearly consider the importance of fine-grained objects and object relations into image captioning, and then propose effective strategies to highlight the contributions of the fine-grained objects and object relations dominating the descriptive contents. Given an image, a corresponding sentence is generated to describe the main visual contents of the image by our proposed network. Firstly, object appearance features and geometry coordinates are extracted by pre-trained Faster R-CNN. Then a linear embedding layer further extracts feature presentation as the object appearance features. After that, attentive soft-foreground features are calculated from the appearance features. Then, taking the attentive soft-foreground features as the input of AORA module. The attentive features with adaptive relation are computed out based on the relation-attentive feature. Note that each multi-head adaptive object relation attention layer is followed by the position-wise feed-forward network, separately. Followed by the point-wise feed-forward network and sequentially repeated six times same above operation layer, the encoded attentive features finally are then decoded by a standard Transformer decoder to generate descriptive sentences. Technically, we propose a concept of Adaptive Soft-Foreground Attention mechanism to distinguish the fine-grained object features that dominate the descriptive contents. Specifically, ASFA assigns weights to different objects based on global information, which can significantly enhance the importance of objects. An Adaptive Object Relation Attention mechanism is also presented to calculate the important relation of fine-grained objects within the same image. The AORA can generate the mutual relations between pairs of objects by adaptive fusion, to regulate the relations between fine-grained objects. Extensive experiments on the benchmark dataset MS-COCO with Karpathy’s splits demonstrated the state-of-the-art performance of our proposed model for image captioning. We evaluate the performance of each image caption model and summarize the results of each model with cross-entropy loss training and self-critical training in table 1. Generally, our caption model exhibits better performance than other compared models, except for X-Transformer in terms of METEOR and CIDEr metrics, when using the cross-entropy loss for optimization. Specifically, our method can effectively improve the performance on B1. We think this is partly because our model can highlight the important fine-grained objects within the images, which will be beneficial to the object recognition task on images. As for the performance on B4, our model focuses on the fine-grained difference of objects in visual side of the image caption task, which does not make effort to the natural language process side, compared with the X-Transformer. Then, the caption’s fluency depicted by B4 is not improved as well as B1. Furthermore, FineFormer conducted with two-sided relation AORA achieves more improvement than one-sided relation AORA, since two-sided relation AORA presents more flexible weight assignment one by one on the important fine-grained object relation.