Introduction

Paper was published by Microsoft researchers from Redmond division in Jun 2021.

Among the image data processing tasks performed by deep learning algorithms, there is object detection. Literally finding the answer to what objects are located in a given image and where.

Detector, has been developed by many researchers over a long period of time but still usually has a de-facto standard common structure unifying all previous works:

This common structure is the Backbone + Neck(sometimes) + Head structure

Untitled

Note: Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.

The structure above is the structure of RetinaNet. The structure of Backbone + Head is clearly shown. It is similar to many other Single-Shot networks like SSD (single-shot detector),YOLO and etc.

Here (a) is the backbone and (c) is the head. The image is processed in the order of Backbone -> Head, and the features of the image are extracted in the Bakcbone and in the Head - box and class branches identify where and what objects are located.

Here (b) is the part called Neck. The presence of a Neck increases the performance of the model, but it is not essential.

Many researches showed that performance of the detector often depends on 'how well the head is made'.

Then what are the conditions for a good head? The author said "The challenges in developing a good object detection head can be summarized into three categories":

The author said that the most of other researches so far has focused on solving only one of the three conditions. It means that a head that satisfies all three conditions has not been researched and created.

An open problem still remains. That motivated authors has created a Head that satisfies all three conditions. That's the Dynamic Head.