MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Untitled

이 논문은 “가볍고 빠른 Mobile vision task를 위한 모델을 만들기 위해 CNN의 강점과 ViT의 강점을 합칠 수 있을 까?”에 대한 질문에 대답합니다.

MobileViT는 ImageNet-1k 데이터셋에서 78.4%의 top-1 정확도로 비슷한 파라미터 개수(~6M)의 MobileNetv3와 DeIT 보다 각각 3.2%, 6.2% 더 높은 성능을 보였습니다.

Introduction

현재 vision task에서 CNN의 대안으로 self-attention-based (ViT 계열) 모델들이 CNN을 능가하는 성능을 보여주고 있지만 모델 크기와 latency 측면의 비용이 크다는 인식이 있습니다.

그러나 많은 real-world application에서는 자원의 제약이 있는 mobile device에서 작동되는 모델을 요구하고 있습니다. 지금까지 mobile device에서 작동할 수 있는 모델은 모두 CNN기반의 모델입니다.

현재 ViT계열의 모델들의 성능은 비슷한 파라미터 수 대비 light-weight CNN 모델에 비해 성능이 떨어진다고 합니다.

예를 들어 5~6M의 파라미터를 가진 DeIT는 MobileNetv3보다 3% 가량 성능이 떨어집니다.

또한 대부분의 ViT계열의 모델은 image-specific inductive bias의 부재로 많은 parameter가 필요하며 최적화가 어려우며 강한 data augmentation과 L2 regularization이 필요하다는 단점이 있습니다.

CNN-based 모델과 ViT-based 모델의 장단점은 다음과 같습니다.

	CNN based	ViT based
장점	- 가볍고 빠르다.

학습이 쉽다.
Spatial inductive bias | - Long-range dependency
높은 성능 | | 단점 | - 국소적 | - Big & slow
학습이 어렵다. (data aug., regularization) |

저자들은 CNN와 transformer의 강점을 조합한 mobile vision task용 모델을 만들고자 하였습니다.

특히 light-weight, general-purpose, 그리고 low latency에 초점을 맞추어 “MobileViT”를 디자인 하였습니다.

저자들에 의하면 light-weight ViT가 단순한 training recipe으로 다양한 vision task에서 light-weight CNN에 준하는 성능을 보인 것은 처음이라고 합니다.

MobileViT의 특징을 요약하면 다음과 같습니다.

Better performance 비슷한 parameter 수에서 mobile vision task에 대해 기존의 CNN 기반의 모델의 성능을 능가함.
Generalization capability (training과 evaluation metric사이의 gap을 의미.) 기존의 ViT 계열(with or without CNN)은 intensive augmentation을 사용해도 CNN 기반의 모델보다 낮은 generalization capability를 보임.
Robust hyper-parameter tuning은 time, resource consuming하므로 좋은 모델은 hyper-parameter에 대해 robust해야 함. MobileViT는 기존의 ViT 계열의 모델과 달리 기본적인 augmentation을 사용하고 L2 regularization에 민감하지 않음(less sensitive).

Related Works

Light-weight CNNs

MobileNets (Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019), ShuffleNetv2 (Ma et al., 2018), ESPNetv2 (Mehta et al., 2019), MixNet (Tan & Le, 2019b), and MNASNet (Tan et al., 2019).

Vision transformers

ViT (Dosovitskiy et al., 2021), DeIT (Touvron et al., 2021a)

Subsequent works shows that this substandard optimizability is due to the lack of spatial inductive biases in ViTs.

e.g., Graham et al., 2021; Dai et al., 2021; Liu et al., 2021; Wang et al., 2021; Yuan et al., 2021b; Chen et al., 2021b

Convolution의 장점을 조합.

ViT-C (Xiao et al., 2021), CvT (Wu et al., 2021), BoTNet (Srinivas et al., 2021), ConViT (d’Ascoli et al., 2021), PiT (Heo et al., 2021)

대부분 무겁고 비슷한 성능 대비 parameter 수가 CNN 기반의 모델보다 많음.

Proposed Method

Figure 1: Visual transformers vs. Mobile ViT

MobileViT는

n x n convolution
MobileNetv2 block (MV2)
MobileViT block

으로 구성됩니다. (Figure 1. (b))

MobileViT block

더 적은 수의 파라미터로 local과 global information을 얻는 것을 목표로 합니다.

Untitled

우선 local spatial information을 얻기 위해 standard n x n, 1x1 convolution을 차례로 적용하여 $\mathbf{X}_{L} \in \mathbb{R}^{H \times W \times d}$ 를 추출합니다. (d > C)

그 다음 global representation을 학습하기 위해 Unfold, Transformer 입력, Fold 과정을 거칩니다.

$\mathbf{X}{L}$를 N non-overlapping flattened patch $\mathbf{X}{U} \in \mathbb{R}^{P \times N \times d}$로 unfold 후 transformer에 입력합니다.

P = wh (w, h : patch의 size), N = HW/P (W, H : 전체 이미지의 크기)

$$ \mathbf{X}{G}(p)=\operatorname{Transformer}\left(\mathbf{X}{U}(p)\right), 1 \leq p \leq P $$

MobileViT는 patch의 순서와 각 patch의 pixel의 spatial order를 잃지 않기 때문에 다시 원래의 shape으로 fold 할 수 있습니다. $\mathbf{X}_{G} \in \mathbb{R}^{P \times N \times d}$