Introduction
This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances.

Main paper contribution are claimed to be:
- We explore the combinations of visual and semantic features, identified by VM and LM, and prove their benefits. To the best of our knowledge, multi-modal feature enhancements with bi-directional fusions are novel components, that have never been explored.
- We propose a new STR method, named MATRN, that contains three major components, spatial encoding for semantics, multi-modal feature enhancements, and visual clue masking strategy, for better combinations of two modalities. Thanks to the effective contributions of the proposed components, MATRN achieves state-of-the-art performances on seven STR benchmarks.
- We provide empirical analyses that illustrate how our components improve STR performances as well as how MATRN contributes to the existing challenges.
Related Works
The development of STR algorithms can be represented by few approaches:
-
Models like ViTSTR mainly focused on Vision features processing module without explicitly modeling and training Language module.

-
Next approaches like SRN and ABINet attempted to utilize a separate Language model with subsequent fusing of Visual and Semantic features for final sequence prediction.


- Approaches like Bhunia et al. proposed a multi-stage decoder referring to visual features multiple times to enhance semantic features.

- Finally VisionLAN proposed a language-aware visual mask that refers to semantic features for enhancing the visual features. Given a masking position of the word, the masking module occludes corresponding visual feature maps of the character region at the training phase.
