Untitled

What does it do?

sequential entity tagging and extraction

Problem

form documents have unique challenges compared to natural language documents stemming from their structural characteristics ( hard to address serialization error!)

Novelty / solution to the problem

Rich attention → replace ETC’s attention mechanism
- leverages the spatial relationship between tokens in a form for more precise attention score calculation
Super-Tokens → embedding
- for each word by embedding representations from their neighboring tokens through graph convolutions

Pretraining

MLM(masked language modeling) only!
steps
1. OCR : document -> OCR words + bboxes
2. BERT-multilingual vocabulary (tokenize the extracted OCR words) : OCR words + bboxes -> tokens
3. GCN (embedding: graph construction & message passing(?)) : tokens & bboxes(2d coordinates) -> super-tokens (graph embedding)
4. ETC(extended transformer construction) w/ Rich Attention : super-tokens -> entity BIOES logits
5. Viterbi (decode and obtain the final entities for output.): entity BIOES logits -> entity extraction outputs
setup
- max sequence length : 1024

RichAtt(Rich Attention)

replace ETC’s attention with rich attention
why not just use ETC?
- ETC uses relative positional encoding → however, token offsets measure based on the error-prone serialization may limit the power of positional encoding
Rich attention
- avoids the deficiencies of absolute and relative embeddings by avoiding embeddings entirely → computes the order of and log distance between pairs of tokens with respect to the x and y axes on the layout grid
Attention Score ( $S_{ij}$)
- $S_{ij} = q^T_i + S^{o}{ij} (cross entropy) + S^d{dij} (L_2 losses)$
- $q_i = affine^{(q)} (h_i), k_j = affine^{k} (h_j)$
  - each pair of token representation
    - $h_i^l, h_j^l$ ( $l$ : each attention head )
    - actual order : $o_{ij} = {{i < j }},$
    - log-distance $d_{ij} = ln(1 + |i - j|)$
  - 단어i 와 단어j 의 attention score 계산: i < j → $S^{o}{ij} = o{ij}ln(p_{ij})$ → higher attention score → 자기 자신의 왼쪽에 있는 단어에 더 높은 attention score를 줌. (영어에서 형용사는 명사의 왼쪽에 있음, 따라서 <형용용사, 명사> 순으로 문장이 있을 때, 명사가 바로 전 형용사에 더 attend하게 됨.)
  - 단어i 와 단어j 사이의 거리가 짧으면 $d_{ij}$ 의 값이 작아지고, $S^d_{ij}$ 의 값이 커지게 됨(절대값이 작아짐) → higher attention score → 자기 자신과 가까운 단어에 높은 attention score를 줌.
what does those bias terms do?
- penalizing attention edges for violating soft order/distance constraints
- → model will learn logical implication rules such as
- “Lazy”는 오른쪽에 있으므로 “crow”를 modify 하지 않고, attention edge가 penalty를 받게 됨.
- “Sly”는 토큰이 많이 떨어져 있으므로 attention edge가 penalty를 받게 됨.
- “Cunning”은 큰 패널티를 받지 않기 때문에, 위에서 (Negative Error에서의) Process of Elimination에서 Attention을 줘야할 가장 확률 높은 후보의 형용사로 선정됨.

Super-Token by Graph Learning