Preliminary

일반적인 LLM 학습 과정
- 대량의 데이터로 AR 방식으로 pre-training
- Human preference alignment를 위해 instruction-tuning을 수행
  - Supervised fine-tuning, RLHF
보통 closed LLM (ChatGPT, BARD, Claude)는 human preferences로 빡세게 훈련됨 → 일반적으로 성능 재현이 어렵고 많은 cost, human annotation이 소요
RLHF + PPO로 alignment를 많이 했지만, 요즘에는 DPO (Direct Preference Optimization) 이용

LLaMa-2, 2-Chat (LLaMa-1)

LLaMa-1 링크: https://arxiv.org/pdf/2302.13971.pdf

LLaMa-2, 2-Chat 링크: https://arxiv.org/pdf/2307.09288.pdf

Scales

Untitled

7B, 13B, 70B, 34B (not opened)

Pre-training

LLaMa-1 data (sampling proportion, Disk size)

CommonCrawl (67%, 3.3TB)
C4 (15%, 783GB): cleaned version of CommonCrawl
GitHub (4.5%, 328GB)
Wikipedia (4.5%, 83GB)
Gutenberg and Books (4.5%, 85GB)
ArXiv (2.5%, 92GB)