Preliminary
- 일반적인 LLM 학습 과정
- 대량의 데이터로 AR 방식으로 pre-training
- Human preference alignment를 위해 instruction-tuning을 수행
- Supervised fine-tuning, RLHF
- 보통 closed LLM (ChatGPT, BARD, Claude)는 human preferences로 빡세게 훈련됨 → 일반적으로 성능 재현이 어렵고 많은 cost, human annotation이 소요
- RLHF + PPO로 alignment를 많이 했지만, 요즘에는 DPO (Direct Preference Optimization) 이용
LLaMa-2, 2-Chat (LLaMa-1)
LLaMa-1 링크: https://arxiv.org/pdf/2302.13971.pdf
LLaMa-2, 2-Chat 링크: https://arxiv.org/pdf/2307.09288.pdf
Scales
- 7B, 13B, 70B, 34B (not opened)
Pre-training
LLaMa-1 data (sampling proportion, Disk size)
- CommonCrawl (67%, 3.3TB)
- C4 (15%, 783GB): cleaned version of CommonCrawl
- GitHub (4.5%, 328GB)
- Wikipedia (4.5%, 83GB)
- Gutenberg and Books (4.5%, 85GB)
- ArXiv (2.5%, 92GB)