CLIP

Contrastive Language-Image Pre-training

CLIP: Contrastive Language-Image Pre-training

paper :
https://arxiv.org/abs/2103.00020
code :
https://github.com/openai/CLIP
referenced blog :
https://xoft.tistory.com/67

Intro

CLIP : text와 image 간의 관계를 사용하는 다양한 task에 적용 가능
CLIP은 large batch-size를 필요로 하므로
요즘은 softmax 대신 sigmoid 함수를 사용하는 SigLIP 많이 사용

Contrastive Pre-training

contrastive learning :
labeling 없는 self-supervised learning 기법 중 하나로,
같은 class라면 embedding distance를 최소화하고
다른 class라면 embedding distance를 최대화한다
- contrastive loss
  \(L_{cont}^m(x_i, x_j, f) = 1 \{ y_i = y_j \} \| f(x_i) - f(x_j) \|^2 + 1 \{ y_i \neq y_j \} \text{max}(0, m - \| f(x_i) - f(x_j) \|^2)\)
- triplet loss
  \(L_{trip}^m(x, x^{+}, x^{-}, f) = max(0, \| f(x) - f(x^{+}) \|^2 - \| f(x) - f(x^{-}) \|^2 + m)\)
- \(N+1\) - Tuplet loss
  \(L_{tupl}(x, x^{+}, \{ x_{i}^{-} \}_{i=1}^{N-1}, f) = log(1 + \Sigma_{i=1}^{N-1}\text{exp}(f(x)^T f(x_{i}^{-}) - f(x)^T f(x^{+}))) = - log(\frac{\text{exp}(f(x)^T f(x^{+}))}{\text{exp}(f(x)^T f(x^{+})) + \Sigma_{i=1}^{N-1} \text{exp}(f(x)^T f(x_{i}^{-}))})\)

cosine similarity matrix가 identity matrix (I) 에 가깝도록 학습

image-text pair로 구성된 dataset에 대해
image, text를 각각 encoder로 embedding한 뒤
같은 pair는 거리 최소화하고
다른 pair는 거리 최대화하도록
constrative learning으로 두 encoder를 학습

Application

dataset classifier 만들기 또는 zero-shot prediction 등에
pre-trained CLIP model 사용 가능