Object Detection


뮌헨공대 (Technical University of Munich)에서 공부한
[IN2375 Computer Vision - Detection, Segmentation and Tracking] 컴퓨터비전 노트 정리

Old Approach

Template matching with sliding window

loss : MSE(SSD) / NCC / ZNCC (difference between ‘image itself’ and template)

단점 : self-occlusions / change in appearance / change in position / change in scale or aspect ratio

Viola-Jones detector (1-stage)

HOG (Histogram of Oriented Gradients) (1-stage)

Proposals(RoIs) by selective search or edge box (2-stage)

+) NMS에서 \(b_i\)와 비슷한 \(b_j\)들에 대해 confidence score를 비교하여 \(b_i\) 제거 여부를 결정하는데, 만약 \(IoU_{threshold}\)를 넘겨야 비교 후 제거 (N) 가능하다면 high \(IoU_{threshold}\) -> less FN, more FP

Detection Evaluation

TP : positive(BB)라고 예측했는데 맞았다(true)
FN : negative(no BB)라고 예측했는데 틀렸다(false)

precision / recall / F1-score

confindence score 순으로 P를 정렬한 뒤 AP(average precision)
여러 object category에 대해 평균 낸 게 mAP

R-CNN (2-stage)

R-CNN : extract RoI -> crop&warp -> CNN -> SVM & BB reg

장점 : CNN / transfer learning

단점 :

  • slow (~2k proposals per image are warped and forwarded each through CNN)
    -> Fast R-CNN에서 SPP로 해결

  • object proposal algorithm is flixed
    -> Faster R-CNN에서 RPN으로 해결

  • not end-to-end (CNN and SVM & BB reg are trained separately)
    -> Faster R-CNN에서 RPN으로 해결

Fast R-CNN : CNN -> extract RoI -> SPP (RoI Pooling) -> fc -> classifier & BB reg

SPP (= Spatial Pyramid Pooling) : fc layer 직전에 배치하면 any input size 가능

R-CNN의 단점 1. 만 해결

Faster R-CNN : CNN -> RPN (loss 1., 2.) -> RoI Pooling -> fc -> classifier & BB reg (loss 3., 4.)

RPN (= Region Proposal Network) : output shape (H, W, 5n)

  • n anchors per location
  • 1 confidence score
  • 4 normalized anchor coordinates

R-CNN의 단점 1., 2., 3. 모두 해결

FPN (= Feature Pyramid Network) :

define RPN on each level of FPN
scale variance 문제 해소
high scale pyramid에서 small object까지 detect하므로 more TP, FP
But, 단점 : model complexity

1-stage

YOLO (= You Only Look Once) : Faster R-CNN의 loss 3., 4.를 loss 1., 2.에 합치자!

output shape (H, W, 5n) 대신 (S, S, (5+C)n)

장점 : efficient, faster
단점 : less accurate (coarse grid resolution)
single scale (small object detect 불가능, scale variation에 취약)

SSD (= Single Shot Multibox Detector) : multi-scale을 사용하자!

장점 : YOLO 단점 해결
단점 : still less accurate than two-stage detectors due to class imbalance
data augmentation 중요

class imbalance 문제 :

two-stage detector의 경우 first stage에서 미리 negative anchor를 대부분 걸러낼 수 있지만
one-stage detector는 그렇지 않아서 class imbalance 문제 발생

대안 :

  • hard negative mining : FP 오류 줄이기 위해 어려웠던 sample들 추가
  • focal loss = \(-(1-p)^r * log(p)\) : 많이 존재하는 easy example(p ~ 1)은 \((1-p)^r\)로 영향 작게 만들고, 적게 존재하는 hard example(p ~ 0)에 가중을 둠

RetinaNet : 기존 1-stage 방법 + multi-scale(FPN) + focal loss

accuracy : YOLO < SSD < two-stage detector < RetinaNet

spatial transformer :

grid generator로 sampling with bilinear interpolation
= localisation & certain transformation




    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • Quantization
  • SegmentAnything
  • Object Tracking
  • EE534 Pattern Recognition Final
  • FMANet