Unsupervised Learning of Depth and Ego-Motion from Video
paper :
https://arxiv.org/abs/1704.07813
code :
https://github.com/tinghuiz/SfMLearner
핵심 요약 :
- monocular camera
- unsupervised learning by reconstruction loss (view synthesis)
- view synthesis (reconstruct target view from source view) by projection and warping
- projection할 때 아래에서 구한 depth와 pose 이용
- use 3 networks
- single-view depth CNN
- multi-view pose CNN
- explainability soft mask
single-view
depth estimation by per-pixel depth mapmulti-view
camera motion (= ego-motion
= pose
) by 6-DoF transformation matrices
unsupervised
learning : 직접적인 GT data가 아니라 view synthesis (reconstruction term)를 supervision
으로 씀Assumption :
Scenes, which we are interested in, are mostly rigid, so changes across different frames are dominated by camera motion
\(p\) : index of target view’s pixel coordinates
\(s\) : index of source views
\(I_{t}(p)\) : target view
\(\hat I_{s}(p)\) : source view warped to target coordinate frame (= reconstructed target view) using predicted depth \(\hat D_{t}\) and \(4 \times 4\) camera transformation matrix \(\hat T_{t \rightarrow s}\) and source view \(I_{s}\)
Depth CNN
을 통해 target view
(single view)로부터 depth prediction
\(\hat D_{t}\) 얻기
Pose CNN
을 통해 target & source view
(multi-view)로부터 \(4 \times 4\) camera transformation matrix
\(\hat T_{t \rightarrow s}\) 얻기
project
하기대응되는 위치
를 구하기 위해target view의 pixel 좌푯값을 source view의 좌푯값으로 project하는 데 중간에 depth map이 왜 필요한 거지???
interpolation
으로 value 얻은 뒤 warp to target coordinate
(= reconstructed target view
)Assumption :
objects are static except camera (changes are dominated by camera motion)
물체들이 움직이지 않아야 Depth CNN과 Pose CNN이 같은 coordinate에 대해 project할 수 있다.
there is no occlusion/disocclusion between target view and source view
target view와 source views 중 하나라도 물체가 가려져서 안보인다면 projection 정보가 없어 학습에 문제가 된다.
surface is Lambertain so that photo-consistency error is meaningful
어떤 방향에서 보든 표면이 isotropic 똑같은 밝기로 보인다고 가정 \(\rightarrow\) photo-consistency에 차이가 있을 경우 이는 다른 surface를 의미함
To improve robustness, train additional network which predicts explainability soft mask
\(\hat E_{s}\) (= per-pixel weight
), and add it to reconstruction loss term.
deep-learning model은 black-box이므로 explainablity는 중요한 요소
trivial sol. \(\hat E_{s} = 0\)을 방지하기 위해, add regularization
term that encourages nonzero prediction of \(\hat E_{s}\)
직접 pixel intensity difference로 reconstruction loss를 얻으므로, GT depth & pose로 project하여 얻은 \(p_{s}\) 가 low-texture region or far region에 있을 경우 training 방해 (common issue in motion estimation)
\(\rightarrow\) 해결 1. use conv. encoder-decoder with small bottleneck
\(\rightarrow\) 해결 2. add multi-scale
and smoothness loss
term
(less sensitive to architecture choice, so 이 논문은 해결 2. 적용)
Single-view Depth CNN
Multi-view Pose CNN
(아래 figure의 파란 부분)6 channels (3 Euler angles + 3D translation vector)
for each source view, and then it is converted to \(4 \times 4\) transformation matrix
)어떻게 transformation matrix로 변환???
Explainablity soft mask
(= reconstruction weight per pixel
) (위의 figure의 빨간 부분)2 channels
for each source view at each prediction layer)weight per pixel인데 왜 2 channels are needed for explainability mask???
Train : BN, Adam optimizer, monocular camera (one camera lens), resize input image
Test : arbitrary input image size
ATE : Absolute Trajectory Error
left/right turning magnitude : coordinate diff. in the side-direction between start and ending frame at test
Mean Odom. : mean of car motion for 5-frame snippets from GT odometry dataset
ORB-SLAM(full) : recover odometry using all frames for loop closure and re-localization
ORB-SLAM(short) : Ours에서처럼, use 5-frame snippets as input
\(\rightarrow\) 특히 small left/right turning magnitude (car is mostly driving forward) 상황에서 Ours가 ORB-SLAM(short)보다 성능 더 좋으므로 monocular SLAM system의 local estimation module을 Ours가 대체할 수 있을 것이라 예상
(SLAM 논문 아직 안 읽어봄. 읽어보자.
)
explainability = per-pixel weight (confidence 느낌) for reconstruction
row 1 ~ 3 : due to motion (dynamic objects are unexplainable)
row 4 ~ 5 : due to occlusion/visibility (disappeared objects are unexplainable)
row 6 ~ 7 : due to other factors (e.g. depth CNN has low confidence on thin structures)
unsupervised
learning from monocular
sequencesview synthesis (reconstruction)을 supervision으로
써서 unsupervised learning으로도 comparable performance 달성)dynamic objects (X) / occlusion (X) / must be Lambertain surface / vast open scenes (X) / when objects are close to the front of camera (X) / thin structure (X)
\(\rightarrow\) 위의 한계들을 개선하고자 explainablity mask (= per-pixel reconstruction confidence 느낌) 도입했지만, it is implicit consideration
assume that camera intrinsic K is given
, so not generalized to the random videos with unknown camera types
predict simplified 3D depth map of surface
(not full 3D volumetric representation
)
중간중간에 있는 질문들은 아직 이해하지 못해서 남겨놓은 코멘트입니다.
추후에 다시 읽어보고 이해했다면 업데이트할 예정입니다.
혹시 알고 계신 분이 있으면 댓글로 남겨주시면 감사하겠습니다!