SSDによる物体検出を試してみた - TadaoYamaokaの開発日記

先日の日記でYOLOv2による物体検出を試してみたが、YOLOと同じくディープラーニングで物体の領域検出を行うアルゴリズムとしてSSD(Single Shot MultiBox Detector)がある。

YOLOv2の方が精度が高いとYOLOv2の論文に書かれているが、SSDの精度も高いようなので試してみた。

オリジナルのSSDの実装は、Caffeが使用されているが、WindowsでビルドできるCaffeとバージョンが異なるものが使用されており、トライしてみたがビルドがうまくいかなかった。

Python系のフレームワークの実装がないか調べたところ、TensorFlowの実装が2つとKerasの実装が1つとChainerの実装が1つ見つかった。

TensorFlowの実装
ssd_tensorflow
SSD-Tensorflow
Kerasの実装
ssd_keras
Chainerの実装
chainer-SSD

個人的にはChainerを使い慣れているが、他に比べて更新が古くてメンテされていないようだったので、熱心に更新されていたTensorFlowの実装を試すことにした。

github.com

生のTensorFlowは記述が冗長なので、あまりさわりたくないが、この実装ではTF-Slimが使用されており少しは読みやすくなっている。

学習済みモデルとデモも配布されていたので、README.mdの「SSD minimal example」の通り実行すると、問題なく動作した。

f:id:TadaoYamaoka:20170317233452p:plain

実行環境には、WindowsにインストールしたTensorFlowを使用した。

これを使って、先日のYOLOv2で試した動画で、物体検出を試してみた。

www.nicovideo.jp

そもそもアニメが学習データに入っていないので比較することには意味がないかもしれないが、YOLOv2と比べて検知漏れや物体でないところでの誤検知、キャラクターがaeroplaneになることが多い。

実行速度は、FPS:45.8であった。YOLOv2よりは遅いがリアルタイムに近い速度が出ている。

なお、TensorFlowの実装には、動画に対して検出するコードは含まれていなかったので、自分で作成した。
以下に作成した動画対応のコードを示しておく。

「SSD minimal example」の最後の「In [11]」の部分を以下のコードに置き換える。

VOC_LABELS = {
    0: 'none',
    1: 'aeroplane',
    2: 'bicycle',
    3: 'bird',
    4: 'boat',
    5: 'bottle',
    6: 'bus',
    7: 'car',
    8: 'cat',
    9: 'chair',
    10: 'cow',
    11: 'diningtable',
    12: 'dog',
    13: 'horse',
    14: 'motorbike',
    15: 'person',
    16: 'pottedplant',
    17: 'sheep',
    18: 'sofa',
    19: 'train',
    20: 'tvmonitor',
}

colors = [(random.randint(0,255), random.randint(0,255), random.randint(0,255)) for i in range(len(VOC_LABELS))]

def write_bboxes(img, classes, scores, bboxes):
    """Visualize bounding boxes. Largely inspired by SSD-MXNET!
    """
    height = img.shape[0]
    width = img.shape[1]
    for i in range(classes.shape[0]):
        cls_id = int(classes[i])
        if cls_id >= 0:
            score = scores[i]
            ymin = int(bboxes[i, 0] * height)
            xmin = int(bboxes[i, 1] * width)
            ymax = int(bboxes[i, 2] * height)
            xmax = int(bboxes[i, 3] * width)
            cv2.rectangle(img, (xmin, ymin), (xmax, ymax),
                                 colors[cls_id],
                                 2)
            class_name = VOC_LABELS[cls_id]
            cv2.rectangle(img, (xmin, ymin-6), (xmin+180, ymin+6),
                                 colors[cls_id],
                                 -1)
            cv2.putText(img, '{:s} | {:.3f}'.format(class_name, score),
                           (xmin, ymin + 6),
                           cv2.FONT_HERSHEY_PLAIN, 1,
                           (255, 255, 255))


import time

vid = cv2.VideoCapture('path/to/movie')
if not vid.isOpened():
    raise IOError(("Couldn't open video file or webcam. If you're "
    "trying to open a webcam, make sure you video_path is an integer!"))

vidw = vid.get(cv2.CAP_PROP_FRAME_WIDTH)
vidh = vid.get(cv2.CAP_PROP_FRAME_HEIGHT)
fps = vid.get(cv2.CAP_PROP_FPS)

fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')
out = cv2.VideoWriter('path/to/output.avi', int(fourcc), fps, (int(vidw), int(vidh)))

prev_time = time.time()
frame_cnt = 0
while True:
    retval, img = vid.read()
    if not retval:
        print("Done!")
        break
    
    rclasses, rscores, rbboxes =  process_image(img)
    write_bboxes(img, rclasses, rscores, rbboxes)
    out.write(img)
    frame_cnt += 1

curr_time = time.time()
exec_time = curr_time - prev_time
print('FPS:{0}'.format(frame_cnt/exec_time))

vid.release()
out.release()