TensorRTでINT8を試す - TadaoYamaokaの開発日記

Turing世代以降のTensorCoreは、INT8に対応している。
GeForce 2080TiでもINT8が利用できるため、試してみた。

なお、V100のTensorCoreは、INT8には対応していないため、dlshogiでは、INT8対応は行っていなかったが、AWSのG4インスタンスでは、NVIDIA T4 Tensor Core GPUが利用できるため、INT8に対応することにした。

dlshogiのINT8対応は別記事で記載予定だが、この記事では、INT8のサンプルコードの動かし方について記述する。

サンプルコード

TensorRTをダウンロードして、圧縮ファイルを解凍したsampleディレクトリにあるsampleINT8に、INT8のサンプルが含まれる。

GitHubにも同様のコードがあるが、Windowsで試す場合は、ダウンロードしたzipに含まれるサンプルだとあらかじめWindowsのプロジェクトファイルになっているので試しやすい。

サンプルコードの説明は、README.mdにある。
GitHubで参照した方が見やすい。

キャリブレーションについて

INT8では、量子化を行うため、各層のダイナミックレンジの設定が必要になる。
ダイナミックレンジは、各層に個別に設定することもできるが、データを使用して計測（キャリブレーション）することもできる。
sampleINT8は、キャリブレーションを行うサンプルになっている。

キャリブレーションについては、以下のドキュメントに記載がある。
Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

MNISTモデルダウンロード

このサンプルコードを動かすには、MNISTのデータセットと、CaffeでトレーニングしたMNISTのモデルが必要になる。

データセットのダウンロード

README.mdにある通り、get_mnist.shを使用して、MNISTデータセットをダウンロードする。
bashとwgetが必要なため、Windowsの場合は、MSYS2やWSLを使う。
データセットは、「data\mnist」に配置する。

Caffeのトレーニング済みモデル

Caffeをインストールして、トレーニングを動かすのは大変なので、検索して見つかった以下のレポジトリからトレーニング済みモデルをダウンロードした。
https://github.com/t-kuha/nn_models/find/master

必要なのは、

caffe/mnist_lenet/lenet.prototxt
caffe/mnist_lenet/lenet_iter_10000.caffemodel

の2つのファイルである。
ダウンロードして、「data\mnist」に配置する。
それぞれ以下の通り、リネームする。

deploy.prototxt
mnist_lenet.caffemodel

実行結果

ビルドして実行すると、FP32、FP16、INT8でのMNISTの推論時間が以下のように表示される。

&&&& RUNNING TensorRT.sample_int8 # H:\src\TensorRT-7.0.0\samples\sampleINT8\\..\..\bin\sample_int8.exe
[09/12/2020-10:49:18] [I] Building and running a GPU inference engine for INT8 sample
[09/12/2020-10:49:18] [I] FP32 run:1800 batches of size 32 starting at 16
[09/12/2020-10:49:21] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[09/12/2020-10:49:21] [W] [TRT] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[09/12/2020-10:49:21] [I] Processing next set of max 100 batches
(略)
[09/12/2020-10:49:22] [I] Processing next set of max 100 batches
[09/12/2020-10:49:22] [I] Top1: 0.998542, Top5: 1
[09/12/2020-10:49:22] [I] Processing 57600 images averaged 0.00654066 ms/image and 0.207457 ms/batch.
[09/12/2020-10:49:22] [I] FP16 run:1800 batches of size 32 starting at 16
[09/12/2020-10:49:30] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[09/12/2020-10:49:30] [W] [TRT] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[09/12/2020-10:49:30] [I] Processing next set of max 100 batches
(略)
[09/12/2020-10:49:31] [I] Processing next set of max 100 batches
[09/12/2020-10:49:31] [I] Top1: 0.998542, Top5: 1
[09/12/2020-10:49:31] [I] Processing 57600 images averaged 0.00548979 ms/image and 0.174126 ms/batch.
[09/12/2020-10:49:31] [I] INT8 run:1800 batches of size 32 starting at 16
[09/12/2020-10:49:31] [I] [TRT] Reading Calibration Cache for calibrator: EntropyCalibration2
[09/12/2020-10:49:31] [I] [TRT] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales.
[09/12/2020-10:49:31] [I] [TRT] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache.
[09/12/2020-10:49:38] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[09/12/2020-10:49:38] [W] [TRT] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[09/12/2020-10:49:38] [I] Processing next set of max 100 batches
(略)
[09/12/2020-10:49:38] [I] Processing next set of max 100 batches
[09/12/2020-10:49:38] [I] Top1: 0.998524, Top5: 1
[09/12/2020-10:49:38] [I] Processing 57600 images averaged 0.00454334 ms/image and 0.144106 ms/batch.
&&&& PASSED TensorRT.sample_int8 # H:\src\TensorRT-7.0.0\samples\sampleINT8\\..\..\bin\sample_int8.exe

推論時間と精度は、以下の通り。

モード	推論時間(ms/image)	推論時間(ms/batch)	精度
FP32	0.00654066	0.207457	0.998542
FP16	0.00548979	0.174126	0.998542
INT8	0.00454334	0.144106	0.998524

推論時間は、FP16→INT8で、1.2倍になっている。
精度は、INT8でわずかに低下している。

まとめ

TensorRTのINT8のサンプルコードの動かし方について説明を行った。
ネットに情報がほとんどなかったため、結構苦労した（特に、Caffeのモデルの準備のあたり）。

PyTorchで出力した可変バッチサイズのONNXモデルの場合、このサンプルの通りには行かず、さらに苦労したのだが、別の記事にする。