一、TensorRT基本概念
TensorRT(GIE)是一个C++库,适用于Jetson TX1和Pascal架构显卡,支持fp16特性,也就是半精度运算。由于采取“精度换速度”的策略,在精度无明显下降的前提下,其对inference的加速明显,往往有1倍以上的性能提升。
二、TensorRT编程流程
使用TensorRT主要有连个步骤:
构建阶段:构建一个网络定义(network),执行优化,并生产推理引擎(inference engine)
执行阶段:引擎运行推理任务(只需要输入/输出buffers),主要是通过 IExecutionContext::execute和 IExecutionContext::enqueue同步或异步执行推理
目前支持的层(据说更新到TensorRT 2.1将支持自定义层,好消息哦):
Convolution: 2D
Activation: ReLu, tanh and sigmoid
Pooling: max and average
ElementWise: sum, product or max of two tensors
LRN: cross-channel only
Fully-connected: with or without bias
SoftMax: cross-channel only
Deconvolution
1.主要两个头文件、两个库:
头文件目录:/usr/include/x86_64-linux-gnu/
NvCaffeParser.h NvInfer.h
库目录:/usr/lib/x86_64-linux-gnu
libnvcaffe_parser.so libnvinfer.so
demo目录:/usr/src/gie_samples
doc目录:/usr/share/doc/gie
2.软件主要流程(代码只列出主要部分):
由于会用到cuda需要包含cuda_runtime_api.h
以sampleMNIST为例:
首先从main()开始:
caffe model转gie model,创建序列化engine;
读入需要处理的图像;
用caffe parser解析均值文件(.binaryproto),与需要处理图像相减;
反序列化gieModelStream到engine,创建引擎执行的环境(context )
执行推理doInference(*context, data, prob, 1);//参数:环境,输入,输出,batch大小
关键步骤1:Caffe model转成GIE model以便提供给cuda engine进行inference:
void caffeToGIEModel(const std::string& deployFile, const std::string& modelFile, const std::vector& output, unsigned int maxBatchSize, std::ostream& gieModelStream)
//创建builder
IBuilder* builder = createInferBuilder(gLogger);
//解析caffe model到network,然后输出
INetworkDefinition* network = builder->createNetwork();
ICaffeParser* parser = createCaffeParser();
const IBlobNameToTensor* blobNameToTensor = parser->parse(deployFile,modelFile, *network,DataType::kFLOAT);
//指定哪个tensors是输出
for(auto& s : outputs) //outputs是传入的string引用。包含需要输出的blob名字
network->markOutput(*blobNameToTensor->find(s.c_str()));
//创建engine(需要给builder传入最大batchsize和workspacesize),最后把network传入builder
ICudaEngine* engine = builder->buildCudaEngine(*network)
//最后销毁不需要的network,paser
*->destory() ;
//序列化engine,最后销毁engine、builder
engine->serialize(gieModelStream);
关键步骤2:对得到的Model的流反序列化到Cuda引擎,并创建用于执行推理的上下文环境context
gieModelStream.seekg(0, gieModelStream.beg);
IRuntime* runtime = createInferRuntime(gLogger);
ICudaEngine* engine = runtime->deserializeCudaEngine(gieModelStream);
IExecutionContext* context = engine->createExecutionContext();
关键步骤3:推理void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
//从上下文环境获取引擎
const ICudaEngine& engine = context.getEngine();
//获得绑定索引数量,只有输出/输出,这值返回2
assert(engine.getNbBindings() == 2)
//通过输入输出tensors返回输入输出索引
int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME),
outputIndex = engine.getBindingIdex(OUTPUT_BLOB_NAME);
//创建GPU buffers和stream。过程与malloc类似
CHECK(cudaMalloc(&buffer[inputIndex], batchSize * INPUT_H * INPUT_W * sizeof(float)));
CHECK(cudaMalloc(&buffer[inputIndex], batchSize * OUTPUT_H * OUTPUT_W * sizeof(float)));
//创建流进程
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
//***DMA方式将input传入GPU,异步执行batch数据的运算,最后DMA回传output到CPU***
CHECK(cudaMemcpyAsync(buffers[inputIndex]), input, batchSize*INPUT_H*INPUT_W*sizeof(float),cudaMemcpyHostToDevice, stream);
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize*OUTPUT_SIZE*sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);//同步
//最后释放stream和buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[inputIndex]));
CHECK(cudaFree(buffers[outputIndex]));