[系统相关] 【Altera SoC体验之旅】+ DE1上跑完整的卷积神经网络

zhaoyongke 2015-5-30 18:18 楼主

距离上次发帖有点久了，这段时间都在调试另一块Stratix V的板子。闲话少絮，这里展示下项目最新进展。前面的帖子（【Altera SoC体验之旅】+ 正式开启OpenCL模式）简要介绍了什么是神经网络，这次为大家详细展示一个利用卷积神经网络实现图片自动分类的例程。神经网络的优点：自动从数据中学习经验知识，无需复杂的模型和算法。缺点：有监督学习，需要大量的带标签数据；参数量太少时容易过拟合，泛化能力差，参数量太大时训练收敛很慢（有可能需要几个月到几年）。为了克服上述缺点，人们发掘了各种计算资源，包括多核CPU、GPU、DSP、ASIC、FPGA，甚至使用模拟电路。使用CPU实现卷积神经网络比较方便调试，但性能太差，一般人们都选用更快的GPU实现。目前开源的框架大多都支持GPU，如伯克利大学Caffe（caffe官网）和Google Convnet（convnet2）。微软在今年2月宣布使用Stratix V完成了CNN加速器，处理 CIFAR10 图片速度可达每秒2300多张，链接如下（微软FPGA-CNN加速器）（技术报告）。这里我们也使用CIFAR10图片数据，在Cyclone V板子上跑一个卷积神经网络CNN demo。由于板子上计算资源太少（DSP Slice只有80多个），实现完整的网络不太现实，只能在FPGA上实现基本计算单元，然后由HPS统一调度。性能预期不会太高，后面给出。 CIFAR10图片都是什么呢？先来张图！

有兴趣的朋友可以到官网下载（CIFAR10官网）。上面提到过，CNN是有监督学习系统，需要大量带label的数据，CIFAR10就是这样一个开放的数据库，提供了60000张不同类别的图片，分为10个类（如上图左侧所示），每个类别有600张图。这个数据集不算特别大，适合在嵌入式平台上实现。而更大的数据集有ImageNet-1000（ImageNet官网），拥有120多万张高清无码大图，我下载到硬盘，占用了近200GB空间（只能忍痛将其他rmvb和avi删掉了）！有朋友会问，不用这些数据行不行，我们的智能手机里面照片能不能用于CNN做训练？答案是可以的，只是你的数据集很不“均匀”，采样不够“完备”，训练出的模型是真实模型的“有偏估计”，而上述两个数据集经过了种种考验，已经是学术界公认的优质数据集，一年一度的ILSVRC比赛就采用了这些数据集。说完数据，再说模型。先来看一张经典的CNN结构：

这是世界上第一个将CNN实用化的例子，实现了手写体字母自动识别。在这个CNN模型中，可以看到输入是一张32 x 32的二维图像，经过卷积层(Convolution)、下采样层（Subsampling，也称Pooling)、全连接层（Full Connection，也称Inner Product)后，得到一组概率密度，我们选其中概率最大的元素作为该模型对输入图像的分类结果。所以实现CNN时，只需要实现三种基本算法：卷积、下采样、矩阵乘。除此之外，每层输出都可选择是否经过非线性变换，常用的非线性变换有ReLU和Sigmoid，前者计算较为简单，使用较为广泛。 Caffe框架中提供了专门为cifar10数据定制的模型，是用proto格式写的，我们的demo也基于这个模型。内容如下：

name: "CIFAR10_quick_test"
input: "data"
input_dim: 1
input_dim: 3
input_dim: 32
input_dim: 32
layers {
name: "conv1"
type: CONVOLUTION
bottom: "data"
top: "conv1"
blobs_lr: 1
blobs_lr: 2
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
}
}
layers {
name: "pool1"
type: POOLING
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "relu1"
type: RELU
bottom: "pool1"
top: "pool1"
}
layers {
name: "conv2"
type: CONVOLUTION
bottom: "pool1"
top: "conv2"
blobs_lr: 1
blobs_lr: 2
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
}
}
layers {
name: "relu2"
type: RELU
bottom: "conv2"
top: "conv2"
}
layers {
name: "pool2"
type: POOLING
bottom: "conv2"
top: "pool2"
pooling_param {
pool: AVE
kernel_size: 3
stride: 2
}
}
layers {
name: "conv3"
type: CONVOLUTION
bottom: "pool2"
top: "conv3"
blobs_lr: 1
blobs_lr: 2
convolution_param {
num_output: 64
pad: 2
kernel_size: 5
stride: 1
}
}
layers {
name: "relu3"
type: RELU
bottom: "conv3"
top: "conv3"
}
layers {
name: "pool3"
type: POOLING
bottom: "conv3"
top: "pool3"
pooling_param {
pool: AVE
kernel_size: 3
stride: 2
}
}
layers {
name: "ip1"
type: INNER_PRODUCT
bottom: "pool3"
top: "ip1"
blobs_lr: 1
blobs_lr: 2
inner_product_param {
num_output: 64
}
}
layers {
name: "ip2"
type: INNER_PRODUCT
bottom: "ip1"
top: "ip2"
blobs_lr: 1
blobs_lr: 2
inner_product_param {
num_output: 10
}
}
layers {
name: "prob"
type: SOFTMAX
bottom: "ip2"
top: "prob"
}

可见，上述模型经过了3个卷积层（conv1, conv2, conv3），每个卷积层后面都跟着下采样层（pool1, pool2, pool3），之后有两个全连接层(ip1, ip2)，最后一层prob为SOFTMAX分类层，是计算概率密度的，这里我们不需要关心。下面三张图分别统计了CNN模型各层的参数量、数据量和计算量。

可以看出，卷积层的参数量很少，但数据量很大；全连接层刚好相反，参数量较大，但数据量很少。通过计算量统计发现conv2计算量最大，其次是conv3和conv1。全连接层的计算量相对卷积层较小，但不可忽略。其他层（pool1, pool2以及各级relu）由于计算量太小，本设计中没有将其实现为Open CL kernel，而是直接CPU端实现。综上所述，我们重点实现两个算法：卷积和矩阵乘，分别对应卷积层、全连接层的实现。在DE1-SOC上我利用了友晶提供的Open CL BSP，支持C语言开发FPGA。卷积层计算kernel函数如下：

__attribute__((num_compute_units(4)))
__kernel
void conv(__global float * a, __global float * b, __global float * c, const int M, const int N, const int K)
{
int gx = get_global_id(0);
int gy = get_global_id(1);
float tmp=0.0f;
for(int x = 0; x < K; x ++)
{
for(int y = 0; y < K; y ++)
{
tmp += a[(gx + x) * M + (gy + y)] * b[x * K + y];
}
}

全连接层计算采用矩阵乘实现，kernel函数如下：

__attribute__((num_compute_units(4)))
__kernel
void gemm(__global float * a, __global float * b, __global float * c, const int M, const int N, const int K)
{
int gx = get_global_id(0);
int gy = get_global_id(1);
int sy = get_global_size(1);
int sx = get_global_size(0);
int s = sx * sy;
for(int x = gx; x < M; x += sx)
{
for(int y = gy; y < N; y += sy)
{
float tmp=0.0f;
for(int z = 0; z < K; z++)
{
tmp += a[z * M + x] * b[y * K + z];
}
c[y * M + x] = tmp;
}
}
}

编译kernel函数需要使用Altera SDK for OpenCL，我用的版本是14.0.0.200，申请了两个月的license。编译使用命令行aoc，得到*.aocx文件。 Open CL编译输出报告中给出了资源占用情况：

+--------------------------------------------------------------------+
; Estimated Resource Usage Summary ;
+----------------------------------------+---------------------------+
; Resource + Usage ;
+----------------------------------------+---------------------------+
; Logic utilization ; 83% ;
; Dedicated logic registers ; 46% ;
; Memory blocks ; 57% ;
; DSP blocks ; 25% ;
+----------------------------------------+---------------------------;

可见，逻辑资源、存储器资源消耗较为明显，而DSP资源并未用尽，说明还有优化的空间。编译主程序需要使用SoCEDS，我用的版本为14.0.2.274，也是命令行方式，在工程目录下执行make，结束后得到可执行文件cnn。将这两个文件拷贝到SD卡，按照前面的博客对板子进行设置，将CNN的模型、CIFAR10数据也拷贝到SD卡中，板子上电，mount SD卡到/mnt，执行cnn，得到输出如下：

<div class="blockcode"><blockquote>Please input the number of images(1~100):100
Loading data...OK!
Constructing CNN...OK!
Begin calculation...Elapsed Time = 141.861 s.
Real Label = 3(cat), Calc Label = 3(cat), error count = 0
Real Label = 8(ship), Calc Label = 8(ship), error count = 0
Real Label = 8(ship), Calc Label = 8(ship), error count = 0
Real Label = 0(airplane), Calc Label = 0(airplane), error count = 0
Real Label = 6(frog), Calc Label = 6(frog), error count = 0
Real Label = 6(frog), Calc Label = 6(frog), error count = 0
Real Label = 1(automobile), Calc Label = 1(automobile), error count = 0
Real Label = 6(frog), Calc Label = 6(frog), error count = 0
Real Label = 3(cat), Calc Label = 3(cat), error count = 0
Real Label = 1(automobile), Calc Label = 1(automobile), error count = 0
Real Label = 0(airplane), Calc Label = 0(airplane), error count = 0
Real Label = 9(truck), Calc Label = 9(truck), error count = 0
Real Label = 5(dog), Calc Label = 5(dog), error count = 0
Real Label = 7(horse), Calc Label = 7(horse), error count = 0
Real Label = 9(truck), Calc Label = 9(truck), error count = 0
Real Label = 8(ship), Calc Label = 8(ship), error count = 0
Real Label = 5(dog), Calc Label = 5(dog), error count = 0
Real Label = 7(horse), Calc Label = 7(horse), error count = 0
Real Label = 8(ship), Calc Label = 8(ship), error count = 0
Real Label = 6(frog), Calc Label = 6(frog), error count = 0
Real Label = 7(horse), Calc Label = 7(horse), error count = 0
Real Label = 0(airplane), Calc Label = 2(bird), error count = 1
Real Label = 4(deer), Calc Label = 4(deer), error count = 1
Real Label = 9(truck), Calc Label = 9(truck), error count = 1
Real Label = 5(dog), Calc Label = 4(deer), error count = 2
Real Label = 2(bird), Calc Label = 3(cat), error count = 3
Real Label = 4(deer), Calc Label = 4(deer), error count = 3
Real Label = 0(airplane), Calc Label = 0(airplane), error count = 3
Real Label = 9(truck), Calc Label = 9(truck), error count = 3
Real Label = 6(frog), Calc Label = 6(frog), error count = 3
Real Label = 6(frog), Calc Label = 6(frog), error count = 3
Real Label = 5(dog), Calc Label = 5(dog), error count = 3
Real Label = 4(deer), Calc Label = 4(deer), error count = 3
Real Label = 5(dog), Calc Label = 5(dog), error count = 3
Real Label = 9(truck), Calc Label = 9(truck), error count = 3
Real Label = 2(bird), Calc Label = 3(cat), error count = 4
Real Label = 4(deer), Calc Label = 7(horse), error count = 5
Real Label = 1(automobile), Calc Label = 9(truck), error count = 6
Real Label = 9(truck), Calc Label = 9(truck), error count = 6
Real Label = 5(dog), Calc Label = 5(dog), error count = 6
Real Label = 4(deer), Calc Label = 4(deer), error count = 6
Real Label = 6(frog), Calc Label = 6(frog), error count = 6
Real Label = 5(dog), Calc Label = 5(dog), error count = 6
Real Label = 6(frog), Calc Label = 6(frog), error count = 6
Real Label = 0(airplane), Calc Label = 0(airplane), error count = 6
Real Label = 9(truck), Calc Label = 9(truck), error count = 6
Real Label = 3(cat), Calc Label = 5(dog), error count = 7
Real Label = 9(truck), Calc Label = 9(truck), error count = 7
Real Label = 7(horse), Calc Label = 7(horse), error count = 7
Real Label = 6(frog), Calc Label = 6(frog), error count = 7
Real Label = 9(truck), Calc Label = 9(truck), error count = 7
Real Label = 8(ship), Calc Label = 8(ship), error count = 7
Real Label = 0(airplane), Calc Label = 2(bird), error count = 8
Real Label = 3(cat), Calc Label = 3(cat), error count = 8
Real Label = 8(ship), Calc Label = 8(ship), error count = 8
Real Label = 8(ship), Calc Label = 8(ship), error count = 8
Real Label = 7(horse), Calc Label = 7(horse), error count = 8
Real Label = 7(horse), Calc Label = 7(horse), error count = 8
Real Label = 4(deer), Calc Label = 3(cat), error count = 9
Real Label = 6(frog), Calc Label = 3(cat), error count = 10
Real Label = 7(horse), Calc Label = 7(horse), error count = 10
Real Label = 3(cat), Calc Label = 5(dog), error count = 11
Real Label = 6(frog), Calc Label = 6(frog), error count = 11
Real Label = 3(cat), Calc Label = 3(cat), error count = 11
Real Label = 6(frog), Calc Label = 6(frog), error count = 11
Real Label = 2(bird), Calc Label = 2(bird), error count = 11
Real Label = 1(automobile), Calc Label = 1(automobile), error count = 11
Real Label = 2(bird), Calc Label = 2(bird), error count = 11
Real Label = 3(cat), Calc Label = 3(cat), error count = 11
Real Label = 7(horse), Calc Label = 9(truck), error count = 12
Real Label = 2(bird), Calc Label = 2(bird), error count = 12
Real Label = 6(frog), Calc Label = 6(frog), error count = 12
Real Label = 8(ship), Calc Label = 8(ship), error count = 12
Real Label = 8(ship), Calc Label = 8(ship), error count = 12
Real Label = 0(airplane), Calc Label = 0(airplane), error count = 12
Real Label = 2(bird), Calc Label = 2(bird), error count = 12
Real Label = 9(truck), Calc Label = 0(airplane), error count = 13
Real Label = 3(cat), Calc Label = 3(cat), error count = 13
Real Label = 3(cat), Calc Label = 2(bird), error count = 14
Real Label = 8(ship), Calc Label = 8(ship), error count = 14
Real Label = 8(ship), Calc Label = 8(ship), error count = 14
Real Label = 1(automobile), Calc Label = 1(automobile), error count = 14
Real Label = 1(automobile), Calc Label = 1(automobile), error count = 14
Real Label = 7(horse), Calc Label = 7(horse), error count = 14
Real Label = 2(bird), Calc Label = 2(bird), error count = 14
Real Label = 5(dog), Calc Label = 7(horse), error count = 15
Real Label = 2(bird), Calc Label = 2(bird), error count = 15
Real Label = 7(horse), Calc Label = 7(horse), error count = 15
Real Label = 8(ship), Calc Label = 8(ship), error count = 15
Real Label = 9(truck), Calc Label = 9(truck), error count = 15
Real Label = 0(airplane), Calc Label = 0(airplane), error count = 15
Real Label = 3(cat), Calc Label = 4(deer), error count = 16
Real Label = 8(ship), Calc Label = 8(ship), error count = 16
Real Label = 6(frog), Calc Label = 6(frog), error count = 16
Real Label = 4(deer), Calc Label = 4(deer), error count = 16
Real Label = 6(frog), Calc Label = 6(frog), error count = 16
Real Label = 6(frog), Calc Label = 6(frog), error count = 16
Real Label = 0(airplane), Calc Label = 2(bird), error count = 17
Real Label = 0(airplane), Calc Label = 0(airplane), error count = 17
Real Label = 7(horse), Calc Label = 7(horse), error count = 17
Classify Score = 83 %.

上面的执行流程是这样的，首先输入测试样本数目（1到100），由于DE1板子FPGA端SDRAM容量较小，难以加载全部测试数据（10000张图片），故每次最多装入100张图片。之后载入数据到HPS内存，然后开始构建CNN模型，构建过程中也实现了Open CL的初始化。构建完毕，将输入图像依次通过CNN，得到一系列分类结果，与标签进行对比，统计错误分类个数，计算分类准确率。经过测试，分类准确率达到83%，与Caffe测试结果一致。经过以上测试，可以得到结论：（1）使用Open CL可以很方便地移植高级语言编写的算法；（2）CNN在移植过程中需要考虑实际硬件，定制合适的模型和数据；（3）Cyclone 5逻辑资源较少（85K，Open CL kernel占用了83%），如果希望进一步提高计算速度，一方面可以选用高性能器件（如Stratix V、Arria 10），另一方面可以使用RTL自己搭建计算系统。我目前正在Stratix V上用RTL搭建并行+流水线CNN加速器，有兴趣的童鞋可以进一步交流。码字不易，申请加精。

本帖最后由 zhaoyongke 于 2015-5-30 17:53 编辑

回复评论（42）

沙发 Mr_Hertz

学习了，支持一下！

点赞 2015-5-30 19:47

板凳 zhaoyongke

引用: Mr_Hertz 发表于 2015-5-30 19:47
学习了，支持一下！

感谢支持！

Caffe please.

点赞 2015-5-30 23:27

4楼 574433742

好厉害的样子，我要学习一下，，然后在模仿一下。。。

分享铸就美好未来。。。

点赞 2015-5-31 10:48

5楼 zhaoyongke

引用: 574433742 发表于 2015-5-31 10:48
好厉害的样子，我要学习一下，，然后在模仿一下。。。

共同进步

Caffe please.

点赞 2015-6-1 09:19

6楼 zgbkdlm

略吊~

点赞 2015-6-1 14:53

7楼 zhaoyongke

引用: zgbkdlm 发表于 2015-6-1 14:53
略吊~

Caffe please.

点赞 2015-6-1 17:16

8楼 chenzhufly

不明觉厉！前来学习

生活就是油盐酱醋再加一点糖，快活就是一天到晚乐呵呵的忙 =================================== 做一个简单的人，踏实而务实，不沉溺幻想，不庸人自扰

点赞 2015-6-1 17:24

9楼 zhaoyongke

引用: chenzhufly 发表于 2015-6-1 17:24
不明觉厉！前来学习

期待柱版详细讲解HPS-FPGA高速AXI通信

Caffe please.

点赞 2015-6-1 21:14

10楼 chenzhufly

这个alex不是开坛授课了吗
我也在努力的学习中呢

生活就是油盐酱醋再加一点糖，快活就是一天到晚乐呵呵的忙 =================================== 做一个简单的人，踏实而务实，不沉溺幻想，不庸人自扰

点赞 2015-6-1 21:26

11楼 zhaoyongke

引用: chenzhufly 发表于 2015-6-1 21:26
这个alex不是开坛授课了吗
我也在努力的学习中呢

原来已经发了，还没看

Caffe please.

点赞 2015-6-2 10:34

12楼 qq365317289

楼主好厉害~方便的话可以留个邮箱交流一下把~

点赞 2015-6-5 16:12

13楼 ZT_15

博主您好！我是一名大四学生，专业是微电子，我的毕业设计是将CNN算法在FPGA上实现加速。前段时间忙于考研复试、调剂的事情，半个月以前才开始做毕设。通过查找各种资料，我对CNN算法思想已经掌握，并用Matlab语言实现了CNN算法，可以进行简单的图片识别。我了解到CNN算法在卷积层和全连接层运算量非常大，也是CNN算法在FPGA上实现加速需要重点考虑的地方，但是我是第一次接触OpenCL语言，将CNN卷积层和全连接层用OpenCL语言实现对我有很大难度。事关我能否毕业，我现在真的非常着急。今天我意外发现这个论坛，看到博主最近也在研究这个问题，感觉真是拨云见雾。博主好人，能不能将CNN算法的OpenCL全部实现代码发给我，我保证尊重博主的科研成果，不会发给别人。我知道我这样做非常不好，但是，我真的是没有办法了。我是新注册用户，不能跟博主私聊，如果博主看到了我的回复，请发发慈悲。我的邮箱是782097656@qq.com 。我会非常感激博主！ 本帖最后由 ZT_15 于 2015-6-5 16:57 编辑

点赞 2015-6-5 16:18

14楼 zhaoyongke

引用: qq365317289 发表于 2015-6-5 16:12
楼主好厉害~方便的话可以留个邮箱交流一下把~

已发私信

Caffe please.

点赞 2015-6-10 14:39

15楼 zhaoyongke

引用: ZT_15 发表于 2015-6-5 16:18
博主您好！我是一名大四学生，专业是微电子，我的毕业设计是将CNN算法在FPGA上实现加速。前段时间忙于考研复试、调剂的事情，半个月以前才开始做毕设。通过查找各种资料，我对CNN算法思想已经掌握，并用Matlab语言实现了CNN算法，可以进行简单的图片识别。我了解到CNN算法在卷积层和全连接层运算量非常大，也是CNN算法在FPGA上实现加速需要重点考虑的地方，但是我是第一次接触OpenCL语言，将CNN卷积层和全连接层用OpenCL语言实现对我有很大难度。事关我能否毕业，我现在真的非常着急。今天我意外发现这个论坛，看到博主最近也在研究这个问题，感觉真是拨云见雾。博主好人，能不能将CNN算法的OpenCL全部实现代码发给我，我保证尊重博主的科研成果，不会发给别人。我知道我这样做非常不好，但是，我真的是没有办法了。我是新注册用户，不能跟博主私聊，如果博主看到了我的回复，请发发慈悲。我的邮箱是。我会非常感激博主！

不好意思，目前还不能公开

Caffe please.

点赞 2015-6-10 14:40