python实现矩阵乘法,c语言实现矩阵乘法

原始Python代码：

用np.random.randint随机生成两个1到100内的100*100的数组，做矩阵相乘。

import numpy as npimport timefrom numba import jitarr_a = np.random.randint(1,100,10000).reshape((100,100))arr_b = np.random.randint(1,100,10000).reshape((100,100))def mutiply(arr_a,arr_b): res = np.zeros((arr_a.shape[0], arr_b.shape[1])) for i in range(arr_a.shape[0]): for j in range(arr_b.shape[1]): for k in range(arr_b.shape[0]): res[i,j] += arr_a[i,k] * arr_b[k,j] return resstart = time.time()print("Result:n",mutiply(arr_a,arr_b))print("耗时%s" %(time.time() - start)) Result: [[243963. 274319. 263109. ... 263299. 271558. 258586.] [236866. 255407. 244862. ... 228297. 227939. 244239.] [236049. 245324. 260419. ... 249052. 252747. 254496.] ... [243759. 258651. 232168. ... 225277. 235829. 246914.] [254883. 277709. 260438. ... 254744. 260553. 265045.] [267688. 286480. 282399. ... 261162. 267804. 277926.]]耗时1.0142886638641357 1.使用Numba

Numba官方网站：Numba: A High Performance Python Compiler

Numba 是 Python 的实时编译器，最适合使用 NumPy 数组和函数以及循环的代码。使用 Numba 最常见的方法是通过它的 decorator 集合（应该是一些@注解），这些 decorator 可以应用到函数中，指示 Numba 编译它们。当对 Numba-decorated 函数进行调用时，它被编译成机器代码“ just-in-time”以便执行，随后代码可以以本机代码速度运行！

如果你的代码是面向数字科学运算的（做很多数学运算），大量使用 NumPy 和/或有很多循环，那么 Numba 通常是一个不错的选择。

加速方法：在方法上面加@jit注解

@jit的作用：Numba 提供了几个用于代码生成的实用程序，其核心特性是 Numba.jit () 。使用这个修饰符，您可以通过 Numba 的 JIT 编译器将一个函数标记为使用cuda优化该函数。其中jit()有几种参数用来控制不同的模式。共有两种不同的编译模式，即nopython模式以及对象模式，Nopython 编译模式的行为基本上是编译修饰函数，这样它就可以在没有 Python 解释器参与的情况下完全运行，这是最常用的方法。

使用nopython模式：在jit的参数里令nopython = True即可启用该模式。

import numpy as npimport timefrom numba import jitarr_a = np.random.randint(1,100,10000).reshape((100,100))arr_b = np.random.randint(1,100,10000).reshape((100,100))@jit(nopython=True)def mutiply(arr_a,arr_b): res = np.zeros((arr_a.shape[0], arr_b.shape[1])) for i in range(arr_a.shape[0]): for j in range(arr_b.shape[1]): for k in range(arr_b.shape[0]): res[i,j] += arr_a[i,k] * arr_b[k,j] return resstart = time.time()print("Result:n",mutiply(arr_a,arr_b))print("耗时%s" %(time.time() - start)) Result: [[212568. 235228. 263089. ... 248437. 235243. 266715.] [230020. 224216. 253386. ... 219960. 235063. 259665.] [211376. 216862. 239213. ... 213518. 222902. 231084.] ... [221239. 250413. 260120. ... 245681. 238919. 257253.] [224442. 209056. 244029. ... 234404. 227210. 264708.] [220948. 223777. 253604. ... 229385. 238134. 245019.]]耗时0.4268648624420166 2.使用Pycuda

Pycuda文档：设备接口 - PyCUDA 2020.1 文档

demo参考：PyCUDA矩阵乘法的精度代码 - VoidCC

使用Pycuda需要自己写C++/C的kernal函数，然后再通过get_function去调用SourceModule里面的核函数，代码如下：

import pycuda.driver as cudaimport pycuda.toolsimport pycuda.autoinitimport numpy as npimport numpy.linalg as lafrom pycuda.compiler import SourceModuleimport timemod = SourceModule("""__global__ void matrixMultiply(float * A, float * B, float * C, int A_shape_0,int A_shape_1,int B_shape_1) { float cValue = 0; int Row = blockIdx.y * blockDim.y + threadIdx.y; int Col = blockIdx.x * blockDim.x + threadIdx.x; if ((Row < A_shape_0) && (Col < B_shape_1)) { for (int k = 0; k < A_shape_1; k++) { cValue += A[Row*A_shape_1 + k] * B[k*B_shape_1 + Col]; } C[Row*B_shape_1 + Col] = cValue; }}""")matrixMultiply = mod.get_function("matrixMultiply")n = 100A = np.random.randint(0,100,10000).reshape(100,100).astype(np.float32)B = np.random.randint(0,100,10000).reshape(100,100).astype(np.float32)C = np.zeros((100,100)).astype(np.float32)BLOCK_SIZE = 10# 在设备上申请存储空间A_gpu = cuda.mem_alloc(A.nbytes)B_gpu = cuda.mem_alloc(B.nbytes)C_gpu = cuda.mem_alloc(C.nbytes)# 将数组从host拷贝到显卡cuda.memcpy_htod(A_gpu, A)cuda.memcpy_htod(B_gpu, B)# 设定grid大小if n%BLOCK_SIZE != 0: grid=(n//BLOCK_SIZE+1,n//BLOCK_SIZE+1,1)else: grid=(n//BLOCK_SIZE,n//BLOCK_SIZE,1)# call gpu functionstart = time.time()matrixMultiply(A_gpu, B_gpu, C_gpu,np.int32(A.shape[0]),np.int32(A.shape[1]),np.int32(B.shape[1]), block=(BLOCK_SIZE,BLOCK_SIZE,1), grid=grid);# copy back the resultcuda.memcpy_dtoh(C, C_gpu)print("Result:n",C)print("耗时%s" %(time.time() - start)) Result: [[219468. 214786. 230702. ... 245646. 236251. 250875.] [227736. 221473. 224950. ... 247127. 247688. 246141.] [223986. 193710. 221462. ... 231594. 245623. 234833.] ... [249705. 238607. 253167. ... 253975. 284177. 246474.] [207058. 212837. 217770. ... 219180. 261689. 224773.] [213341. 231024. 251518. ... 229844. 268992. 245802.]]耗时0.002018451690673828 3.使用Pybind11 调用C++写的CUDA代码

Pybind11 是一个轻量级的 C++ 库，用于将你的 C++ 代码暴露给 Python 调用（反之也可，但主要还是前者）。Pybind11 借鉴了 Boost::Python 库的设计，但使用了更为简洁的实现方式，使用了大量 C++11 的新特性，更易于使用。

pybind安装参考：pybind11使用 - 简书

demo参考：使用python 调用 pybind11封装的 cuda C++ 动态链接库 - 简书

C++代码：

写出cuda核函数，例子中用到了numpy在c++中的表示。调用主要用到PYBIND11 _MODULE(example, m)该方法，参数中第一个为导出的包名，第二个参数为模块实例对象，pybind11的模块实例对象提供了 def()函数。

#include<pybind11/pybind11.h>#include <iostream>#include <stdio.h>#include <stdlib.h>#include <cuda_runtime.h>#include <device_launch_parameters.h>#include <pybind11/numpy.h>namespace py = pybind11;__global__ void matrix_glbal_mul(float* arr_a, float* arr_b, float* res, int a_shape_1){//a_shape_0，a_shape_1分别为第一个数组的行数和列数，b_shape_1为第二个数组的列数int x = threadIdx.x + blockIdx.x * blockDim.x; // 定位到res的列索引int y = threadIdx.y + blockIdx.y * blockDim.y; // 定位到res的行索引float Pvalue = 0;for (int k = 0; k < a_shape_1; ++k)Pvalue += arr_a[y * a_shape_1 + k] * arr_b[k * a_shape_1 + x];res[y * a_shape_1 + x] = Pvalue;}py::array_t<float> np_multiply(py::array_t<float> &arr_a, py::array_t<float> &arr_b){//可通过此函数传入python中的numpy.ndarray数据，在C++中表现为py::array_t<T>格式。py::buffer_info bufA = arr_a.request(), bufB = arr_b.request();//request方法活得对py::array_t<T>的绑定，包括维度、数据指针、size、shape等参数const int a_shape_0 = bufA.shape[0], a_shape_1 = bufA.shape[1], b_shape_1 = bufB.shape[1];//分别是A的行数、列数、B的列数std::cout << a_shape_0 << a_shape_1 << b_shape_1 << std::endl;auto result = py::array_t<float>(a_shape_0 * b_shape_1);result.resize({ a_shape_0, b_shape_1 });py::buffer_info bufResult = result.request();float *ptrA = (float *)bufA.ptr,*ptrB = (float *)bufB.ptr,*ptrResult = (float *)bufResult.ptr; //获得数据指针float *d_a, *d_b, *d_res;cudaMalloc((void **)&d_a, a_shape_0 * a_shape_1 * sizeof(float));cudaMalloc((void **)&d_b, a_shape_1 * b_shape_1 * sizeof(float));cudaMalloc((void **)&d_res, a_shape_0 * b_shape_1 * sizeof(float));cudaMemcpy(d_a, ptrA, a_shape_0 * a_shape_1 * sizeof(float), cudaMemcpyHostToDevice);cudaMemcpy(d_b, ptrB, a_shape_1 * b_shape_1 * sizeof(float), cudaMemcpyHostToDevice);//constexpr const int TP = 10;//dim3 block(TP, TP);//dim3 grid(a_shape_0 / TP, b_shape_1 / TP); constexpr const int TP = 16; dim3 block(TP, TP); dim3 grid((a_shape_0 + TP - 1) / TP, (b_shape_1 + TP - 1) / TP);matrix_glbal_mul <<<grid, block >>> (d_a, d_b, d_res,a_shape_1);cudaMemcpy(ptrResult, d_res, a_shape_0 * b_shape_1 * sizeof(float), cudaMemcpyDeviceToHost);cudaFree(d_a);cudaFree(d_b);cudaFree(d_res);return result;}PYBIND11_MODULE(example, m) {m.doc() = "pybind11 example module";m.def("matrix_glbal_mul", &matrix_glbal_mul, "Multuply tow arrays");m.def("np_multiply", &np_multiply, "Multuply tow arrays");}

python调用代码：

import numpy as npimport demo.example as exampleimport timearr_a = np.random.randint(1,100,10000).reshape((100,100))arr_b = np.random.randint(1,100,10000).reshape((100,100))start = time.time()res = example.np_multiply(arr_a,arr_b)print("Result:n",res)print("耗时%s" %(time.time() - start)) Result: [[279828. 259870. 266260. ... 254709. 227848. 250391.] [237871. 228993. 244860. ... 235741. 207431. 227064.] [268107. 233281. 259488. ... 252508. 220149. 248723.] ... [276107. 237983. 269437. ... 253083. 233473. 255776.] [251326. 214743. 248869. ... 231401. 200128. 224235.] [300701. 283541. 292042. ... 289940. 255317. 274050.]]耗时0.14971685409545898