Phosphorus' Blog

近几天因为项目需要，要在自己的Windows台式机上安装CUDA和配置nvcc开发环境。不想这么一装就装了一天有余，强行解决了各种错误才得以正常使用，以下把遇到的所有可能错误及解决方案列出，以供将来安装时参考。

安装后进入系统黑屏

这是安装CUDA时遇到的第一个错误~~，还让我一度以为显卡坏了~~，但最后发现是CUDA安装时安装的驱动和显卡不兼容。

解决方案：

Boot 进安全模式，在设备管理器里选回退驱动程序把显卡驱动回滚到先前的版本。重启电脑就能正常运行了。

NVCC 编译时显示 “找不到 cl.exe”

非常显然的错误，在Windows下，nvcc是依赖于MSVC工具链的（目前还没有提供任何mingw的支持），因此需要一套完整的MSVC Build Tools。（笔者这里最后用的是用于桌面的 Visual C++ 2015 v140 工具集）。

解决方案：

安装Visual Studio的MSVC模块后将cl.exe所在目录加入到PATH环境变量中。

CUDA无法正常运行

我在这里遇到的情况是所有__global__函数都没有效果，cudaGetDeviceCount返回一个很大的数值。

初步判断是CUDA Runtime存在问题，发现cudaGetDeviceCount返回了错误代码35，输出错误信息为CUDA driver version is insufficient for CUDA runtime version。因此可以断定是驱动版本与CUDA版本不匹配导致。

解决方案：

安装与当前驱动版本相匹配的CUDA，比如笔者所用的391.35最适合的版本是CUDA 9.1

以下是来自英伟达官网的驱动适配表格：

CUDA Toolkit	Linux x86_64 Driver Version	Windows x86_64 Driver Version
CUDA 10.0.130	>= 410.48	>= 411.31
CUDA 9.2 (9.2.148 Update 1)	>= 396.37	>= 398.26
CUDA 9.2 (9.2.88)	>= 396.26	>= 397.44
CUDA 9.1 (9.1.85)	>= 390.46	>= 391.29
CUDA 9.0 (9.0.76)	>= 384.81	>= 385.54
CUDA 8.0 (8.0.61 GA2)	>= 375.26	>= 376.51
CUDA 8.0 (8.0.44)	>= 367.48	>= 369.30
CUDA 7.5 (7.5.16)	>= 352.31	>= 353.66
CUDA 7.0 (7.0.28)	>= 346.46	>= 347.62

各种奇怪的编译问题

在配置好上述环境后，笔者又在编译时遇到了两个意想不到的错误，最后发现这两个错误都能归结于一个原因——老版CUDA对新版的编译工具集不支持。

其中部分报错内容如下：

1	...\include\type_traits(603): error: expression must have a constant value

1	fatal error -- unsupported Microsoft Visual Studio version! Only the versions 2012, 2013, 2015 and 2017 are supported!

遇到这些错误，不要慌张，这（一般）只是因为你所使用的工具集太新了。

解决方案：

下载安装老版本（如Visual C++ 2015 v140）的工具集，或者在Visual Studio Installer里手动添加老版本工具集包。并将cl.exe的PATH设置到老工具集所在的位置（如果是Visual Studio用户，可以直接考虑切换项目编译工具集）。

结果检验

在排除了上述问题后，我们就可以测试CUDA是否正常工作了~

下面放出我自己写的一个测试代码（比较长）：

#include <cuda_runtime.h>
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <cstdio>
#include <ctime>
#include <windows.h>

using std::cout;
using std::endl;
typedef float calc_type;

void randomize(calc_type * array, int len) { // 随机生成数据
	for(int i = 0 ; i < len ; i ++) array[i] = rand()/((calc_type)RAND_MAX);
}

void cpuAddition(calc_type * a, calc_type *b, calc_type *c, int len) { // 用cpu计算
	for(int i = 0 ; i < len ; i ++) c[i] = a[i] + b[i];
}

__global__ void gpuAddition(calc_type * a, calc_type *b, calc_type *c) { // 用gpu计算
	int idx = threadIdx.x + blockIdx.x * 1024;
	c[idx] = a[idx] + b[idx];
}

__global__ void helloWorldMulti() {
	printf("Hello World from gpu thread %d grid %d\n", threadIdx.x, blockIdx.x);
}

const int arrSize = 1024 * 1024 * 128;

int main(int argc, char ** argv) {
	srand(time(NULL));
	cudaDeviceProp device;
	cudaGetDeviceProperties(&device, 0); // 获取设备信息
	int driver, runtime;
	cudaDriverGetVersion(&driver);
	cudaRuntimeGetVersion(&runtime);
	cout << "Device : \"" << device.name << "\"" << endl;
	cout << "	CUDA Runtime Version : " << runtime / 1000 << "." << (runtime % 100) / 10 << endl;
	cout << "	Device CUDA Capability : " << device.major << "." << device.minor << endl;
	cout << "	Memory : " << (float)device.totalGlobalMem / pow(1024.0, 3) << " GigaByte(s)" << endl;
	cout << "	Constant Memory : " << device.totalConstMem << " Byte(s)" << endl;
	cout << "	L2 Cache Size : " << device.l2CacheSize << " Byte(s)" << endl;
	cout << "	GPU Clock Rate : " << device.clockRate / 1000 << " MHz" << endl;
	cout << "	Memory Clock Rate : " << device.memoryClockRate / 1000 << " MHz" << endl;
	cout << "	Memory Bus Width : " << device.memoryBusWidth << "-bit" << endl;
	cout << "	Shared Memory per Block : " << device.sharedMemPerBlock << " Byte(s)" << endl;
	cout << "	Warp Size : " << device.warpSize << endl;
	cout << "	Maximum threads per block : " << device.maxThreadsPerBlock << endl;
	cout << "	Maximum Dimensions of block : (" << device.maxThreadsDim[0] << ", " << device.maxThreadsDim[1] << ", " << device.maxThreadsDim[2] << ")" << endl;
	cout << "	Maximum Dimensions of grid : (" << device.maxGridSize[0] << ", " << device.maxGridSize[1] << ", " << device.maxGridSize[2] << ")" << endl;
	
	cout << endl << endl << "Multi-thread tests :" << endl;
	helloWorldMulti<<<2, 2>>>(); // 启动四个(2 x 2)线程进行输出测试
	cudaDeviceSynchronize();
	
	getchar();
	
	cout << endl << "Addition tests : " << endl; // 加法测试
	calc_type *a, *b, *c;
    // 分配数据内存并初始化
	a = (calc_type *)malloc(sizeof(calc_type) * arrSize);
	b = (calc_type *)malloc(sizeof(calc_type) * arrSize);
	c = (calc_type *)malloc(sizeof(calc_type) * arrSize);
	randomize(a, arrSize);
	randomize(b, arrSize);
	calc_type *ga, *gb, *gc, *gr;
	gr = (calc_type *)malloc(sizeof(calc_type) * arrSize);
	cudaMalloc(&ga, sizeof(calc_type) * arrSize);
	cudaMalloc(&gb, sizeof(calc_type) * arrSize);
	cudaMalloc(&gc, sizeof(calc_type) * arrSize);
	DWORD c1, c2, g1, g2;
	
	c1 = GetTickCount();
	cpuAddition(a, b, c, arrSize);
	c2 = GetTickCount();
	
	g1 = GetTickCount();
    // 拷贝数据进显存
	cudaMemcpy(ga, a, sizeof(calc_type) * arrSize, cudaMemcpyHostToDevice);
	cudaMemcpy(gb, b, sizeof(calc_type) * arrSize, cudaMemcpyHostToDevice);
	gpuAddition<<<arrSize / 1024, 1024>>>(ga, gb, gc);
	cudaDeviceSynchronize();
	g2 = GetTickCount();
	
	cudaMemcpy(gr, gc, sizeof(calc_type) * arrSize, cudaMemcpyDeviceToHost);
	cudaFree(ga);
	cudaFree(gb);
	cudaFree(gc);
	int errors = 0;
    // 验证计算是否正确
	for(int i = 0 ; i < arrSize ; i ++ )
		if( fabs(gr[i] - c[i]) > 1e-6 )
			cout << gr[i] << ' ' << c[i] << endl;
	cout << errors << " Error(s) found" << endl;
	cout << "CPU Time Consumption : " << c2 << " -> " << c1 << " = " << (c2 - c1) << " ms" << endl;
	cout << "GPU Time Consumption : " << g2 << " -> " << g1 << " = " << (g2 - g1) << " ms" << endl;
	cudaDeviceReset();
	return 0;
}

用nvcc编译运行，获得输出：

Device : "GeForce GTX 950"
        CUDA Runtime Version : 9.1
        Device CUDA Capability : 5.2
        Memory : 2 GigaByte(s)
        Constant Memory : 65536 Byte(s)
        L2 Cache Size : 1048576 Byte(s)
        GPU Clock Rate : 1190 MHz
        Memory Clock Rate : 3305 MHz
        Memory Bus Width : 128-bit
        Shared Memory per Block : 49152 Byte(s)
        Warp Size : 32
        Maximum threads per block : 1024
        Maximum Dimensions of block : (1024, 1024, 64)
        Maximum Dimensions of grid : (2147483647, 65535, 65535)


Multi-thread tests :
Hello World from gpu thread 0 grid 1
Hello World from gpu thread 1 grid 1
Hello World from gpu thread 0 grid 0
Hello World from gpu thread 1 grid 0


Addition tests :
0 Error(s) found
CPU Time Consumption : 49762109 -> 49761359 = 750 ms
GPU Time Consumption : 49762296 -> 49762109 = 187 ms

可以看出，GPU计算运行正常，并且速度明显快于CPU，由此可以确定CUDA配置正常。

Phosphorus' Blog

Learning, Discovering, Exploiting

Windows下安装 CUDA 的问题总结

安装后进入系统黑屏

NVCC 编译时显示 “找不到 cl.exe”

CUDA无法正常运行

各种奇怪的编译问题

结果检验