感觉自己对CUDA的异步并发还是无法深入理解,在大神建议下看了CUDA C Programming Guide ,感觉对自己的帮助很大。
下面分享一下看了之后的体会,附上原文:
Concurrent host execution is facilitated through asynchronous library functions thatreturn control to the host thread before the device completes the requested task. Using asynchronous calls, many device operations can be queued up together to be executed by the CUDA driver when appropriate device resources are available. This relieves the host thread of much of the responsibility to manage the device, leaving it free for other
tasks. The following device operations are asynchronous with respect to the host:
异步调用是在GPU运行时,CPU并不是堵塞或等待的,也可以进行一定的程序的运行;
The following device operations are asynchronous with respect to the host:
‣ Kernel launches;
‣ Memory copies within a single device’s memory;
‣ Memory copies from host to device of a memory block of 64 KB or less;
‣ Memory copies performed by functions that are suffixed with Async;
‣ Memory set function calls.
下面几个在设备上运行的程序与主机是异步执行的:
1.内核的启动;
2.单一设备上的内存拷贝;
3.从主机拷贝内存块到设备大小小于64k;
4.内存拷贝后缀为异步执行的函数;
5.Memory set function calls
Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably;
程序员可以通过CUDA_LAUNCH_BLOCKING 设为1来禁用异步并发 ,这只试用于调试;
Kernel launches are synchronous if hardware counters are collected via a profiler(Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memorycopies will also be synchronous if they involve host memory that is not page-locked;
异步并发的效果可以用Nsight 和Visual Profiler两个检测工具看到,异步内存也将同步副本如果他们涉及到主机内存不是page-locked
Some devices of compute capability 2.x and higher can execute multiple kernels concurrently. Applications may query this capability by checking the concurrentKernels device property (see Device Enumeration), which is equal to 1 for devices that support it.
当GPU的计算能力大于2可以多核并发。程序会通过查看concurrentKernels来确定;
The maximum number of kernel launches that a device can execute concurrently depends on its compute capability and is listed in Table 13。
最大核并发数与GPU的计算能力有关,见下表:
A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context.
一个CUDA内核不能执行另一个CUDA内核;
Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels。
内核使用许多纹理或大量的本地内存不大可能与其他内核并发执行。
Some devices can perform an asynchronous memory copy to or from the GPU concurrently with kernel execution. Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is greater than zero for devices that support it. If host memory is involved in the copy, it must be page-locked.
异步从主机与设备之间的内存拷贝时,程序通过asyncEngineCount 检测设备,从主机向设备拷贝时,必须是页锁定内存才有异步效果;
It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property). Intra-evice copies are initiated using the standard memory copy functions with destination and source addresses residing on the same device.
也可以同时执行intra-device(设备内复制,显存到显存)复制内核执行(设备支持concurrentKernels)。
Concurrent Data Transfers
Some devices of compute capability 2.x and higher can overlap copies to and from the device. Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is equal to 2 for devices that support it. In order to be overlapped, any host memory involved in the transfers must be pagelocked.
并行数据的传输 一些计算能力大于2的设备可以 overlap copy,程序通过 asyncEngineCount来检测设备是否支持功能,这个转移内存必须是页锁定的;
Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined).
应用程序通过流来管理应用程序。流是相互分离的一些命令。不同的流可能会不安顺序执行会彼此并发;
A stream is defined by creating a stream object and specifying it as the stream parameter to a sequence of kernel launches and host < > device memory copies. The following code sample creates two streams and allocates an array hostPtr of float in pagelocked memory.
一个流通过创建一个流主体来定义,指定流参数一个序列的内核发射,主机和设备之间的内存交换。下面的流例子:
cudaStream_t stream[2];
for (int i = 0; i < 2; ++i)
cudaStreamCreate(&stream[i]);
float* hostPtr;
cudaMallocHost(&hostPtr, 2 * size);
Each of these streams is defined by the following code sample as a sequence of one memory copy from host to device, one kernel launch, and one memory copy from device to host:
每一个流被定义一连串的从主机到设备内存的拷贝,一次内核发射,一次内存拷贝;
for (int i = 0; i < 2; ++i) {
cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
size, cudaMemcpyHostToDevice, stream[i]);
MyKernel <<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
size, cudaMemcpyDeviceToHost, stream[i]);
}
Each stream copies its portion of input array hostPtr to array inputDevPtr in device memory, processes inputDevPtr on the device by calling MyKernel(), and copies the result outputDevPtr back to the same portion of hostPtr. Overlapping Behavior describes how the streams overlap in this example depending on the capability of the device. Note that hostPtr must point to page-locked host memory for any overlap to occur.
每个流复制其输入数组hostPtr到数组inputDevPtr设备内存,进程inputDevPtr在设备上通过调用MyKernel()和将结果outputDevPtr 复制回到hostPtr。重叠的行为描述了流如何重叠在这个例子中依赖的能力设备。注意,hostPtr必须是页锁存内存。
for (int i = 0; i < 2; ++i)
cudaStreamDestroy(stream[i]);
cudaStreamDestroy() waits for all preceding commands in the given stream to complete before destroying the stream and returning control to the host thread.
cudaStreamDestroy() 等所有流上的进程指令完成后摧毁流并返回主机线程。
Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the defaultstream. They are therefore executed in order.
核函数的发射和主机与设备内存的复制不指定任何流的参数,或者相当于把流的参数设制为0,是发布默认的流。他们按顺序执行;
For code that is compiled using the –default-stream per-thread compilation flag (or that defines the CUDA_API_PER_THREAD_DEFAULT_STREAM macro before including CUDA headers (cuda.h and cuda_runtime.h)), the default stream is a regular streamand each host thread has its own default stream.
代码编译使用——default-stream线程编译标志(在包含库之前用CUDA_API_PER_THREAD_DEFAULT_STREAM定义),默认的流是一个常规的流,主机每一个线程都有他自己的默认流;
For code that is compiled using the –default-stream legacy compilation flag, the default stream is a special stream called the NULL stream and each device has a single NULL stream used for all host threads. The NULL stream is special as it causes implicit synchronization as described in Implicit Synchronization.
代码编译使用——default-stream遗留编译标志, 默认流是一种特殊的流称为零流和每个设备都有一个单独的空流用于所有主机线程。零流是特别的,因为它导致隐式隐式同步所述同步。
For code that is compiled without specifying a –default-stream compilation flag, –default-stream legacy is assumed as the default.
default -stream是默认的流;
There are various ways to explicitly synchronize streams with each other.
cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads have completed.
cudaStreamSynchronize()takes a stream as a parameter and waits until all preceding commands in the given stream have completed. It can be used to synchronize the host with a specific stream, allowing other streams to continue executing on the device.
cudaStreamWaitEvent()takes a stream and an event as parameters (see Events for a description of events)and makes all the commands added to the given stream after the call to cudaStreamWaitEvent()delay their execution until the given event has completed. The stream can be 0, in which case all the commands added to any stream after the call to cudaStreamWaitEvent()wait on the event.
cudaStreamQuery()provides applications with a way to know if all preceding commands in a stream have completed.
To avoid unnecessary slowdowns, all these synchronization functions are usually best used for timing purposes or to isolate a launch or memory copy that is failing
几种流同步的方法:
1.cudaDeviceSynchronize() 等待线程所有流都完成;
2.cudaStreamSynchronize()把一个流作为参考等待直到所有给流进程的指令都完成。他可以用作用一个特殊的流同步主机让其他流继续在设备上执行,
3.cudaStreamWaitEvent()把一个流或者事件作为参考等待其他所有指令都执行到这一步,流可以为零;
4.cudaStreamQuery()让程序知道在流上所有指令是否以经完成;
Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:
‣ a page-locked host memory allocation,
‣ a device memory allocation,
‣ a device memory set,
‣ a memory copy between two addresses to the same device memory,
‣ any CUDA command to the NULL stream,
‣ a switch between the L1/shared memory configurations described in Compute Capability 2.x and Compute Capability 3.x.
发生以下几个指令时,流不能同时执行:
1.页锁存内存的分配;
2.设备内存的分配;
3.设备内存的设置;
4.设备上内存的拷贝;
5任何CUDA非流指令;
6一个在l1缓存或共享内存上的切换在计算能力在2.x或3.x的设备;
For devices that support concurrent kernel execution and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete:
‣ Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing;
‣ Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete.
计算能力3.0以下的设备任何操作都需要依赖检查一个流的内核启动完成:
Operations that require a dependency check include any other commands within the same stream as the launch being checked and any call to cudaStreamQuery() on that stream. Therefore, applications should follow these guidelines to improve their potentia for concurrent kernel execution:
‣ All independent operations should be issued before dependent operations,
‣ Synchronization of any kind should be delayed as long as possible
操作需要一个依赖项检查中包含任何其他命令流一样发射检查和任何调用cudaStreamQuery() 流。因此,应用程序应该遵循这些指导方针:
1.非独立操作应在独立操作之后;
2.同步的时间要足够长;
笔者文笔不足,可能中间会有错误欢迎指正;