Extended CUDA Library (ecuda)
2.0
|
This can also be found in the README file:
ecuda Extended CUDA C++ API release 2.x
These are the release notes for ecuda version 2.
WHAT IS ECUDA?
ecuda is a C++ wrapper around the CUDA C API designed to closely resemble and be functionally equivalent to the C++ Standard Template Library (STL). Specifically: algorithms, containers, and iterators.
REQUIREMENTS
ecuda is a header only API, and the only pre-requisite library is the CUDA API version 5 or later. It should work with any C++ compiler, but has been developed and tested with several versions of gcc (most recently 4.8.4) and clang 3.6. The C++11 standard is optional, but is utilized if enabled. Visual Studio 2013 on Windows 10 was also successfully tested (see the INSTALLATION section below).
A correct setup should be able to compile the tools/print_device_info.cu program without issue. You can try:
$ mkdir bin$ cd bin$ cmake ../tools$ maketo identify any issues. When run, the program prints out a pretty summary of the current system's GPU hardware and capabilities.
DOCUMENTATION:
Documentation can be viewed online:
- This is generated from the source files themselves using doxygen. The base directory contains a default doxygen.cfg file that will build a local copy of the documentation in the
docs/html
subdirectory. Make sure you have doxygen installed and run:$ doxygen doxygen.cfgINSTALLATION:
Linux/MacOS:
- As long as the include/ subdirectory is visible to the compiler, the API can be installed anywhere. A default install using CMake can be done by running:
$ cmake .$ sudo make installThis will copy the contents of the include/ subdirectory to ${CMAKE_INSTALL_PREFIX}/include (usually /usr/local/include).
Windows/Visual Studio:
- The latest free Visual Studio at the time of last update was Visual Studio Community 2015, but it is confirmed that CUDA 7.5 is not supported at this time. I managed to get everything working with Visual Studio Community 2013 on Windows 10. Here is my story:
Download and install Visual Studio Community 2013 from:
https://www.visualstudio.com/en-us/news/vs2013-community-vs.aspx
Download and install the Nvidia CUDA Toolkit from:
http://developer.nvidia.com/cuda-downloads
The order is important since the CUDA installer integrates with any installed Visual Studio versions that it supports. Also note that in the successful configuration, only the following items in the CUDA installer's custom installation were left checked:
- CUDA Toolkit 7.5
- CUDA Visual Studio Integration 7.5
The following items were already installed on the test system with equal or greater version numbers:
- Graphics Driver
- HD Audio Driver
- NVIDIA GeForce Experience
- PhysX System Software
Do whatever makes the most sense for your situation.
- Start Visual Studio and load the ecuda.sln solution file.
- The print_device_info project contains a source file that should build successfully at this point. Build the Release target with the x64 platform, and bin/x64/Release/print_device_info.exe should appear. Running this from the Windows command line should display a pretty summary of the current system's GPU hardware and capabilities.
- When building a Debug target, Visual Studio's C++ Standard Library implementation does some kind of "iterator checking" that doesn't play nice with ecuda's custom iterators, causing erroneous assertion failures to get raised at runtime. Placing this at the beginning of a program will turn this off (and suppresses a warning about macro redefinition):
#pragma warning(disable:4005)#undef _HAS_ITERATOR_DEBUGGING#pragma warning(default:4005)
- Since ecuda is not actively developed on Windows, please report any issues or workarounds!
BENCHMARKS AND EXAMPLES:
- The benchmarks/, test/ and t/ directories contain programs that were useful for development. They might be useful examples to see how ecuda can be used. Again, these were used during API development so they can be quite ugly and full of hacks.
- Each subdirectory contains a CMakeList.txt file so building them should be easy if your system is properly set up. For example, to build the benchmarks/ folder, the following could be used:
$ mkdir -p bin/benchmarks$ cd bin/benchmarks$ cmake ../../benchmarks$ make
- Note that a file called local-config.cmake can be created in the release root directory that contains any system-specific CMake directives (e.g. nvcc compiler flags). The local-config.cmake.example file is an example of how this file might look.
FILE DESCRIPTIONS:
benchmarks/ Programs that compare cuda and ecuda performance. docs/ Additional elements for building docs with doxygen. include/ The ecuda API header files. t/ Catch unit tests. test/ Programs to loosely test elements of the API. tools/ Utilities that utilize ecuda. CMakeLists.txt CMake configuration file doxygen.cfg doxygen configuration file ecuda.config Qt Creator project file ecuda.creator Qt Creator project file ecuda.files Qt Creator project file ecuda.includes Qt Creator project file ecuda.sln Visual Studio 2013 Solution file local-config.cmake.example Example file with additional CMake directives .gitignore local files to omit from version control LICENSE.txt release license MANIFEST list of files under version control README this file VERSION current version of the API
ecuda was written to be light-weight, intuitive and to follow the STL specification. Code should naturally follow modern C++ programming paradigms (e.g. RAII/SBRM, smart pointers). This can prevent many issues that arise from using the CUDA C API. That said, there are a few key, non-obvious concepts that you should know before using ecuda.
When passing base containers to a kernel function, declare the type as Container::kernel_argument
or Container::const_kernel_argument
.
This is not necessary for other container constructs.
This should be done even in later versions of CUDA that support pass-by-reference, since one of the features of the kernel_argument
subclass is that it strips away the reference-counting smart pointer from the container, sparing some registers.
Containers implements both operator[]
and operator()
. The latter should be used when directly accessing a single element in a multidimensional container. For linear containers like ecuda::array and ecuda::vector the operators are equivalent. The at()
method can also be used. However, as with std
:: containers, bounds checking is performed which is more expensive.
Be mindful that host copy commands that involve device memory require that memory to be contiguous. A compile-time assertion will be raised if you forget.
This rule doesn't apply in device code.
Macro | Description |
---|---|
CUDA_CALL(...) | execute ... and if cudaSuccess is not returned, throw an ecuda::cuda_error |
CUDA_CHECK_ERRORS() | check for current CUDA error with cudaCheckLastError throw ecuda::cuda_error if present |
CUDA_CALL_KERNEL_AND_WAIT(...) | combines the above two macros with an additional call to cudaDeviceSynchronize() that waits for the kernel ... to complete |
ecuda will behave appropriately if code is compiled for host only execution. This is useful for prototyping. You'll have to use the __CUDACC__
define in a few places, however. Here's an example:
If compiled with just the C++ compiler (e.g. g++), the resulting program will run as expected without the GPU.
Specification is identical to the C++11 std::array. More efficient when a sequence size is known at compile time. However, ecuda::array doesn't require C++11.
This example requires CUDA >= 7.0 and C++11 support since it uses <array>.
Specification is identical to std::vector. Will automatically grow in size to accomodate new data (e.g. ecuda::vector::insert, ecuda::vector::assign, ecuda::vector:resize).
A logical extension of an STL container to two dimensions. Memory is column-wise contiguous (i.e. (0,1) is followed by (0,2)). Separate threads should ideally access different columns for best memory coalescing. Utilizes memory allocation that is hardware aligned, so memory coalescing is more consistent. Rows and columns can be accessed and will have the same functionality as ecuda::vector.
A logical extension of the ecuda::matrix to to three dimensions. Memory is depth-wise contiguous (i.e. (0,1,2) is followed by (0,1,3)). Separate threads should ideally access different depths (and then different columns) for best memory coalescing. Utilizes memory allocation that is hardware aligned, so memory coalescing is more consistent. XY, XZ, and YZ slices can be accessed and will have the same functionality as ecuda::matrix. Individual rows, columns, and depths can be accessed and will have the same functionality as ecuda::vector.
ecuda uses custom STL allocators to handle device memory allocation (ecuda::device_allocator and ecuda::device_pitch_allocator). One normally doesn't need to worry about these.
However, the ecuda::host_allocator can be useful when allocating memory used as "staging" for transfers between the host and device (see the CUDA API cudaHostAlloc for more discussion). You can specify this as an allocator type when a host container is being used in this way.
ecuda will catch any errors that arise from calls to the CUDA C API and throw an ecuda::cuda_error exception. This will also occur if a kernel crashes.
A wrapper around CUDA event objects called ecuda::event makes these more C++-like.
At this point, this section is really an "anticipated" FAQ.
The Thrust library is officially supported by NVidia and is similar in that it makes CUDA more C++ friendly. However, the emphasis is quite different in that it aims to parallelize common algorithms like sort
. It also features only two containers: thrust::host_vector
and thrust::device_vector
.
ecuda is focused more on the data structures themselves, making them easier to manipulate in device code and providing an intuitive relationship between device and host memory (and code).
Whether you use ecuda or Thrust (or both) depends on the focus of your project.
Good task for Thrust:
Good task for ecuda:
Ideally none. ecuda does pray that the compiler will help in this. In some cases, additional overhead is not easily avoidable. For example:
will never beat:
in the case where it is known that both matrices have the same dimension. This has never been an issue for me in practice.
None where it matters, some where it shouldn't. In cases where ecuda code takes longer to execute, it is probably worth it in terms of safety and consistency. For example:
is slower than:
but only because the former will always initialize the contents (e.g. with a cudaMemset() call where it makes sense).
Within a kernel function, typical data access and manipulation will run identically. For example:
will run just as fast as:
The library has been tested and compiles successfully with CUDA versions 5.0, 5.5, 6.0, and 7.0 in combination with GCC 4.8.1 and 4.8.4. CUDA 6.0 and 7.0 with GCC 4.8.2 or Clang 3.5 and CUDA 7.5 with GCC 4.8.4 also compiled successfully but no example programs were tested. CUDA <5.0 is not supported (specifically, CUDA 3.2, 4.0, 4.1, and 4.2 were tested and did not respect the preprocessor directives in __host__/__device__ methods that create a host-specific and device-specific implementation).
Some very cursory tests with Windows 10, Visual Studio 2013, and CUDA 7.5 were also done successfully. Unlike with Linux, ecuda has not been used in a production setting on Windows, however.
I've been developing and using ecuda in a production setting performing scientific computing that is heavily focused on statistics and information theory. All of the problems with version 1.0 have been addressed in this release and I'm fairly confident in its robustness at this point.
A fixed-size matrix and cube (like std::vector is to std::array) could potentially be useful. I'll likely add it when I actually need it.
Hopefully, any future work will be confined to bug fixes or addressing user difficulties.
The entire API was refined based on lessons learned. Broadly, the changes were:
The ecuda library is open source and released under the FreeBSD license.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. The views and conclusions contained in the software and documentation are those of the authors and should not be interpreted as representing official policies, either expressed or implied, of the FreeBSD Project.
Scott Zuyderduyn, Ph.D.
Postdoctoral Research Fellow
Bader Lab
The University of Toronto
Email: scott.zuyderduyn *at* utoronto.ca
The resources and expertise of the SciNet supercomputing centre at The University of Toronto which is home to several GPU clusters. I used these extensively for my own scientific research (which spawned the creation of this library).
The support of the Bader Lab, part of the Donnelly Centre for Cellular and Biomolecular Research at The University of Toronto, where I am currently a postdoctoral fellow.