Overview

This can also be found in the README file:

ecuda Extended CUDA C++ API release 2.x

These are the release notes for ecuda version 2.

WHAT IS ECUDA?

ecuda is a C++ wrapper around the CUDA C API designed to closely resemble and be functionally equivalent to the C++ Standard Template Library (STL). Specifically: algorithms, containers, and iterators.

REQUIREMENTS

ecuda is a header only API, and the only pre-requisite library is the CUDA API version 5 or later. It should work with any C++ compiler, but has been developed and tested with several versions of gcc (most recently 4.8.4) and clang 3.6. The C++11 standard is optional, but is utilized if enabled. Visual Studio 2013 on Windows 10 was also successfully tested (see the INSTALLATION section below).

A correct setup should be able to compile the tools/print_device_info.cu program without issue. You can try:

$ mkdir bin

$ cd bin

$ cmake ../tools

$ make

to identify any issues. When run, the program prints out a pretty summary of the current system's GPU hardware and capabilities.

DOCUMENTATION:

Documentation can be viewed online:

https://baderlab.github.io/ecuda/

This is generated from the source files themselves using doxygen. The base directory contains a default doxygen.cfg file that will build a local copy of the documentation in the docs/html subdirectory. Make sure you have doxygen installed and run:

$ doxygen doxygen.cfg

INSTALLATION:

Linux/MacOS:

As long as the include/ subdirectory is visible to the compiler, the API can be installed anywhere. A default install using CMake can be done by running:

$ cmake .

$ sudo make install

This will copy the contents of the include/ subdirectory to ${CMAKE_INSTALL_PREFIX}/include (usually /usr/local/include).

Windows/Visual Studio:

The latest free Visual Studio at the time of last update was Visual Studio Community 2015, but it is confirmed that CUDA 7.5 is not supported at this time. I managed to get everything working with Visual Studio Community 2013 on Windows 10. Here is my story:

Download and install Visual Studio Community 2013 from:

https://www.visualstudio.com/en-us/news/vs2013-community-vs.aspx

Download and install the Nvidia CUDA Toolkit from:

http://developer.nvidia.com/cuda-downloads

The order is important since the CUDA installer integrates with any installed Visual Studio versions that it supports. Also note that in the successful configuration, only the following items in the CUDA installer's custom installation were left checked:

CUDA Toolkit 7.5

CUDA Visual Studio Integration 7.5

The following items were already installed on the test system with equal or greater version numbers:

Graphics Driver

HD Audio Driver

NVIDIA GeForce Experience

PhysX System Software

Do whatever makes the most sense for your situation.

Start Visual Studio and load the ecuda.sln solution file.

The print_device_info project contains a source file that should build successfully at this point. Build the Release target with the x64 platform, and bin/x64/Release/print_device_info.exe should appear. Running this from the Windows command line should display a pretty summary of the current system's GPU hardware and capabilities.

When building a Debug target, Visual Studio's C++ Standard Library implementation does some kind of "iterator checking" that doesn't play nice with ecuda's custom iterators, causing erroneous assertion failures to get raised at runtime. Placing this at the beginning of a program will turn this off (and suppresses a warning about macro redefinition):

#pragma warning(disable:4005)

#undef _HAS_ITERATOR_DEBUGGING

#pragma warning(default:4005)

Since ecuda is not actively developed on Windows, please report any issues or workarounds!

BENCHMARKS AND EXAMPLES:

The benchmarks/, test/ and t/ directories contain programs that were useful for development. They might be useful examples to see how ecuda can be used. Again, these were used during API development so they can be quite ugly and full of hacks.

Each subdirectory contains a CMakeList.txt file so building them should be easy if your system is properly set up. For example, to build the benchmarks/ folder, the following could be used:

$ mkdir -p bin/benchmarks

$ cd bin/benchmarks

$ cmake ../../benchmarks

$ make

Note that a file called local-config.cmake can be created in the release root directory that contains any system-specific CMake directives (e.g. nvcc compiler flags). The local-config.cmake.example file is an example of how this file might look.

FILE DESCRIPTIONS:
  benchmarks/                Programs that compare cuda and ecuda performance.
  docs/                      Additional elements for building docs with doxygen.
  include/                   The ecuda API header files.
  t/                         Catch unit tests.
  test/                      Programs to loosely test elements of the API.
  tools/                     Utilities that utilize ecuda.
  CMakeLists.txt             CMake configuration file
  doxygen.cfg                doxygen configuration file
  ecuda.config               Qt Creator project file
  ecuda.creator              Qt Creator project file
  ecuda.files                Qt Creator project file
  ecuda.includes             Qt Creator project file
  ecuda.sln                  Visual Studio 2013 Solution file
  local-config.cmake.example Example file with additional CMake directives
  .gitignore                 local files to omit from version control
  LICENSE.txt                release license
  MANIFEST                   list of files under version control
  README                     this file
  VERSION                    current version of the API

A Few Things You Should Know

ecuda was written to be light-weight, intuitive and to follow the STL specification. Code should naturally follow modern C++ programming paradigms (e.g. RAII/SBRM, smart pointers). This can prevent many issues that arise from using the CUDA C API. That said, there are a few key, non-obvious concepts that you should know before using ecuda.

Containers as Kernel Arguments

When passing base containers to a kernel function, declare the type as Container::kernel_argument or Container::const_kernel_argument.

__global__ void kernelFunction( typename ecuda::vector<double>::const_kernel_argument src, typename ecuda::vector<double>::kernel_argument dest );

This is not necessary for other container constructs.

__global__ void kernelFunction( typename ecuda::matrix<double>::const_row_type src, typename ecuda::matrix<double>::row_type dest );

This should be done even in later versions of CUDA that support pass-by-reference, since one of the features of the kernel_argument subclass is that it strips away the reference-counting smart pointer from the container, sparing some registers.

__global__ void kernelFunction( const ecuda::vector<double>& src, ecuda::vector<double>& dest ); // NO!

__global__ void kernelFunction( typename ecuda::vector<double>::const_kernel_argument& src, typename ecuda::vector<double>::kernel_argument& dest ); // OK :/

Direct Access Operators

Containers implements both operator[] and operator(). The latter should be used when directly accessing a single element in a multidimensional container. For linear containers like ecuda::array and ecuda::vector the operators are equivalent. The at() method can also be used. However, as with std:: containers, bounds checking is performed which is more expensive.

__global__ void kernelFunction( typename ecuda::matrix<double>::kernel_argument mat )
{
  // ...
  std::size_t x, y;
  // ...
  double& val = mat[x][y];   // slow
  double& val = mat.at(x,y); // faster, kernel terminates and throws ecuda::cuda_error if x or y out of bounds
  double& val = mat(x,y);    // fastest
  // ...
}

Memory Contiguity

Be mindful that host copy commands that involve device memory require that memory to be contiguous. A compile-time assertion will be raised if you forget.

ecuda::matrix<double> deviceMatrix( 100, 100 );
std::vector<double> vec( 100 );
ecuda::copy( vec.begin(), vec.end(), deviceMatrix.get_row(22).begin() ); // fine
ecuda::copy( vec.begin(), vec.end(), deviceMatrix.get_column(22).begin() ); // compile-time assertion CANNOT_USE_NONCONTIGUOUS_DEVICE_ITERATOR_AS_DESTINATION_FOR_COPY

This rule doesn't apply in device code.

__global__ void kernelFunction( typename ecuda::vector<double>::const_kernel_argument src, typename ecuda::matrix<double>::kernel_argument dest )
{
  const std::size_t t = threadIdx.x;
  ecuda::copy( src.begin(), src.end(), dest.get_row(t).begin() ); // fine
  ecuda::copy( src.begin(), src.end(), dest.get_column(t).begin() ); // also fine
}

Macros

Macro	Description
CUDA_CALL(...)	execute ... and if cudaSuccess is not returned, throw an ecuda::cuda_error
CUDA_CHECK_ERRORS()	check for current CUDA error with cudaCheckLastError throw ecuda::cuda_error if present
CUDA_CALL_KERNEL_AND_WAIT(...)	combines the above two macros with an additional call to cudaDeviceSynchronize() that waits for the kernel ... to complete

Host Emulation

ecuda will behave appropriately if code is compiled for host only execution. This is useful for prototyping. You'll have to use the __CUDACC__ define in a few places, however. Here's an example:

#include <vector>
#include <ecuda/ecuda.hpp>
const std::size_t N = 100000;
__global__
void reverseVector( typename ecuda::vector<double>::kernel_argument vec )
{
  #ifdef __CUDACC__
  const std::size_t t = blockIdx.x*blockDim.x+threadIdx.x;
  #else
  for( std::size_t t = 0; t < vec.size(); ++t )
  #endif
  if( t < (vec.size()/2) ) {
    const std::size_t u = vec.size()-t-1;
    ecuda::swap( vec[t], vec[u] );
  }
}
int main( int argc, char* argv[] )
{
  const std::size_t THREADS = 512;
  ecuda::vector<double> deviceVector( N );
  std::vector<double> hostVector( N );
  // ... initialize host vector values
  ecuda::copy( hostVector.begin(), hostVector.end(), deviceVector.begin() );
  #ifdef __CUDACC__
  CUDA_CALL_KERNEL_AND_WAIT( reverseVector<<<((N+THREADS-1)/THREADS),THREADS>>>( deviceVector ) );
  #else
  reverseVector( deviceVector )
  #endif
  ecuda::copy( deviceVector.begin(), deviceVector.end(), hostVector.begin() );
  // ... host vector now contains result
  return 0;
}

If compiled with just the C++ compiler (e.g. g++), the resulting program will run as expected without the GPU.

Core Concepts and Examples

Arrays

Specification is identical to the C++11 std::array. More efficient when a sequence size is known at compile time. However, ecuda::array doesn't require C++11.

This example requires CUDA >= 7.0 and C++11 support since it uses <array>.

#include <array>
#include <ecuda/ecuda.hpp>
const std::size_t N = 100000;
__global__
void squareRootArray( typename ecuda::array<double,N>::kernel_argument arr )
{
  const std::size_t t = blockIdx.x*blockDim.x+threadIdx.x;
  if( t < arr.size() ) {
    double x = arr[t];
    x = sqrt(x);
    arr[t] = x;
  }
}
int main( int argc, char* argv[] )
{
  const std::size_t THREADS = 512;
  ecuda::array<double,N> deviceArray;
  std::array<double,N> hostArray;
  // ... initialize host array values
  ecuda::copy( hostArray.begin(), hostArray.end(), deviceArray.begin() );
  CUDA_CALL_KERNEL_AND_WAIT( squareRootArray<<<((N+THREADS-1)/THREADS),THREADS>>>( deviceArray ) );
  ecuda::copy( deviceArray.begin(), deviceArray.end(), hostArray.begin() );
  // ... host array now contains result
  return 0;
}

Vectors

Specification is identical to std::vector. Will automatically grow in size to accomodate new data (e.g. ecuda::vector::insert, ecuda::vector::assign, ecuda::vector:resize).

#include <vector>
#include <ecuda/ecuda.hpp>
const std::size_t N = 100000;
__global__
void reverseVector( typename ecuda::vector<double>::kernel_argument vec )
{
  const std::size_t t = blockIdx.x*blockDim.x+threadIdx.x;
  if( t < (vec.size()/2) ) {
    const std::size_t u = vec.size()-t-1;
    ecuda::swap( vec[t], vec[u] );
  }
}
int main( int argc, char* argv[] )
{
  const std::size_t THREADS = 512;
  ecuda::vector<double> deviceVector( N );
  std::vector<double> hostVector( N );
  // ... initialize host vector values
  ecuda::copy( hostVector.begin(), hostVector.end(), deviceVector.begin() );
  CUDA_CALL_KERNEL_AND_WAIT( reverseVector<<<((N+THREADS-1)/THREADS),THREADS>>>( deviceVector ) );
  ecuda::copy( deviceVector.begin(), deviceVector.end(), hostVector.begin() );
  // ... host vector now contains result
  return 0;
}

Matrices

A logical extension of an STL container to two dimensions. Memory is column-wise contiguous (i.e. (0,1) is followed by (0,2)). Separate threads should ideally access different columns for best memory coalescing. Utilizes memory allocation that is hardware aligned, so memory coalescing is more consistent. Rows and columns can be accessed and will have the same functionality as ecuda::vector.

#include <algorithm>
#include <vector>
#include <ecuda/ecuda.hpp>
const std::size_t ROWS = 1000;
const std::size_t COLS = 1000;
__global__
void sumMatrixColumns(
  typename ecuda::matrix<double>::const_kernel_argument mat,
  typename ecuda::vector<double>::kernel_argument vec
)
{
  const std::size_t t = blockIdx.x*blockDim.x+threadIdx.x;
  if( t < mat.number_columns() ) {
    vec[t] = ecuda::accumulate( mat.get_column(t).begin(), mat.get_column(t).end(), static_cast<double>(0) );
  }
}
int main( int argc, char* argv[] )
{
  const std::size_t THREADS = 512;
  ecuda::matrix<double> deviceMatrix( ROWS, COLS );
  std::vector<double> hostVector( COLS );
  // ... initialize host vector values
  for( std::size_t i = 0; i < ROWS; ++i ) {
    std::random_shuffle( hostVector.begin(), hostVector.end() );
    ecuda::copy( hostVector.begin(), hostVector.end(), deviceMatrix[i].begin() );
  }
  ecuda::vector<double> deviceSums( COLS );
  CUDA_CALL_KERNEL_AND_WAIT( sumMatrixColumns<<<((COLS+THREADS-1)/THREADS),THREADS>>>( deviceMatrix, deviceSums ) );
  ecuda::copy( deviceSums.begin(), deviceSums.end(), hostVector.begin() );
  // ... host vector now contains result
  return 0;
}

Cubes

A logical extension of the ecuda::matrix to to three dimensions. Memory is depth-wise contiguous (i.e. (0,1,2) is followed by (0,1,3)). Separate threads should ideally access different depths (and then different columns) for best memory coalescing. Utilizes memory allocation that is hardware aligned, so memory coalescing is more consistent. XY, XZ, and YZ slices can be accessed and will have the same functionality as ecuda::matrix. Individual rows, columns, and depths can be accessed and will have the same functionality as ecuda::vector.

#include <algorithm>
#include <vector>
#include <ecuda/ecuda.hpp>
const std::size_t ROWS = 100;
const std::size_t COLS = 100;
const std::size_t DEPS = 1024;
__global__
void sumMatrix(
  typename ecuda::cube<double>::const_kernel_argument cbe,
  typename ecuda::vector<double>::kernel_argument vec
)
{
  const std::size_t t = blockIdx.x*blockDim.x+threadIdx.x;
  if( t < cbe.number_depths() ) {
    vec[t] = ecuda::accumulate( cbe.get_xy(t).begin(), cbe.get_xy(t).end(), static_cast<double>(0) );
  }
}
int main( int argc, char* argv[] )
{
  const std::size_t THREADS = 512;
  ecuda::cube<double> deviceCube( ROWS, COLS, DEPS );
  for( std::size_t i = 0; i < ROWS; ++i ) {
    typename ecuda::cube<double>::slice_yz_type sliceYZ = deviceCube.get_yz(i);
    std::vector<double> hostMatrix( COLS*DEPS );
    // ... initialize host matrix
    ecuda::copy( hostMatrix.begin(), hostMatrix.end(), sliceYZ.begin() );
  }
  ecuda::vector<double> deviceSums( DEPS );
  CUDA_CALL_KERNEL_AND_WAIT( sumMatrix<<<((DEPS+THREADS-1)/THREADS),THREADS>>>( deviceCube, deviceSums ) );
  std::vector<double> hostSums( DEPS );
  ecuda::copy( deviceSums.begin(), deviceSums.end(), hostSums.begin() );
  // ... host vector now contains result
  return 0;
}

Allocators

ecuda uses custom STL allocators to handle device memory allocation (ecuda::device_allocator and ecuda::device_pitch_allocator). One normally doesn't need to worry about these.

However, the ecuda::host_allocator can be useful when allocating memory used as "staging" for transfers between the host and device (see the CUDA API cudaHostAlloc for more discussion). You can specify this as an allocator type when a host container is being used in this way.

ecuda::vector<double> deviceVector( 1000 );
// ... do stuff
std::vector< double, ecuda::host_allocator<double> > hostVector1( 1000 ); // underlying memory allocated using cudaHostAlloc
std::vector<double> hostVector2( 1000 ); // underlying memory allocated using standard "new"
ecuda::copy( deviceVector.begin(), deviceVector.end(), hostVector1.begin() ); // faster
ecuda::copy( deviceVector.begin(), deviceVector.end(), hostVector2.begin() ); // slower
std::copy( hostVector1.begin(), hostVector1.begin()+500, hostVector1.begin()+500 ); // slower
std::copy( hostVector2.begin(), hostVector2.begin()+500, hostVector2.begin()+500 ); // faster

Exceptions

ecuda will catch any errors that arise from calls to the CUDA C API and throw an ecuda::cuda_error exception. This will also occur if a kernel crashes.

try {
  CUDA_CALL_KERNEL_AND_WAIT( kernelFunction<<<THREADS>>>( ... ) );
} catch( ecuda::cuda_error& ex ) {
  std::cerr << "kernel failed: " << ex.what() << " (code: " << ex.get_error_code() << ")" << std::endl;
}

Events

A wrapper around CUDA event objects called ecuda::event makes these more C++-like.

// record kernel execution time
ecuda::event start, stop;
start.record(); // record start time
myKernel<<<10,1000>>>( ... );
stop.record(); // record stop time
stop.synchronize(); // kernel execution is asynchronous, wait until it finishes
std::cerr << "EXECUTION TIME: " << ( stop-start ) << "milliseconds" << std::endl;

Frequently Asked Questions

At this point, this section is really an "anticipated" FAQ.

Why not Thrust?

The Thrust library is officially supported by NVidia and is similar in that it makes CUDA more C++ friendly. However, the emphasis is quite different in that it aims to parallelize common algorithms like sort. It also features only two containers: thrust::host_vector and thrust::device_vector.

ecuda is focused more on the data structures themselves, making them easier to manipulate in device code and providing an intuitive relationship between device and host memory (and code).

Whether you use ecuda or Thrust (or both) depends on the focus of your project.

Good task for Thrust:

// I have 1000 measurements from 1000 experiments, and I want to sort each experiment
for( std::size_t i = 0; i < 1000; ++i ) {
  std::vector<double> measurements( 1000 );
  // ... load measurements for i-th experiment
  thrust::sort( measurements.begin(), measurements.end() ); // parallelized!
}

Good task for ecuda:

// I have 1000 measurements from 1000 experiments, and I want to run my fancy stats on each experiment
ecuda::matrix<double> data( 1000, 1000 );
ecuda::vector<double> result( 1000 );
// ... load measurements
CUDA_CALL_KERNEL_AND_WAIT( runStatistics<<<1,1000>>>( data, result ) );
__global__void runStatistics( const ecuda::matrix<double>::kernel_argument data, ecuda::vector<double>::kernel_argument result )
{
  const int t = threadIdx.x;
  if( t < data.number_columns() ) {
    double fancyStat;
    ecuda::matrix<double>::const_column_type measurements = data.get_column(t);
    // ... calculate fancy stat
    result[t] = fancyStat;
  }
}

How much overhead?

Ideally none. ecuda does pray that the compiler will help in this. In some cases, additional overhead is not easily avoidable. For example:

__global__ void kernelFunction( typename ecuda::matrix<int64_t>::kernel_argument mat1, typename ecuda::matrix<double>::kernel_argument mat2 );

will never beat:

__global__ void kernelFunction( int64_t* mat1, double* mat2, const size_t pitch, const size_t nr, const size_t nc );

in the case where it is known that both matrices have the same dimension. This has never been an issue for me in practice.

How much of a performance penalty?

None where it matters, some where it shouldn't. In cases where ecuda code takes longer to execute, it is probably worth it in terms of safety and consistency. For example:

ecuda::matrix<double> mat( 100, 100 );

/// ... set values

is slower than:

const size_t rows = 100;
const size_t cols = 100;
double* mat;
size_t pitch;
CUDA_CALL( cudaMalloc2D( &mat, &pitch, cols*sizeof(double), rows ) );
/// ... set values

but only because the former will always initialize the contents (e.g. with a cudaMemset() call where it makes sense).

Within a kernel function, typical data access and manipulation will run identically. For example:

__global__ void reverseSequence( typename ecuda::vector<double>::kernel_argument v )
{
  const size_t t = threadIdx.x;
  if( t < (v.size()/2) ) ecuda::swap( v[t], v[vec.size()-t-1] );
}

will run just as fast as:

__global__ void reverseSequence( double* seq, const size_t len )
{
  const size_t t = threadIdx.x;
  if( t < (len/2) ) {
    double tmp = seq[t];
    seq[t] = seq[len-t-1];
    seq[u] = tmp;
  }
}

Compatibility

The library has been tested and compiles successfully with CUDA versions 5.0, 5.5, 6.0, and 7.0 in combination with GCC 4.8.1 and 4.8.4. CUDA 6.0 and 7.0 with GCC 4.8.2 or Clang 3.5 and CUDA 7.5 with GCC 4.8.4 also compiled successfully but no example programs were tested. CUDA <5.0 is not supported (specifically, CUDA 3.2, 4.0, 4.1, and 4.2 were tested and did not respect the preprocessor directives in __host__/__device__ methods that create a host-specific and device-specific implementation).

Some very cursory tests with Windows 10, Visual Studio 2013, and CUDA 7.5 were also done successfully. Unlike with Linux, ecuda has not been used in a production setting on Windows, however.

Future Work

I've been developing and using ecuda in a production setting performing scientific computing that is heavily focused on statistics and information theory. All of the problems with version 1.0 have been addressed in this release and I'm fairly confident in its robustness at this point.

A fixed-size matrix and cube (like std::vector is to std::array) could potentially be useful. I'll likely add it when I actually need it.

Hopefully, any future work will be confined to bug fixes or addressing user difficulties.

Changes from v.1.0

The entire API was refined based on lessons learned. Broadly, the changes were:

Removal of container operator<< and operator>> to transfer between host and device memory. The ecuda::copy function (equivalent to std::copy) should now be used.
Any container passed to a kernel function as an argument should be declared as Container::kernel_argument.
Copy constructors now work as expected (memory is allocated and the contents copied).
Container at() method now performs bounds-checking (which is more consistent with the STL specification), and direct access to a particular container element is now done using operator().

License

The ecuda library is open source and released under the FreeBSD license.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
   list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
   this list of conditions and the following disclaimer in the documentation
   and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those
of the authors and should not be interpreted as representing official policies,
either expressed or implied, of the FreeBSD Project.

Author

Scott Zuyderduyn, Ph.D.
Postdoctoral Research Fellow
Bader Lab
The University of Toronto

Email: scott.zuyderduyn *at* utoronto.ca

Acknowledgements

The resources and expertise of the SciNet supercomputing centre at The University of Toronto which is home to several GPU clusters. I used these extensively for my own scientific research (which spawned the creation of this library).

The support of the Bader Lab, part of the Donnelly Centre for Cellular and Biomolecular Research at The University of Toronto, where I am currently a postdoctoral fellow.

Table of Contents