In this post we will try to demonstrate how to call CUDA FFT routines (CUFFT) from a FORTRAN application, using the native CUDA interface and our bindings.

CUFFT usage

CUFFT library by NVIDIA, follows FFTW library manners to run FFTs.
For example, executing a 2D FFT over a 256×256 data set involves the following steps.

General GPU steps:

  1. Select the GPU device to work with
  2. Allocate enough device memory to store data
  3. Transfer input data to device

FFT steps:

  1. Create FFT plan with specific dimensions
  2. Execute FFT on device with input and output parameters
  3. Destroy FFT plan

After computing steps:

  1. Copy results back to CPU memory (RAM)
  2. Release device memory

Let’s code

General GPU steps

To select the device we want to work with we can take two possible ways. One is to use the driver interface, and the 2nd is to use the runtime interface.

Selecting a device with CUDA driver is a bit more complicated but adds more levels of flexibility.

# Initialize CUDA, default flags
call cuInit(0)
# Get a reference to the 1st device in the system
# recognized by CUDA
call cuDeviceGet(idev, 0)
# Now, create a new context a bind it to the
# device we got before
call cuCtxCreate(ictx, 0, idev)

This code fragment is relevant to clause 1 of general GPU steps, as we actually selected the device to work with, to be the 1st in the system.

Allocating device memory can be done using cuMemAlloc function of CUDA.
For example:

# Allocate memory for array of nx * ny with real
# complex elements
call cuMemAlloc(iptr, inx * iny * 4 * 2)

This one, maps to step 2 of general GPU steps.

To copy memory from CPU to GPU, or device, we need to issue cuMemcpyHtoD meaning Host->Device copy.

# Assume that data was defined as COMPLEX data(inx, iny)
call cuMemcpyHtoD(iptr, data, inx*iny * 4 * 2)

This maps to step 3 of general GPU steps.

By that we have finished to prepare the data on the GPU and we are ready to run the FFT routine.

FFT steps

Using CUFFT library is relatively easy using the following example.

# Here we create the FFT plan, note that dimensions
# of the FFT are specified in this stage so this plan
# can be reused later.
# The last parameter denotes the type of FFT to perform:
# Real->Complex, Complex->Real or Complex->Complex,
# The value 0x29 represents Complex->Complex, while
# it is possible to create a constant for this purpose.
call cufftPlan2d(iplan, inx, iny, 0x29)

This maps to step 1 of FFT steps, to create an FFT plan.

When we have the plan we can simply execute our requested FFT and get back results

# Execute the FFT according to our plan. Specifying
# iptr for input & output means in place FFT.
# It is possible to store the results in a different buffer.
# The value -1, denotes the direction of FFT, where
# -1 is forward and 1 is inverse.
call cufftExecC2C(iplan, iptr, iptr, -1)

This maps to step 2 of FFT steps.

After we managed to execute our FFT and finished working with it, it is now time to release the resources consumed by the FFT library.

# Destroy the FFT plan
call cufftDestroy(iplan)

Here we completed our FFT steps.

After computing steps:

Computations using the GPU are now over, we can copy the results back to CPU memory for further computations.

# Use the Device->Host function to copy the
# computed data from GPU to CPU.
call cuMemcpyDtoH(data, iptr, inx*iny * 4 * 2)

This maps to step 1 of after computing steps. After this copy command, data computed by the GPU will be available in “data” array variable.

Now we shall release GPU resources used during our computation

# Free the GPU memory we allocated previously
call cuMemFree(iptr)
# Unbind the CUDA context, this step happens in any case
# when the process exits, but it's a good habit
# to follow that
call cuCtxDestroy(ictx)

This is it, our entire code is over, and we used the GPU to compute FFT.

Final words

This example showed the usage of FFT computations using the GPU with CUDA framework by NVIDIA. FFT is a very important tool for many applications and scientific computations. The GPU can significantly improve performance with FFT computations, by many factors compared to the CPU.


If using gfortran, g77, g95 or ifort under Linux, to compile the above code in FORTRAN simple issue the command:

gfortran fft.f cuda.o cufft.o -lcufft -lcuda

Where gfortran can be replaced by any of your favoured compiler. Libraries and come as part of NVIDIA CUDA Toolkit release and driver, so they are present on a machine having them installed. Files cuda.o and cufft.o contain the bridge code needed for FORTRAN to C communication.

10 Replies to “Using CUDA FFT from FORTRAN”

  1. Hi, nice tutorial, thanks.
    We are analysing how to port a CFD code that involves
    a lot of hepta-diagonal matrix solves (either with red-black
    solver or Stone’s SIP).
    Do you know about resoruces for doing those computations
    in GPU’s ?.
    Thanks, Gabriel

  2. Hello,

    GReat tutorial. I am new to cuda and I am still in the process on learning how to use it. Do you have any test programs that will just take a 2D or 3D matrix make a fourier transform, multiply it with some filter in k space (like -k^2) and then take the inverse Foureir transform?

    It would make my lprocess of using cufft and fortran on windows very easy.



  3. Useful tutorial but I cannot get -lcufft to find the /usr/local/cuda/lib64/ and related libraries installed by the cuda sdk toolkit etc. I have tried putting the literal path in the gfortran statement as in:
    gfortran testb.f95 cuda.o cufft.o -lcuda -l/usr/local/lib64/
    and variations on this theme to no avail

    any suggestions are welcome (p.s all the supplied cuda examples compiled with their appropriate makefiles work well but they are , of course all C and my electron diffraction simulation programs are in good ol fortran)


  4. Oops! made a typing error in my Sep 4 comment,

    meant to type:
    gfortran testb.f95 cuda.o cufft.o -lcuda -l/usr/local/cuda/lib64/

    testb.f95 is just your sample program with some test data from me.


  5. After more trials have found that the required form is :

    gfortran test.f cuda.o cufft.o -lcuda -L/usr/local/cuda/lib64 -lcufft -o test

    even though the library path has been set as per cuda instructions.


  6. Hello,

    Nice job. Finally managed to make it work. I have some questions.

    If I put the code to run how many cores is it going to use on the video card? Is it going to use only one core or is the transform parallelis and it will use all available cores?

    If I want to make a convolution, would it be possible to transfer the data to the video card make the transform, multiply by the convoluting kernel, make the inverse transform and then transfer the data to the host?

    If I have the cygwin enviroment are the file *.o supplied for linux going to work in cygwin?


  7. I’m using CUDA 2.1 and i don’t find cuda.o or cufft.o to compile a simple Fortran code to test FFT on GPU,
    can you help me?

    Fedele STABILE

  8. Excuse me for my prevous e-mail: i’ve found the object files in the package on you web-site.
    Now i’m trying to use it with a real test.


Comments are closed.