Abstract: Lattice Boltzmann Methods (LBM) was one of the first simulation models which successfully run on Graphic Processing Units. Earlier LBM implementations using NVIDIA Compute Unified Device Architecture programming language required two steps (collision-propagation and exchange) to maximize memory bandwidth. This article presents a parallel single-step implementation of the Lattice Boltzmann method with fully coalesced memory access using shared-memory. The resulting code, running on low cost Personal Computer Graphic Processing Unit was able to process more than 800 million cells updates per second, approaching High Performance Computing clusters performances. Substantial reductions of the calculation rates were achieved, lowering down to 240 times the time required by a CPU to execute the same model. The code was tested on the numerical calculation of the flow in a two-dimensional channel with a sudden expansion. The precision of the results were validated against a proved finite-element solver.
INTRODUCTION
One of the novelties in recent parallel computing technologies is the Graphic Processing Unit (GPU) (Anderson et al., 2008). Essentially, a GPU is the chip used by graphic cards to render pixels on the screen. Modern GPUs are optimized for executing a simple instruction simultaneously over each of the elements within a large set (SIMT, Single Instruction Multiple Thread). A GPU can execute a large number of threads in parallel, operating as a co-processor of the host CPU. In this way, a fraction of the operations requiring intensive computation is streamed to the GPU using functions executed as multiple parallel threads. The GPU has its own RAM memory and data can be copied to and from the CPU memory by means of optimized direct access methods (NVIDIA, 2010). Recently, several researches have shown that the application of GPU on Cellular Automata (CA) is a valid tool to simulate fluids (Goodnight, 2007; Tolke, 2007; Zhao, 2007; Kuznik et al., 2010).
The Lattice Boltzmann Method (LBM) consists in two independent separated steps named collision and propagation (or advection). The number of data transfers can be reduced by executing the two steps in the same loop (Wellein et al., 2006). In most high performance CPU implementations (Wellein et al., 2006; Zhao, 2007; Petkov et al., 2009; Guo et al., 2009), the propagation is realized as last step of the iteration loop resulting in non-local write operations (scatter data). However, most implementations of LBM in GPU based on this scheme does not take full advantage of grouped memory calls provided by NVIDIA Compute Unified Device Architecture (CUDA) (NVIDIA, 2010). More efficient LBM CUDA implementations use one single loop for collision and propagation (Tolke, 2007; Kuznik et al., 2010). But the way memory coalescence is achieved requires a second loop to complete the LBM calculations which in turn reduces the global performance.
In this paper, a single loop, more efficient CUDA implementation is presented. A pull scheme LBM is developed using a newer CUDA version with less memory alignment restrictions. Flow simulations were run over a GPU NVIDIA GeForce achieving high performances and near 60% of hardware theoretical bandwidth.
MATERIALS AND METHODS
The NVIDIA GeForce GPU: For the present study, carried in 2010-2011, 2 different NVIDIA GeForce graphic cards were used: 8800 GT from the G80 series and GTX 260 from the GTX200 series. These GPUs were developed by NVIDIA, together with the programming model CUDA (NVIDIA, 2008). The NVIDIA GPU is composed of a set of multiprocessors with SIMD architecture, as shown in Fig. 1. Each processor within a multiprocessor executes the same instruction in every clock cycle simultaneously over different data. Table 1 details the main characteristics of the GPUs.
CUDA is a technology specially developed by NVIDIA to facilitate the access of intensive applications to the processing capabilities of GPU through a program interface. The GPU, called device, works like a co-processor of the main CPU, called host. The host and the device keep their own RAM memory.
The BGK lattice boltzmann scheme: LBM is an explicit streaming-collision iterative scheme defined in a regular grid. It is essentially a mesoscopic kinetic model with discrete internal velocities, supported on regular discrete time and space domains. The macroscopic properties approximate to second order a set of transport equations in the continuum limit (Chen and Doolen, 1998).
Table 1: | GPU hardware specifications (NVIDIA, 2008) |
Fig. 1: | SIMD multiprocessors with shared memory inside G80 NVIDIA GPU (NVIDIA, 2008) |
Fig. 2: | Space of discrete velocities ei in LBM-D2Q9, corresponding to the population functions fi |
In the present study, the BGK instance of LBM is used which solves the Navier-Stokes equations in two-dimensions (Bhatnagar et al., 1954; Chen et al., 1991; Qian et al., 1992). Defined in a regular grid of square cells, the evolution rule is given by:
(1) |
where, fi(x, t) is the population function in (x, y, t) of particles having velocity ei. In two dimensions only 9 velocities are allowed (Fig. 2). The following equilibrium function fi(eq)(x, t) ensures that the Navier-Stokes equations are recovered to second order (Chen and Doolen, 1998):
(2) |
Where:
(3) |
and:
(4) |
A second order precision bounce-back condition was imposed in the close boundaries of the grid to simulate solid walls and special velocity-pressure conditions were imposed on the inlet and exit boundaries to enforce a parabolic profile (Zou and He, 1997).
CUDA LBM implementation: In order to maximize performance, it is convenient that the size of the data structures is multiple of 16 (NVIDIA, 2010). Accordingly, a rectangular 32x96 grid of square cells was represented by means of 18 floating point arrays. Nine arrays represent each fi at t and the other nine represent each fi at t+1. These arrays are created in the CPU and afterwards copied to the GPU. The geometry, including the boundary conditions, is specified by means of an integer array containing the cell types which are detailed in Table 2.
In every update, values are read from one array and written to the other including the propagation step (within the reading step). The propagation is performed first in each iteration loop, resulting in a pull scheme of the update process (Wellein et al., 2006).
CUDA enables grouping threads in blocks that communicate efficiently by sharing data through fast shared memory and coordinating data access by synchronization points.
Table 2: | Cell types |
To maximize the hardware potential each block should contain at least 64 and no more than 512 threads (Tolke, 2007). Threads from different blocks cannot communicate synchronically and safely between themselves. Nevertheless, all blocks executing the same code can be grouped in grids, maximizing the number of threads triggered by a single call. To take full advantage of the hardware the number of blocks of a single grid should exceed 16 (Tolke, 2007). Each thread has its local memory, each block has its memory shared by its own threads and all threads can access the GPU memory. Accessing a data from a shared memory takes approximately 4 clock cycles, whereas accessing the global GPU memory takes from 400 to 600 cycles. The thread scheduler can execute independent arithmetic instructions during the waiting time. Therefore, it is convenient to trigger several thread blocks simultaneously, such that when a block is waiting for data retrieval, another block takes its place on the multiprocessor.
Shared memory can be used for thread communication or, like in this implementation, as a user-managed cache to improve global memory bandwidth (NVIDIA, 2010). The effective memory bandwidth also depends substantially on the access pattern. Since the global memory is slow and does not have cache, accesses to it should be minimized by copying data to local or shared memory for later processing. Provided that threads from the same block access the GPU memory simultaneously in aligned directions, groups of 16 threads can share a single access. This procedure is called coalescence and can substantially increase memory bandwidth. In grids ordered by lines, coalescence is ensured if neighbour threads sweep the grid by columns. For GPU devices with computing capability 1.0 and 1.1 (G80), threads need to access words in sequence and coalescing only happens if the half-warp addresses a single segment (NVIDIA, 2010). In turn, for GPU devices of computing capability 2.0 (GTX200) threads can access any words in any order, including the same words, issuing a single memory transaction for each segment. Tolke (2007) proposed to execute the advection step on shared memory avoiding unaligned access at the cost of an additional sweep. However, this procedure is no longer needed in the later versions of CUDA. In the present study we show a single step implementation that runs fully coalesced on GTX280 devices and also has good performance in G80 devices, allowing certain unaligned accesses during the advection step. The update of a cell is the following:
• | Advection: Copy the states of the neighbour cells from the
global memory to the shared memory, i.e., |
• | Synchronize |
• | Apply boundary conditions to complete the set of states if needed. (shared memory) |
• | Calculate the macroscopic averages |
• | Calculate the collision step fi(eq). (shared memory) |
• | Synchronize |
• | Copy the new values, i.e., |
Since the initial grid used in this study was small (approximately 3000 cells), all the cells are executed in parallel, such that each thread updates a single cell, generating 32 blocks of 64, 96 or 128 threads.
RESULTS
The LBM implemented in the GPU was applied to simulate the flow in a two-dimensional rectangular channel with a sudden expansion (Fig. 3). The channel length is 3 length units and the inlet and outlet are 0.5 and 1, respectively.
Fig. 3: | Channel flow with a sudden expansion |
Fig. 4(a-b): | Contour map of the x-component of the velocity calculated with (a) Finite Elements and (b) CUDA LBM |
The expansion is located at position 1 from the inlet. The inlet velocity profile is parabolic with maximum central velocity 0.3. At the exit the pressure is uniform and the velocity profile is forced parallel to the inlet velocity (i.e., uy = 0). The exit density is p = 1 and the viscosity is v = 1. The corresponding Reynolds number is 100.
In order to check accuracy, the results were compared against the solution produced by a Finite Element simulator running in CPU. This fluid solver is an equal-order FE model with subgrid scale stabilization. The channel was supported in a regular mesh of 1821 nodes. Figure 4 and 5 showed the contour map of the velocity component obtained with FEM and LBM CUDA. It can be seen that the velocity profile of the x component expands smoothly and the maximum velocity decreases as the flow develops. There is a transition region after the expansion where the y component of the velocity peaks (blue region in Fig. 4) corresponding to the flow development after the expansion. The yellow region in Fig. 5 indicates a recirculation pattern. It can be seen that the general trend is the same in both calculations.
Figure 6 compares the normalized horizontal velocity profiles at two different axial positions downstream of the expansion. It can be seen that there is good agreement between both solutions.
The measured execution time for FE application was about 13.28 sec for 100 sec simulation (7.5 times CPU LBM and 100 times CUDA LBM). It should be mentioned that the FE solver used is a software simulation tool with much larger application areas and was not optimized to this particular problem. This execution time is only reported here to show that it is not exceeded by LBM simulations.
Fig. 5(a-b): | Contour map of the y-component of the velocity calculated with (a) Finite Elements and (b) CUDA LBM |
Fig. 6: | Normalized profile of the horizontal component of the velocity at positions x = 1.1 (immediately after the expansion) and x = 1.5. Lattice Boltzmann over GPU (solid), Finite Elements (dashed) |
In order to test the performance of the GPU implementation, a comparison was made against the implementation of the same LBM in CPU using C programming language. The same CPU Intel Core2 Quad 2.4 GHz with 8 Gb RAM was used in both cases. The performance of a LBM can be measured in terms of the mean time per iteration over the whole grid or the lattice-site updates per microsecond (MLUPS) (Lammers and Kuster, 2007). Table 3 compares the performances of each implementation for different platforms scaling the problem to various domain grid sizes. Depending on the grid size and the parallelization scheme, GPU calculations can be 240 times faster than the CPU.
The peak performance achieved was 158 MLUPS using GeForce 8800 GT and 891 MLUPS with the GeForce GTX 260. In contrast, Tolke (2007) implementation of LBM on GPU achieved less than 1 MLUPS running a Multi-Relaxation-Time LBM. On the other hand, Mazzeo and Coveney (2008) achieved less than 6 MLUPS on a 1.7-GHz CPU. Only multicore supercomputers (e.g., Hitachi SR8000-F1 8 cores) with intra-node parallelization implementations could reach 150 MLUPS (Pohl et al., 2004).
Table 3: | Performance results averaged over 1000 iterations |
The maximum memory bandwidth used in the simulations was calculated as the number of bytes per cell and time step interchanged between GPU and CPU multiplied by maximum MLUPS. At 891 MLUPS this gives 65934 GB/s for the GTX 260 which is about 60% of the theoretical memory bandwidth. In comparison, Kuznik et al. (2010) implemented LBM with two steps approach in the more powerful Nvidia GTX 280 GPU reaching 947 MLUPS. This is less than 50% of the memory bandwidth.
CONCLUSIONS
The performance of the GPU NVIDIA running a CUDA implementation of a LBM of the Navier-Stokes equations was studied. The results show that the use of GPU can speed up the calculation more than 240 times compared to the same model running in CPU. The GPU was able to achieve 890 MLUP which is comparable with the performance of a multicore supercomputer. The numerical results were compared against a finite element solution, showing good agreement. It should be stressed that the key to speed-up LBM simulations in GPU is the memory management strategy to optimize the access pattern. The reverse advection-collision LBM scheme implemented in this work with utilization of shared memory for local calculations proved to be a progress in this area. Although, the case studied in the present work is relatively simple, the results are encouraging, showing that GPU is a powerful economic tool for fluid simulation.