Previous Section - LAMMPS WWW Site - LAMMPS Documentation - LAMMPS Commands - Next Section

5. Accelerating LAMMPS performance

This section describes various methods for improving LAMMPS performance for different classes of problems running on different kinds of machines.

There are two thrusts to the discussion that follows. The first is using code options that implement alternate algorithms that can speed-up a simulation. The second is to use one of the several accelerator packages provided with LAMMPS that contain code optimized for certain kinds of hardware, including multi-core CPUs, GPUs, and Intel Xeon Phi coprocessors.

The Benchmark page of the LAMMPS web site gives performance results for the various accelerator packages discussed in Section 5.2, for several of the standard LAMMPS benchmark problems, as a function of problem size and number of compute nodes, on different hardware platforms.



5.1 Measuring performance

Before trying to make your simulation run faster, you should understand how it currently performs and where the bottlenecks are.

The best way to do this is run the your system (actual number of atoms) for a modest number of timesteps (say 100 steps) on several different processor counts, including a single processor if possible. Do this for an equilibrium version of your system, so that the 100-step timings are representative of a much longer run. There is typically no need to run for 1000s of timesteps to get accurate timings; you can simply extrapolate from short runs.

For the set of runs, look at the timing data printed to the screen and log file at the end of each LAMMPS run. This section of the manual has an overview.

Running on one (or a few processors) should give a good estimate of the serial performance and what portions of the timestep are taking the most time. Running the same problem on a few different processor counts should give an estimate of parallel scalability. I.e. if the simulation runs 16x faster on 16 processors, its 100% parallel efficient; if it runs 8x faster on 16 processors, it's 50% efficient.

The most important data to look at in the timing info is the timing breakdown and relative percentages. For example, trying different options for speeding up the long-range solvers will have little impact if they only consume 10% of the run time. If the pairwise time is dominating, you may want to look at GPU or OMP versions of the pair style, as discussed below. Comparing how the percentages change as you increase the processor count gives you a sense of how different operations within the timestep are scaling. Note that if you are running with a Kspace solver, there is additional output on the breakdown of the Kspace time. For PPPM, this includes the fraction spent on FFTs, which can be communication intensive.

Another important detail in the timing info are the histograms of atoms counts and neighbor counts. If these vary widely across processors, you have a load-imbalance issue. This often results in inaccurate relative timing data, because processors have to wait when communication occurs for other processors to catch up. Thus the reported times for "Communication" or "Other" may be higher than they really are, due to load-imbalance. If this is an issue, you can uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile LAMMPS, to obtain synchronized timings.


5.2 General strategies

NOTE: this section 5.2 is still a work in progress

Here is a list of general ideas for improving simulation performance. Most of them are only applicable to certain models and certain bottlenecks in the current performance, so let the timing data you generate be your guide. It is hard, if not impossible, to predict how much difference these options will make, since it is a function of problem size, number of processors used, and your machine. There is no substitute for identifying performance bottlenecks, and trying out various options.

2-FFT PPPM, also called analytic differentiation or ad PPPM, uses 2 FFTs instead of the 4 FFTs used by the default ik differentiation PPPM. However, 2-FFT PPPM also requires a slightly larger mesh size to achieve the same accuracy as 4-FFT PPPM. For problems where the FFT cost is the performance bottleneck (typically large problems running on many processors), 2-FFT PPPM may be faster than 4-FFT PPPM.

Staggered PPPM performs calculations using two different meshes, one shifted slightly with respect to the other. This can reduce force aliasing errors and increase the accuracy of the method, but also doubles the amount of work required. For high relative accuracy, using staggered PPPM allows one to half the mesh size in each dimension as compared to regular PPPM, which can give around a 4x speedup in the kspace time. However, for low relative accuracy, using staggered PPPM gives little benefit and can be up to 2x slower in the kspace time. For example, the rhodopsin benchmark was run on a single processor, and results for kspace time vs. relative accuracy for the different methods are shown in the figure below. For this system, staggered PPPM (using ik differentiation) becomes useful when using a relative accuracy of slightly greater than 1e-5 and above.

IMPORTANT NOTE: Using staggered PPPM may not give the same increase in accuracy of energy and pressure as it does in forces, so some caution must be used if energy and/or pressure are quantities of interest, such as when using a barostat.


5.3 Packages with optimized styles

Accelerated versions of various pair_style, fixes, computes, and other commands have been added to LAMMPS, which will typically run faster than the standard non-accelerated versions. Some require appropriate hardware to be present on your system, e.g. GPUs or Intel Xeon Phi coprocessors.

All of these commands are in packages provided with LAMMPS. An overview of packages is give in Section packages. These are the accelerator packages currently in LAMMPS, either as standard or user packages:

USER-CUDA for NVIDIA GPUs
GPU for NVIDIA GPUs as well as OpenCL support
USER-INTEL for Intel CPUs and Intel Xeon Phi
KOKKOS for GPUs, Intel Xeon Phi, and OpenMP threading
USER-OMP for OpenMP threading
OPT generic CPU optimizations

Any accelerated style has the same name as the corresponding standard style, except that a suffix is appended. Otherwise, the syntax for the command that uses the style is identical, their functionality is the same, and the numerical results it produces should also be the same, except for precision and round-off effects.

For example, all of these styles are accelerated variants of the Lennard-Jones pair_style lj/cut:

To see what accelerate styles are currently available, see Section_commands 5 of the manual. The doc pages for individual commands (e.g. pair lj/cut or fix nve) also list any accelerated variants available for that style.

To use an accelerator package in LAMMPS, and one or more of the styles it provides, follow these general steps. Details vary from package to package and are explained in the individual accelerator doc pages, listed above:

build the accelerator library only for USER-CUDA and GPU packages
install the accelerator package make yes-opt, make yes-user-intel, etc
add compile/link flags to Makefile.machine in src/MAKE,
only for USER-INTEL, KOKKOS, USER-OMP packages
re-build LAMMPS make machine
run a LAMMPS simulation lmp_machine < in.script
enable the accelerator package via "-c on" and "-k on" command-line switches,
only for USER-CUDA and KOKKOS packages
set any needed options for the package via "-pk" command-line switch or package command,
only if defaults need to be changed
use accelerated styles in your input script via "-sf" command-line switch or suffix command

The first 4 steps can be done as a single command, using the src/Make.py tool. The Make.py tool is discussed in Section 2.4 of the manual, and its use is illustrated in the individual accelerator sections. Typically these steps only need to be done once, to create an executable that uses one or more accelerator packages.

The last 4 steps can all be done from the command-line when LAMMPS is launched, without changing your input script, as illustrated in the individual accelerator sections. Or you can add package and suffix commands to your input script.

IMPORTANT NOTE: With a few exceptions, you can build a single LAMMPS executable with all its accelerator packages installed. Note that the USER-INTEL and KOKKOS packages require you to choose one of their options when building. I.e. CPU or Phi for USER-INTEL. OpenMP, Cuda, or Phi for KOKKOS. Here are the exceptions; you cannot build a single executable with:

See the examples/accelerate/README and make.list files for sample Make.py commands that build LAMMPS with any or all of the accelerator packages. As an example, here is a command that builds with all the GPU related packages installed (USER-CUDA, GPU, KOKKOS with Cuda), including settings to build the needed auxiliary USER-CUDA and GPU libraries for Kepler GPUs:

Make.py -j 16 -p omp gpu cuda kokkos -cc nvcc wrap=mpi   -cuda mode=double arch=35 -gpu mode=double arch=35 \  -kokkos cuda arch=35 lib-all file mpi 

The examples/accelerate directory also has input scripts that can be used with all of the accelerator packages. See its README file for details.

Likewise, the bench directory has FERMI and KEPLER and PHI sub-directories with Make.py commands and input scripts for using all the accelerator packages on various machines. See the README files in those dirs.

As mentioned above, the Benchmark page of the LAMMPS web site gives performance results for the various accelerator packages for several of the standard LAMMPS benchmark problems, as a function of problem size and number of compute nodes, on different hardware platforms.

Here is a brief summary of what the various packages provide. Details are in the individual accelerator sections.

The individual accelerator package doc pages explain:


5.4 Comparison of various accelerator packages

NOTE: this section still needs to be re-worked with additional KOKKOS and USER-INTEL information.

The next section compares and contrasts the various accelerator options, since there are multiple ways to perform OpenMP threading, run on GPUs, and run on Intel Xeon Phi coprocessors.

All 3 of these packages accelerate a LAMMPS calculation using NVIDIA hardware, but they do it in different ways.

As a consequence, for a particular simulation on specific hardware, one package may be faster than the other. We give guidelines below, but the best way to determine which package is faster for your input script is to try both of them on your machine. See the benchmarking section below for examples where this has been done.

Guidelines for using each package optimally:

Differences between the two packages:

Examples:

The LAMMPS distribution has two directories with sample input scripts for the GPU and USER-CUDA packages.

These contain input scripts for identical systems, so they can be used to benchmark the performance of both packages on your system.