Return to Section accelerate overview
The USER-INTEL package was developed by Mike Brown at Intel Corporation. It provides a capability to accelerate simulations by offloading neighbor list and non-bonded force calculations to Intel(R) Xeon Phi(TM) coprocessors (not native mode like the KOKKOS package). Additionally, it supports running simulations in single, mixed, or double precision with vectorization, even if a coprocessor is not present, i.e. on an Intel(R) CPU. The same C++ code is used for both cases. When offloading to a coprocessor, the routine is run twice, once with an offload flag.
The USER-INTEL package can be used in tandem with the USER-OMP package. This is useful when offloading pair style computations to coprocessors, so that other styles not supported by the USER-INTEL package, e.g. bond, angle, dihedral, improper, and long-range electrostatics, can run simultaneously in threaded mode on the CPU cores. Since less MPI tasks than CPU cores will typically be invoked when running with coprocessors, this enables the extra CPU cores to be used for useful computation.
If LAMMPS is built with both the USER-INTEL and USER-OMP packages intsalled, this mode of operation is made easier to use, because the "-suffix intel" command-line switch or the suffix intel command will both set a second-choice suffix to "omp" so that styles from the USER-OMP package will be used if available, after first testing if a style from the USER-INTEL package is available.
When using the USER-INTEL package, you must choose at build time whether you are building for CPU-only acceleration or for using the Xeon Phi in offload mode.
Here is a quick overview of how to use the USER-INTEL package for CPU-only acceleration:
Note that many of these settings can only be used with the Intel compiler, as discussed below.
Using the USER-INTEL package to offload work to the Intel(R) Xeon Phi(TM) coprocessor is the same except for these additional steps:
The latter two steps in the first case and the last step in the coprocessor case can be done using the "-pk intel" and "-sf intel" command-line switches respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the package intel or suffix intel commands respectively to your input script.
Required hardware/software:
To use the offload option, you must have one or more Intel(R) Xeon Phi(TM) coprocessors.
Optimizations for vectorization have only been tested with the Intel(R) compiler. Use of other compilers may not result in vectorization or give poor performance.
Use of an Intel C++ compiler is recommended, but not required (though g++ will not recognize some of the settings, so they cannot be used). The compiler must support the OpenMP interface.
Building LAMMPS with the USER-INTEL package:
You must choose at build time whether to build for CPU acceleration or to use the Xeon Phi in offload mode.
You can do either in one line, using the src/Make.py script, described in Section 2.4 of the manual. Type "Make.py -h" for help. If run from the src directory, these commands will create src/lmp_intel_cpu and lmp_intel_phi using src/MAKE/Makefile.mpi as the starting Makefile.machine:
Make.py -p intel omp -intel cpu -o intel_cpu -cc icc file mpi Make.py -p intel omp -intel phi -o intel_phi -cc icc file mpi
Note that this assumes that your MPI and its mpicxx wrapper is using the Intel compiler. If it is not, you should leave off the "-cc icc" switch.
Or you can follow these steps:
cd lammps/src make yes-user-intel make yes-user-omp (if desired) make machine
Note that if the USER-OMP package is also installed, you can use styles from both packages, as described below.
The Makefile.machine needs a "-fopenmp" flag for OpenMP support in both the CCFLAGS and LINKFLAGS variables. You also need to add -DLAMMPS_MEMALIGN=64 and -restrict to CCFLAGS.
If you are compiling on the same architecture that will be used for the runs, adding the flag -xHost to CCFLAGS will enable vectorization with the Intel(R) compiler.
In order to build with support for an Intel(R) Xeon Phi(TM) coprocessor, the flag -offload should be added to the LINKFLAGS line and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
Example makefiles Makefile.intel_cpu and Makefile.intel_phi are included in the src/MAKE/OPTIONS directory with settings that perform well with the Intel(R) compiler. The latter file has support for offload to coprocessors; the former does not.
If using an Intel compiler, it is recommended that Intel(R) Compiler 2013 SP1 update 1 be used. Newer versions have some performance issues that are being addressed. If using Intel(R) MPI, version 5 or higher is recommended.
Running with the USER-INTEL package from the command line:
The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
If you plan to compute (any portion of) pairwise interactions using USER-INTEL pair styles on the CPU, or use USER-OMP styles on the CPU, you need to choose how many OpenMP threads per MPI task to use. Note that the product of MPI tasks * OpenMP threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer.
If LAMMPS was built with coprocessor support for the USER-INTEL package, you also need to specify the number of coprocessor/node and the number of coprocessor threads per MPI task to use. Note that coprocessor threads (which run on the coprocessor) are totally independent from OpenMP threads (which run on the CPU). The default values for the settings that affect coprocessor threads are typically fine, as discussed below.
Use the "-sf intel" command-line switch, which will automatically append "intel" to styles that support it. If a style does not support it, an "omp" suffix is tried next. OpenMP threads per MPI task can be set via the "-pk intel Nphi omp Nt" or "-pk omp Nt" command-line switches, which set Nt = # of OpenMP threads per MPI task to use. The "-pk omp" form is only allowed if LAMMPS was also built with the USER-OMP package.
Use the "-pk intel Nphi" command-line switch to set Nphi = # of Xeon Phi(TM) coprocessors/node, if LAMMPS was built with coprocessor support. All the available coprocessor threads on each Phi will be divided among MPI tasks, unless the tptask option of the "-pk intel" command-line switch is used to limit the coprocessor threads per MPI task. See the package intel command for details.
CPU-only without USER-OMP (but using Intel vectorization on CPU): lmp_machine -sf intel -in in.script # 1 MPI task mpirun -np 32 lmp_machine -sf intel -in in.script # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes)
CPU-only with USER-OMP (and Intel vectorization on CPU): lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node mpirun -np 4 lmp_machine -sf intel -pk omp 4 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node mpirun -np 32 lmp_machine -sf intel -pk omp 4 -in in.script # ditto on 8 16-core nodes
CPUs + Xeon Phi(TM) coprocessors with or without USER-OMP: lmp_machine -sf intel -pk intel 1 omp 16 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, all 240 coprocessor threads lmp_machine -sf intel -pk intel 1 omp 16 tptask 32 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, only 32 coprocessor threads mpirun -np 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # 4 MPI tasks, 4 OpenMP threads/task, 1 coprocessor, 60 coprocessor threads/task mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # ditto on 8 16-core nodes mpirun -np 8 lmp_machine -sf intel -pk intel 4 omp 2 -in in.script # 8 MPI tasks, 2 OpenMP threads/task, 4 coprocessors, 120 coprocessor threads/task
Note that if the "-sf intel" switch is used, it also invokes two default commands: package intel 1, followed by package omp 0. These both set the number of OpenMP threads per MPI task via the OMP_NUM_THREADS environment variable. The first command sets the number of Xeon Phi(TM) coprocessors/node to 1 (and the precision mode to "mixed", as one of its option defaults). The latter command is not invoked if LAMMPS was not built with the USER-OMP package. The Nphi = 1 value for the first command is ignored if LAMMPS was not built with coprocessor support.
Using the "-pk intel" or "-pk omp" switches explicitly allows for direct setting of the number of OpenMP threads per MPI task, and additional options for either of the USER-INTEL or USER-OMP packages. In particular, the "-pk intel" switch sets the number of coprocessors/node and can limit the number of coprocessor threads per MPI task. The syntax for these two switches is the same as the package omp and package intel commands. See the package command doc page for details, including the default values used for all its options if these switches are not specified, and how to set the number of OpenMP threads via the OMP_NUM_THREADS environment variable if desired.
Or run with the USER-INTEL package by editing an input script:
The discussion above for the mpirun/mpiexec command, MPI tasks/node, OpenMP threads per MPI task, and coprocessor threads per MPI task is the same.
Use the suffix intel command, or you can explicitly add an "intel" suffix to individual styles in your input script, e.g.
pair_style lj/cut/intel 2.5
You must also use the package intel command, unless the "-sf intel" or "-pk intel" command-line switches were used. It specifies how many coprocessors/node to use, as well as other OpenMP threading and coprocessor options. Its doc page explains how to set the number of OpenMP threads via an environment variable if desired.
If LAMMPS was also built with the USER-OMP package, you must also use the package omp command to enable that package, unless the "-sf intel" or "-pk omp" command-line switches were used. It specifies how many OpenMP threads per MPI task to use, as well as other options. Its doc page explains how to set the number of OpenMP threads via an environment variable if desired.
Speed-ups to expect:
If LAMMPS was not built with coprocessor support when including the USER-INTEL package, then acclerated styles will run on the CPU using vectorization optimizations and the specified precision. This may give a substantial speed-up for a pair style, particularly if mixed or single precision is used.
If LAMMPS was built with coproccesor support, the pair styles will run on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The performance of a Xeon Phi versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/coprocessor, and the precision used on the coprocessor (double, single, mixed).
See the Benchmark page of the LAMMPS web site for performance of the USER-INTEL package on different hardware.
Guidelines for best performance on an Intel(R) Xeon Phi(TM) coprocessor:
Restrictions:
When offloading to a coprocessor, hybrid styles that require skip lists for neighbor builds cannot be offloaded. Using hybrid/overlay is allowed. Only one intel accelerated style may be used with hybrid styles. Special_bonds exclusion lists are not currently supported with offload, however, the same effect can often be accomplished by setting cutoffs for excluded atom types to 0. None of the pair styles in the USER-INTEL package currently support the "inner", "middle", "outer" options for rRESPA integration via the run_style respa command; only the "pair" option is supported.