Alumni Project

Performance Optimization of Sparse Kernels in TOPS

PIs: J. Demmel2 , J. Dongarra4 , V. Eijkhout4 , S. Li2,3 , B. Smith1 , R. Vuduc3 1 Argonne National Lab, 2 Lawrence Berkeley National Lab, 3 U. California-Berkeley, 4 U. Tennessee

Summary

To deliver as high a percentage of per-processor peak performance as is practical on the hierarchical memory architectures on which most SciDAC scientists make their production runs, the Terascale Optimal PDE Simulations (TOPS) project is researching innovative strategies for tuning the kernels that arise most often in solving sparse linear systems from DOE simulations.

TOPS is automatically tuning performance of sparse matrix kernels that dominate many scientific and engineering applications. Given a sparse matrix (i.e., its sparsity pattern and other properties like symmetry), the operation to be performed (e.g., sparse matrix vector multiply or SpMV, triangular solve, SOR smoother sweep) and the processor architecture, it is possible to build a custom data structure and implementation that can substantially increase performance. The speedups from uniprocessor tuning depend strongly on the matrix, the operation to be performed, and architecture, but improvement of several factors are possible.

Straightforward implementations of standard operations on large data objects can run slowly on contemporary hierarchical memory processors, where the main memory latency is 100 or more times greater than the processor clock period. For operations with no cache locality whatsoever, this bounds performance at 1% of peak or less.

Local discretizations (e.g., finite elements) of partial differential equations (PDEs) typically give rise to sparse linear systems that are much less amenable to obtaining a high percentage of peak than dense systems using tuned BLAS (Basic Linear Algebra Subroutines). SpMV is the most important kernel in iterative methods for these PDEs. Its memory traffic pattern is also analogous to evaluating a differential operator on a grid function, a grid transfer operation in a multilevel method, or accumulating the right-hand side in a triangular solve: each matrix element is used only once, magnifying the importance of tuning these operations. The added complication in the SOR sweep is the sequentiality, making it harder to optimize by rearranging operations.

Multicomponent systems of PDEs, when ordered most rapidly by unknowns at a gridpoint, have a sparse structure of dense blocks. Furthermore, density increases through fill-in in incomplete factors often employed as preconditioners, or in exact LU factorizations. By exploiting effects like these through hand-tuning, researchers using the now TOPS-supported PETSc software were able to obtain up to 25% of peak uniprocessor performance on hierarchical memory machines in an implicit aerodynamics computation en route to a Bell Prize in 1999, for a code whose original uniprocessor performance was less than 5%. We are automating such optimizations in TOPS.

Whereas matrices arising in PDE simulations are sparse, vectors representing gridfunctions are dense. Some linear algebraic methods operate on multiple vector columns simultaneously, providing another source of density to exploit. Historically, block iterative methods of this type have not enjoyed great favor. However, tests performed by TOPS researchers show improved computation rates for block algorithms due to their superior memory locality—an effect that may overcome convergence rate disadvantages in many problem-architecture combinations.

Among the locality-enhancing techniques for SpMV and SOR being studied by TOPS researchers are register- and cache-level blocking, multiplication by multiple dense vectors, symmetry, splitting for variable block sizes, diagonal substructure, reordering to create block structure, and combinations. These ideas are being applied to other kernels as well.

A 100 ´ 100 spy plot taken from the diagonal of a larger matrix arising in an accelerator cavity design problem being in conjunction with researchers at the Stanford Linear Accelerator Center.Potential single processor speedups for SpMV over a blocked, naturally ordered, nonsymmetric implementation:
Figure 1. ( Left) A 100 ´ 100 spy plot taken from the diagonal of a larger matrix arising in an accelerator cavity design problem being in conjunction with researchers at the Stanford Linear Accelerator Center. The nonzero pattern in the natural ordering consists of the green asterisks and red circles. A judicious reordering of rows and columns enhances locality by creating block structure , as shown in the pattern consisting of green asterisks and blue dots. This reordering is computed by approximating a solution to the Traveling Salesman Problem (TSP). ( Right) Potential single processor speedups for SpMV over a blocked, naturally ordered, nonsymmetric implementation: a combination of a good reordering, symmetry, and register-level blocking is 1.6–3.3 ´ faster across a variety of platforms . The percentage of uniprocessor peak achieved when all of these optimizations are applied can be as high as 23% (on an IBM Power 4).

A current goal is to speed up important DOE applications using these tuned implementations. The proof-of-concept experiment summarized in Figure 1 shows how the uniprocessor performance of SpMV may be improved for a matrix extracted from a DOE accelerator cavity design application. Reordering the rows and columns, based on approximating a solution to the Traveling Salesman Problem, actually, thereby enables register-level blocking optimizations. The combination of exploiting the nonzero symmetry and the block structure created by this reordering leads to speedups of 1.6–3.3 ´ over a blocked, naturally ordered, non-symmetric implementation on a variety of platforms, and up to 23% of peak (on an IBM Power 4). Integration into the full-scale parallel applications is ongoing.

The TOPS project webpage may be found at http://www.tops-scidac.org .

For further information on this subject contact:
Prof. David E. Keyes, Project Lead
Columbia University
Phone: 212-854-1120
david.keyes@columbia.edu

back to project page

 


Home  |  ASCR  |  Contact Us  |  DOE disclaimer