Online Book Reader

Home Category

Beautiful Code [198]

By Root 5120 0
Factor diagonal and subdiagonal blocks and test for exact

* singularity.

CALL DGETF2( M-J+1, JB, A( J, J ), LDA, IPIV( J ), IINFO )

* Adjust INFO and the pivot indices.

IF( INFO.EQ.0 .AND. IINFO.GT.0 ) INFO = IINFO + J - 1

DO 10 I = J, MIN( M, J+JB-1 )

IPIV( I ) = J - 1 + IPIV( I )

10 CONTINUE

* Apply interchanges to columns 1:J-1.

CALL DLASWP( J-1, A, LDA, J, J+JB-1, IPIV, 1 )

*

IF( J+JB.LE.N ) THEN

* Apply interchanges to columns J+JB:N.

CALL DLASWP( N-J-JB+1, A( 1, J+JB ), LDA, J, J+JB-1, IPIV, 1 )

* Compute block row of U.

CALL DTRSM( 'Left', 'Lower', 'No transpose', 'Unit', JB,

$ N-J-JB+1, ONE, A( J, J ), LDA, A( J, J+JB ), LDA )

IF( J+JB.LE.M ) THEN

* Update trailing submatrix.

CALL DGEMM( 'No transpose', 'No transpose', M-J-JB+1,

$ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA,

$ A( J, J+JB ), LDA, ONE, A( J+JB, J+JB ), LDA )

END IF

END IF

20 CONTINUE

END IF

RETURN

end

Most of the computational work in the algorithm from Example 14-3 is contained in three routines:

DGEMM

Matrix-matrix multiplication

DTRSM

Triangular solve with multiple righthand sides

DGETF2

Unblocked LU factorization for operations within a block column

One of the key parameters in the algorithm is the block size, called NB here. If NB is too small or too large, poor performance can result—hence the importance of the ILAENV function, whose standard implementation was meant to be replaced by a vendor implementation encapsulating machine-specific parameters upon installation of the LAPACK library. At any given point of the algorithm, NB columns or rows are exposed to a well-optimized Level-3 BLAS. If NB is 1, the algorithm is equivalent in performance and memory access patterns to the LINPACK's version.

Matrix-matrix operations offer the proper level of modularity for performance and transportability across a wide range of computer architectures, including parallel systems with memory hierarchy. This enhanced performance is primarily due to a greater opportunity for reusing data. There are numerous ways to accomplish this reuse of data to reduce memory traffic and to increase the ratio of floating-point operations to data movement through the memory hierarchy. This improvement can bring a three-to ten-fold improvement in performance on modern computer architectures.

The jury is still out concerning the productivity of writing and reading the LAPACK code: how hard is it to generate the code from its mathematical description? The use of vector notation in LINPACK is arguably more natural than LAPACK's matrix formulation. The mathematical formulas that describe algorithms are usually more complex if only matrices are used, as opposed to mixed vector-matrix notation.

How Elegant Code Evolves with Hardware The Case of Gaussian Elimination > Recursive LU

14.6. Recursive LU

Setting the block size parameter for the LAPACK's LU might seem like a trivial matter at first. But in practice, it requires a lot of tuning for various precisions and matrix sizes. Many users end up leaving the setting unchanged, even if the tuning has to be done only once at installation. This problem is exacerbated by the fact that not just one but many LAPACK routines use a blocking parameter.

Another issue with LAPACK's formulation of LU is the factorization of tall and narrow panels of columns performed by the DGETF2 routine. It uses Level-1 BLAS and was found to become a bottleneck as the processors became faster throughout the 1990s without corresponding increases in memory bandwidth.

A solution came from a rather unlikely direction: divide-and-conquer recursion. In place of LAPACK's looping constructs, the newer recursive LU algorithm splits the work in half, factorizes the left part of the matrix, updates the rest of the matrix, and factorizes the right part. The use of Level-1 BLAS is reduced to an acceptable minimum, and most of the calls to Level-3 BLAS operate on larger portions of the matrix than LAPACK's algorithm. And, of course, the block size does not have to be tuned anymore.

Recursive LU required the use of Fortran 90,

Return Main Page Previous Page Next Page

®Online Book Reader