Beautiful Code [199]
The Fortran 90 compilers use the LDA parameter to avoid copying the data into a contiguous buffer when calling external routines, such as one of the BLAS. Without LDA, the compiler has to assume the worst-case scenario when input matrix a is not contiguous and needs to be copied to a temporary contiguous buffer so the call to BLAS does not end up with an out-of-bands memory access. With LDA, the compiler passes array pointers to BLAS without any copies.
Example 14-4 shows recursive LU factorization.
Example 14-4. Recursive variant (Fortran 90 coding)
Code View: Scroll / Show All
recursive subroutine rdgetrf(m, n, a, lda, ipiv, info)
implicit none
integer, intent(in) :: m, n, lda
double precision, intent(inout) :: a(lda,*)
integer, intent(out) :: ipiv(*)
integer, intent(out) :: info
integer :: mn, nleft, nright, i
double precision :: tmp
double precision :: pone, negone, zero
parameter (pone=1.0d0)
parameter (negone=-1.0d0)
parameter (zero=0.0d0)
intrinsic min
integer idamax
external dgemm, dtrsm, dlaswp, idamax, dscal
mn = min(m, n)
if (mn .gt. 1) then
nleft = mn / 2
nright = n - nleft
call rdgetrf(m, nleft, a, lda, ipiv, info)
if (info .ne. 0) return
call dlaswp(nright, a(1, nleft+1), lda, 1, nleft, ipiv, 1)
call dtrsm('L', 'L', 'N', 'U', nleft, nright, pone, a, lda,
$ a(1, nleft+1), lda)
call dgemm('N', 'N', m-nleft, nright, nleft, negone,
$ a(nleft+1,1) , lda, a(1, nleft+1), lda, pone,
$ a(nleft+1, nleft+1), lda)
call rdgetrf(m - nleft, nright, a(nleft+1, nleft+1), lda,
$ ipiv(nleft+1), info)
if (info .ne. 0) then
info = info + nleft
return
end if
do i =nleft+1, m
ipiv(i) = ipiv(i) + nleft
end do
call dlaswp(nleft, a, lda, nleft+1, mn, ipiv, 1)
else if (mn .eq. 1) then
i = idamax(m, a, 1)
ipiv(1) = i
tmp = a(i, 1)
if (tmp .ne. zero .and. tmp .ne. -zero) then
call dscal(m, pone/tmp, a, 1)
a(i,1) = a(1,1)
a(1,1) = tmp
else
info = 1
end if
end if
return
end
There is a certain degree of elegance in the recursive variant. No loops are exposed in the routine. Instead, the algorithm is driven by the recursive nature of the method (see the paper by F. G. Gustavson listed under "Further Reading").
The Recursive LU Algorithm consists of four basic steps, illustrated in Figure 14-2:
Split the matrix into two rectangles (m * n/2); if the left part ends up being only a single column, scale it by the reciprocal of the pivot and return.
Apply the LU algorithm to the left part.
Apply transformations to the right part (perform the triangular solve A12 =L-1A12 and matrix multiplication A22=A22–A21*A12).
Apply the LU algorithm to the right part.
Figure 14-2. Recursive LU factorization
Most of the work is performed in the matrix multiplications, which operate on successive matrices of size n/2, n/4, n/8, etc. The implementation in Example 14-4 can show about a 10 percent improvement in performance over the LAPACK implementation given in Example 14-3.
In a sense, any of the previous renditions of the LU algorithm could be considered a step backward in terms of code elegance. But divide-and-conquer recursion was a tremendous leap forward (even dismissing the modest performance gains). The recursive algorithm for matrix factorization can now be taught to students alongside other recursive algorithms, such as various kinds of sorting methods.
By changing just the size of matrix parts, it is possible to achieve the same memory access pattern as in LINPACK or LAPACK. Setting nleft to 1 makes the code operate on vectors, just as in LINPACK, whereas setting nleft to NB>1 makes it behave like LAPACK's blocked code. In both cases, the original recursion