Beautiful Code [203]
The multithreaded version of the algorithm recognizes the existence of a so-called critical path in the algorithm: a portion of the code whose execution depends on previous calculations and can block the progress of the algorithm. The LAPACK's LU does not treat this critical portion of the code in any special way: the DGETF2 subroutine is called by a single thread and doesn't allow much parallelization even at the BLAS level. While one thread calls this routine, the other ones wait idly. And since the performance of DGETF2 is bound by memory bandwidth (rather than processor speed), this bottleneck will exacerbate scalability problems as systems with more cores are introduced.
The multithreaded version of the algorithm attacks this problem head-on by introducing the notion of look-ahead: calculating things ahead of time to avoid potential stagnation in the progress of the computations. This of course requires additional synchronization and bookkeeping not present in the previous versions—a trade-off between code complexity and performance. Another aspect of the multithreaded code is the use of recursion in the panel factorization. It turns out that the use of recursion can give even greater performance benefits for tall panel matrices than it does for the square ones.
Example 14-6 shows a factorization suitable for multithreaded execution.
Example 14-6. Factorization for multithreaded execution (C code)
Code View: Scroll / Show All
void SMP_dgetrf(int n, double *a, int lda, int *ipiv, int pw,
int tid, int tsize, int *pready,ptm *mtx, ptc *cnd) {
int pcnt, pfctr, ufrom, uto, ifrom, p;
double *pa = a, *pl, *pf, *lp;
pcnt = n / pw; /* number of panels */
pfctr = tid + (tid ? 0 : tsize); /* first panel that should be factored by this
thread after the very first panel (number 0) gets factored */
/* this is a pointer to the last panel */
lp = a + (size_t)(n - pw) * (size_t)lda;
/* for each panel (that is used as source of updates) */
for (ufrom = 0; ufrom < pcnt; ufrom++, pa += (size_t)pw * (size_t)(lda + 1)){
p = ufrom * pw; /* column number */
/* if the panel to be used for updates has not been factored yet; 'ipiv'
does not be consulted, but it is to possibly avoid accesses to 'pready'*/
if (! ipiv[p + pw - 1] || ! pready[ufrom]) {
if (ufrom % tsize == tid) { /* if this is this thread's panel */
pfactor( n - p, pw, pa, lda, ipiv + p, pready, ufrom, mtx, cnd );
} else if (ufrom < pcnt - 1) { /* if this is not the last panel */
LOCK( mtx );
while (! pready[ufrom]) { WAIT( cnd, mtx ); }
UNLOCK( mtx );
}
}
/* for each panel to be updated */
for (uto = first_panel_to_update( ufrom, tid, tsize ); uto < pcnt;
uto += tsize) {
/* if there are still panels to factor by this thread and preceding panel
has been factored; test to 'ipiv' could be skipped but is in there to
decrease number of accesses to 'pready' */
if (pfctr < pcnt && ipiv[pfctr * pw - 1] && pready[pfctr - 1]) {
/* for each panel that has to (still) update panel 'pfctr' */
for (ifrom = ufrom + (uto > pfctr ? 1 : 0); ifrom < pfctr; ifrom++) {
p = ifrom * pw;
pl = a + (size_t)p * (size_t)(lda + 1);
pf = pl + (size_t)(pfctr - ifrom) * (size_t)pw * (size_t)lda;
pupdate( n - p, pw, pl, pf, lda, p, ipiv, lp );
}
p = pfctr * pw;
pl = a + (size_t)p * (size_t)(lda + 1);
pfactor( n - p, pw, pl, lda, ipiv + p, pready, pfctr, mtx, cnd );
pfctr += tsize; /* move to this thread's next panel */
}
/* if panel 'uto' hasn't been factored (if it was, it certainly has been
updated, so no update is necessary) */
if (uto > pfctr || ! ipiv[uto * pw]) {
p = ufrom * pw;
pf = pa + (size_t)(uto - ufrom) * (size_t)pw * (size_t)lda;
pupdate( n - p, pw, pa, pf, lda, p, ipiv, lp );
}
}
}
The algorithm