High Performance Computing - Charles Severance [88]
Interestingly, when using user space threads, an attempt to lock an already locked mutex, semaphore, or lock can cause a thread context switch. This allows the thread that “owns” the lock a better chance to make progress toward the point where they will unlock the critical section. Also, the act of unlocking a mutex can cause the thread waiting for the mutex to be dispatched by the thread library.
Barriers
Barriers are different than critical sections. Sometimes in a multithreaded application, you need to have all threads arrive at a point before allowing any threads to execute beyond that point. An example of this is a time-based simulation. Each task processes its portion of the simulation but must wait until all of the threads have completed the current time step before any thread can begin the next time step. Typically threads are created, and then each thread executes a loop with one or more barriers in the loop. The rough pseudocode for this type of approach is as follows:
main() {
for (ith=0;ith } work_routine() { for(ts=0;ts<10000;ts++) { /* Time Step Loop */ /* Compute total forces on particles */ wait_barrier(); /* Update particle positions based on the forces */ wait_barrier(); } return; } In a sense, our SpinFunc( ) function implements a barrier. It sets a variable initially to 0. Then as threads arrive, the variable is incremented in a critical section. Immediately after the critical section, the thread spins until the precise moment that all the threads are in the spin loop, at which time all threads exit the spin loop and continue on. For a critical section, only one processor can be executing in the critical section at the same time. For a barrier, all processors must arrive at the barrier before any of the processors can leave. In all of the above examples, we have focused on the mechanics of shared memory, thread creation, and thread termination. We have used the sleep( ) routine to slow things down sufficiently to see interactions between processes. But we want to go very fast, not just learn threading for threading’s sake. The example code below uses the multithreading techniques described in this chapter to speed up a sum of a large array. The hpcwall routine is from The Section Called “Introduction”. This code allocates a four-million-element double-precision array and fills it with random numbers between 0 and 1. Then using one, two, three, and four threads, it sums up the elements in the array: #define _REENTRANT /* basic 3-lines for threads */ #include #include #include #define MAX_THREAD 4 void *SumFunc(void *); int ThreadCount; /* Threads on this try */ double GlobSum; /* A global variable */ int index[MAX_THREAD]; /* Local zero-based thread index */ pthread_t thread_id[MAX_THREAD]; /* POSIX Thread IDs */ pthread_attr_t attr; /* Thread attributes NULL=use default */ pthread_mutex_t my_mutex; /* MUTEX data structure */ #define MAX_SIZE 4000000 double array[MAX_SIZE]; /* What we are summing... */ void hpcwall(double *); main() { int i,retval; pthread_t tid; double single,multi,begtime,endtime; /* Initialize things */ for (i=0; i pthread_mutex_init (&my_mutex, NULL); pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM); /* Single threaded sum */ GlobSum = 0; hpcwall(&begtime); for(i=0; i single = endtime - begtime; printf("Single sum=%lf time=%lf\n",GlobSum,single); /* Use different numbers of threads to accomplish the same thing */ for(ThreadCount=2;ThreadCount<=MAX_THREAD; ThreadCount++) { printf("Threads=%d\n",ThreadCount); GlobSum = 0; hpcwall(&begtime); for(i=0;i retval = pthread_create(&tid,&attr,SumFunc,(void
A Real Example *