

# Symmetric Multiprocessors: Synchronization and Sequential Consistency

Arvind Computer Science and Artificial Intelligence Lab M.I.T.

> Based on the material prepared by Arvind and Krste Asanovic

# Symmetric Multiprocessors



6.823 L16- 3 Arvind

# Synchronization

The need for synchronization arises whenever there are parallel processes in a system (even in a uniprocessor system)

*Forks and Joins:* In parallel programming a parallel process may want to wait until several events have occurred

*Producer-Consumer:* A consumer process must wait until the producer process has produced data

*Exclusive use of a resource:* Operating system has to ensure that only one process uses a resource at a given time





### A Producer-Consumer Example



Problems?



November 7, 2005

6.823 L16- 5 Arvind

# A Producer-Consumer Example

Producer posting Item x:

- Load(R<sub>tail</sub>, tail)
- Store(R<sub>tail</sub>, x) R<sub>tail</sub>=R<sub>tail</sub>+1
   Store(tail, R<sub>tail</sub>)

Can the tail pointer get updated before the item x is stored?

Consumer: Load( $R_{head}$ , head) spin: Load( $R_{tail}$ , tail) 3 if  $R_{head} = = R_{tail}$  goto spin Load(R,  $R_{head}$ ) 4  $R_{head} = R_{head} + 1$ Store(head,  $R_{head}$ ) process(R)

Programmer assumes that if 3 happens after 2, then 4 happens after 1.

Problem sequences are:

2, 3, 4, 1 4, 1, 2, 3



#### Sequential Consistency A Memory Model



" A system is *sequentially consistent* if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program" *Leslie Lamport* 

Sequential Consistency = arbitrary *order-preserving interleaving* of memory references of sequential programs



## Sequential Consistency

Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10)

T1: T2: Store(X, 1) (X = 1)Store(Y, 11) (Y = 11)Load(R<sub>1</sub>, Y) Store(Y', R<sub>1</sub>) (Y'=Y)Load(R<sub>2</sub>, X) Store(X', R<sub>2</sub>) (X'=X)

what are the legitimate answers for X' and Y'?

If y is 11 then x cannot be 0



## Sequential Consistency

Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies (-----)

What are these in our example ?

T1:  
Store(X, 1) 
$$(X = 1)$$
  
Store(Y, 11)  $(Y = 11)$   
T2:  
Load(R<sub>1</sub>, Y)  
Store(Y', R<sub>1</sub>)  $(Y' = Y)$   
Load(R<sub>2</sub>, X)  
Store(X', R<sub>2</sub>)  $(X' = X)$ 

Does (can) a system with caches or out-of-order execution capability provide a *sequentially consistent* view of the memory ?

more on this later



## Multiple Consumer Example



6.823 L16- 10 Arvind

#### Locks or Semaphores E. W. Dijkstra, 1965

A *semaphore* is a non-negative integer, with the following operations:

P(s): *if s>0 decrement s by 1 otherwise wait* 

V(s): increment s by 1 and wake up one of the waiting processes

P's and V's must be executed atomically, i.e., without

- interruptions or
- *interleaved accesses to s* by other processors

Process i P(s) <critical section> V(s)

*initial value of s determines the maximum no. of processes in the critical section* 



## Implementation of Semaphores

Semaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design...

Simpler solution:

atomic read-modify-write instructions

Examples: *m is a memory location, R is a register* 

Fetch&Add(m,  $R_v$ , R):  $R \leftarrow M[m];$  $M[m] \leftarrow R + R_v;$ 

Swap(m,R):  $R_t \leftarrow M[m];$  $M[m] \leftarrow R;$  $R \leftarrow R_t;$ 



6.823 L16- 12 Arvind

#### Multiple Consumers Example using the Test&Set Instruction



Other atomic read-modify-write instructions (Swap, Fetch&Add, etc.) can also implement P's and V's

What if the process stops or is swapped out while in the critical section?



# Nonblocking Synchronization

| Compare&Swap(m,R <sub>t</sub> ,R <sub>s</sub> ):<br>if (R <sub>t</sub> ==M[m]) |                                     | atatua ia an                                |
|--------------------------------------------------------------------------------|-------------------------------------|---------------------------------------------|
| then                                                                           | $M[m] = R_s;$<br>$R_s = R_t;$       | status is an<br><i>implicit</i><br>argument |
| else                                                                           | status ← success;<br>status ← fail; |                                             |

try: Load(
$$R_{head}$$
, head)  
spin: Load( $R_{tail}$ , tail)  
if  $R_{head} = = R_{tail}$  goto spin  
Load( $R$ ,  $R_{head}$ )  
 $R_{newhead} = R_{head} + 1$   
Compare&Swap(head,  $R_{head}$ ,  $R_{newhead}$ )  
if (status = = fail) goto try  
process( $R$ )



#### Load-reserve & Store-conditional

Special register(s) to hold reservation flag and address, and the outcome of store-conditional

Load-reserve(R, m): <flag, adr>  $\leftarrow$  <1, m>; R  $\leftarrow$  M[m];

Store-conditional(m, R): *if* <flag, adr> == <1, m> *then* cancel other procs' reservation on m;  $M[m] \leftarrow R;$ status  $\leftarrow$  succeed; *else* status  $\leftarrow$  fail;

try:Load-reserve( $R_{head}$ , head)spin:Load ( $R_{tail}$ , tail)if  $R_{head} = = R_{tail}$  goto spinLoad(R,  $R_{head}$ ) $R_{head} = R_{head} + 1$ Store-conditional(head,  $R_{head}$ )if (status==fail) goto tryprocess(R)



## Performance of Locks

Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap

VS

Non-blocking atomic read-modify-write instructions *e.g.*, *Compare&Swap*,

Load-reserve/Store-conditional

vs Protocols based on ordinary Loads and Stores

Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores



#### Issues in Implementing Sequential Consistency



Implementation of SC is complicated by two issues

- Our-of-order execution capability

   Load(a); Load(b)
   yes
   Load(a); Store(b)
   yes if a ≠ b
   Store(a); Load(b)
   yes if a ≠ b
   Store(a); Store(b)
- Caches

Caches can prevent the effect of a store from being seen by other processors



#### Memory Fences Instructions to sequentialize memory accesses

Processors with *relaxed or weak memory models*, i.e., permit Loads and Stores to different addresses to be reordered need to provide *memory fence* instructions to force the serialization of memory accesses

Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad\_Membar #LoadStore

Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore

PowerPC (WO): Sync, EIEIO

Memory fences are expensive operations, however, one pays the cost of serialization only when it is required



# Using Memory Fences



#### Data-Race Free Programs a.k.a. Properly Synchronized Programs

| Process 1                                                     | Process 2                                                 |
|---------------------------------------------------------------|-----------------------------------------------------------|
| <br>Acquire(mutex);<br>< critical section><br>Release(mutex); | Acquire(mutex);<br>< critical section><br>Release(mutex); |

Synchronization variables (e.g. mutex) are disjoint from data variables

Accesses to writable shared data variables are protected in critical regions

 $\Rightarrow$  no data races except for locks

(Formal definition is elusive)

In general, it cannot be proven if a program is data-race free.



November 7, 2005

6.823 L16- 20 Arvind

#### Fences in Data-Race Free Programs

| Process 1           | Process 2           |  |
|---------------------|---------------------|--|
|                     |                     |  |
| Acquire(mutex);     | Acquire(mutex);     |  |
| <i>membar;</i>      | <i>membar;</i>      |  |
| < critical section> | < critical section> |  |
| <i>membar;</i>      | <i>membar;</i>      |  |
| Release(mutex);     | Release(mutex);     |  |

- Relaxed memory model allows reordering of instructions by the compiler or the processor as long as the reordering is not done across a fence
- The processor also should not speculate or prefetch across fences





# Five-minute break to stretch your legs

## Mutual Exclusion Using Load/Store

A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 *(not busy)* 

Process 1

... c1=1; L: *if* c2=1 *then go to* L < critical section> c1=0; Process 2

c2=1; L: *if* c1=1 *then go to* L < critical section> c2=0;

What is wrong?

Deadlock!



#### Mutual Exclusion: second attempt

To avoid *deadlock*, let a process give up the reservation (i.e. Process 1 sets c1 to 0) while waiting.



- Deadlock is not possible but with a low probability a *livelock* may occur.
- An unlucky process may never get to enter the critical section ⇒ starvation



6.823 L16- 24 Arvind

#### A Protocol for Mutual Exclusion T. Dekker, 1966

A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 *(not busy)* 



- turn = *i* ensures that only process *i* can wait
- variables c1 and c2 ensure mutual exclusion Solution for n processes was given by Dijkstra and is quite tricky!



### Analysis of Dekker's Algorithm

|                | Process 1                  | Process 2                  |
|----------------|----------------------------|----------------------------|
| <del>~ -</del> | c1=1;                      | c2=1;                      |
|                | turn = 1;                  | turn = 2;                  |
| Scenario       | L: <i>if</i> c2=1 & turn=1 | L: <i>if</i> c1=1 & turn=2 |
| en             | then go to L               | then go to L               |
| Sc             | < critical section>        | < critical section>        |
|                | c1=0;                      | c2=0;                      |
|                |                            |                            |
|                | <b>D</b> ucces 1           |                            |
|                | Process 1                  | Process 2                  |
| $\sim$         | c1=1;                      | c2=1;                      |
| 0              | turn – 1·                  | $t_{\rm urn} = 2$          |

Scenari

turn = 1;turn = 2;L: *if* c2=1 & turn=1 L: *if* c1=1 & turn=2 then go to L < critical section> < critical section>  $c^{2}=0;$ c1 = 0;



then go to L

6.823 L16- 26 Arvind

#### N-process Mutual Exclusion Lamport's Bakery Algorithm

```
Process i
                                   Initially num[j] = 0, for all j
Entry Code
       choosing[i] = 1;
       num[i] = max(num[0], ..., num[N-1]) + 1;
       choosing[i] = 0;
       for(j = 0; j < N; j++) {
           while( choosing[j] );
           while( num[j] &&
                   ( ( num[j] < num[i] ) ||
                     (num[j] = num[i] \&\& j < i));
       }
Exit Code
       num[i] = 0;
```



6.823 L16- 27 Arvind

next time

## Effect of caches on Sequential Consistency



November 7, 2005



## Thank you !