



Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

> Based on the material prepared by Krste Asanovic and Arvind





## Supercomputers

Definition of a supercomputer:

- Fastest machine in world at given task
- A device to turn a compute-bound problem into an I/O bound problem
- Any machine costing \$30M+
- Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer



## Supercomputer Applications

Typical application areas

- Military research (nuclear weapons, cryptography)
- Scientific research
- Weather forecasting
- Oil exploration
- Industrial design (car crash simulation)
- Bioinformatics
- Cryptography

All involve huge computations on large data sets

*In 70s-80s, Supercomputer = Vector Machine* 

| Loop l                             | Jnrolled | Сос    | de S  | Sche  | edu | le      | November | Joel Er<br>* 30, 20<br>23, L22 |
|------------------------------------|----------|--------|-------|-------|-----|---------|----------|--------------------------------|
| loop: ld f1, 0(r1)<br>ld f2, 8(r1) |          | Int1   | Int 2 | M1    | M2  | FP+     | FPx      |                                |
| ld f3, 16(r1)                      | loop:    |        |       | ld f1 |     |         |          |                                |
| ld f4, 24(r1)                      |          |        |       | ld f2 |     |         |          |                                |
| add r1, 32                         |          |        |       | ld f3 |     |         |          |                                |
| fadd f5, f0, f1                    | Schedule | add r1 |       | ld f4 |     | fadd f5 |          |                                |
| fadd f6, f0, f2                    |          |        |       |       |     | fadd f6 |          |                                |
| fadd f7, f0, f3                    |          |        |       |       |     | fadd f7 |          |                                |
| fadd f8, f0, f4                    |          |        |       |       |     | fadd f8 |          |                                |
| sd f5, 0(r2)                       |          |        |       | sd f5 |     |         |          |                                |
| sd f6, 8(r2)                       |          |        |       | sd f6 |     |         |          |                                |
| sd f7, 16(r2)                      |          |        |       | sd f7 |     |         |          |                                |
| sd f8, 24(r2)                      |          | add r2 | bne   | sd f8 |     |         |          |                                |
| add r2, 32                         |          |        |       |       |     |         |          |                                |
| bne r1, r3, loop                   |          |        |       |       |     |         |          |                                |



# Vector Supercomputers Epitomized by Cray-1, 1976:

- Scalar Unit
  - Load/Store Architecture
- Vector Extension
  - Vector Registers
  - Vector Instructions
- Implementation
  - Hardwired Control
  - Highly Pipelined Functional Units
  - Interleaved Memory System
  - No Data Caches
  - No Virtual Memory



## Cray-1 (1976)

Joel Emer November 30, 2005 6.823, L22-6

#### Core unit of the Cray 1 computer

Image removed due to copyright restrictions.

To view image, visit http://www.craycyber.org/memory/scray.php.







## Vector Code Example

| # C code             | # Scalar Code    | # Vector Code     |
|----------------------|------------------|-------------------|
| for (i=0; i<64; i++) | LI R4, 64        | LI VLR, 64        |
| C[i] = A[i] + B[i];  | loop:            | LV V1, R1         |
|                      | L.D F0, 0(R1)    | LV V2, R2         |
|                      | L.D F2, 0(R2)    | ADDV.D V3, V1, V2 |
|                      | ADD.D F4, F2, F0 | SV V3, R3         |
|                      | S.D F4, 0(R3)    |                   |
|                      | DADDIU R1, 8     |                   |
|                      | DADDIU R2, 8     |                   |
|                      | DADDIU R3, 8     |                   |
|                      | DSUBIU R4, 1     |                   |
|                      | BNEZ R4, loop    |                   |



## Vector Instruction Set Advantages

- Compact
  - one short instruction encodes N operations
- Expressive, tells hardware that these N operations:
  - are independent
  - use the same functional unit
  - access disjoint registers
  - access registers in same pattern as previous instructions
  - access a contiguous block of memory (unit-stride load/store)
  - access memory in a known pattern (strided load/store)
- Scalable
  - can run same code on more parallel pipelines (lanes)

Joel Emer November 30, 2005 6.823, L22-11

# Vector Arithmetic Execution

- Use deep pipeline (=> fast clock) to execute element operations
- Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)



V3 <- v1 \* v2







## Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency

• Bank busy time: Cycles between accesses to same bank















# Vector Chaining Advantage

• Without chaining, must wait for last element of result to be written before starting dependent instruction



• With chaining, can start dependent instruction as soon as first result appears





## **Vector Startup**

Joel Emer November 30, 2005 6.823, L22-19

## Two components of vector startup penalty

- functional unit latency (time through pipeline)
- dead time or recovery time (time before another vector instruction can start down pipeline)









## Vector Memory-Memory versus Vector Register Machines

- Vector memory-memory instructions hold all vector operands in main memory
- The first vector machines, CDC Star-100 ('73) and TI ASC ('71), were memory-memory machines
- Cray-1 ('76) was first vector register machine





## Vector Memory-Memory vs. Vector Register Machines

- Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why?
- VMMAs make if difficult to overlap execution of multiple vector operations, why?
  - M
- VMMAs incur greater startup latency
  - Scalar code was faster on CDC Star-100 for vectors < 100 elements</li>
  - For Cray-1, vector/scalar breakeven point was around 2 elements
- ⇒ Apart from CDC follow-ons (Cyber-205, ETA-10) all major vector machines since Cray-1 have had vector register architectures

(we ignore vector memory-memory from now on)







# Vector Scatter/Gather

# Want to vectorize loops with indirect accesses: for (i=0; i<N; i++)</pre>

```
A[i] = B[i] + C[D[i]]
```

# Indexed load instruction (Gather)LV vD, rD# Load indices in D vectorLVI vC, rC, vD# Load indirect from rC baseLV vB, rB# Load B vectorADDV.D vA, vB, vC# Do addSV vA, rA# Store result



## Vector Scatter/Gather

## Scatter example:

```
for (i=0; i<N; i++)
        A[B[i]]++;</pre>
```

Is following a correct translation?



# Vector Conditional Execution

Problem: Want to vectorize loops with conditional code:

```
for (i=0; i<N; i++)
    if (A[i]>0) then
        A[i] = B[i];
```

Solution: Add vector mask (or flag) registers

- vector version of predicate registers, 1 bit per element

...and maskable vector instructions

- vector operation becomes NOP at elements where mask bit is clear

Code example:

| CVM            | # Turn on all elements                |
|----------------|---------------------------------------|
| LV vA, rA      | # Load entire A vector                |
| SGTVS.D vA, FO | # Set bits in mask register where A>0 |
| LV vA, rB      | # Load B vector into A under mask     |
| SV vA, rA      | # Store A back to memory under mask   |



## Masked Vector Instructions

### Simple Implementation

 execute all N operations, turn off result writeback according to mask

| M[7]=1             | A[7] | B[7]                    |
|--------------------|------|-------------------------|
| M[6]=0             | A[6] | B[6]                    |
| M[5]=1             | A[5] | B[5]                    |
| M[4] = 1           | A[4] | B[4]                    |
| M[3]=0             | A[3] | B[3]                    |
| M[2]=0<br>M[1]=1   |      | [2]<br>[1]              |
| ואון דין דין דין   |      |                         |
| M[0]=0<br>Write En | able | C[0]<br>Write data port |

## Density-Time Implementation

 scan mask vector and only execute elements with non-zero masks





# Compress/Expand Operations

- Compress packs non-masked elements from one vector register contiguously at start of destination vector register
  - population count of mask vector gives packed vector length
- Expand performs inverse operation



Compress Expand

Used for density-time conditionals and also for general selection operations



## Vector Reductions

Problem: Loop-carried dependence on reduction variables

```
sum = 0;
   for (i=0; i<N; i++)
       sum += A[i]; # Loop-carried dependence on sum
Solution: Re-associate operations if possible, use binary tree to
  perform reduction
   # Rearrange as:
   sum[0:VL-1] = 0
                               # Vector of VL partial sums
   for(i=0; i<N; i+=VL)  # Stripmine VL-sized chunks</pre>
       sum[0:VL-1] += A[i:i+VL-1]; \# Vector sum
   # Now have VL partial sums in one vector register
   do {
       VL = VL/2i
                                      # Halve vector length
       sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials
   } while (VL>1)
```

Joel Emer November 30, 2005

## A Modern Vector Super: NEC SX-6 (2003) 6.823, L22-31

- CMOS Technology
  - 500 MHz CPU, fits on single chip
  - SDRAM main memory (up to 64GB)
- Scalar unit
  - 4-way superscalar with out-of-order and speculative execution
  - 64KB I-cache and 64KB data cache
- Vector unit
  - 8 foreground VRegs + 64 background VRegs (256x64bit elements/VReg)
  - 1 multiply unit, 1 divide unit, 1 add/shift unit, 1 logical unit, 1 mask unit
  - 8 lanes (8 GFLOPS peak, 16 FLOPS/cycle)
  - 1 load & store unit (32x8 byte accesses/cycle)
  - 32 GB/s memory bandwidth per processor
- SMP structure
  - 8 CPUs connected to memory through crossbar
  - 256 GB/s shared memory bandwidth (4096 interleaved banks)

#### I mage removed due to copyright restrictions.

Image available in Kitagawa, K., S. Tagaya, Y. Hagihara, and Y. Kanoh. "A hardware overview of SX-6 and SX-7 supercomputer." *NEC Research & Development Journal* 44, no. 1 (Jan 2003):2-7.



## Multimedia Extensions

- Very short vectors added to existing ISAs for micros
- Usually 64-bit registers split into 2x32b or 4x16b or 8x8b
- Newer designs have 128-bit registers (Altivec, SSE2)
- Limited instruction set:
  - no vector length control
  - no strided load/store or scatter/gather
  - unit-stride loads must be aligned to 64/128-bit boundary
- Limited vector register length:
  - requires superscalar dispatch to keep multiply/add/load units busy
  - loop unrolling to hide latencies increases register pressure
- Trend towards fuller vector support in microprocessors