

## Prime Minister of Italy (1867-1869)



(G. Moore image) https://www.businesswire.com/news/home/20161102005463/en/Gordon-and-Betty-Moore-Foundation-Announces-Inaugural-Moore-Inventor-Fellows

(Projection figure from 1965) G. E. Moore, "Cramming more components onto integrated circuits" (1965) (Projection figure from 1975) G. E. Moore, "Progress in digital integrated electronics", Electron devices meeting (1975)

**Progress in Digital Integrated Electronics (1975)** 



## Moore's Law - Historical Data



Image credits/references:

(Moore's law) H. Ritchie and M. Roser, <u>https://ourworldindata.org/technological-change</u> (last-accessed 2023-01) (Rising costs) <u>https://www.fabricatedknowledge.com/p/the-rising-tide-of-semiconductor</u>

K. Flamm, "Measuring Moore's law: evidence from price, cost, and quality indexes", University of Chicago Press (2019) (Jensen Huang image) https://nvidianews.nvidia.com/bios/jensen-huang (Pat Gelsinger image) https://www.concordia.net/community/patrick-gelsinger/



# Dennard Scaling



References/Notes/Image credits:

(image source) https://www.ibm.com/blogs/think/2019/11/ibms-robert-h-dennard-and-the-chip-that-changed-the-world/ R. H. Dennard et al., "Design of ion-implanted MOSFET's with very small physical dimensions", IEEE Journal of solid-state circuits (1974) https://en.wikipedia.org/wiki/Dennard\_scaling

S. Borkar et al., "The future of microprocessors", Communications of the ACM (2011)

# The End Of Dennard Scaling

## Microprocessor trends



References/Image credits: (Image credit) <u>https://github.com/karlrupp/microprocessor-trend-data</u> Data up to 2010 collected/plotted by M. Horowitz et al., data from 2010-2019 by K, Rupp M. Bohr, "A 30 year retrospective on Dennard's MOSFET scaling paper", IEEE Solid-State Circuits Society Newsletter (2007)





# Amdahl's Law



**Reference:** 

G. M. Amdahl, "Validity of the single processor approach to achieving large scale computing capabilities", Proceedings of the Spring Joint Computer Conference (1967) J. Hennessy and D. Patterson, "Computer Architecture: A Quantitative Approach", Chap. 1 (2017)



# Gustafson's Law



**Reference:** 

https://en.wikipedia.org/wiki/Gustafson's\_law

J. Gustafson, "Reevaluating Amdahl's law", Comms. of the ACM (1988) https://en.wikipedia.org/wiki/Parkinson's\_law

# Memory Models For Parallel Computing

Since the end of Dennard scaling ( $\approx$ 2005), multicore computers have become pervasive

| Shared              | Memory     |             |  |
|---------------------|------------|-------------|--|
| Key idea: any core  | e can dire | ctly access |  |
| any location in a s | hared add  | dress space |  |
| Typical hardware:   | phones     | laptops     |  |
| "Hard to build, ea  | sy to prog | gram"       |  |

References:

J. Hennessy and D. Patterson. "Computer Architecture: A Quantitative Approach", Chap. 5 (2017)

- Important design choice with multiple cores/multiple multiprocessors: how to organise memory

## **Distributed** Memory

- Key idea: each core sends message (over network)
- to access memory belonging to another core
- Typical hardware: compute clusters

"Easy to build, hard to program"



## Shared Memory Variants



References:

J. Hennessy and D. Patterson. "Computer Architecture: A Quantitative Approach", Chap. 5 (2017)



# Forms Of Parallelism

Data-level parallelism

Distribute data across processors

Processors perform the same task on

different subsets of data in parallel

We will focus on task-level parallelism

**References:** https://en.wikipedia.org/wiki/Data\_parallelism https://en.wikipedia.org/wiki/Task\_parallelism





with a

shared memory model

# Task-Parallel Platforms

## Implementing task-level parallelism with threads

Task parallelism can be implemented with threads ("virtual processors") that share memory However, this has proven difficult to program: Scheduling/load-balancing is a challenging job



Task-parallel platforms

Add abstraction layer on top of threads Programmer specifies which tasks can run in parallel (but not where they run) Platform manages scheduling, balancing etc.



# Fork-Join Parallelism



Most task-parallel platforms support fork-join Cilk which tasks must run in parallel



Fork-join



## Fork-join parallelism (1963)

References:

(M. Conway image) <u>http://www.melconway.com/Home/pdf/committees.pdf</u> M. Conway, "A multiprocessor system design", Fall Joint Computer Conference (1963) https://en.wikipedia.org/wiki/Fork-join\_model (CLRS) T. Cormen et al., "Introduction to algorithms", Chap 26, MIT press (2022) R. Blumofe et al., "Cilk: An efficient multithreaded runtime system", ACM SigPlan (1995) (OneTBB) <u>https://github.com/oneapi-src/oneTBB</u>





**OpenMP** 

- Spawn: "forks" executes function while caller continues to run in parallel
- Sync: "joins" waits for spawned threads to finish before proceeding
- Key concept: programmer only specifies which tasks can run in parallel, not
- Parallel sections can fork recursively until reaching a given task granularity



# An Example: Fibonacci

## Fibonacci Numbers

def fib(n): if n < 2: return n x = fib(n - 1)y = fib(n - 2)return x + y





References:

(CLRS) T. Cormen et al., "Introduction to algorithms", Chap 26, MIT press (2022)

# Parallel Code



spawn says main thread can execute in parallel with the spawned child, not that it must sync says parent must wait for all spawned children to finish (join) spawn/sync express the logical parallelism of the tasks It is the responsibility of the scheduler to assign the tasks to processors

**References:** 

(CLRS) T. Cormen et al., "Introduction to algorithms", Chap 26, MIT press (2022) R. Blumofe et al., "Cilk: An efficient multithreaded runtime system", ACM SigPlan (1995) (nogil python) https://nogil.dev/

# Computation DAG





References:

(CLRS) T. Cormen et al., "Introduction to algorithms", Chap 26, MIT press (2022) https://www.csd.uwo.ca/~mmorenom/cs3101\_Winter\_2015/Multithreaded\_Parallelism\_and\_Performance\_Measures.pdf

## Parallel Computation Analysis: Assumptions

## Assumptions for analysis

- 1. We have an ideal parallel computer:
- multiple processors
- sequentially consistent memory
- 2. Processors have equal computing power
- 3. No overhead for scheduling

References/image credits:

T. Cormen et al., "Introduction to algorithms", Chap 26, MIT press (2022)

(Sequential consistency) L. Lamport, "How to make a multiprocessor computer that correctly executes multiprocess programs", IEEE ToC (1979) (Image of L. Lamport) https://en.wikipedia.org/wiki/Leslie\_Lamport#/media/File:Leslie\_Lamport.jpg

Sequentially consistent (Lamport, 1979): Instruction execution preserves the partial ordering of DAG Attain via sequential processors; FIFO memory (processors communicate through memory)





# Work/Span Analysis

Let  $T_P$  denote runtime of program on P processors Work  $(T_1)$ : time to execute program on 1 processor Span  $(T_{\infty})$ : time to execute program on  $\infty$  processors The span is the sum of the runtimes of strands on the "critical path" - the longest path in the computation DAG

Work law:  $T_P \ge T_1/P$  (P processors can achieve at most P units of work per time step) Linear speedup:  $T_1/T_P = \Theta(P)$  Perfect linear Parallelism:  $T_1/T_{\infty}$ 

**References:** 

T. Cormen et al., "Introduction to algorithms", Chap 26, MIT press (2022) https://en.wikipedia.org/wiki/Analysis\_of\_parallel\_algorithms



- Span law:  $T_P \ge T_{\infty}$  (with  $\infty$  processors, we can emulate the P processors and leave rest idle)

**r speedup:** 
$$T_1/T_P = P$$
 **super-linear impossible in this m**

(maximum possible speed up with any number of processors)





## Reference:

T. Cormen et al., "Introduction to algorithms", Chap 26, MIT press (2022)

Parallel  

$$X$$

$$Y$$
Work:  $T_1(X \cup Y) = T_1(X) + T_1(Y)$ 
Span:  $T_{\infty}(X \cup Y) = \max(T_{\infty}(X), T_{\infty}(Y))$ 

$$T_{\infty}(n) \text{ for } par_fib(n):$$

$$(n) = \max(T_{\infty}(n-1), T_{\infty}(n-2)) + \Theta(1)$$

$$T_{\infty}(n-1) + \Theta(1) \qquad T_{\infty} = \Theta(n)$$

callelism: 
$$\frac{T_1(n)}{T_{\infty}(n)} = \Theta\left(\left(\frac{1+\sqrt{5}}{2}\right)^n/n\right)$$
 Grows fast we have the set of the set

