Parallel Computing

class: center, middle, title-slide-cs49365

## CSci 493.65 Parallel Computing

<br>

## Chapter 1: Introduction

.author[
Stewart Weiss<br>
]

.license[
Copyright 2021-24 Stewart Weiss. Unless noted otherwise all content is released under a
[Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).
Background image: George Washington Bridge at Dusk, by Stewart Weiss.
]

---
name: cc-notice
template: default
layout: true

.bottom-left[© Stewart Weiss. CC-BY-SA.]

---
name: tinted-slide
template: cc-notice
layout: true
class: tinted

---
name: objectives
### Objectives of This Course

This course has two primary goals:

- To introduce you to the major concepts and ideas of
parallel computing, and

- To give you the basic ability to write simple
parallel programs using MPI, OpenMP, and Pthreads.

So it has theoretical and practical objectives.

But we begin with the basics - what is parallel computing and why bother with it?

---
name: intro
### Introduction

- People invented computers to solve problems faster than they could be solved by hand.

- Computers have gotten faster and faster, as predicted by Gordon Moore (1965, Moore's Law)

- But individual processors are not capable of solving the most significant computational
problems, nor will they ever be, because of their inherent computational complexity.

- Parallelism in computers came into existence in 1970's

---
name: parallelcomputing
### What is Parallel Computing?

.redbold[Parallel computing] is the use of multiple processors or computers working
together on a common task.

- Each processor works on its section of the problem or data

- Processors can exchange information

---
### Sequential Versus Parallel Computing

- An ordinary sequential program is designed for .redbold[sequential] (serial)
computation:

- It runs on a single computer with a single processor.

--
  - A problem is decomposed  into a  sequence of discrete instructions.

- Instructions are executed one after another.

- Only one instruction can be executed at a time.

- A parallel program is designed for  .redbold[parallel]  computation:

--
  - It runs on multiple processors.

- A problem is decomposed into discrete components that can be solved concurrently

- Each component is decomposed into a sequence of instructions.

- Instructions from each component execute simultaneously on different processors.

---

### Why Do We Need It?

A single CPU has limits on

- performance (time and speed)

- available memory (problem size)

With parallel computing

- we can solve problems that are too large to fit on a single CPU

- and/or solve problems that can’t be solved in a reasonable amount of time

We can either

- solve larger problems (like more nodes in Traveling Salesman)

- solve the same problem in less time,

- solve problems with greater accuracy (like more accurate prediction or estimation,

- solve many more cases of the same problem (different "what if" scenarios)

---
### Hard Problems

Some problems cannot be done on ordinary computers and need either supercomputers -
those with many, many, many processors in them. Just a few examples:

- modeling computational fluid dynamics and turbulence

- materials design: finding new superconductors, immunological agents

- genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling

- natural language understanding, automated reasoning

- forecasting severe weather events

- predicting global warming

---
### Definitions

- .redbold[Serial or sequential code] allows only a single thread of execution to work on a single data
item at any one time

- .redbold[Parallel code] can have multiple computations at any instance of time. These could be

- a single thread of execution operating on multiple data items
simultaneously

- multiple serial threads of execution in a single executable

- multiple executables (think separate programs) all working on the same problem

-   any combination of the above

---
### Definitions

- A .redbold[task] is the name we use for a single  instance of an executable. Each task
has its own virtual address space and may have multiple threads.

- A .redbold[parallel computer ]is a computer containing more than one processor.
Parallel computers can be categorized as either multi-computers or multiprocessors.

- A .redbold[multicomputer] is a computer that contains two or more computers connected
by an interconnection network.

- A .redbold[centralized multiprocessor], also known as a symmetrical multiprocessor (SMP),
is a computer in which the processors share access to a single, global memory.

- A .redbold[multi-core processor] is a particular type of multiprocessor
in which the individual processors (called ".redbold[cores]") are in a single integrated circuit.

---
### Definitions

- A .redbold[node] is a discrete unit of a computer system that typically runs its
own instance of the operating system.

- A .redbold[cluster] is a collection of machines or nodes that function in some
way as a single resource.

- .redbold[Parallel programming] is programming in a language that allows one to
explicitly indicate how the different parts of the computation can be executed
concurrently by different processors.

---
### Data Dependence Graphs

A .redbold[data dependence graph] is a directed graph G = < V;E > in which each vertex represents a task
to be completed, and an edge from vertex s to vertex t exists if and only if task s must be completed before
task t can be started.

When a task t cannot be started until s completes, we say t .redbold[depends] on s.

Example
.center[
  <img src="figures/data-dependence-1.png" width=40% />
]

So the task "execute" depends on the task "parse", but "parse" and "cleanup" do not depend on each other.
---
### Data Dependence Graphs and Parallelism

When tasks do not depend on each other, they can be run in parallel.

Data dependence graphs show the parallelism possible in a problem; the longest path
through the graph shows the longest sequence of tasks that must execute in sequence, so it shows
limits of parallelization.

.center[
  <img src="figures/dependence-graph.png" width=40% />
]

Here 7,11,10 is length 3, as are other paths. But 7, 5, and 3 can run in parallel, as can 11 and 8.
---
### 1.5.2 Data Parallelism

A data dependence graph exhibits .redbold[data parallelism] when it contains
instances of a task that applies the same sequence of operations to different elements of a data set.

It can be

- fine-grained, as in
```C
   for ( i = 0; i < 4000; i++ ) 
       A[i]++;
```
which increments each element of an array named `A`

- or coarse-grained, as in
```C
   sort(A,0,999);
   sort(A, 1000, 1999);
   sort(A,2000,2999);
   sort(A,3000,3999);
```
which applies a function named `sort` to multiple ranges of an array named `A`.
---
### Functional Parallelism

A data dependence graph exhibits functional parallelism
if it contains independent tasks that apply different operations to possibly different data elements.

.center[
  <img src="figures/data-dependence-1.png" width=40% />
]

In this dependence graph, the compiler has various tasks that could execute in parallel
so it has functional parallelism.

---
### Data versus Functional Parallelism

-  Partitioning the problem by task (i.e., functional parallelism):

- Each process performs a different "function" or executes a different
code section
   - First identify functions, then look at the data requirements
   - Commonly programmed with message-passing libraries

- Partitioning problem by data (i.e., data parallelism):

- Each process does the same work on a unique piece of data
  - First divide the data. Each process then becomes responsible for
whatever work is needed to process its data.
  - Data placement is an essential part of a data-parallel algorithm
  - Usually  more scalable than functional parallelism (can handle larger and larger problem sizes)

---
### Pipelining

Pipelining is a form of partial parallelism. When a sequence of instructions
is executed many times with different data, it can be staged like an
assembly line.

Code like this:
```C
sum = 0;
for ( i = 0; i < 4; i++ ) 
    sum = sum + a[i];
```
lends itself to pipelining. The  loop is unrolled into  four instructions:
```
sum0 = a[0];
sum1 = sum0 + a[1];
sum2 = sum1 + a[2];
sum3 = sum2 + a[3];
```
which become like stages on an assembly line:
.center[
  <img src="figures/sum_pipeline.png" width=40% />
]

---

### Paths Towards Parallel Programming

How to introduce parallel computation into sequential programming languages,
or how to get programmers to write parallel code when their languages are sequential

- Modifying compilers so that they detect parallelism and generate parallel code.

- Compilers that do this are called .redbold[parallelizing compilers].

- 90% of execution time is spent in 10% of the code - means finding that code
to parallelize it saves lots of time

- downsides - programmer does not help much and can actually hinder compiler;
and it is difficult when pointers are used a lot in code.

---
### Paths Towards Parallel Programming

- Providing libraries and/or preprocessors that extend the features of a sequential language
to allow the parallel computation.

- MPI is an example of such a library. Programmers write code that uses the library,
which is linked into the code.

- OpenMP is an example of a modification of the compiler with new preprocessor directives
as well as libraries that allow the programmer to specify parallel code.

---
### Paths Towards Parallel Programming

- Creating new parallel languages
  - Many new languages have been written to support some type of parallelism. They number in the
dozens, perhaps more than a hundred.

- Many are not useful for all purposes and most are not widely supported across multiple
hardware platforms.

- Some popular ones: Clojure, Haskell, Parlog, C#, C*, Java, Python, FortranM, Occam

- Wikipedia has a long list of them.