Parallel Computing

class: center, middle, title-slide-cs49365

## CSci 493.65/795.24 Parallel Computing

<br>

## Chapter 2: Parallel Architectures and Interconnection Networks

.author[
Stewart Weiss<br>
]

.license[
Copyright 2021-24 Stewart Weiss. Unless noted otherwise all content is released under a
[Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).
Background image: George Washington Bridge at Dusk, by Stewart Weiss.
]

---
name: cc-notice
template: default
layout: true

.bottom-left[© Stewart Weiss. CC-BY-SA.]

---
name: tinted-slide
template: cc-notice
layout: true
class: tinted

---
name: intro
### Introduction

Things that operate in parallel but that work together must occasionally
communicate to achieve their goals, such as

- computers working together

- processes working together

- people working together

When a system of parallel and discrete entities work together and communicate
with each other, they form
a .redbold[network].

This chapter looks at the various ways in which components of parallel systems
can be connected.

---
name: motivation
### Example

Imagine three processes that work together. One reads a file containing numbers whose
sum is to be calculated.

It sends half the file to a second process and the remaining half to a third process.

These two processes add their numbers independently and send their sums to the first
process, which adds them and prints them out.

If we call them processes 1, 2, and 3 and represent them as nodes, their communication pattern
looks like this:

.center[
  <img src="figures/binarytree_00.png" width=40% />
]

1 and 2 communicate, and 1 and 3 communicate, but not 2 and 3.

---
name: topology
### Network Topology

The preceding example illustrates a network topology.

A .redbold[network topology] formalizes the way in which a set of
nodes are "connected" to each other.

- A network topology is essentially a discrete graph : a set of nodes
connected by edges.

- Edges may or may not have direction. If information can flow back and forth, they
are undirected, but if it flows only in a single direction it is directed and is
often drawn with an arrow.

- Distance does not exist - edges do not have length.

- A network topology is a type of mathematical topological space.

---
name: topo-example
### Network Topology

Suppose that in the preceding example, process 2 sends half its data to processes 4 and 5,
and process 3 sends half its data to 6 and 7.

Each of 4,5,6, and 7 add up their numbers
independently. Processes 4 and 5 send their sums to 2, and 6 and 7 send theirs to 3.

Then processes 2 and 3 send their sums to 1, which sums them and prints them out. The
communication pattern
looks like this:

.center[
  <img src="figures/binarytreesort.png" width=40% />
]

---
name: topo-terminology
### Some Terminology

A .redbold[path] from a node S to a node T is a sequence of nodes S=N0, N1, N2, ..., Nk = T such that
there is an edge from Ni to Ni+1 for each i.
  - The .redbold[length] of the path is the number of edges. If S == T the path has length 0.

The .redbold[distance] between two nodes S and T is the length of the __shortest path__ between them.
.center[
  <img src="figures/pathlength.png" width=40% />
]
The distance from S to T is 2.

The .redbold[degree] of a node is the number of edges that are incident to that
node, whether they are incoming or outgoing or undirected.
---
name: topo-properties
### Properties of a Topology

Properties of a topology affect performance, scalability, cost, and feasibility of using it.

- .redbold[Size]
This is the number of nodes in the topology.

- .redbold[Diameter]:
This is the maximum distance between any two nodes in the topology.

- .redbold[Bisection Width]
This is the number of edges that must be removed so that the vertex set is split into two equal
size sets of vertices, or if it has an add number to start, then two sets whose sizes differ by 1.

- .redbold[Degree]
This is maximum degree of any node in the topology.

Although in mathematical topoloogies,  edges do not have length, another proerty of topologies that describe physical
networks such as CPUs on a chip or board, or computers connected together,
is the maximum length of any edges among them.

---
name: topologies
### Specific Topologies

Certain regular topologies arise as a result of algorithms or design decisions that are made in parallel
systems.

Some that we study include:

- Binary tree

- Fully-connected (also called completely-connected)

- Mesh and torus

- Hypercube (also called a binary n-cube)

- Butterfly

---
name: bintrees
### Binary Tree Network Topology

In a binary tree network, the number of nodes, N, is equal to 2^k -1 for some k,
and these nodes are arranged in a complete binary tree of depth k-1.

- They have sizes 1, 3, 7, 15, 31, and so on.

- Rarely used as a physical network among cores.

- Mostly occurs when tasks communicate using certain types of algoriithms, such as our
earlier example.

---
name: fully-connected
### Fully-Connected Network Topology

Every node is connected to every other node.

- Rarely used to connect cores,

- Often represents possible connections between tasks or processes.

---
name: mesh
### Mesh Network Topology

A mesh network looks like a grid in two dimensions, or a wire mesh cube in three dimensions,
but is harder to visualize when there
are more than 3 dimensions.

Formally, it is called a .redbold[lattice].

Does not have to have same size in each dimension.
.center[
   <img src="figures/mesh.png" width=40% />
]

---
name: torus
### Torus Network Topology

When a mesh has two dimensions, if we connect nodes at top and bottom by edges, and nodes
on left and right by edges, we have a .redbold[torus]. It looks like a "doughnut".

---
name: hypercube
### Hypercube (Binary n-Cube)

A binary n-cube, also called a hypercube, network is a network with N = 2^n nodes arranged as
the vertices of a n-dimensional cube.

A hypercube is simply a generalization of an ordinary cube.

In a 3D cube, there are N = 8 = 2^3 nodes, each connected to 3 nodes, one in each dimension.

A square is a 2D cube. It has N = 4 = 2^2 nodes, each connected to 2 nodes, one in each dimension.

In general, in an n-cube, each node is connected to n other nodes, ome in each dimension.
- These arise very often in various parallel algorithms, where nodes represent tasks.

.left-column-small[
This is a 4-cube:
]
.right-column-large[
   <img src="figures/hypercube.png" width=50% />
]

---
name: butterfly
### Butterfly Network Topology

A butterfly network topology consists of (k+1)2^k nodes arranged in k+1 ranks (i.e., rows),
each containing n=2^k nodes. k is called the order of the network.

The ranks are labeled 0 through k. The columns in the figure are labeled 0 through 2^k.

A butterfly of order 0 has 1 node.

A butterfly of order 1 has 4 nodes - 2 rows with 2 nodes each.

A butterfly of order 2 has 12 nodes - 3 rows with 4 nodes each.

---
name: butterfly-connections
### Butterfly Network Topology Connections

Each node is connected to the node above it and below it in its column.

Nodes are also connected to nodes not in their columns.
- Let a xor b represent the bitwise exclusive-or of a and b.
In each rank, from 0 through k-1,  each node [i,j] is connected to node [i+1, j xor 2^(k-i-1)].

It is easier to see how this works if you write numbers in binary using k bits for order k. Consider k=3= 011.
Node [1,3] is [001,011]. 2^(k-i-1) = 2^(3-1-1) = 2= 010. The bitwise xor of j = 3 = 011 and 010 is 001 = 1,
so [001,011] has an edge to [010,001] = [2,1].

.center[
   <img src="figures/butterfly.png" widht=40% />
]

There is a way to "grow" butterfly networks recursively. See the lecture notes.

---
name: interconnection-networks
### Interconnection Networks

An .redbold[interconnection network] is a system of links that connects
one or more devices to each other for the purpose of inter-device communication.

In the context of computer architecture, an interconnection network is used
primarily to connect processors to processors, or to allow multiple processors
to connect to one or more shared memory modules (as in SMPs.)

Sometimes they are used to connect processors with locally attached memories to each other.
The interconnection-network has a significant effect on the cost,
applicability, scalability, reliability, and performance of a parallel computer.

---
name: interconnection-network-props
### Interconnection Network Properties

An interconnection network may be classified as shared or switched.

A .redbold[shared network]  can have at most one message on it at any time.
For example, a .redbold[bus] is a shared network, as is traditional Ethernet.

A .redbold[switched network] allows point-to-point messages among pairs of
nodes and therefore supports the transfer of multiple concurrent messages.
It is a collection of interconnected switches.
Switched Ethernet is a switched network.

.left-column[
    <img src="figures/switched-shared.png" width=90% />
]
.right-column[
Shared networks are inferior to switched networks in terms of performance and scalability.
In the example to the left, the switched network allows two links simultaneously.
]

---
name: interconnect-topologies
### Interconnection Network Topologies

Because interconnection networks can connect different types of devices, they have been used
in different ways.

.left-column[
In one approach, each node is a switch, and exactly one device is connected to that switch. This is
called a  .redbold[direct topology]. The mesh to the right is a direct topology.
]
.right-column[
    <img src="figures/direct-mesh.png" width=90% />
]
.below-column[
The convention is that switches are drawn with circles and devices such as processors,
computers, or memories, are drawn with squares.
]

---
name: interconnect-topologies-2
### Interconnection Network Topologies

Because interconnection networks can connect different types of devices, they have been used
in different ways.

.left-column[
In another approach, called an .redbold[indirect topology],
the number of switches is greater than the number of device nodes.
The switches are used for routing messages from one processor or device to another.
The binary tree to the right is used as a switching network in which the
processors are connected only to the leaf nodes.
]
.right-column[
    <img src="figures/indirect-tree.png" width=60% />
]

---
name: interconnect-topologies-3
### Interconnection Network Topologies

- Meshes are almost always used as a direct topology, with a processor attached to each switch.

- Binary trees are always indirect topologies, acting as a switching network to connect a bank of proces-
sors to each other.

- Butterfly networks are always indirect topologies; the processors are connected to rank 0, and either
memory modules or switches back to the processors are connected to the last rank.

- Hypercube networks are always direct topologies.

---
name: multiprocessors
### Multiprocessors

A .redbold[multiprocessor] is a computer with multiple CPUs and a .bluebold[shared address space].
- In a multiprocessor, the same address generated on two different CPUs refers to the same memory location.
- Changes made by one CPU to a memory location are seen by all other CPUs.
- Processes running on separate CPUs can access the same logical addresses.

Most multiprocessors are one of  two types:
- those in which the shared memory is physically in one place, and

- those in which it is distributed among the processors.

Regardless of where physical memory is located, parallel programs running
on multiprocessors can take advantage of shared memory.

---
name: smp
### UMA Multiprocessors

When the shared memory is a physically central memory in one place, it is called any of:
- .redbold[centralized multiprocessor],
- .redbold[symmetric multiprocessor], or
- .redbold[Uniform Memory Access (UMA) multiprocessor]

.left-column[
Schematically, it looks like this:
   <img src="figures/uma-1.png" width=100% />
]
.right-column[
In more detail, like this:
   <img src="figures/uma-2.png" width=100% />

]

---
name: numa
### NUMA Multiprocessors

If a multiprocessor does not have a centralized memory, but instead has
physical memory distributed across the separate CPUs, it is called either:
- .redbold[distributed multiprocessor], or a

- .redbold[Non-Uniform Memory Access (NUMA) multiprocessor]

The CPUs still can access a shared set of addresses, but how they do it is different
than if the memory were centralized.

Schematically, a NUMA multiprocessor looks like this:
.center[
   <img src="figures/numa-1.png" width=50% />
]

- Data that is in the memory attached to a CPU is .redbold[local] to that CPU.
- Accessing local data is much faster than accessing non-local data because memory accesses
that are not local must traverse the network that connects the CPUs.
- Coding efficiently for NUMA machines requires understanding where data might be located,
to reduce that effects caused by local versus non-local memory accesses.

---
name: multicomputers
### Multicomputers

The important distinction between a multiprocessor and a multicomputer is that
a multicomputer is a distributed memory, multiple-CPU computer
in which .bluebold[there is no shared address space among the separate CPUs].

- Each CPU has its own address space and can access only its own local memory,
which is called .redbold[private memory].

- The same address on two different CPUs refers to two different memory locations.

- Processes running on multicomputers must share data through a message-passing interface;
they cannot access the data that is used by the other programs.

---
name: multicomputers
### Types of Multicomputers

A .redbold[commercial multicomputer] is a multicomputer designed, manufactured,
and intended to be sold as a multicomputer.

A .redbold[commodity cluster] is a multicomputer put together out
of off-the-shelf components to create a multicomputer.

A commercial multicomputer's interconnection network and processors are
optimized to work with each other, providing low-latency,
high-bandwidth connections between the computers,
at a higher price tag than a commodity cluster.

Commodity clusters, though, generally have lower performance,
with higher latency and lower bandwidth in the interprocessor connections.

---
name: asymmetric-multicomputers
### Asymmetrical Multicomputers

Some multicomputers, called .redbold[asymmetrical multicomputers], are designed with a special front-end computer
and back-end computers.

The front-end is the master and gateway for the machine,
and the back-end CPUs are for computation.
Users login to the front-end and their jobs are run on the back-end processors,
and all I/O takes place through the front-end machine.

.center[
   <img src="figures/asymmetric-mc.png" width=45% />
]

- The front-end is a bottleneck
- Does not scale well
- Other problems as well

---
name: symmetric-multicomputers
### Symmetrical Multicomputers

A .redbold[symmetrical multicomputer] is one in which all of the hosts
are identical and are connected to each other through an interconnection network.

Users can login to any host and the file system and
I/O devices are equally accessible from every host.

.center[
   <img src="figures/symmetric-mc.png" width=50% />
]

- No bottleneck at front-end , and highly scalable,
- Not suitable environment for running large scale parallel programs,
because there is little control over loads on hosts

---
name: flynn-taxonomy
### Flynn's Taxonomy

Flynn (1966) categorized parallel hardware based upon a
classification scheme with two orthogonal parameters:
the .redbold[instruction stream] and the .redbold[data stream].
- An instruction stream is a sequence of instructions that flow through a processor.
- A data stream is a sequence of data items that are computed on by a processor.

In .redbold[Flynn's Taxonomy], a machine is classified by whether it has a single or multiple instruction streams,
and whether it has single or multiple data streams.

---
name: flynn-taxonomy-2
### Flynn's Taxonomy

There are four possibilities:

- .redbold[SISD]: single instruction, single data;
  - example: conventional uni-processor.

- .redbold[SIMD]: single instruction, multiple data;
  - like MMX or SSE instructions in
the x86 processor series, processor arrays and pipelined vector processors.
SIMD multiprocessors issue a single instruction that operates on multiple
data items simultaneously.

- .redbold[MISD]: multiple instruction, single data;
  - very rare but one example is the
U.S. Space Shuttle flight controller.

- .redbold[MIMD]: multiple instruction, multiple data; SMPs, clusters.
  - This is the most common multiprocessor.
  - MIMD multiprocessors are more complex and expensive
  - Multiprocessors and multicomputers fall into this category.

---
name: programming-models
### Parallel Programming Models

Parallel programming models are an abstraction that is
independent of the actual hardware.
They describe how tasks (processes or threads) can interact with each other
and share data.

Several different models:

- .redbold[Shared memory without threads] (processes arrange sharing of memory)

- .redbold[Shared memory with threading]  (Pthreads, OpenMP)

- .redbold[Distributed memory / Message Passing] (MPI)

- .redbold[Data parallel hybrid]
  - combining the message passing model (MPI) with the threading model (OpenMP)
  - using MPI with GPU programming

- .redbold[Single Program Multiple Data (SPMD)]
  - MPI uses this model

- .redbold[Multiple Program Multiple Data (MPMD)]
  - MPI also supports this model.

---
name: summary
### Models Summarized

POSIX Threads (Pthreads)
- Library based; requires parallel coding
- C Language only
- Ubiquitous, portable
- Explicit parallelism; significant programmer attention to detail

OpenMP
- Compiler directive based; can use serial code
- Jointly defined by a consortium of major computer hardware and
software vendors
- Portable / multi-platform, including Unix and Windows platforms
- Available in C/C++ and Fortran implementations
- Relatively easy to use - provides for "incremental parallelism"

---
name: summary
### Models Summarized

Distributed memory / Message Passing

- Tasks use their own local memory during computation.
- Multiple tasks can reside on the same physical host and/or across
any number of hosts
- Tasks exchange data by sending and receiving
messages
- Message Passing Interface (MPI) is the "de facto" industry standard for
message passing. MPI implementations exist for
virtually all popular parallel computing platforms