class: center, middle, title-slide-cs49365 ## CSci 493.65/795.24 Parallel Computing
## Chapter 2: Parallel Architectures and Interconnection Networks .author[ Stewart Weiss
] .license[ Copyright 2021-24 Stewart Weiss. Unless noted otherwise all content is released under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/). Background image: George Washington Bridge at Dusk, by Stewart Weiss. ] --- name: cc-notice template: default layout: true .bottom-left[© Stewart Weiss. CC-BY-SA.] --- name: tinted-slide template: cc-notice layout: true class: tinted --- name: intro ### Introduction Things that operate in parallel but that work together must occasionally communicate to achieve their goals, such as - computers working together - processes working together - people working together When a system of parallel and discrete entities work together and communicate with each other, they form a .redbold[network]. This chapter looks at the various ways in which components of parallel systems can be connected. --- name: motivation ### Example Imagine three processes that work together. One reads a file containing numbers whose sum is to be calculated. It sends half the file to a second process and the remaining half to a third process. These two processes add their numbers independently and send their sums to the first process, which adds them and prints them out. If we call them processes 1, 2, and 3 and represent them as nodes, their communication pattern looks like this: .center[
] 1 and 2 communicate, and 1 and 3 communicate, but not 2 and 3. --- name: topology ### Network Topology The preceding example illustrates a network topology. A .redbold[network topology] formalizes the way in which a set of nodes are "connected" to each other. - A network topology is essentially a discrete graph : a set of nodes connected by edges. - Edges may or may not have direction. If information can flow back and forth, they are undirected, but if it flows only in a single direction it is directed and is often drawn with an arrow. - Distance does not exist - edges do not have length. - A network topology is a type of mathematical topological space. --- name: topo-example ### Network Topology Suppose that in the preceding example, process 2 sends half its data to processes 4 and 5, and process 3 sends half its data to 6 and 7. Each of 4,5,6, and 7 add up their numbers independently. Processes 4 and 5 send their sums to 2, and 6 and 7 send theirs to 3. Then processes 2 and 3 send their sums to 1, which sums them and prints them out. The communication pattern looks like this: .center[
] --- name: topo-terminology ### Some Terminology A .redbold[path] from a node S to a node T is a sequence of nodes S=N0, N1, N2, ..., Nk = T such that there is an edge from Ni to Ni+1 for each i. - The .redbold[length] of the path is the number of edges. If S == T the path has length 0. The .redbold[distance] between two nodes S and T is the length of the __shortest path__ between them. .center[
] The distance from S to T is 2. The .redbold[degree] of a node is the number of edges that are incident to that node, whether they are incoming or outgoing or undirected. --- name: topo-properties ### Properties of a Topology Properties of a topology affect performance, scalability, cost, and feasibility of using it. - .redbold[Size] This is the number of nodes in the topology. - .redbold[Diameter]: This is the maximum distance between any two nodes in the topology. - .redbold[Bisection Width] This is the number of edges that must be removed so that the vertex set is split into two equal size sets of vertices, or if it has an add number to start, then two sets whose sizes differ by 1. - .redbold[Degree] This is maximum degree of any node in the topology. Although in mathematical topoloogies, edges do not have length, another proerty of topologies that describe physical networks such as CPUs on a chip or board, or computers connected together, is the maximum length of any edges among them. --- name: topologies ### Specific Topologies Certain regular topologies arise as a result of algorithms or design decisions that are made in parallel systems. Some that we study include: - Binary tree - Fully-connected (also called completely-connected) - Mesh and torus - Hypercube (also called a binary n-cube) - Butterfly --- name: bintrees ### Binary Tree Network Topology In a binary tree network, the number of nodes, N, is equal to 2^k -1 for some k, and these nodes are arranged in a complete binary tree of depth k-1. - They have sizes 1, 3, 7, 15, 31, and so on. - Rarely used as a physical network among cores. - Mostly occurs when tasks communicate using certain types of algoriithms, such as our earlier example. --- name: fully-connected ### Fully-Connected Network Topology Every node is connected to every other node. - Rarely used to connect cores, - Often represents possible connections between tasks or processes. --- name: mesh ### Mesh Network Topology A mesh network looks like a grid in two dimensions, or a wire mesh cube in three dimensions, but is harder to visualize when there are more than 3 dimensions. Formally, it is called a .redbold[lattice]. Does not have to have same size in each dimension. .center[
] --- name: torus ### Torus Network Topology When a mesh has two dimensions, if we connect nodes at top and bottom by edges, and nodes on left and right by edges, we have a .redbold[torus]. It looks like a "doughnut". --- name: hypercube ### Hypercube (Binary n-Cube) A binary n-cube, also called a hypercube, network is a network with N = 2^n nodes arranged as the vertices of a n-dimensional cube. A hypercube is simply a generalization of an ordinary cube. In a 3D cube, there are N = 8 = 2^3 nodes, each connected to 3 nodes, one in each dimension. A square is a 2D cube. It has N = 4 = 2^2 nodes, each connected to 2 nodes, one in each dimension. In general, in an n-cube, each node is connected to n other nodes, ome in each dimension. - These arise very often in various parallel algorithms, where nodes represent tasks. .left-column-small[ This is a 4-cube: ] .right-column-large[
] --- name: butterfly ### Butterfly Network Topology A butterfly network topology consists of (k+1)2^k nodes arranged in k+1 ranks (i.e., rows), each containing n=2^k nodes. k is called the order of the network. The ranks are labeled 0 through k. The columns in the figure are labeled 0 through 2^k. A butterfly of order 0 has 1 node. A butterfly of order 1 has 4 nodes - 2 rows with 2 nodes each. A butterfly of order 2 has 12 nodes - 3 rows with 4 nodes each. --- name: butterfly-connections ### Butterfly Network Topology Connections Each node is connected to the node above it and below it in its column. Nodes are also connected to nodes not in their columns. - Let a xor b represent the bitwise exclusive-or of a and b. In each rank, from 0 through k-1, each node [i,j] is connected to node [i+1, j xor 2^(k-i-1)]. It is easier to see how this works if you write numbers in binary using k bits for order k. Consider k=3= 011. Node [1,3] is [001,011]. 2^(k-i-1) = 2^(3-1-1) = 2= 010. The bitwise xor of j = 3 = 011 and 010 is 001 = 1, so [001,011] has an edge to [010,001] = [2,1]. .center[
] There is a way to "grow" butterfly networks recursively. See the lecture notes. --- name: interconnection-networks ### Interconnection Networks An .redbold[interconnection network] is a system of links that connects one or more devices to each other for the purpose of inter-device communication. In the context of computer architecture, an interconnection network is used primarily to connect processors to processors, or to allow multiple processors to connect to one or more shared memory modules (as in SMPs.) Sometimes they are used to connect processors with locally attached memories to each other. The interconnection-network has a significant effect on the cost, applicability, scalability, reliability, and performance of a parallel computer. --- name: interconnection-network-props ### Interconnection Network Properties An interconnection network may be classified as shared or switched. A .redbold[shared network] can have at most one message on it at any time. For example, a .redbold[bus] is a shared network, as is traditional Ethernet. A .redbold[switched network] allows point-to-point messages among pairs of nodes and therefore supports the transfer of multiple concurrent messages. It is a collection of interconnected switches. Switched Ethernet is a switched network. .left-column[
] .right-column[ Shared networks are inferior to switched networks in terms of performance and scalability. In the example to the left, the switched network allows two links simultaneously. ] --- name: interconnect-topologies ### Interconnection Network Topologies Because interconnection networks can connect different types of devices, they have been used in different ways. .left-column[ In one approach, each node is a switch, and exactly one device is connected to that switch. This is called a .redbold[direct topology]. The mesh to the right is a direct topology. ] .right-column[
] .below-column[ The convention is that switches are drawn with circles and devices such as processors, computers, or memories, are drawn with squares. ] --- name: interconnect-topologies-2 ### Interconnection Network Topologies Because interconnection networks can connect different types of devices, they have been used in different ways. .left-column[ In another approach, called an .redbold[indirect topology], the number of switches is greater than the number of device nodes. The switches are used for routing messages from one processor or device to another. The binary tree to the right is used as a switching network in which the processors are connected only to the leaf nodes. ] .right-column[
] --- name: interconnect-topologies-3 ### Interconnection Network Topologies - Meshes are almost always used as a direct topology, with a processor attached to each switch. - Binary trees are always indirect topologies, acting as a switching network to connect a bank of proces- sors to each other. - Butterfly networks are always indirect topologies; the processors are connected to rank 0, and either memory modules or switches back to the processors are connected to the last rank. - Hypercube networks are always direct topologies. --- name: multiprocessors ### Multiprocessors A .redbold[multiprocessor] is a computer with multiple CPUs and a .bluebold[shared address space]. - In a multiprocessor, the same address generated on two different CPUs refers to the same memory location. - Changes made by one CPU to a memory location are seen by all other CPUs. - Processes running on separate CPUs can access the same logical addresses. Most multiprocessors are one of two types: - those in which the shared memory is physically in one place, and - those in which it is distributed among the processors. Regardless of where physical memory is located, parallel programs running on multiprocessors can take advantage of shared memory. --- name: smp ### UMA Multiprocessors When the shared memory is a physically central memory in one place, it is called any of: - .redbold[centralized multiprocessor], - .redbold[symmetric multiprocessor], or - .redbold[Uniform Memory Access (UMA) multiprocessor] .left-column[ Schematically, it looks like this:
] .right-column[ In more detail, like this:
] --- name: numa ### NUMA Multiprocessors If a multiprocessor does not have a centralized memory, but instead has physical memory distributed across the separate CPUs, it is called either: - .redbold[distributed multiprocessor], or a - .redbold[Non-Uniform Memory Access (NUMA) multiprocessor] The CPUs still can access a shared set of addresses, but how they do it is different than if the memory were centralized. Schematically, a NUMA multiprocessor looks like this: .center[
] - Data that is in the memory attached to a CPU is .redbold[local] to that CPU. - Accessing local data is much faster than accessing non-local data because memory accesses that are not local must traverse the network that connects the CPUs. - Coding efficiently for NUMA machines requires understanding where data might be located, to reduce that effects caused by local versus non-local memory accesses. --- name: multicomputers ### Multicomputers The important distinction between a multiprocessor and a multicomputer is that a multicomputer is a distributed memory, multiple-CPU computer in which .bluebold[there is no shared address space among the separate CPUs]. - Each CPU has its own address space and can access only its own local memory, which is called .redbold[private memory]. - The same address on two different CPUs refers to two different memory locations. - Processes running on multicomputers must share data through a message-passing interface; they cannot access the data that is used by the other programs. --- name: multicomputers ### Types of Multicomputers A .redbold[commercial multicomputer] is a multicomputer designed, manufactured, and intended to be sold as a multicomputer. A .redbold[commodity cluster] is a multicomputer put together out of off-the-shelf components to create a multicomputer. A commercial multicomputer's interconnection network and processors are optimized to work with each other, providing low-latency, high-bandwidth connections between the computers, at a higher price tag than a commodity cluster. Commodity clusters, though, generally have lower performance, with higher latency and lower bandwidth in the interprocessor connections. --- name: asymmetric-multicomputers ### Asymmetrical Multicomputers Some multicomputers, called .redbold[asymmetrical multicomputers], are designed with a special front-end computer and back-end computers. The front-end is the master and gateway for the machine, and the back-end CPUs are for computation. Users login to the front-end and their jobs are run on the back-end processors, and all I/O takes place through the front-end machine. .center[
] - The front-end is a bottleneck - Does not scale well - Other problems as well --- name: symmetric-multicomputers ### Symmetrical Multicomputers A .redbold[symmetrical multicomputer] is one in which all of the hosts are identical and are connected to each other through an interconnection network. Users can login to any host and the file system and I/O devices are equally accessible from every host. .center[
] - No bottleneck at front-end , and highly scalable, - Not suitable environment for running large scale parallel programs, because there is little control over loads on hosts --- name: flynn-taxonomy ### Flynn's Taxonomy Flynn (1966) categorized parallel hardware based upon a classification scheme with two orthogonal parameters: the .redbold[instruction stream] and the .redbold[data stream]. - An instruction stream is a sequence of instructions that flow through a processor. - A data stream is a sequence of data items that are computed on by a processor. In .redbold[Flynn's Taxonomy], a machine is classified by whether it has a single or multiple instruction streams, and whether it has single or multiple data streams. --- name: flynn-taxonomy-2 ### Flynn's Taxonomy There are four possibilities: - .redbold[SISD]: single instruction, single data; - example: conventional uni-processor. - .redbold[SIMD]: single instruction, multiple data; - like MMX or SSE instructions in the x86 processor series, processor arrays and pipelined vector processors. SIMD multiprocessors issue a single instruction that operates on multiple data items simultaneously. - .redbold[MISD]: multiple instruction, single data; - very rare but one example is the U.S. Space Shuttle flight controller. - .redbold[MIMD]: multiple instruction, multiple data; SMPs, clusters. - This is the most common multiprocessor. - MIMD multiprocessors are more complex and expensive - Multiprocessors and multicomputers fall into this category. --- name: programming-models ### Parallel Programming Models Parallel programming models are an abstraction that is independent of the actual hardware. They describe how tasks (processes or threads) can interact with each other and share data. Several different models: - .redbold[Shared memory without threads] (processes arrange sharing of memory) - .redbold[Shared memory with threading] (Pthreads, OpenMP) - .redbold[Distributed memory / Message Passing] (MPI) - .redbold[Data parallel hybrid] - combining the message passing model (MPI) with the threading model (OpenMP) - using MPI with GPU programming - .redbold[Single Program Multiple Data (SPMD)] - MPI uses this model - .redbold[Multiple Program Multiple Data (MPMD)] - MPI also supports this model. --- name: summary ### Models Summarized POSIX Threads (Pthreads) - Library based; requires parallel coding - C Language only - Ubiquitous, portable - Explicit parallelism; significant programmer attention to detail OpenMP - Compiler directive based; can use serial code - Jointly defined by a consortium of major computer hardware and software vendors - Portable / multi-platform, including Unix and Windows platforms - Available in C/C++ and Fortran implementations - Relatively easy to use - provides for "incremental parallelism" --- name: summary ### Models Summarized Distributed memory / Message Passing - Tasks use their own local memory during computation. - Multiple tasks can reside on the same physical host and/or across any number of hosts - Tasks exchange data by sending and receiving messages - Message Passing Interface (MPI) is the "de facto" industry standard for message passing. MPI implementations exist for virtually all popular parallel computing platforms