yazik.info Livros Advanced Computer Architecture Pdf


Wednesday, June 26, 2019

Computer Systems. Hardware. Architecture. Operating. System. Application. Software. No Component. Can be Treated. In Isolation. From the Others. 𝗣𝗗𝗙 | On Jan 1, , Jain Nitin and others published UNIT 1 Advanced Computer Architecture Introduction. Subject: ADVANCED COMPUTER ARCHITECTURE. Credits: 4 age, is no more limited to computer programmers and computer engineers. Rather than.

Advanced Computer Architecture Pdf

Language:English, Spanish, Indonesian
Genre:Science & Research
Published (Last):19.02.2016
ePub File Size:19.88 MB
PDF File Size:12.66 MB
Distribution:Free* [*Register to download]
Uploaded by: TONA

Advanced Computer Architecture. Instructor: Andreas Moshovos [email protected] yazik.info Fall Some material is based on slides developed by profs. best book for computer architecture. Advance computer architecture book pdf by patterson, Study notes for Advanced Computer Architecture. This course aims to give an introduction to some advanced aspects of computer architecture. One of the main areas that we will be considering is RISC.

Instead, our presentation focuses on core concepts likely to be found in any new machine. The key criterion remains that of selecting ideas that have been examined and utilized successfully enough to permit their discussion in quantitative terms. Our intent has always been to focus on material that is not available in equiva- lent form from other sources, so we continue to emphasize advanced content wherever possible.

Indeed, there are several systems here whose descriptions cannot be found in the literature. Readers interested strictly in a more basic introduction to computer architecture should read Computer Organization and Design: Chapter 1 has been beefed up in this edition.

It includes formulas for static power, dynamic power, integrated circuit costs, reliability, and availability. We go into more depth than prior editions on the use of the geometric mean and the geo- metric standard deviation to capture the variability of the mean.

Our hope is that these topics can be used through the rest of the book. In addition to the classic quantitative principles of computer design and performance measurement, the benchmark section has been upgraded to use the new SPEC suite. Our view is that the instruction set architecture is playing less of a role today than in , so we moved this material to Appendix B. It still uses the MIPS64 architecture.

Chapters 2 and 3 cover the exploitation of instruction-level parallelism in high-performance processors, including superscalar execution, branch prediction, speculation, dynamic scheduling, and the relevant compiler technology. As men- tioned earlier, Appendix A is a review of pipelining in case you need it.

Chapter 3 surveys the limits of ILR New to this edition is a quantitative evaluation of multi- threading. While the last edition contained a great deal on Itanium, we moved much of this material to Appendix G, indicating our view that this architecture has not lived up to the early claims.

Given the switch in the field from exploiting only ILP to an equal focus on thread- and data-level parallelism, we moved multiprocessor systems up to Chap- ter 4, which focuses on shared-memory architectures. The chapter begins with the performance of such an architecture.

It then explores symmetric and distributed-memory architectures, examining both organizational principles and performance. Topics in synchronization and memory consistency models are. The example is the Sun Tl "Niagara" , a radical design for a commercial product.

It reverted to a single-instruction issue, 6-stage pipeline microarchitec- ture. It put 8 of these on a single chip, and each supports 4 threads. Hence, soft- ware sees 32 threads on this single, low-power chip. As mentioned earlier, Appendix C contains an introductory review of cache principles, which is available in case you need it. This shift allows Chapter 5 to start with 11 advanced optimizations of caches. The chapter includes a new sec- tion on virtual machines, which offers advantages in protection, software man- agement, and hardware management.

The example is the AMD Opteron, giving both its cache hierarchy and the virtual memory scheme for its recently expanded bit addresses. Chapter 6, "Storage Systems," has an expanded discussion of reliability and availability, a tutorial on RAID with a description of RAID 6 schemes, and rarely found failure statistics of real systems. Rather than go through a series of steps to build a hypothetical cluster as in the last edition, we evaluate the cost, performance, and reliability of a real cluster: This brings us to Appendices A through L.

As mentioned earlier, Appendices A and C are tutorials on basic pipelining and caching concepts. Readers relatively new to pipelining should read Appendix A before Chapters 2 and 3, and those new to caching should read Appendix C before Chapter 5.

Appendix E, on networks, has been extensively revised by Timothy M. Pink- ston and Jose Duato. Appendix F, updated by Krste Asanovic, includes a descrip- tion of vector processors. We think these two appendices are some of the best material we know of on each topic.

Appendix H describes parallel processing applications and coherence proto- cols for larger-scale, shared-memory multiprocessing.

Appendix I, by David Goldberg, describes computer arithmetic. Appendix K collects the "Historical Perspective and References" from each chapter of the third edition into a single appendix.

It attempts to give proper credit for the ideas in each chapter and a sense of the history surrounding the inventions. We like to think of this as presenting the human drama of computer design.

It also supplies references that the student of architecture may want to pursue. If you have time, we recommend reading some of the classic papers in the field that are mentioned in these sections. It is both enjoyable and educational. Appendix L available at textbooks. There is no single best order in which to approach these chapters and appendices, except that all readers should start with Chapter 1.

If you don't want to read everything, here are some suggested sequences:. Appendix D can be read at any time, but it might work best if read after the ISA and cache sequences. Appendix I can be read whenever arithmetic moves you. The material we have selected has been stretched upon a consistent framework that is followed in each chapter. We start by explaining the ideas of a chapter. These ideas are followed by a "Crosscutting Issues" section, a feature that shows how the ideas covered in one chapter interact with those given in other chapters.

This is followed by a "Putting It All Together" section that ties these ideas together by showing how they are used in a real machine. Next in the sequence is "Fallacies and Pitfalls," which lets readers learn from the mistakes of others. We show examples of common misunderstandings and architectural traps that are difficult to avoid even when you know they are lying in wait for you. The "Fallacies and Pitfalls" sections is one of the most popular sec- tions of the book.

Each chapter ends with a "Concluding Remarks" section. Each chapter ends with case studies and accompanying exercises. Authored by experts in industry and academia, the case studies explore key chapter concepts and verify understanding through increasingly challenging exercises. Instructors should find the case studies sufficiently detailed and robust to allow them to cre- ate their own additional exercises. We hope this helps readers to avoid exercises for which they haven't read the corresponding section, in addition to providing the source for review.

Note that we provide solutions to the case study. Exercises are rated, to give the reader a sense of the amount of time required to complete an exercise:. A second set of alternative case study exercises are available for instructors who register at textbooks. This second set will be revised every summer, so that early every fall, instructors can download a new set of exercises and solutions to accompany the case studies in the book. Additional resources are available at textbooks.

The instructor site accessible to adopters who register at textbooks. New materials and links to other resources available on the Web will be added on a regular basis. Finally, it is possible to make money while reading this book. Talk about cost- performance! If you read the Acknowledgments that follow, you will see that we went to great lengths to correct mistakes.

Since a book goes through many print- ings, we have the opportunity to make even more corrections. If you uncover any remaining resilient bugs, please contact the publisher by electronic mail ca4bugs mkp.

We process the bugs and send the checks about once a year or so, so please be patient. We welcome general comments to the text and invite you to send them to a separate email address at ca4comments mkp. Once again this book is a true co-authorship, with each of us writing half the chapters and an equal share of the appendices. We can't imagine how long it would have taken without someone else doing half the work, offering inspiration when the task seemed hopeless, providing the key insight to explain a difficult concept, supplying reviews over the weekend of chapters, and commiserating when the weight of our other obligations made it hard to pick up the pen.

These obligations have escalated exponentially with the number of editions, as one of us was President of Stanford and the other was President of the Association for Computing Machinery. Thus, once again we share equally the blame for what you are about to read. Although this is only the fourth edition of this book, we have actually created nine different versions of the text: Along the way, we have received help from hundreds of reviewers and users.

Each of these people has helped make this book better. Thus, we have cho- sen to list all of the people who have made contributions to some version of this book. Like prior editions, this is a community effort that involves scores of volunteers. Without their help, this edition would not be nearly as polished. Ziavras, New Jersey Institute of Technology. Kirischian, Ryerson University; Timothy M. Pinkston, University of Southern California. Andrea C. Wood, University of Wisconsin-Madison Chapter 4.

Finally, a special thanks once again to Mark Smofherman of Clemson Univer- sity, who gave a final technical reading of our manuscript. Mark found numerous bugs and ambiguities, and the book is much cleaner as a result. This book could not have been published without a publisher, of course. For this fourth edition, we particularly want to thank Kimberlee Honjo who coordinated surveys, focus groups, manuscript reviews and appendices, and Nate McFadden, who coordinated the development and review of the case studies.

Our warmest thanks to our editor, Denise Penrose, for her leadership in our continu- ing writing saga. We must also thank our university staff, Margaret Rowland and Cecilia Pracher, for countless express mailings, as well as for holding down the fort at Stanford and Berkeley while we worked on the book.

Our final thanks go to our wives for their suffering through increasingly early mornings of reading, thinking, and writing. If you don't receive any email, please check your Junk Mail box. If it is not there too, then contact us to info docsity. If even this does not goes as it should, we need to start praying!

This is only a preview. Load more. Search in the document preview. Rules of Thumb 1. Bandwidth Rule: Bandwidth grows by at least the square of the improvement in latency. Dependability Rule: Design with no single point of failure. In Praise of Computer Architecture: Colwell, Intel lead architect "Not only does the book provide an authoritative reference on the concepts that all computer architects should be familiar with, but it is also a good starting point for investigations into emerging areas in the field.

You don't need the 4th edition of Computer Architecture'' —Michael D. Smith, Harvard University. Hill, University of Wisconsin-Madison. Hennessy is the president of Stanford University, where he has been a member of the faculty since in the departments of electrical engineering and computer science. He has also received seven honorary doctorates. After com- pleting the project in , he took a one-year leave from the university to cofound MIPS Com- puter Systems, which developed one of the first commercial RISC microprocessors.

After being acquired by Silicon Graphics in , MIPS Technologies became an independent company in , focusing on microprocessors for the embedded marketplace. As of,over million MIPS microprocessors have been shipped in devices ranging from video games and palmtop computers to laser printers and network switches.

Patterson has been teaching computer architecture at the University of California, Berkeley, since joining the faculty in , where he holds the Pardee Chair of Computer Sci- ence.

He was also involved in the Network of Workstations NOW project, which led to cluster technology used by Internet companies. These projects earned three dissertation awards from the ACM.

His current research projects are the RAD Lab, which is inventing technology for reli- able, adaptive, distributed Internet services, and the Research Accelerator for Multiple Proces- sors RAMP project, which is developing and distributing low-cost, highly scalable, parallel computers based on FPGAs and open-source hardware and software. Hennessy Stanford University David A. Wood University of Wisconsin-Madison. All rights reserved. Published Fourth edition Designations used by companies to distinguish their products are often claimed as trademarks or reg- istered trademarks.

Computer architecture: Hennessy, David A. Patterson ; with contributions by Andrea C. Patterson, David A. Arpaci-Dusseau, Andrea C. This lat- est edition expands the coverage of threading and multiprocessing, virtualization ix.

Contents 2. Hwu and JohnW. Wood Chapter 5 Memory Hierarchy Design 5. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau Appendix A Pipelining: Basic and Intermediate Concepts A. Preface Why We Wrote This Book Through four editions of this book, our goal has been to describe the basic princi- ples underlying what will be tomorrow's technological developments.

This Edition The fourth edition of Computer Architecture: As the first figure in the book documents, after 16 years of doubling performance every 18 months, sin- XV.

XVI ii Preface gle-processor performance improvement has dropped to modest annual improve- ments. There were many reasons for this change: Topic Selection and Organization As before, we have taken a conservative approach to topic selection, for there are many more interesting ideas in the field than can reasonably be covered in a treat- ment of basic principles.

An Overview of the Content Chapter 1 has been beefed up in this edition. Appendix D, updated by Thomas M. Conte, consolidates the embedded mate- rial in one place. Navigating the Text There is no single best order in which to approach these chapters and appendices, except that all readers should start with Chapter 1.

In Section 2. Famous topologies include the ring, tree, mesh, torus, hypercube, cube. Various communication patterns are demanded among the nodes, such as one-to-one. Important issues for multicomputers include message-routing schemes, network flow control strategies, deadlock avoidance, virtual channels, message-passing primitives, and program decomposition techniques. In Part IV, we will study the programming issues of three generations of multicomputers.

Representative Multicomputers Three message-passing multicomputers are sum- marized in Table 1. However, message passing imposes a hard- ship on programmers to distribute the computations and data sets over the nodes or to establish efficient communication among nodes. Until intelligent compilers and efficient distributed OSs become available, multicomputers will continue to lack programmabil- ity. Application General sparse matrix Scientific number Scientific and Drivers methods, parallel crunching with scalar academic d a t a manipulation, nodes, database applications.

The Intel is and some custom-designed VLSI processors are used as building blocks in these machines. Most multicomputer are being upgraded to yield a higher degree of parallelism with enhanced processors. We will study various massively parallel systems in Part III where the tradeoffs between scalability and programmability are analyzed. The SIMDs appeal more to special-purpose applications. Furthermore, the boundary between multiprocessors and multicomputer has become blurred in recent years, Eventually, the distinctions may vanish.

The architectural trend for future general-purpose computers is in favor of MIMD configurations with distributed memories having a globally shared virtual address space. He considers shared-memory multiprocessors as having a single address space. Scalable multiprocessors or multicomputer must use distributed shared memory. Unscalable multiprocessors use centrally shared memory. Multicomputer use distributed memories with multiple address spaces. They are scalable with distributed memory.

Centralized multicomputer are yet to appear. Many of the identified example systems will be treated in subsequent chapters. The evolu- tion of fast LAN local area network-connected workstations will create "commodity supercomputing".

Bell advocates high-speed workstation clusters interconnected by high-speed switches in lieu of special-purpose multicomputer. The CM-5 development has already moved in this direction. We classify supercomputers either as pipelined vector machines using a few powerful processors equipped with vector hardware, or as SIMD computers emphasizing massive data parallelism.

As shown in Fig. Program Limited preview! Shared Memoiy NEC. All instruc- tions are first decoded by the scalar control unit- If the decoded instruction is a scalar operation or a program control operation, it will be directly executed by the scalar processor using the scalar functional pipelines. If the instruction is decoded as a vector operation, it will be sent to the vector control unit.

This control unit will supervise the flow of vector data between the main memory and vector functional pipelines. The vector data flow is coordinated by the con- trol unit. A number of vector functional pipelines may be built into a vector processor. Two pipeline vector supercomputer models are described below. Vector Func. Vector Processor Models Figure l.

Vector registers are used to hold the vector operands, intermediate and final vector results. The vector functional pipelines retrieve operands from and put results into the vector registers. AH vector registers are programmable in user instructions. Each vector register is equipped with a component counter which keeps track of the component registers used in successive pipeline cycles. The length of each vector register is usually fixed, say, sixty-four bit component registers in a vector register in a Cray Series supercomputer.

Other machines, like the Fujitsu VP Series, use reconfigurable vector registers to dynamically match the register length with that of the vector operands. In general, there are fixed numbers of vector registers and functional pipelines in a vector processor. Therefore, both resources must be reserved in advance to avoid resource conflicts between different vector operations.

Several vector-register based supercomputers are summarized in Table 1. A memory-to-memory architecture differs from a register-to-register architecture in the use of a vector stream unit to replace the vector registers. Vector operands and results are directly retrieved from the main memory in superwords, say, bits as in the Cyber Pipelined vector supercomputers started with uniprocessor models such as the Cray 1 in Recent supercomputer systems offer both uniprocessor and multiprocessor models such as the Cray Y-MP Series.

Most high-end mainframes offer multiprocessor Limited preview! Representative Supercomputers Over a dozen pipelined vector computers have been manufactured, ranging from workstations to mini- and supercomputers. Pipeline chaining possible. C scalar optimization! The latest C3 Series is based on GaAs technology. The VAX processors use a hybrid architecture. The Cray Y-MP family offers both vector and multiprocessing capabilities. Siegel These include arithmetic, logic, data routing, masking, and other local operations executed by each active PE over data within that PE.

N-1 Mem. O M6Ti 1 Mem. N-1 Interconnection Network Figure 1. One can describe a particular SIMD machine architecture by specifying the 5-tuple.

An example SIMD machine is partially specified below: Listed below is a partial specification of the 5-tuple for this machine: The PEs receive instructions from the CU. Multiple PEs can be built on a single chip due to the simplicity of each PE. The 32 PEs are interconnected by an X-Net mesh, which is a 4-neighbor mesh augmented with diagonal dual-stage links. The CM-2 implements 16 PEs as a mesh on a single chip.

Each PE mesh chip is placed at one vertex of a dimensional hypercube. Globally, a large mesh 64 x 64 is formed by interconnecting these small meshes on chips.

Fortran 90 and modified versions of C, Lisp, and other synchronous programming languages have been developed to program SIMD machines. The ideal models provide a convenient framework for de- veloping parallel algorithms without worry about the implementation details or physical constraints. The models can be applied to obtain theoretical performance bounds on parallel computers or to estimate VLSI complexity on chip area and execution time before the Limited preview!

Advanced Computer Architecture

An X-Net symbolic debugger, visualizers mesh plus a multistage crossbar and animators. The abstract models are also useful in scalability and programma- bility analysis, when real machines are compared with an idealized parallel machine without worrying about communication overhead among processing nodes.

We define first the time and space complexities. Computational tractability is reviewed for solving difficult problems on computers.

These complexity models facilitate the study of asymptotic behavior of algorithms implementable on parallel computers. Time a n d Space Complexities The complexity of an algorithm for solving a prob- lem of size s on a computer is determined by the execution time and the storage space required. The time complexity is a function of the problem size. The time complexity function in order notation is the asymptotic time complexity of the algorithm.

Usually, the worst-case time complexity is considered. The asymptotic space complexity refers to the data storage of large problems. Note that the program code storage requirement and the storage for input data are not considered in this. The time complexity of a serial algorithm is simply called serial complexity. The time complexity of a parallel algorithm is called parallel complexity. Intuitively, the parallel complexity should be lower than the serial complexity, at least asymptotically.

We consider only deterministic algorithms, in which every operational step is uniquely defined in agreement with the way programs are executed on real computers. A nondeterministic algorithm contains operations resulting in one outcome in a set of possible outcomes. There exist no real computers that can execute nondetermin- istic algorithms. Therefore, all algorithms or machines considered in this book are deterministic, unless otherwise noted.

The set of problems having polynomial-complexity algorithms is called P-class for polynomial class. The set of problems solvable by nondeterministic algorithms in polynomial time is called NP-class for nondeterministic polynomial class. Since deterministic algorithms are special cases of the nondeterministic ones, we know that P C NP. The P-class problems are computationally tractable, while the NP — P-class problems are intractable. This is still an open problem in computer science.

To simulate a nondeterministic algorithm with a deterministic algorithm may re- quire exponential time. Therefore, intractable NP-class problems are also said to have exponential-time complexity. Therefore, both problems belong to the P-class. Nonpolynomial algorithms have been developed for the traveling salesperson problem with complexity 0 n 2 2 n and for the knapsack problem with complexity 0 2 n ' 2. So far, deterministic polynomial algorithms have not been found for these problems.

Therefore, exponential-complexity problems belong to the NP-class. Thus NP-complete problems are considered the hardest ones to solve. Only approximation algorithms were derived for solving some of the NP-complete problems. NondeEennirmric polynomial- time class P: NP-complete class Figure 1. P R A M Models Conventional uniprocessor computers have been modeled as random- access machines RAM by Sheperdson and Sturgis , A parallel random-access machine PRAM model has been developed by Fortune and Wyllie for model- ing idealized parallel computers with zero synchronization or memory access overhead.

This PRAM model will be used for parallel algorithm development and for scalability and complexity analysis. Each processor can access any m e m o r y location in unit time. The shared memory can be distributed among the processors or centralized in one place. The n processors [also called processing elements PEs by other authors] operate on a syn- chronized read-memory, compute, and write-memory cycle. With shared memory, the model must specify how concurrent read and concurrent write of memory are handled.

Four memory-update options are possible: Exclusive read ER — This allows at mast one processor to read from any memory location in each cycle, a rather restrictive policy. In order to avoid confusion, some policy must be set up to resolve the write conflicts.

Various combinations of the above options lead to several variants of the PRAM model as specified below. Since CR does not create a conflict problem, variants differ mainly in how they handle the CW conflicts.

This is the most restrictive PRAM model proposed. Concurrent reads to the same memory location arc allowed. The conflicting writes are resolved by one of the following four policies Fortune and Wyllie, Assume n3 PEs are available initially.

To visualize the algorithm, assume the memory is organized as a three-dimensional array with inputs A and B stored in two planes. Also, for sake of explanation, assume a three-dimensional indexing of the PEs.

In step 1, n product terms corresponding to each output are computed using n PEs in O l time. In step 2, these are added to produce an output in O logn time.

Listed below are programs for each PE i,j, k to execute. All n3 PEs operate in parallel for n3 multiplications. Step 1: Read A i,k 2. Read B k,j 3. Compute A i, k x B k,j 4. Store in C i,j,k Step 2: Each PE is responsible for computing logn product terms and summing them up.

Discrepancy with Physical Models PRAM models idealized parallel computers, in which all memory references and program executions by multiple processors are synchronized without extra cost. In reality, such parallel machines do not exist. However, PRAM allows different instructions to be executed on different processors simultaneously.

This particular model will be used in defining scalability in Chapter 3. For complexity analysis or performance comparison, various PRAM variants offer an ideal model of parallel computers.

Therefore, computer scientists use the PRAM model more often than computer engineers. The PRAM model will be used for scalability and performance studies in Chapter 3 as a theoretical reference machine. For regularly structured parallelism, the PRAM can model much better than practical machine models.

Therefore, sometimes PRAMs can indicate an upper bound on the performance of real parallel computers. Let s by the problem size involved in the computation.

The latency T is the time required from when inputs are applied until all outputs are produced for a single problem instance. Figure 1. The chip is represented by the base area in the two horizontal dimensions.

The vertical dimension corresponds to time. Therefore, the three-dimensional solid represents the history of the computation performed by the chip. M e m o r y Bound on Chip Area A There are many computations which are memory- bound, due to the need to process large data sets. To implement this type of compu- tation in silicon, one is limited by how densely information bit cells can be placed on the chip.

The amount of information processed by the chip can be visualized as information flow upward across the chip area. Each bit can flow through a unit area of the horizontal chip slice.

Thus, the chip area bounds the amount of memory bits stored on the chip. As information flows through the chip for a period of time T, the number of input bits cannot exceed the volume. The area A corresponds to data into and out of the entire surface of the silicon chip. This area! The height T of the volume can be visualized as a number of snapshots on the chip, as computing time elapses.

The volume represents the amount of information flowing through the chip during the entire course of the computation. The bisection is represented by the vertical slice cutting across the shorter dimension of the chip area.

The distance of this dimension is at most VA fur a square chip. The height of the cross section is T. The bisection area represents the maximum amount of information exchange be- tween the two halves of the chip circuit during the time period T.

The cross-section area vAT limits the communication bandwidth of a computation. Charles Seitz has given another interpretation of the AT2 result. This implies that the cost of computation for a two-dimensional chip decreases with the execution time allowed.

When three-dimensional multilayer silicon chips are used, Seitz asserted that the Limited preview! The 2-D mesh architecture is shown in Fig. Inter-PE communicaiton is done through the broadcast buses. Thus the total chip area needed is G n2 for an n x n mesh with broadcast buses.

Initially the input matrix elements A i,j and Bfi,j are stored in PE i,j with no duplicated data. The memory is distributed among all the PEs. Each PE can access only its own local memory. The following parallel algorithm shows how to perform the dot-product operations in generating all the output elements C I. Not limit for this for commercial use book. The Stanford Dash Lenoski, Hennessy et al. Cache coherence is enforced with distributed directories.

The Fujitsu V P P is a processor system with a crossbar interconnect. Following the Ultracomputer are two large-scale multiprocessors, both using mul- tistage networks but with different interstage connections to be studied in Chapters 2 and 7. Among the systems listed in Fig. T h e rest are Limited preview!

Since then, Intel has produced a series of medium-grain hypercube computers the iPSCs. The nCUBE 2 also assumes a hypercube configuration.

The latest Intel system is the Paragon to be studied in Chapter 7. Detailed studies can be found in Chapter 8. Multivector Track These are traditional vector supercomputers. The CDC was the first vector dual-processor system.

Two subtracks were derived from the CDC The Cray and Japanese supercomputers all followed the register-to-register archi- tecture. Cray 1 pioneered the multivector development in It is supposed to work as a back-end accelerator engine compatible with the existing Cray Y-MP Series. The other subtrack used memory-to-memory architecture in building vector super- computers.

Since the production of both machines has been discontinued now, we list them here simply for completeness in tracking different supercomputer architectures.

Both tracks will be studied in Chapter 9. The following introduction covers only basic definitions and milestone systems built today. The conventional von Neumann machines are built with processors that execute a single context by each processor at a time.

In other words, each processor main- tains a single thread of control with limited hardware resources. In a multithreaded architecture, each processor can execute multiple contexts at the same time. The term multithreading implies that there are multiple threads of control in each processor. Mul- tithreading offers an effective mechanism for hiding long latency in building large-scale multiprocessors.

The latest multithreaded multiprocessor projects are the Tera computer Alverson, Smith et al. Until then, all multiprocessors studied use single-threaded processors as building blocks.

T h e Dataflow Track We will introduce the basic concepts of dataflow computers in Section 2. Some experimental dataflow systems are described in Section 9. The key idea is to use a dataflow mechanism, instead of a control-flow mechanism as in von Neumann machines, to direct the program flow.

Fine-grain, instruction-level parallelism is exploited in dataflow computers. As listed in Fig. The concept later inspired the development of "dynamic" dataflow computers. A series of tagged-token architectures was developed at MIT by Arvind and coworkers. We will describe the tagged-token architecture in Section 2v3. Anther important subtrack of dynamic dataflow computer is represented by the Manchester machine Gurd and Watson, We will study the EM5 Sakai et al.

These dataflow machines are still in the research stage. A recent collection of papers on architectural alternatives for exploiting parallelism can be found in the Limited preview! For a treatment of earlier parallel processing computers, readers are referred to [Hwang84] and Briggs. The layered classification of parallel computers was proposed in [Ni91j. Systolic array was intro- duced by [Kung78] and Leiserson. Multiprocessor issues were characterized by [Gajski85] and Pier and by [Dubois88], Scheurich, and Briggs.

Multicomputer technology was assessed by [Athas88] and Seitz. An introduction to existing parallel computers can be found in [TVewOl] and Wilson. Key references of various computer systems, listed in Bell's taxonomy Fig. Additional references to these case-study machines can be found in later chapters. A collection of reviews, a bibliography, and indexes of resources in parallel systems can be found in [ACM91] with an introduction by Charles Seitz. SIMD machines were modeled by [Siegel79].

Readers are referred to the following journals and conference records for information on recent developments: Exercises Problem 1. Problem 1. Assume a one-cycle delay for each memory ac- cess. However, the speed of the memory subsystem remains unchanged, and consequently two clock cycles are needed per memory access. The program consists of four major types of instructions. The instruction mix and the number of cycles CPI needed for each instruction type are given below based on the result of a program trace experiment: You have either reached a page that is unavailable for viewing or reached Limited your preview viewing!

Assume that initially each location holds one input value. Explain how you would make the algorithm processor time optimal. The corresponding algorithm must be shown, similar to that in Example 1. The mapping is a one-to-one correspondence. Compiler techniques are needed to get around the control dependence in order to exploit more parallelism. Resource Dependence This is different from data or control dependence, which demands the independence of the work to be done. Resource dependence is concerned with the conflicts in using shared resources, such as integer units, floating-point units, registers, and memory areas, among parallel events.

If the conflicts involve workplace storage, we call it storage dependence. In the case of storage dependence, each task must work on independent storage locations or use protected access such as locks or monitors to be described in Chapter 11 to shared writable data.

The transformation of a sequentially coded program into a parallel executable form can be done manually by the programmer using explicit parallelism, or by a compiler detecting implicit parallelism automatically. In both approaches, the decomposition of programs is the primary objective. Program partitioning determines whether a given program can be partitioned or split into pieces that can execute in parallel or follow a certain prespecifled order of execution.

Some programs are inherently sequential in nature and thus cannot be decomposed into parallel branches. The detection of parallelism in programs requires a check of the various dependence relations.

All time popular Study Materials

Bernstein's Conditions In , Bernstein revealed a set of conditions based on which two processes can execute in parallel. A process is a software entity corresponding to the abstraction of a program fragment defined at various processing levels. Similarly, the output set 0 , consists of all output variables generated after execution of the process P,, Input variables are essentially operands which can be fetched from memory or registers, and output variables are the results to be stored in working registers or memory locations.

Formally, these conditions are stated as follows: The program flow graph displays the patterns of simultaneously executable operations. Parallelism in a program varies during the execution period. It often limits the sustained performance of the processor. Example 2. There are eight instructions four loads and four arithmetic operations to be executed in three consecutive machine cycles.

Therefore, the parallelism varies from 4 to 2 in three cycles. Cytta 1 Cyeta! With this hardware restriction, the pro- gram must execute in seven machine cycles as shown in Fig. This demonstrates a mismatch between the software parallelism and the hardware parallelism. Let us try to match the software parallelism shown in Fig. Communication latency and scheduling issues are illustrated with programming examples. The simplest measure is to count the number of instructions in a grain program segment.

Grain size determines the basic program segment chosen for parallel processing. Grain sizes are commonly described as fine, medium, or coarse. Latency is a time measure of the communication overhead incurred between ma- chine subsystems. For example, the memory latency is the time required by a processor to access the memory.

The time required for two processes to synchronize with each other is called the synchronization latency. Computational granularity and communi- cation latency are closely related. We reveal their relationship below. Parallelism has been exploited at various processing levels. The lower the level, the finer the granu- larity of the software processes.

In general, the execution of a program may involve a combination of these levels. The actual combination depends on the application, formulation, algorithm, language, program, compilation support, and hardware limitations. We characterize below the parallelism levels and review their implementation issues from the viewpoints of a pro- grammer and of a compiler writer. Instruction Level At instruction or statement level.

Depending on individual programs, fine- grain parallelism at this level may range from two to thousands. Butler et al. Wall finds that the average parallelism at instruction level is around five, rarely exceeding seven, in an ordinary program. For scientific applications, Kumar has measured the av- erage parallelism in the range of to Fortran statements executing concurrently in an idealized environment.

The advantage of fine-grain computation lies in the abundance of parallelism. The exploitation of fine-grain parallelism can be assisted by an optimizing compiler which should be able to automatically detect parallelism and translate the source code to a parallel form which can be recognized by the run-time system. Instruction-level paral- lelism is rather tedious for an ordinary programmer to detect in a source code.

Loop Level This corresponds to the iterative loop operations. A typical loop contains less than instructions. Some loop operations, if independent ; n successive iterations, can be vectorized for pipelined execution or for lock-step execution on SIMD machines. Reprinted from Hwang, Proc. Loop-level parallelism is the most optimized program construct to execute on a parallel or vector computer. However, recursive loops are rather difficult to parallelize. Vector processing is mostly exploited at the loop level level 2 in Fig.

The loop level is still considered a fine grain of computation. P r o c e d u r e Level This level corresponds to medium-grain size at the task, procedu- ral, subroutine, and coroutine levels.

A typical grain at this level contains less than instructions. Detection of parallelism at this level is much more difficult than at the finer-grain levels.

Interprocedural dependence analysis is much more involved and history-sensitive. The communication requirement is often less compared with that required in MIMD execution mode. SPMD execution mode is a special case at this level. Multitasking also belongs in this category. Significant efforts by programmers may be needed to restructure a program at this level, and some compiler assistance is also needed. S u b p r o g r a m Level This corresponds to the level of job steps and related subpro- grams.

The grain size may typically contain thousands of instructions,. Subprograms can be scheduled for different processors in Limited preview! Multiprogramming on a uniprocessor or on a multiprocessor is conducted at this level. In the past, parallelism at this level has been exploited by algorithm designers or programmers, rather than by compilers.

We do not have good compilers for exploiting medium- or coarse-grain parallelism at present. Job Program Level This corresponds to the parallel execution of essentially in- dependent jobs programs on a parallel computer. The grain size can be as high as tens of thousands of instructions in a single program.

For supercomputers with a small number of very powerful processors, such coarse-grain parallelism is practical. Job-level parallelism is handled by the program loader and by the operating system in general. Time-sharing or space-sharing multiprocessors explore this level of parallelism.

In fact, both time and space sharing are extensions of multiprogramming. To summarize, fine-grain parallelism is often exploited at instruction or loop levels, preferably assisted by a parallelizing or vectorizing compiler. Medium-grain parallelism at the task or job step demands significant roles for the programmer as well as compilers.

Coarse-grain parallelism at the program level relies heavily on an effective OS and on the efficiency of the algorithm used.

Shared-variable communication is often used to support fine-grain and medium-grain computations. Mess age-passing multicomputers have been used for medium- and coarse-grain com- putations. In general, the finer the grain size, the higher the potential for parallelism and the higher the communication and scheduling overhead.

Fine grain provides a higher degree of parallelism, but heavier communication overhead, as compared with coarse-grain computations. Communication Latency By balancing granularity and latency, one can achieve better performance of a computer system. Various latencies are attributed to machine architecture, implementing technology, and communication patterns involved. The ar- chitecture and technology affect the design choices for latency tolerance between sub- systems.

In fact, latency imposes a limiting factor on the scalability of the machine size. For example, memory latency increases with respect to memory capacity.

Thus mem- ory cannot be increased indefinitely without exceeding the tolerance level of the access latency. Various latency hiding or tolerating techniques will be studied in Chapter 9. The latency incurred with interprocessor communication is another important pa- rameter for a system designer to minimize. Besider; signal delays in the data path, IPC latency is also affected by the communication patterns involved.

Thus the complexity grows quadratically. This leads to a communication bound which limits the number of processors allowed in a large computer system. Communication patterns are determined by the algorithms used as well as by the architectural support provided.

Frequently encountered patterns include permutations and broadcast, multicast, and conference many-to-many communications.

The com- munication demand may limit the granularity or parallelism. Very often tradeoffs do exist between the two. We will study techniques that minimize communication latency, prevent deadlock, and optimize grain size throughout the hook.

This grain-size problem demands determination of both the number and the size of grains or microtasks in a parallel program. Of course, the solution is both problem- dependent and machine-dependent.

The goal is to produce a short schedule for fast execution of subdivided program modules. The time complexity involves both computation and communication overheads The program partitioning involves the algorithm designer, programmer, compiler, op- erating system support, etc.

We describe below a grain packing approach introduced by Kruatrachue and Lewis for parallel programming applications. In Fig. A program graph shows the structure of a program. It is very similar to the dependence graph introduced in Section 2.

Each node in the program graph corresponds to a computational unit in the program. The grain size is measured by the number of basic machine cycles including both processor and memory cycles needed to execute all the operations within the node. We denote each node in Fig. G by a pair n, s , where n is the node name id and s is the grain size of the node. Thus grain size reflects the number of computations involved in a program segment.

Fine-grain nodes have a smaller grain size, and coarse-grain nodes have a larger grain size. The edge label v, d between two end nodes specifies the output variable v from the source node or the input variable to the destination node, and the communi- cation delay d between them.

This delay includes all the path delays and memory latency involved. There are 17 nodes in the fine-grain program graph Fig. The coarse-grain node is obtained by combining grouping multiple fine-grain nodes.

The fine grain corresponds to the following program: Each takes one cycle to address and six cycles to fetch from memory. All remaining nodes 7 to 17 are CPU operations, each requiring two cycles to complete. After packing, the coarse-grain nodes have larger grain sizes ranging from 4 to 8 as shown. The node A,8 in Fig. Then one combines packs multiple fine-grain nodes into a coarse- grain node if it can eliminate unnecessary communications delays or reduce the overall scheduling overhead.

Usually, all fine-grain operations within a single coarse-grain node are assigned to the same processor for execution. Fine-grain partition of a program often demands more interprocessor communication than that required in a coarse-grain partition.

Internal delays among fine-grain operations within the same coarse-grain node are negligible because the communication delay is contributed mainly by interprocessor delays rather than by delays within the same processor. The choice of the optimal grain size is meant to achieve the shortest schedule for the nodes on a parallel computer system. The fine-grain schedule is longer 42 time units because more communication delays were included as shown by the shaded area.

The coarse-grain schedule is shorter 38 time units because communication delays among nodes 12, 13, and 14 within the same node D and also the delays among 15, 16, and 17 within the node E are eliminated after grain packing. In general, dynamic multiprocessor scheduling is an NP-hard problem. Very often heuristics are used to yield suboptimal solutions. We introduce below the basic concepts behind multiprocessor scheduling using static schemes. Node Duplication In order to eliminate the idle time and to further reduce the communication delays among processors, one can duplicate some of the nodes in more than one processor.

Figure 2. This schedule contains idle time as well as long interprocessor delays 8 units between P I and P2. The new schedule shown in Fig. The reduction in schedule time is caused by elimination of the a, 8 and c, 8 delays between the two processors. Four major steps are involved in the grain determination and the process of scheduling optimization: Step 1. Construct a fine-grain program graph.

Step 2. Schedule the fine-grain computation. Step 3. Grain packing to produce the coarse grains- Step 4. Generate a parallel schedule based on the packed graph. The purpose of multiprocessor scheduling is to obtain a minimal time schedule for the computations involved. The following example clarifies this concept. In this example, two 2 x 2 matrices A and B are multiplied to compute the sum of the four elements in the resulting product matrix C — A x B.

There are eight multiplications and seven additions to be performed in this program, as written below: Note that the communication delays have slowed down the parallel execution signifi- cantly, resulting in many processors idling indicated by I , except for P i which produces the final sum. Next we show how to use grain packing Step 3 to reduce the communication overhead.

The remaining three nodes N, O, P then form the fifth node Z. Note that there is only one level of interprocessor communication required as marked by d in Fig. Since the maximum degree of parallelism is now reduced to 4 in the program graph, we use only four processors to execute this coarse-grain program.

Dataflow computers are based on a data-driven mechanism which allows the execution of anv instruction to be driven by data operand availability. Dataflow computers emphasize a high degree of parallelism at the fine-grain instructional level. Reduction computers are based on a demand-driven mechanism which initiates an operation based oa the demand for its results by other computations. The data-driven chain reactions are shown in Fig. Note that no shared memory is used in the dataflow implemen- tation.

The example does not show any time advantage of dataflow execution over control flow execution. The chain reaction control in dataflow is more difficult to implement and may result in longer overhead, as compared with the uniform operations performed by all the processors in Fig.

Thus, instruction-level parallelism of dataflow graphs can absorb the communication latency and minimize the losses due to synchronization waits. Besides token matching and I-structure, compiler technology is also needed to generate dataflow graphs for tagged-token dataflow computers. The dataflow architecture offers an ideal model for massively parallel computations because all far-reaching side effects are removed.

Side effects refer to the modification of some shared variables by unrelated operations. Such a computation has been called eager eval- uation because operations are carried out immediately after all their operands become available. A demand-driven computation corresponds to lazy evaluation, because operations are executed only when their results are required by another instruction. The demand- driven approach matches naturally with the functional programming concept.

The removal of side effects in functional programming makes programs easier to parallelize. There are two types of reduction machine models, both having a recursive control mech- anism as characterized below.

Reduction Machine Models In a string reduction model, each demander gets a separate copy of the expression for its own evaluation.

A long string expression is Limited preview! These routing functions can be implemented on ring, mesh, hypercube, or multistage networks. The set of all permutations form a permutation group with respect to a composition operation. One can use cycle notation to specify a permutation function. The cycle a,b,c has a period of 3, and the cycle d, e a period of 2. One can use a crossbar switch to implement the permutation.

Multistage networks can implement some of the permutations in one or multiple passes through the network. Permutations can also be implemented with shifting or broadcast operations. The permutation capability of a network is often used to indicate the data routing capability.

When n is large, the permutation speed often dominates the performance of a data- routing network. Perfect Shuffle and Exchange Perfect shuffle is a special permutation function suggested by Harold Stone for parallel processing applications. The mapping corresponding to a perfect shuffle is shown in Fig. Its inverse is shown on the right-hand side Fig. It is symmetric with a constant node degree of 2.

The IBM token ring has this topology, in which messages circulate along the ring until they reach the destination with a matching token. Pipelined or packet-switched, rings have been implemented in the CDC Cyberplus multiprocessor and in the KSR-1 computer system for interprocessor communications.

By increasing the node degree from 2 to 3 or 4, we obtain two chordal rings as shown in Figs. One and two extra links are added to produce the two chordal rings, respectively. In general, the more links added, the higher the node degree and the shorter the network diameter.

Comparing the node ring Fig. In the extreme, the completely connected network in Fig. Barrel Shifter As shown in Fig. Obviously, the connectivity in the barrel shifter is increased over that of any chordal ring of lower node degree. But the barrel shifter complexity is still much lower than that of the completely connected network Fig.

Tree and S t a r A binary tree of 31 nodes in five levels is shown in Fig. The maximum node degree is 3 and the diameter is 2 fc - 1. With a constant node degree, the binary tree is a scalable architecture. However, the diameter is rather long. Note that the pure mesh as shown in Fig. The node degrees at the boundary and corner nodes are 3 or 2. The Illiac IV assumed an 8 x 8 Illiac mesh with a constant node degree of 4 and a diameter of 7. The Illiac mesh is topologically equivalent to a chordal ring of degree 4 as shown in Fig.

The torus shown in Fig. This topology combines the ring and mesh and extends to higher dimensions. The torus has ring connections along each row and along each column of the array. The torus is a symmetric topology. All added wraparound connections help reduce the diameter by one-half from that of the mesh. Systolic Arrays This is a class of multidimensional pipelined array architectures designed for implementing fixed algorithms.

What is shown in Fig. The interior node degree is 6 in this example. In general, static systolic arrays are pipelined with multidirectional flow of data streams. The commercial machine Intel iWarp system Anaratone et al.

The systolic array has become a popular research area ever since its introduction by Kung and Leiserson in With fixed interconnection and synchronous operation, a systolic array matches the communication structure of the algorithm. However, the structure has limited applicability and can be very difficult to program. Since this book emphasizes general-purpose computing, we will not study systolic arrays further.

Interested readers may refer to the book by S. Kung for using systolic and wavefront architectures in building VLSI array processors. A 3-cube with 8 nodes is shown in Fig. A 4-cube can be formed by interconnecting the corresponding nodes of two 3- cubes, as illustrated in Fig. The node degree of an n-cube equals n and so does the network diameter. In fact, the node degree increases linearly with respect to the dimension, making it difficult to consider the hypercube a scalable architecture.

Binary hypercube has been a very popular architecture for research and devel- opment in the s. The architecture has dense connections. Many other architectures, such as binary trees, meshes, etc. With poor scalability and difficulty in packaging higher-dimensional hypercubes, the hypercube architecture is gradually being replaced by other architectures.

For Limited preview! But the CCC has a node degree of 3, smaller than the node degree of 6 in a 6-cube. In this sense, the CCC is a better architecture for building scalable systems if latency can be tolerated in some way. The parameter n is the dimension of the cube and k is the radix, or the number of nodes multiplicity along each dimension.

For sim- plicity, all links are assumed bidirectional. Each line in the network represents two communication channels, one in each direction. Traditionally, low-dimensional fc-ary n-cubes are called torit and high-dimensional binary n-cubes are called hypercubes- The long end-around connections in a torus can be avoided by folding the network as shown in Fig.

In this case, all links along the ring in each dimension have equal wire length when the multidimensional network Limited preview! Other boards for processors, memories, or device interfaces are plugged into the backplane board via connectors or cables.

The passive or slave devices memories or peripherals respond to the requests. The common bus is used on a time-sharing basis, and important busing issues include the bus arbitration, interrupts handling, coherence protocols, and transaction processing. Hierarchical bus structures for building larger multiprocessor systems are studied in Chapter 7.

Switch Modules A a x b switch module has a inputs and b outputs. In theory, a and 6 do not have to be equal. Table 2. Each input can be connected to one or more of the outputs. However, conflicts must be avoided at the output terminals. In other words, one-to-one and one-to-many mappings are allowed; but many-to-one mappings are not allowed due to conflicts at the output terminal.

When only one-to-one mappings permutations are allowed, we call the module an n x n crossbar switch. For example, a 2 x 2 crossbar switch can connect two possible patterns: The numbers of legitimate connection patterns for switch modules of various sizes are listed in Table 2. The small boxes and the ultimate building blocks of the subblocks are the 2 x 2 switches, each with two legitimate connection states: A 16 x 16 Baseline network is shown in Fig.

Advanced Computer Architecture

In Problem 2. C r o s s b a r N e t w o r k The highest bandwidth and interconnection capability are pro- vided by crossbar networks. A crossbar network can be visualized as a single-stage switch network. Like a telephone switchboard, the crosspoint switches provide dynamic connections between source, destination pairs.

Each crosspoint switch can provide a dedicated connection path between a pair. The switch can be set on or off dynamically upon program demand. Two types of crossbar networks are illustrated in Fig. To build a shared-memory multiprocessor, one can use a crossbar network between the processors and memory modules Fig. This is essentially a memory-access network. The C. The 16 memory modules can Limited preview!

Considering each statement as a separate process, clearly identify input set It and output set Oi of each process.

Restructure the program using Bernstein's conditions in order to achieve maximum parallelism between processes. If any pair of processes cannot be executed concurrently, specify which of the three conditions is not satisfied. Prove that the Flip network is topologically equivalent to the Baseline network. Courtesy of Ken Batcher; reprinted from Proc.Each metric used by SPEC95 is the aggregate overall benchmark of a given suite by taking the geometric mean of the ratios of the individual benchmarks.

Parallel architecture has a higher potential to deliver scalable performance. Plagiarism is a serious offence and will always result in imposition of a penalty. Examples and the Algorithm 97 2. His current research projects are the RAD Lab, which is inventing technology for reli- able, adaptive, distributed Internet services, and the Research Accelerator for Multiple Proces- sors RAMP project, which is developing and distributing low-cost, highly scalable, parallel computers based on FPGAs and open-source hardware and software.

Conte North Carolina State University. Appendix K collects the "Historical Perspective and References" from each chapter of the third edition into a single appendix.