PROGRAMMING MASSIVELY PARALLEL PROCESSORS A HANDS-ON APPROACH PDF
Programming Massively Parallel Processors: A Hands-on Approach (http:// yazik.info). Mattson, T. G., Sanders. CHAPTER 13 Parallel Programming and Computational. Thinking. million of these processors in the hands of consumers and professionals, and edu/Byelick/csf07/handouts/yazik.info OpenMP. Programming Massively Parallel Processors: A Hands-on Approach. (http:// yazik.info). Mattson, T. G., Sanders.
|Language:||English, Spanish, Indonesian|
|Genre:||Business & Career|
|ePub File Size:||25.75 MB|
|PDF File Size:||18.75 MB|
|Distribution:||Free* [*Register to download]|
Programming Massively Parallel Processors A Hands-on Approach Second Edition David B. Kirk a Programming Embedded Systems, Second Edition with C. Programming Massively Parallel Processors: A Hands-On Approach | 𝗥𝗲𝗾𝘂𝗲𝘀𝘁 𝗣𝗗𝗙 on ResearchGate | Programming Massively Parallel Processors: A. [PDF BOOK] Programming Massively Parallel Processors: A Hands-on Approach By David B. Kirk AUDIOBOOK OpenCL in Action How to.
San Mateo, CA: Industrial Light and Magic www. Intel 64 and IA Architectures optimization reference manual. Order No. Intel Corp. Kessenich, J. Madison, AL: Kirk, D. The rendering architecture of the DNVS. Lindholm, E.
A user-programmable vertex engine. In Proceedings of the 28th annual ACM conference on computer graphics and interactive techniques pp. A unified graphics and computing architecture. IEEE Micro, 28 2 , 39— Microsoft DirectX 9 programmable graphics pipeline. Redmond, WA: Microsoft Press. Microsoft DirectX specification. Microsoft Press http: Montrym, J.
A realtime graphics system. Owen, T. Mones-Hattal Eds. The GeForce IEEE Micro, 25 2 , 41— Moore, G. Cramming more components onto integrated circuits.
Electronics, 38 8 , — Nguyen, H. GPU Gems 3. Nickolls, J. Scalable parallel programming with CUDA. ACM Queue, 6 2 , 40— Parallel thread execution, ISA Version 1. CUDA Zone http: Nyland, L. Nguyen Ed. Oberman, S. A high-performance area-efficient multifunction interpolator.
Cape Cod, MA. Patterson, D. Computer organization and design: Pharr, M. GPU Gems 2: Programming techniques for high-performance graphics and general-purpose computation. Addison Wesley. Satish, N. Designing efficient sorting algorithms for manycore GPUs.
Segal, M. A specification, Version 2. Mountain View, CA: Silicon Graphics http: Sengupta, S. Scan primitives for GPU computing. Segal Eds. San Diego, CA: ACM Press. References and Further Reading Stratton, J. Volkov, V. Technical report no. Williams, S. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Parallel computing—Special issue on revolutionary technologies for acceleration of emerging petascale applications. In modern software applications, program sections often exhibit a rich amount of data parallelism, a property allowing many arithmetic operations to be safely performed on program data structures in a simultaneous manner.
The CUDA devices accelerate the execution of these applications by harvesting a large amount of data parallelism.
Because data parallelism plays such an important role in CUDA, we will first discuss the concept of data parallelism before introducing the basic features of CUDA. Rigid body physics and fluid dynamics model natural forces and movements that can be independently evaluated within small time steps.
Such independent evaluation is the basis of data parallelism in these applications. As we mentioned earlier, data parallelism refers to the program property whereby many arithmetic operations can be safely performed on the data structures in a simultaneous manner. We illustrate the concept of data parallelism with a matrix—matrix multiplication matrix multiplication, for brevity example in Figure 3.
In this example, each element of the product matrix P is generated by performing a dot product between a row of input matrix M and a column of input matrix N. In Figure 3. Note that the dot product operations for computing different matrix P elements can be simultaneously performed.
Customers who viewed this item also viewed
WIDTH 3. Therefore, matrix multiplication of large dimensions can have very large amount of data parallelism. By executing many dot products in parallel, a CUDA device can significantly accelerate the execution of the matrix multiplication over a traditional host CPU. The data parallelism in real applications is not always as simple as that in our matrix multiplication example. In a later chapter, we will discuss these more sophisticated forms of data parallelism.
The phases that exhibit little or no data parallelism are implemented in host code. The phases that exhibit rich amount of data parallelism are implemented in the device code.
A CUDA program is a unified source code encompassing both host and device code. The device code is written using ANSI C extended with keywords for labeling data-parallel functions, called kernels, and their associated data structures. The device code is typically further compiled by the nvcc and executed on a GPU device. The kernel functions or, simply, kernels typically generate a large number of threads to exploit data parallelism.
In the matrix multiplication example, the entire matrix multiplication computation can be implemented as a kernel where each thread is used to compute one element of output matrix P. In this example, the number of threads used by the kernel is a function of the matrix dimension. CUDA programmers can assume that these threads take very few cycles to generate and schedule due to efficient hardware support.
This is in contrast with the CPU threads that typically require thousands of clock cycles to generate and schedule. The execution starts with host CPU execution. When a kernel function is invoked, or launched, the execution is moved to a device GPU , where a large number of threads are generated to take advantage of abundant data parallelism.
All the threads that are generated by a kernel during an invocation are collectively called a grid. Figure 3. We will discuss how these grids are organized soon. When all threads of a kernel complete their execution, the corresponding grid terminates, and the execution continues on the host until another kernel is invoked. For simplicity, we assume that the matrices are square in shape, and the dimension of each matrix is specified by the parameter Width.
The details of Part 3 are also shown in Appendix A. Part 2 is the main focus of our 3. It calls a function, MatrixMultiplication , to perform matrix multiplication on a device. Before we explain how to use a CUDA device to execute the matrix multiplication function, it is helpful to first review how a conventional CPU-only matrix multiplication function works.
A simple version of a CPU-only matrix multiplication function is shown in Figure 3. The MatrixMultiplication function implements a straightforward algorithm that consists of three loop levels. The innermost loop iterates over variable k and steps through one row of matrix M and one column of matrix N.
The loop calculates a dot product of the row of M and the column of N and generates one element of P. Immediately after the innermost loop, the P element generated is written into the output P matrix.
This is because the M matrix elements are placed into the system memory that is ultimately accessed with a linear address. That is, every location in the system memory has an address that ranges from 0 to the largest memory location. For C programs, the placement of a 2-dimensional matrix into this linear addressed memory is done according to the row-major convention, as illustrated in Figure 3.
The rows are then placed one after another.
New Private Message
All elements of a column are first placed into consecutive locations, and all columns are then placed in their numerical order. The k term then selects the proper element within the section for row i. The outer two i and j loops in Figure 3.
Each i value identifies a row. By systematically iterating all M rows and all N columns, the function generates all P elements. We now have a complete matrix multiplication function that executes solely on the CPU.
Note that all of the code that we have shown so far is in standard C. Assume that a programmer now wants to port the matrix multiplication function into CUDA. A straightforward way to do so is to modify the MatrixMultiplication function to move the bulk of the calculation to a CUDA device. The structure of the revised function is shown in Figure 3.
Part 1 of the function allocates device GPU memory to hold copies of the M, N, and P matrices and copies these matrices over to the device memory.
Part 2 invokes a kernel that launches parallel execution of the actual matrix multiplication on the device. Part 3 copies the product matrix P from the device memory back to the host memory.
Note that the revised MatrixMultiplication function is essentially an outsourcing agent that ships input data to a device, activates the calculation on the device, and collects the results from the device.
The details of the revised function, as well as the way to compose the kernel function, will serve as illustrations as we introduce the basic features of the CUDA programming model. This reflects the reality that devices are typically hardware cards that come with their own dynamic random access memory DRAM.
In order to execute a kernel on a device, the programmer needs to allocate memory on the device and transfer pertinent data from the host memory to the allocated device memory. This corresponds to Part 1 of Figure 3. Similarly, after device execution, the programmer needs to transfer result data from the device memory back to the host memory and free up the device memory that is no longer needed.
This corresponds to Part 3 of Figure 3. The CUDA runtime system provides application programming interface API functions to perform these activities on behalf of the programmer.
From this point on, we will simply say that a piece of data is transferred from host to device as shorthand for saying that the piece of data is transferred from the host memory to the device memory. The same holds for the opposite data transfer direction.
At the bottom of the figure, we see global memory and constant memory. These are the memories that the host code can transfer data to and from the device, as illustrated by the bidirectional arrows between these memories and the host. Constant memory allows read-only access by the device code and is described in Chapter 5. For now, we will focus on the use of global memory. Note that the host memory is not explicitly shown in Figure 3.
The function cudaMalloc can be called from the host code to allocate 2 Note that we have omitted the texture memory from Figure 3. We will introduce texture memory later. The reader should be able to notice the striking similarity between cudaMalloc and the standard C runtime library malloc. By keeping the interface as close to the original C runtime libraries as possible, CUDA minimizes the time that a C programmer needs to relearn the use of these extensions.
The first parameter of the cudaMalloc function is the address of a pointer variable that must point to the allocated object after allocation. This address allows the cudaMalloc function to write the address of the allocated object into the pointer variable. The usage of this second parameter is consistent with the size parameter of the C malloc function.
We now use a simple code example illustrating the use of cudaMalloc. This is a continuation of the example in Figure 3. The programmer passes the address of Md i. After the computation, cudaFree is called with pointer Md as input to free the storage space for the M matrix from the device global memory: The C Malloc function returns a pointer to the allocated object. It takes only one parameter that specifies the size of the allocated object. The cudaMalloc function writes to the pointer variable whose address is given as the first parameter.
As a result, the cudaMalloc function takes two parameters. Furthermore, Part 3 in Figure 3. Once a program has allocated device global memory for the data objects, it can request that data be transferred from host to device. The cudaMemcpy function takes four parameters. The first parameter is a pointer to the destination location for the copy operation.
The second parameter points to the source data object to be copied. The third parameter specifies the number of bytes to be copied. The fourth parameter indicates the types of memory involved in the copy: For example, the memory copy function can be used to copy data from one location of the device memory to another location of the device memory.
For the matrix multiplication example, the host code calls the cudaMemcpy function to copy the M and N matrices from the host memory to the device memory before the multiplication and then to copy the P matrix from the device memory to the host memory after the multiplication is done.
The same function can be used to transfer data in both directions by properly ordering the source and destination pointers and using the appropriate constant for the transfer type: MatrixMultiplication , as outlined in Figure 3. We often refer to this type of host code as the stub function for invoking a kernel. After the matrix multiplication, MatrixMultiplication also copies result data from device to the host. We show a more fleshed out version of the MatrixMultiplication function in Figure 3.
Compared to Figure 3. This is accomplished with calls to the cudaMalloc and cudaMemcpy functions. The readers are encouraged to write their own function calls with the appropriate parameter values and compare their code with that shown in Figure 3.
Part 2 invokes the kernel and will be described in the following text. Part 3 reads the product data from device memory to host memory so the value will be available to main. This is accomplished with a call to the cudaMemcpy function. It then frees Md, Nd, and Pd from the device memory, which is accomplished with calls to the cudaFree functions. In CUDA, a kernel function specifies the code to be executed by all threads during a parallel phase. Because all of these threads execute the same code, CUDA programming is an instance of the well-known single-program, multiple-data SPMD parallel programming style [Atallah ], a popular programming style for massively parallel computing systems.
This keyword indicates that the function is a kernel and that it can be called from a host functions to generate a grid of threads on a device. The meanings of these keywords are summarized in Figure 3.
The function will be executed on the device and can only be called from the host to generate a grid of threads on a device. We will show the host code syntax for calling a kernel function later in Figure 3. In an SPMD system, the parallel processing units execute the same program on multiple parts of the data; however, these processing units do not have to be executing the same instruction at the same time.
In an SIMD system, all processing units are executing the same instruction at any instant. A device function executes on a CUDA device and can only be called from a kernel function or another device function. Device functions can have neither recursive function calls nor indirect function calls through pointers in them.
A host function is simply a traditional C function that executes on the host and can only be called from another host function. The programmer would add kernel functions and device functions 3. The original functions remain as host functions. Having all functions default into host functions spares the programmer the tedious work of changing all original function declarations. This combination triggers the compilation system to generate two versions of the same function. One is executed on the host and can only be called from a host function.
The other is executed on the device and can only be called from a device or kernel function. This supports a common use when the same function source code can be simply recompiled to generate a device version.
Many user library functions will likely fall into this category. Note that all threads execute the same kernel code. There needs to be a mechanism to allow them to distinguish themselves and direct themselves toward the particular parts of the data structure that they are designated to work on.
These keywords identify predefined variables that allow a thread to access the hardware registers at runtime that provide the identifying coordinates to the thread. Different threads will see different values in their threadIdx.
For simplicity, we will refer to a thread as ThreadthreadIdx. Note that the coordinates reflect a multidimensional organization for the threads. We will come back to this point soon. A quick comparison of Figure 3. The kernel function in Figure 3. The readers should ask where the other two levels of outer loops go. The answer is that the outer two loop levels are now replaced with the grid of threads.
The entire grid forms the equivalent of the two-level loop. Each thread in the grid corresponds to one of the iterations of the original two-level loop. The original loop variables i and j are now replaced with threadIdx. Instead of having the loop increment the values of i and j for use in each loop iteration, the CUDA threading hardware generates all of the threadIdx.
It should be clear that these indices simply take over the role of variables i and j in Figure 3. Note that we assigned threadIdx. Each thread also uses its threadIdx. This way, the threads collectively generate all the elements of the Pd matrix. When a kernel is invoked, or launched, it is executed as grid of parallel threads. Creating enough threads to fully utilize the hardware often requires a large amount of data parallelism; for example, each element of a large array might be computed in a separate thread.
Threads in a grid are organized into a two-level hierarchy, as illustrated in Figure 3. For simplicity, a small number of threads are shown in Figure 3. In reality, a grid will typically consist of many more threads.
At the top level, each grid consists of one or more thread blocks. All blocks in a grid have the same number of threads. All thread blocks must have the same number of threads organized in the same manner. The coordinates of threads in a block are uniquely defined by three thread indices: Not all applications will use all three dimensions of a thread block.
This is obviously a simplified example. In the matrix multiplication example, a grid is invoked to compute the product matrix. The code in Figure 3. Threads with the same threadIdx values from different blocks would end up accessing the same input and output data elements.
As a result, the kernel can use only one thread block. The threadIdx. Because a thread block can have only up to threads, and each thread calculates one element of the product matrix in Figure 3.
This is obviously not acceptable. As we explained before, the product matrix must have millions of elements in order to have a sufficient amount of data parallelism to benefit from execution on a device.
We will address this issue in Chapter 4 using multiple blocks. When the host code invokes a kernel, it sets the grid and thread block dimensions via execution configuration parameters. This is illustrated in Figure 3. Two struct variables of type dim3 are declared. The second variable, dimGrid, describes the configuration of the grid.
The final line of code invokes the kernel. It provides the dimensions of the grid in terms of number of blocks and the dimensions of the blocks in terms of number of threads. CUDA extends the C language to support parallel computing. The extensions discussed in this chapter are summarized below. The extensions are summarized in Figure 3.
If a function declaration does not have any CUDA extension keyword, the function defaults into a host function. These execution configuration parameters are only used during a call to a kernel function, or a kernel launch. We discussed the execution configuration parameters that define the dimensions of the grid and the dimensions of each block. We discussed the threadIdx variable in this chapter.
In Chapter 4, we will further discuss blockIdx, gridDim, and 5 blockDim variables. They should not be confused with the user defined dimGrid and dimBlock variables that are used in the host code for the purpose of setting up the configuration parameters.
The value of these configuration parameters will ultimately become the values of gridDim and blockDim once the kernel has been launched. References and Further Reading 3. The services that we discussed in this chapter are cudaMalloc and cudaMemcpy functions. These functions allocate device memory and transfer data between the host and device on behalf of the calling program. The chapter is by no means a comprehensive account of all CUDA features.
Some of these features will be covered in the rest of the book; however, our emphasis will be on key concepts rather than details. In general, we would like to encourage the reader to always consult the CUDA Programming Guide for more details on the concepts that we cover. References and Further Reading Atallah, M. Algorithms and theory of computation handbook. Boca Raton, FL: CRC Press. CUDA programming guide 2. Stratton, J.
As we explained in Chapter 3, launching a CUDA kernel function creates a grid of threads that all execute the kernel function. That is, the kernel function specifies the C statements that are executed by each individual thread created when the kernel is launched at runtime. This chapter presents more details on the organization, resource assignment, and scheduling of threads in a grid. These threads are organized into a two-level hierarchy using unique coordinates—blockIdx for block index and threadIdx for thread index —assigned to them by the CUDA runtime system.
The blockIdx and threadIdx appear as builtin, preinitialized variables that can be accessed within kernel functions. Additional built-in variables, gridDim and blockDim, provide the dimension of the grid and the dimension of each block respectively. Figure 4. The grid in this example consists of N thread blocks, each with a blockIdx.
Each block, in turn, consists of M threads, each with a threadIdx. All blocks at the grid level are organized as a one-dimensional 1D array; all threads within each block are also organized as a 1D array. The black box of each thread block in Figure 4. In this example, access to blockDim in the kernel returns Thread 15 of Block has a threaded value of The reader should verify that every one of the threads has its own unique threaded value.
In Figure 4. If we assume that both arrays are declared with elements, then each thread will take one of the input elements and produce one of the output elements.
In general, a grid is organized as a 2D array of blocks. Each block is organized into a 3D array of threads. The exact organization of a grid is determined by the execution configuration provided at kernel launch. The first parameter of the execution configuration specifies the dimensions of the grid in terms of number of blocks. The second specifies the dimensions of each block in terms of number of threads.
Each such parameter is a dim3 type, which is essentially a C struct with three unsigned integer fields: Because grids are 2D arrays of block dimensions, the third field of the grid dimension parameter is ignored; it should be set to 1 for clarity.
The following host code can be used to launch the kernel whose organization is shown in Figure 4. Thread block 0 threadIdx. Because the grid and the blocks are 1D arrays, only the first dimension of dimBlock and dimGrid are used. The other dimensions are set to 1. The third statement is the actual kernel launch. Note that scalar values can also be used for the execution configuration parameters if a grid or block has only one dimension; for example, the same grid can be launched with one statement: The values of gridDim.
Once a kernel is launched, its dimensions cannot change. All threads in a block share the same blockIdx value. The blockIdx. Each block in Figure 4. In general, blocks are organized into 3D arrays of threads. All blocks in a grid have the same dimensions. Each threadIdx consists of three components: The number of threads in each dimension of a block is specified by the second execution configuration parameter given at the kernel launch. With the kernel, this configuration parameter can be accessed as a predefined struct variable, blockDim.
The total size of a block is limited to threads, with flexibility in distributing these elements into the three dimensions as long as the total number of threads does not exceed For example, , 1, 1 , 8, 16, 2 , and 16, 16, 2 are all allowable blockDim values, but 32, 32, 1 is not allowable because the total number of threads would be Because all blocks within a grid have the same dimensions, we only need to show one of them.
Note that, in this example, we have 4 blocks of 16 threads each, with a grand total of 64 threads in the grid. We have used these small numbers to keep the illustration simple.
Typical CUDA grids contain thousands to millions of threads. One common usage for threadIdx and blockIdx is to determine the area of data that a thread is to work on. This was illustrated by the simple matrix multiplication code in Figure 3.
We will now cover more sophisticated usage of these variables. One limitation of the simple code in Figure 3. This limitation comes from the fact that the kernel function does not use blockIdx. As a result, we are limited to using only one block of threads. Even if we used more blocks, threads from different blocks would end up calculating the same Pd element if they have the same threadIdx value. Recall that each block can have up to threads. With each thread calculating one element of Pd, we can calculate up to Pd elements with the code.
In order to accommodate larger matrices, we need to use multiple thread blocks. Conceptually, we break Pd into square tiles. All the Pd elements of a tile are computed by a block of threads. By keeping the dimensions of these Pd tiles small, we keep the total number of threads in each block under , the maximal allowable block size.
Similarly, we abbreviate blockIdx. Each thread still calculates one Pd element. The difference is that it must use its blockIdx values to identify the tile that contains its element before it uses its threadIdx values to identify its element inside the tile. That is, each thread now uses both threadIdx and blockIdx to identify the Pd element to work on. This is portrayed in Figure 4. All threads calculating the Pd elements within a tile have the same blockIdx values.
Each block handles such a section. Thus, a thread can find the 4. The Pd matrix is now divided into 4 tiles. Each dimension of Pd is now divided into sections of 2 elements. Each block needs to calculate 4 Pd elements. In the example, thread 0, 0 of block 0, 0 calculates Pd0,0, whereas thread 0, 0 of block 1, 0 calculates Pd2,0.
It is easy to verify that one can identify the Pd element calculated by thread 0, 0 of block 1, 0 with the formula given above: The reader should work through the index derivation for as many threads as it takes to become comfortable with the concept. A Hands-on Approach, Third Edition shows both student and professional alike the basic concepts of parallel programming and GPU architecture, exploring, in detail, various techniques for constructing parallel programs.
Case studies demonstrate the development process, detailing computational thinking and ending with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in-depth. For this new edition, the authors have updated their coverage of CUDA, including coverage of newer libraries, such as CuDNN, moved content that has become less important to appendices, added two new chapters on parallel patterns, and updated case studies to reflect current industry practices.
David B. Kirk is well recognized for his contributions to graphics hardware and algorithm research.
Programming Massively Parallel Processors
By the time he began his studies at Caltech, he had already earned B. At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III.
A technological "evangelist" who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide. Wen-mei W. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. Hwu received his Ph. CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign.
We are always looking for ways to improve customer experience on Elsevier. We would like to ask you for a moment of your time to fill in a short questionnaire, at the end of your visit. If you decide to participate, a new browser tab will open so you can complete the survey after you have completed your visit to this website. Thanks in advance for your time. Skip to content.
Search for books, journals or webpages All Webpages Books Journals. View on ScienceDirect. David Kirk Wen-mei Hwu. Paperback ISBN: Morgan Kaufmann.
No notes for slide. A Hands-on Approach Read online 1. A Hands-on Approach Read online 2. Book Details Author: David B. Kirk ,Wen-mei W. Hwu Pages: Morgan Kaufmann Brand: English ISBN: Publication Date: Notice Message: Trying to get property of non-object Filename: A Hands-on Approach, click button download in the last page 6.
Download or read Programming Massively Parallel Processors: A Hands-on Approach OR. You just clipped your first slide!The shader stage in Figure 2. Their teams created an excellent infrastructure for this course.
During the symposium, students use presentation slots proportional to the size of the teams. It shows an example in which a world map texture is mapped onto a sphere object. This property has motivated the design of the programmable pipeline stages into massively parallel processors.
Figure 4. All blocks at the grid level are organized as a one-dimensional 1D array; all threads within each block are also organized as a 1D array. Programming techniques, tips, and tricks for realtime graphics.