Advanced computer architecture pdf

Tuesday, April 23, 2019 admin Comments(0)

Advanced Computer Architecture. Instructor: Andreas Moshovos [email protected] Fall Some material is based on slides developed by profs. best book for computer architecture. Advance computer architecture book pdf by patterson, Study notes for Advanced Computer Architecture. This course aims to give an introduction to some advanced aspects of computer architecture. One of the main areas that we will be considering is RISC.

Language: English, Spanish, Hindi
Country: Turkey
Genre: Art
Pages: 500
Published (Last): 02.08.2016
ISBN: 634-4-22714-799-9
ePub File Size: 22.57 MB
PDF File Size: 16.58 MB
Distribution: Free* [*Regsitration Required]
Downloads: 26246
Uploaded by: KARAN

Computer Systems. Hardware. Architecture. Operating. System. Application. Software. No Component. Can be Treated. In Isolation. From the Others. 𝗣𝗗𝗙 | On Jan 1, , Jain Nitin and others published UNIT 1 Advanced Computer Architecture Introduction. Subject: ADVANCED COMPUTER ARCHITECTURE. Credits: 4 age, is no more limited to computer programmers and computer engineers. Rather than.

The example does not show any time advantage of dataflow execution over control flow execution. It is faster to access a local memory with a local processor. Pipelining is extensively applied in memory-access, instruction execution, scalar, superscalar, and vector arithmetic operations. Indeed, there are several systems here whose descriptions cannot be found in the literature. When a copy is dirty, it must be written back to global memory.

As you work through this book you will find plenty of both. The result of great architecture, whether in computer design, building design or textbook design, is to take the customer's requirements and desires and return a design that causes that customer to say, "Wow, I didn't know that was possible. Performance and Price-Performance 44 1.

Concepts and Challenges 66 2. Examples and the Algorithm 97 2. Hardware versus Software Speculation 3. Thread-Level Parallelism 3. Performance and Efficiency in Advanced. Multiple-Issue Processors 3. The Basics 4. An Introduction 4. Virtual Memory and Virtual Machines 5.

The Design of Memory Hierarchies 5. Archive Cluster 6. The Role of Compilers B. I Introduction C. Through four editions of this book, our goal has been to describe the basic princi- ples underlying what will be tomorrow's technological developments. Our excitement about the opportunities in computer architecture has not abated, and we echo what we said about the field in the first edition: It's a discipline of keen intellectual interest, requiring the balance of marketplace forces to cost-performance-power, leading to glorious failures and some notable successes.

Our primary objective in writing our first book was to change the way people learn and think about computer architecture. We feel this goal is still valid and important.

The field is changing daily and must be studied with real examples and measurements on real computers, rather than simply as a collection of defini- tions and designs that will never need to be realized. We offer an enthusiastic welcome to anyone who came along with us in the past, as well as to those who are joining us now. Either way, we can promise the same quantitative approach to, and analysis of, real systems. As with earlier versions, we have strived to produce a new edition that will continue to be as relevant for professional engineers and architects as it is for those involved in advanced computer architecture and design courses.

As much as its predecessors, this edition aims to demystify computer architecture through an emphasis on cost-performance-power trade-offs and good engineering design.

Advance computer architecture book pdf by patterson - Docsity

We believe that the field has continued to mature and move toward the rigorous quantitative foundation of long-established scientific and engineering disciplines. The fourth edition of Computer Architecture: A Quantitative Approach may be the most significant since the first edition.

Shortly before we started this revision, Intel announced that it was joining IBM and Sun in relying on multiple proces- sors or cores per chip for high-performance designs. As the first figure in the book documents, after 16 years of doubling performance every 18 months, sin-.

This fork in the computer architecture road means that for the first time in history, no one is building a much faster sequential processor. If you want your program to run significantly faster, say, to justify the addition of new features, you're going to have to parallelize your program. Hence, after three editions focused primarily on higher performance by exploiting instruction-level parallelism ILP , an equal focus of this edition is thread-level parallelism TLP and data-level parallelism DLP.

This historic shift led us to change the order of the chapters: The changing technology has also motivated us to move some of the content from later chapters into the first chapter.

Because technologists predict much higher hard and soft error rates as the industry moves to semiconductor processes with feature sizes 65 nm or smaller, we decided to move the basics of dependabil- ity from Chapter 7 in the third edition into Chapter 1. As power has become the dominant factor in determining how much you can place on a chip, we also beefed up the coverage of power in Chapter 1.

Of course, the content and exam- ples in all chapters were updated, as we discuss below. In addition to technological sea changes that have shifted the contents of this edition, we have taken a new approach to the exercises in this edition. It is sur- prisingly difficult and time-consuming to create interesting, accurate, and unam- biguous exercises that evenly test the material throughout a chapter. Alas, the Web has reduced the half-life of exercises to a few months. Rather than working out an assignment, a student can search the Web to find answers not long after a book is published.

Hence, a tremendous amount of hard work quickly becomes unusable, and instructors are denied the opportunity to test what students have learned. To help mitigate this problem, in this edition we are trying two new ideas.

First, we recruited experts from academia and industry on each topic to write the exercises. This means some of the best people in each field are helping us to cre- ate interesting ways to explore the key concepts in each chapter and test the reader's understanding of that material. Second, each group of exercises is orga- nized around a set of case studies. Our hope is that the quantitative example in each case study will remain interesting over the years, robust and detailed enough to allow instructors the opportunity to easily create their own new exercises, should they choose to do so.

Key, however, is that each year we will continue to release new exercise sets for each of the case studies. These new exercises will have critical changes in some parameters so that answers to old exercises will no longer apply. Another significant change is that we followed the lead of the third edition of Computer Organization and Design COD by slimming the text to include the material that almost all readers will want to see and moving the appendices that.

There were many reasons for this change:. Students complained about the size of the book, which had expanded from pages in the chapters plus pages of appendices in the first edition to chapter pages plus appendix pages in the second edition and then to chapter pages plus pages in the paper appendices and pages in online appendices. At this rate, the fourth edition would have exceeded pages both on paper and online!

Similarly, instructors were concerned about having too much material to cover in a single course. As was the case for COD, by including a CD with material moved out of the text, readers could have quick access to all the material, regardless of their ability to access Elsevier's Web site.

Hence, the current edition's appendices will always be available to the reader even after future editions appear. This flexibility allowed us to move review material on pipelining, instruction sets, and memory hierarchy from the chapters and into Appendices A, B, and C. The advantage to instructors and readers is that they can go over the review material much more quickly and then spend more time on the advanced top- ics in Chapters 2, 3, and 5.

It also allowed us to move the discussion of some topics that are important but are not core course topics into appendices on the CD. In this edition we have 6 chapters, none of which is longer than 80 pages, while in the last edition we had 8 chapters, with the longest chapter weighing in at pages.

This package of a slimmer core print text plus a CD is far less expensive to manufacture than the previous editions, allowing our publisher to signifi- cantly lower the list price of the book. With this pricing scheme, there is no need for a separate international student edition for European readers.

Yet another major change from the last edition is that we have moved the embedded material introduced in the third edition into its own appendix, Appen- dix D. We felt that the embedded material didn't always fit with the quantitative evaluation of the rest of the material, plus it extended the length of many chapters that were already running long. We believe there are also pedagogic advantages in having all the embedded information in a single appendix.

This edition continues the tradition of using real-world examples to demon- strate the ideas, and the "Putting It All Together" sections are brand new; in fact, some were announced after our book was sent to the printer. As before, we have taken a conservative approach to topic selection, for there are many more interesting ideas in the field than can reasonably be covered in a treat- ment of basic principles.

We have steered away from a comprehensive survey of every architecture a reader might encounter. Instead, our presentation focuses on core concepts likely to be found in any new machine. The key criterion remains that of selecting ideas that have been examined and utilized successfully enough to permit their discussion in quantitative terms. Our intent has always been to focus on material that is not available in equiva- lent form from other sources, so we continue to emphasize advanced content wherever possible.

Indeed, there are several systems here whose descriptions cannot be found in the literature. Readers interested strictly in a more basic introduction to computer architecture should read Computer Organization and Design: Chapter 1 has been beefed up in this edition.

It includes formulas for static power, dynamic power, integrated circuit costs, reliability, and availability. We go into more depth than prior editions on the use of the geometric mean and the geo- metric standard deviation to capture the variability of the mean.

Our hope is that these topics can be used through the rest of the book. In addition to the classic quantitative principles of computer design and performance measurement, the benchmark section has been upgraded to use the new SPEC suite.

Our view is that the instruction set architecture is playing less of a role today than in , so we moved this material to Appendix B. It still uses the MIPS64 architecture.

Chapters 2 and 3 cover the exploitation of instruction-level parallelism in high-performance processors, including superscalar execution, branch prediction, speculation, dynamic scheduling, and the relevant compiler technology.

As men- tioned earlier, Appendix A is a review of pipelining in case you need it. Chapter 3 surveys the limits of ILR New to this edition is a quantitative evaluation of multi- threading. While the last edition contained a great deal on Itanium, we moved much of this material to Appendix G, indicating our view that this architecture has not lived up to the early claims.

Given the switch in the field from exploiting only ILP to an equal focus on thread- and data-level parallelism, we moved multiprocessor systems up to Chap- ter 4, which focuses on shared-memory architectures. The chapter begins with the performance of such an architecture.

It then explores symmetric and distributed-memory architectures, examining both organizational principles and performance. Topics in synchronization and memory consistency models are. The example is the Sun Tl "Niagara" , a radical design for a commercial product.

It reverted to a single-instruction issue, 6-stage pipeline microarchitec- ture. It put 8 of these on a single chip, and each supports 4 threads.

Hence, soft- ware sees 32 threads on this single, low-power chip. As mentioned earlier, Appendix C contains an introductory review of cache principles, which is available in case you need it. This shift allows Chapter 5 to start with 11 advanced optimizations of caches. The chapter includes a new sec- tion on virtual machines, which offers advantages in protection, software man- agement, and hardware management.

The example is the AMD Opteron, giving both its cache hierarchy and the virtual memory scheme for its recently expanded bit addresses. Chapter 6, "Storage Systems," has an expanded discussion of reliability and availability, a tutorial on RAID with a description of RAID 6 schemes, and rarely found failure statistics of real systems.

Rather than go through a series of steps to build a hypothetical cluster as in the last edition, we evaluate the cost, performance, and reliability of a real cluster: All three processors request access to two different memory modules: In this case two requests can be granted. There are 18 ways 36 accepted requests in which such a case can arise. All three processors request access to three different memory modules: In this case all three requests can be granted.

There are six ways 18 accepted requests in which such a case can arise. From the above enumeration, it is clear that of the 27 combinations of 3 requests taken from 3 possible requests, there are 57 requests that can be accepted causing no memory contention. In general, for M memory modules and n processors, if a processor generates a request with probability r in a cycle directed to each memory with equal probability, then the expression for the bandwidth can be computed as follows.

In deriving the above expression, we have assumed that all processors generate requests for memory modules during a given cycle. A similar expression can be derived for the case whereby only a fraction of processors generate requests during a given cycle see the exercise at the end of the chapter.

It consists of M memory modules, n processors, and B buses. A given bus is dedicated to a particular processor for the duration of a bus transaction. A processor — memory transfer can use any of the available buses. The set of M arbiters accepts only one request for each memory module at any given time. Let us assume that a processor generates a request with probability r in a cycle directed to each memory with equal probability.

There- fore, out of all possible memory requests, only up to M memory requests can be accepted. The Figure 3. Two cases have to be con- k sidered. These are the case where fewer than B different requests being made while fewer than B buses are being used and the case where B or more different requests are made while all B buses are in use.

One such MIN is the Delta network. This assumption is made such that the results we obtained for the bandwidth of the crossbar network can be utilized. This recursive relation can be extended to compute the number of requests at the output of stage j in terms of the rate of input requests passed on from stage j 2 1 as follows: It should be noted that parallel machines attempt to minimize the communication latency by increasing the interconnectivity.

In our discussion, we will show the latency caused by the time spent in switching elements. Latency caused by software overhead, routing delay, and connection delay are overlooked in this discussion.

Average distance, da , traveled by a message in a static network, is a measure of the typical number of links hops a message has to traverse as it makes its way from any source to any destination in the network. In a network consisting of N nodes, the average distance can be computed using the following relation: Consider, for example, a 4-cube network.

The average distance between two nodes in such a network can be computed as follows. We compute the distance between node and all other 15 nodes in the cube. These are shown in Table 3.

Complexity Cost of a static network can be measured in terms of the number of links needed to realize the topology of the network. Interconnectivity of a network is a measure of the existence of alternate paths between each source —destination pair. The importance of network connectivity is that it shows the resistance of the network to node and link failures.

Consider, for example, the binary tree architecture. The failure of a node, for example, the root node, can lead to the partitioning of the network into two disjoint halves. Similarly, the failure of a link can lead to the partitioning of the network. We therefore say that the binary tree network has a node connectivity of 1 and a link connectivity of 1.

Based on the above discussion and the information provided in Chapter 2, the fol- lowing two tables, Tables 3. Having presented a number of performance measures for static and dynamic networks, we now turn our attention to the important issue of parallel architecture scalability. Unless otherwise mentioned, our discussion in this section will assume the scaling up of systems.

In prac- tice, the scalability of a system can be manifested in a number of forms. In terms of speed, a scalable system is capable of increasing its speed in proportion to the increase in the number of processors. Assume for simplicity that m is a multiple of n. The addition can then proceed as follows. The addition operation is performed simultaneously in all processors. Secondly, each pair of neighboring processors can communicate their results to one of them whereby the communicated result is added to the local result.

It is interesting to notice from the table that for the same number of processors, n, a larger instance of the same problem, m, results in an increase in the speedup, S.

This is a property of a scalable parallel system. Consider, for example, the above problem of adding m num- bers on an n-cube. For example, in a highly scalable parallel system the size of the problem needs to grow linearly TABLE 3. The following relationship applies: It is interesting to note that a sequential algorithm running on a single processor does not suffer from such overhead.

Consider again the problem of adding m numbers using an n-cube. Recall that Gustafson has shown that by scaling up the problem size, m, it is possible to obtain near-linear speedup on as many as processors see Section 3.

In addition to the above scalability metrics, there has been a number of other unconventional metrics used by some researchers. A number of these are explained below.

Size scalability measures the maximum number of processors a system can accommodate. Application scalability refers to the ability of running application software with improved performance on a scaled-up version of the system. Consider, for example, an n-processor system used as a database server, which can handle 10, trans- actions per second.

This system is said to possess application scalability if the number of transactions can be increased to 20, using double the number of processors. Generation scalability refers to the ability of a system to scale up by using next- generation fast components. Heterogeneous scalability refers to the ability of a system to scale up by using hardware and software components supplied by different vendors.

These are size scalability, generation scalability, space scalability, compatibility, and compe- titiveness. As can be seen, three of these long-term survivability requirements have to do with different forms of scalability. As can be seen from the above introduction, scalability, regardless of its form, is a desirable feature of any parallel system.

Owing to its importance, there has been an evolving design trend, called design for scalability DFS , which promotes the use of scalability as a major design objective. Two different approaches have evolved as DFS.

These are overde- sign and backward compatibility. An illustrative example for such approach is the design of modern processors with bit address, that is, bytes address space.

It should be noted that the current UNIX operating system supports only bit address space. With memory space overdesign, future transition to bit UNIX can be performed with minimum system changes. The other form of DFS is the backward compatibility. This approach considers the requirements for scaled-down systems.

Backward compatibility allows scaled-up components hardware or software to be usable with both the original and the scaled-down systems. As an example, a new processor should be able to execute code generated by old processors. Similarly, a new version of an operating system should preserve all useful functionality of its predecessor such that appli- cation software that runs under the old version must be able to run on the new version.

Having introduced a number of scalability metrics for parallel systems, we now turn our attention to the important issue of benchmark performance measurement. Benchmark programs should be designed to provide fair and effective comparisons among high- performance computing systems. For a benchmark to be meaningful, it should evaluate faithfully the performance for the intended use of the system. Whenever advertising for their new computer systems, companies usually quote the benchmark ratings of their systems as a trusted measure.

These ratings are usually used for per- formance comparison purposes among different competing systems. These are synthetic not real benchmarks intended to measure performance of real machines. The Dhrystone benchmark addresses inte- ger performance. This makes the Dhrystone rather unreli- able as a source for performance measure. The execution speed obtained using Whetstone is used solely to determine the system perform- ance.

Two measures were derived from SPEC The SPEC92 consists of two suites: In using SPEC for performance measures, three major steps have to be taken: The tools are used to compile, run, and evaluate the benchmarks. The use of the geometric mean to obtain the average time ratio for all programs in the SPEC92 has been subject to a number of criticisms.

The premise for these criti- cisms is that the geometric mean is bound to cause distortion in the obtained results. For example, Table 3. As can be observed from Table 3.

It is such a drawback that causes skepticism among computer architects for the use of the geometric mean in SPEC It was because of this observation that Giladi and Ahituv have suggested that the geometric mean be replaced by the harmonic mean. Recall that PSECpeaks are those ratings that are reported by vendors in their advertisement of new products. In addition, it has been reported that a number of tuning parameters are usually used by vendors in obtaining their reported SPECpeak and SPECbase ratings and that reproducibility of those ratings is sometimes impossible.

As can be seen from the table, while some machines show superior performance to other machines based on the reported SPECbase, they show inferior performance using the SPECpeak, and vice versa. For the abovementioned observations, it became apparent to a number of compu- ter architects that SPEC92 does not predict faithfully the performance of computers on random software for a typical user.

Performance results are therefore shown as ratios compared to that machine. Each metric used by SPEC95 is the aggregate overall benchmark of a given suite by taking the geometric mean of the ratios of the individual benchmarks. In presenting the performance results, SPEC takes the speed metrics to measure the ratios to exe- cute a single copy of the benchmark, while the throughput metrics measure the ratios to execute multiple copies of the benchmark.

The SPECfp is obtained by taking the geometric mean of the ratios of the ten benchmarks of the CFP95, where each benchmark is compiled with aggressive optimization. Therefore, the number Face recognition Pollutant distribution in the CPU It was reported that the performance of the 26 benchmarks on the systems ranges from Two computational models: A rebuttal to a number of critical views about the effec- tiveness of parallel architectures has been made.

In addition, the Gustafson — Barsis law, which supports the use of multiprocessor architecture, has been introduced. A number of performance metrics for static and dynamic interconnection networks has then been provided. The metrics include the bandwidth, delay, and complexity.

A number of unconventional metrics for scalability has also been discussed. Finally, the issue of benchmark performance measurement has been introduced. Consider the case of a multiple-bus system consisting of 50 processors, 50 memory modules, and 10 buses. Assume that a processor generates a memory request with probability r in a given cycle. In deriving the expression for the bandwidth of a crossbar system, we have assumed that all processors generate requests for memory modules during a given cycle.

Derive a similar expression for the case whereby only a fraction of processors, f, generate requests during a given cycle.

Consider the two cases whereby a processor generates a memory request with probability r in a given cycle and whereby a processor can request any memory module. Consider the case of a binary n-cube having N nodes. Compute the bandwidth of such a cube given that r is the probability that a node receives an external request and n is the probability that a node generates a request either internally or passes on an external request. Assume that a fraction f of the external requests received by a node is passed onwards to another node.

Contrast the following two approaches for building a parallel system. In the second approach, a large number of simple processors are used in which each processor is capable of performing serial computations at a lower rate, F , C. Consider a parallel architecture built using processors each capable of sus- taining 0.

What is the condition in terms of f under which the parallel architecture can exceed the performance of the supercomputer? What is the maximum speedup achievable by a parallel form of the algorithm? If the problem size m grows at a rate slower than Q n as the number of processors increases, then the number of processors can exceed the problem size m.

Bell, G. The problem with MPPs. Chan, Y.

Architecture advanced pdf computer

Computer Architecture News, 22 4 , 60 — 70 Cosnard, M. Evaluating speedups on distributed memory architectures. Parallel Computing, 10, — Curnow, H. A synthestic benchmark. The Computer Journal, 19 1 , 43 — 49 Dixit, K. SPEC developing new component benchmark suits. SPEC Newsletter, 3 4 , 14 — 17 The SPEC benchmarks. Parallel Computing, 17, — Eager, D.

Ein-Dor, P. CPU power and the cost of computation. Communications of the ACM, 28 2 , — Gee, J. Cache performance of the SPEC92 bench- mark suite. IEEE Micro, 15 4 , 17— 27 Giladi, R. SPEC as a performance evaluation measure.

Computer 28 8 , 33 — 42 Grama, A. Measuring the scalability of parallel algorithms and architectures. Gupta, A.

Performance properties of large scale parallel systems. Journal of Parallel and Distribute Computing, 19, — Gustafson, J. Communications of the ACM, 31 5 , — Henning, J. Measuring CPU performance in the new millennium.

Hill, M. What is scalability? Computer Architecture News, 18 4 , 18 — 21 Kumar, V. Analyzing scalability of parallel algorithms and architectures. Journal of Parallel and Distributed Computing, 22, — Lubeck, O. A benchmark comparison of three supercomputers. Computer, 18 12 , 10— 24 Mirghafori, N.

Truth in SPEC benchmarks. Computer Architecture News, 23 5 , 34— 42 IEEE Computer, 62 —76 Smith, J. Characterizing computer performance with a single number.

Communications of the ACM, 31 10 , — SPEC Newsletters, 1 — 10, — In this category, all processors share a global memory. Communication between tasks running on different processors is performed through writing to and reading from the global memory. All interprocessor coordination and synchronization is also accomplished via the global memory. A shared memory computer system consists of a set of inde- pendent processors, a set of memory modules, and an interconnection network as shown in Figure 4.

Two main problems need to be addressed when designing a shared memory system: Per- formance degradation might happen when multiple processors are trying to access the shared memory simultaneously. A typical design might use caches to solve the contention problem.

However, having multiple copies of data, spread throughout the caches, might lead to a coherence problem. The copies in the caches are coherent if they are all equal to the same value.

However, if one of the processors writes over the value of one of the copies, then the copy becomes inconsistent because it no longer equals the value of the other copies. In this chapter we study a variety of shared memory systems and their solutions of the cache coherence problem. If a new request arrives while the memory is busy servicing a previous request, the memory module sends a wait signal, through the memory controller, to the processor making the new request.

In response, the requesting processor may hold its request on the line until the memory becomes free or it may repeat its request some time later. If the arbitration unit receives two requests, it selects one of them and passes it to the memory con- troller.

Again, the denied request can be either held to be served next or it may be repeated some time later. Based on the interconnection network used, shared memory systems can be categorized in the following categories.

All processors have equal access time to any memory location. The interconnection network used in the UMA can be a single bus, multiple buses, or a crossbar switch. A typical bus-structured SMP computer, as shown in Figure 4. In the extreme, the bus contention might be reduced to zero after the cache memories are loaded from the global memory, because it is possible for all instructions and data to be completely con- tained within the cache. This memory organization is the most popular among M P1 P2 Figure 4.

However, the access time to mod- ules depends on the distance to the processor. Among these are the tree and the hierarchical bus networks. Figure 4. There is no memory hierarchy and the address space is made of all the caches. There is a cache directory D that helps in remote cache access.

The simplest network for shared memory systems is the bus.

Pdf advanced computer architecture

However, the bus may get saturated if mul- tiple processors are trying to access the shared memory via the bus simultaneously. A typical bus-based design uses caches to solve the bus contention problem.

High- speed caches connected to each processor on one side and the bus on the other side mean that local copies of instructions and data can be supplied at the highest possible rate. One of the goals of the cache is to maintain a high hit rate, or low miss rate under high processor loads.

A high hit rate means the processors are not using the bus as much. Hit rates are determined by a number of factors, ranging from the application programs being run to the manner in which cache hardware is implemented. Typically, individual processors execute less than one instruction per cycle, thus reducing the number of times it needs to access memory.

Subscalar processors execute less than one instruction per cycle, and superscalar processors execute more than one instruction per cycle. In any case, we want to minimize the number of times each local processor tries to use the central bus. Otherwise, processor speed will be limited by bus bandwidth. If each processor is running at a speed of V, then misses are being generated at a rate of V 1 2 h.

For an N-processor system, misses are simultaneously being generated at a rate of N 1 2 h V. This leads to saturation of the bus when N processors simultaneously try to access the bus.

Thus, the system we have in mind can support only three processors! We might ask what hit rate is needed to support a processor system. Increasing h by 2. Cache coherence algorithms are needed to maintain a level of consistency throughout the parallel system. When a task running on a processor P requests the data in memory location X, for example, the contents of X are copied to the cache, where it is passed on to P.

When P updates the value of X in the cache, the other copy in memory also needs to be updated in order to maintain consistency. In write-through, the memory is updated every time the cache is updated, while in write-back, the memory is updated only when the block in the cache is being replaced. Table 4. Now, suppose processor Q also accesses X. What happens if Q wants to write a new value over the old value of X? There are two fundamental cache coherence policies: Write-invalidate maintains consistency by reading from local caches until a write occurs.

When any processor updates the value of X through a write, posting a dirty bit for X invalidates all other copies. For example, processor Q invalidates all other copies of X when it writes a new value into its cache.

This sets the dirty bit for X. However, when processor P wants to read X, it must wait until X is updated and the dirty bit is cleared. Write-update maintains consistency by immediately updating all copies in all caches. All dirty bits are set during each write operation. Write-update and write-through;.

Write-update and write-back;. Write-invalidate and write-through; and. Write-invalidate and write-back. If we permit a write-update and write-through directly on global memory location X, the bus would start to get busy and ultimately all processors would be idle while waiting for writes to complete. In write-update and write-back, only copies in all caches are updated.

On the contrary, if the write is limited to the copy of X in cache Q, the caches become inconsistent on X. Setting the dirty bit prevents the spread of inconsistent values of X, but at some point, the inconsistent copies must be updated. Global memory is moved in blocks, and each block has a state associated with it, which determines what happens to the entire contents of the block.

A cache miss means that the requested block is not in the cache or it is in the cache but has been invalidated. Snooping protocols differ in whether they update or invalidate shared copies in remote caches in case of a write operation.

They also differ as to where to obtain the new data in the case of a cache miss. In what follows we go over some examples of snooping protocols that maintain cache coherence. Invalid [INV] The copy is inconsistent.

Event Actions Read-Hit Use the local copy from the cache. Read-Miss Fetch a copy from global memory. Set the state of this copy to Valid.

Write-Hit Perform the write locally. Broadcast an Invalid command to all caches. Update the global memory.

Write-Miss Get a copy from global memory. Broadcast an invalid command to all caches. Update the local copy and set its state to Valid. Block replacement Since memory is always consistent, no write-back is needed when a block is replaced. Multiple processors can read block copies from main memory safely until one processor updates its copy. At this time, all cache copies are invalidated and the memory is updated to remain consistent.

The block states and protocol are summarized in Table 4. Example 2 Consider a bus-based shared memory with two processors P and Q as shown in Figure 4. Let us see how the cache coherence is maintained using Write- Invalidate Write-Through protocol. Assume that that X in memory was originally set to 5 and the following operations were performed in the order given: Multiple processors can safely read these blocks from their caches until one processor updates its copy.

At this time, the writer becomes the only owner of the valid block and all other copies are invalidated. Example 3 Consider the shared memory system of Figure 4. Multiple copies can be in this state. Exclusive Read-Write [RW] Only one valid cache copy exists and can be read from and written to safely. Copies in other caches are invalid. Event Action Read-Hit Use the local copy from the cache.

Set the state of this copy to Shared Read-Only. If an Exclusive Read-Write copy exists, make a copy from the cache that set the state to Exclusive Read-Write , update global memory and local cache with the copy.

Set the state to Shared Read- Only in both caches. If the state is Shared Read-Only , then broadcast an Invalid to all caches. Set the state to Exclusive Read-Write.

Write-Miss Get a copy from either a cache with an Exclusive Read- Write copy, or from global memory itself. Update the local copy and set its state to Exclusive Read-Write.

Block replacement If a copy is in an Exclusive Read-Write state, it has to be written back to main memory if the block is being replaced. If the copy is in Invalid or Shared Read-Only states, no write-back is needed when a block is replaced. Subsequent writes are performed using write-back. Example 4 Consider the shared memory system of Figure 4. There is also a special bus line, which is asserted to indicate that at least one other cache is sharing the block.

Example 5 Consider the shared memory system of Figure 4. Example 6 Consider the shared memory system of Figure 4. Reserved [RES] Data have been written exactly once and the copy is consistent with global memory. There is only one copy of the global memory block in one local cache. When a copy is dirty, it must be written back to global memory. Read-Miss If no Dirty copy exists, then supply a copy from global memory. If a dirty copy exists, make a copy from the cache that set the state to Dirty, update global memory and local cache with the copy.

If the state is Valid, then broadcast an Invalid command to all caches. Update the global memory and set the state to Reserved.

Write-Miss Get a copy from either a cache with a Dirty copy or from global memory itself. Update the local copy and set its state to Dirty. Block replacement If a copy is in a Dirty state, it has to be written back to main memory if the block is being replaced.

If the copy is in Valid, Reserved, or Invalid states, no write-back is needed when a block is replaced. TABLE 4.

All copies are consistent with memory. It is not consistent with global memory. Copy ownership. State does not change. Read-Miss If no other cache copy exists, then supply a copy from global memory. Set the state of this copy to Valid Exclusive.

If a cache copy exists, make a copy from the cache. Set the state to Shared in both caches. If the cache copy was in a Dirty state, the value must also be written to memory. Write-Hit Perform the write locally and set the state to Dirty. If the state is Shared, then broadcast data to memory and to all caches and set the state to Shared.

If other caches no longer share the block, the state changes from Shared to Valid Exclusive. Write-Miss The block copy comes from either another cache or from global memory. If the block comes from another cache, perform the update and update all other caches that share the block and global memory. Set the state to Shared.

If the copy comes from memory, perform the write and set the state to Dirty. If the copy is in Valid Exclusive or Shared states, no write-back is needed when a block is replaced. For example, when a multistage network is used to build a large shared memory system, the broadcasting techniques used in the snoopy proto- cols becomes very expensive.

In such situations, coherence commands need to be sent to only those caches that might be affected by an update. This is the idea behind directory-based protocols. Cache coherence protocols that somehow store information on where copies of blocks reside are called directory schemes.

A direc- tory is a data structure that maintains information on the processors that share a memory block and on its state. A Central directory maintains information about all blocks in a central data structure. While Central directory includes everything in one location, it becomes a bottleneck and suffers from large search time. To alleviate this problem, the same information can be handled in a distributed fashion by allowing each memory module to maintain a separate directory. In a distributed directory, the entry associated with a memory block has only one pointer one of the cache that requested the block.

Each entry might also contain a dirty bit to specify whether or not a unique cache has permission to write this memory block. Most directory-based protocols can be categorized under three categories: Full-Map Directories In a full-map setting, each directory entry contains N pointers, where N is the number of processors.

Therefore, there could be N cached copies of a particular block shared by all processors. Set the state to Shared Clean. If the supplying cache copy was in a Dirty or Shared Dirty state, its new state becomes Shared Dirty.

The direct copying of one's own writings qualifies as plagiarism if the fact that the work has been or is to be presented elsewhere is not acknowledged. Plagiarism is a serious offence and will always result in imposition of a penalty. In deciding upon the penalty the Department will take into account factors such as the year of study, the extent and proportion of the work that has been plagiarized, and the apparent intent of the student.

The penalties that can be imposed range from a minimum of a zero mark for the work without allowing resubmission through caution to disciplinary measures such as suspension or expulsion. Tutorial 1 Computer Components. Classification of computer architectures. Performance of computer architecture. Assignment 1 Pipelining with Regular Instructions.

Optimization of Pipelining. Advanced Processor Technology. Vector Instruction Types. VLIW Processors. Assignment 2 Hierarchical Memory Technology. Inclusion, Coherence and Locality. Cache Memory Organization. Cache Addressing Models. Tutorial 3 Hierarchical Bus System.

Backplane Bus Specification. Assignment 3 Arbitration, Transaction and Interrupt. Shared-Memory Multiprocessors. Distributed-Memory Multiprocessors. Tutorial 7 System Interconnect architecture. Network Properties. Static Connection Network. Dynamic Connection Network. Seminar Specimen Disk Arrays. Attendance Policy: If the excuse is approved by the Dean, the student shall be considered to have withdrawn from the course. Module References Students will be expected to give the same attention to these references as given to the Module textbook s 1.

Sima, T. Fountain, P.