带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之二:Performance Issues

简介: 本书以Intel x86体系结构和ARM两个处理器系列为例,将当代计算机系统性能设计问题与计算机组成的基本概念和原理紧密联系起来,介绍了当代计算机体系结构的主流技术和最新技术。本书作者曾13次获a得美国教材和学术专著作者协会颁发的年度最佳计算机科学教材奖。目前,他是一名独立顾问,为众多计算机和网络制造商、软件开发公司以及政府前沿研究机构提供服务。

点击查看第一章
点击查看第三章

Chapter2

Performance Issues

image.png

This chapter addresses the issue of computer system performance. We begin with a consideration of the need for balanced utilization of computer resources, which provides a perspective that is useful throughout the book. Next we look at contemporary computer organization designs intended to provide performance to meet cuncnt and projected demand. Finally, we look at tools and models that have been developed to provide a means of assessing comparative computer system performance.

2.1 DESIGNING FOR PERFORMANCE

Year by year, the cost of computer systems continues to drop dramatically, while the performance and capacity of those systems continue to rise equally dramatically. Today's laptops have the computing power of an IBM mainframe from 1() or 15 years ago.Thus, we have virtually “free” computer power. Processors arc so inexpensive that we now have microprocessors we throw away. The digital pregnancy test is an example (used once and then thrown away). And this continuing technological revolution has enabled the development of applications of astounding complexity and power. For example, desktop applications that require the great power of today's microprocessor-based systems include

  • Image processing
  • Three-dimensional rendering
  • Speech recognition
  • Videoconferencing
  • Multimedia authoring
  • Voice and video annotation of files
  • Simulation modeling

Workstation systems now support highly sophisticated engineering and scientific applications and have the capacity to support image and video applications. In addition. businesses arc relying on increasingly powerful servers to handle transaction and database processing and to support massive clicnt^crvcr networks (hat have replaced the huge mainframe computer centers of yesteryear. As well, cloud service providers use massive high-performance banks of servers to satisfy high-volume, high-transaction-ratc applications for a broad spectrum of clients.
What is fascinating about ail (his from the perspective of computer organization and architecture is that. on the one hand, the basic buikling blocks for today's computer miracles arc virtually the same as those of the IAS computer from over 50 years ago. while on the other hand, the techniques for squeezing the maximum performance out of the materials at hand have become increasingly sophisticated.
This observation serves as a guiding principle for the presentation in (his book. As we progress through the various elements and components of a computer, two objectives arc pursued. First. the book explains the fundamental functionality in each area under consideration, and second, the book explores those techniques required to achieve maximum performance. In the remainder of this section, we highlight some of the driving factors behind the need to design for performance.
Microprocessor Speed
What gives Intel x86 processors or IBM mainframe computers such mind-boggling power is the relentless pursuit of speed by processor chip manufacturers.Hie evolution of these machines continues to bear out Moore's law, described in Chapter 1. So long as this law holds.chipmakers can unleash a new generation of chips every three years—with four times as many transistors In memory chips, this has quadrupled the capacity of dynamic random-access memory (DRAM). still the basic technology for computer main memory, every three years. In microprocessors, the addition of new circuits, and the speed boost that comes from reducing the distances between them, has improved performance four- or fivefold every three years or so since Intel launched its x86 family in 1978.
But the raw speed of the microprocessor will not achieve its potential unless it is fed a constant stream of work to do in the form of computer instructions. Anything that gets in the way of that smooth flow undermines the power of the processor. Accordingly, while the chipmakers have been busy learning how to fabricate chips of greater and greater density, the processor designers must come up with ever more elaborate techniques for feeding the monster. Among the tcchnicucs built into contemporary processors arc the following:

  • Pipelining: The execution of an instruction involves multiple stages of operation. including fetching the instruction, decoding the opcode, fetching operands. performing a calculation, and so on. Pipelining enables a processor to work simultaneously on multiple instructions by performing a different phase for each of the multiple instructions at the same time. The processor overlaps operations by moving data or instructions into a conceptual pipe with ail stages of the pipe processing simultaneously. For example. while one instruction is being executed, the computer is decoding the next instruction. This is the same principle as seen in an assembly line.
  • Branch prediction: The processor looks ahead in the instruction code fetched from memory and predicts which branches, or groups of instructions, arc likely to be processed next. If the processor guesses right most of the time. i( can prefetch the correct instructions and buffer them so that the processor is kept busy. The more sophisticated examples of this strategy predict not just
  1. next branch but multiple branches ahead. Thus, branch prediction poten- (ially increases the amount of work available for the processor to execute.
  • Superscalar execution: This is the ability to issue more than one instruction in every processor clock cycle. In effect. multiple parallel pipelines arc used.
  • Data flow analysis: The processor analyzes which instructions arc dependent on each other's results, or data, to create an optimized schedule of instructions. In fact, instructions arc scheduled to be executed when ready, independent of the original program order. This prevents unnecessary delay.
  • Speculative execution: Using branch prediction and data flow analysis, some processors speculatively execute instructions ahead of their actual appearance in the program execution, holding the results in temporary locations. This enables the processor to keep its execution engines as busy as possible by executing instructions that arc likely to be needed.

These and other sophisticated techniques arc made necessary by the sheer power of the processor. Collectively they make it possible to execute many instructions per processor cycle, rather than to take many cycles per instruction.
Performance Balance
While processor power has raced ahead at breakneck speed, other critical components of the computer have not kept up.Thc result is a need to look for performance balance: an adjustmcnl/tuning of the oi^anization and architecture to compensate for the mismatch among the capabilities of the various components.
The problem created by such mismatches is particularly critical at the ir.tcr- face between processor and main memory. While processor speed has grown rapidly. the speed with which data can be transferred between main memory and the processor has lagged badly. The interface between processor and main memory is the most crucial pathway in the entire computer because i( is responsible for carrying a constant flow of program instructions and data between memory chips and the processor. If memory or the pathway fails to keep pace with the processor's insistent demands, the processor stalls in a wait state, and valuable processing time is lost.
A system architect can attack this problem in a number of ways, all of which arc reflected in contemporary computer designs. Consider the following examples:

  • Increase the number of bits that arc retrieved at one time by making DRAMs “wider" rather than "deeper" and by using wide bus data paths.
  • Change the DRAM interface to make i( more efficient by including a cache1 or other buffering scheme on the DRAM chip.
  • Reduce the frequency of memory access by incorporating increasingly com- plcx and efficient cache structures between the processor and main memory. This includes the incorporation of one or more caches on the processor chp as well as on an off-chip cache close to the processor chip.
  • Increase the intcrconnccl bandwidth between processors and memory by using higher-speed buses and a hierarchy of buses to buffer and structure data flow.

Another area of design focus is the handling of I/O devices. As computers become faster and more capable, more sophisticated applications arc developed that support the use of peripherals wilh intensive I/O demands. Figure 2.1 gives some examples of typical peripheral devices in use on personal computers and workstations. These devices create tremendous data throughput demands. While the current generation of processors can handle the data pumped out by these devices, there remains the problem of getting that data moved between processor and peripheral. Strategies here include caching and buffering schemes plus the use of higher-speed interconnection buses and more elaborate interconnection structures. In addition, the use of multiple-processor configurations can aid in satisfying I/O demands.
The key in ail this is balance. Designers constantly strive to balance the throughput and processing demands of the processor components, main memory, I/O devices, and the interconnection structures. This design must constantly be rethought to cope with two constantly evolving factors:

  • The rate at which performance is changing in the various technology areas (processor, buses, memory. peripherals) differs greatly from one type of element to another.
  • New applications and new peripheral devices constantly change the natun: of lhe demand on the system in terms of typical instruction profile and the data access patterns.

image.png

Thus, computer design is a constantly evolving art form. This book attempts to present the fundamentals on which this art form is based and to present a survey of the current state of that art.
Improvements in Chip Organization and Architecture
As designers wrestle with the challenge of balancing processor performance with (ha( of main memory and other computer components, the need to increase processor speed remains. 'Dicrc arc three approaches to achieving increased processor speed:

  • Increase the hardware speed of the processor. This increase is fundamcnlally due to shrinking the size of the logic gates on the processor chip, so that more gates can be packed together more tightly and to increasing the clock rate. With gates closer together. the propagation time for signals is significantly reduced, enabling a speeding up of the processor. An increase in clock rate means that individual operations arc executed more rapidly.
  • Increase the size and speed of caches that arc interposed between the processor and main memory. In particular, by dedicating a portion of the processor chip itself to the cache. cache access times drop significantly.
  • Make changes to the processor organization and architecture that increase the effective speed of instruction execution. Typically, this involves using parallelism in one form or another.

Traditionally, the dominant factor in performance gains has been in increases in clock speed due and logic density. However. as clock speed and logic density increase, a number of obstacles become more significant [INTE04]:

  • Power: As the density of logic and the clock speed on a chip increase. so docs the power density (Watts/cm2).The difficulty of dissipating the heat generated on high-density, high-speed chips is becoming a serious design issue [GIBBW. BORK03].
  • RC delay: The speed at which electrons can flow on a chip between transistors is limited by the resistance and capacitance of the metal wires connecting them; specifically, delay increases as the RC product increases. As components on the chip decrease in size, the wire interconnects become thinner, increasing resistance. Also, the wires arc closer together, increasing capacitance.
  • Memory latency and throughput: Memory access speed (latency) and transfer speed (throughput) lag processor speeds, as previously discussed.

Thus, there will be more emphasis on organization and arc hi tcctural approaches to improving performance. These techniques arc discussed in later chapters of the text.
Beginning in the late 1980s. and continuing for about 15 years, two main strategies have been used to increase performance beyond what can be achieved simply by increasing dock speed. First. there has been an increase in cache capacity. There arc now typically two or three levels of cache between the processor and main memory. As chip density has increased, more of the cache memory has been incorpor- alcd on the chip, enabling faster cache access. For example. the original Pentium chip devoted about 10% of on-chip area to a cache. Contemporary chips devote over half of the chip area to caches. And. typically, about three-quarters of the other half is for pipeline-related control and buffering.
Second, the instruction execution logic within a processor has become increasingly complex to enable parallel execution of instructions within the processor. Two noteworthy design approaches have been pipelining and superscalar. A pipeline works much as an assembly line in a manufacturing plan! enabling different stages of execution of different instructions to occur at the same time along the pipeline. A superscalar approach in essence allows multiple pipelines within a single processor, so (ha( instructions (ha( do not depend on one another can be executed in paralci.
By the mid to late 90s. both of these approaches were reaching a point of diminishing returns. The internal organization of contemporary processors is exceedingly complex and is able to squeeze a great deal of parallelism out of the instruction stream. It seems likely that further significant increases in this direction will be relatively modest [GIBB04]. With three levels of cache on the processor chip, cadi level providing substantial capacity, i( also seems that the benefits from the cache arc reaching a limit.
However. simply relying on increasing clock rate for increased performance runs into the power dissipation problem already referred to. The faster the clock rate, the greater the amount of power to be dissipated, and some fundamental physical limits arc being reached.
Figure 22 illustrates the concepts we have been discussing.2 The top line shows that, as per Moore's Law. the number of transistors on a single chip continues to grow exponentially.3 Meanwhile. the clock speed has leveled off. in order to prevent a further rise in power. To continue increasing performance, designers have had to find ways of exploiting the growing number of transistors other than simply building a more complex processor. The response in recent years has been the development of the multicorc computer chip.

image.png

2.2 MULTICORE, MICS, AND GPGPUS

With ail of the difficulties cited in the preceding section in mind, designers Eave tumedto a fundamentally new approach to improving performance: placing m ultiplc processors on the same chip, with a large shared cache. Hie use of multiple processors on the same chip, also referred to as multiple cores, or multicore, provides the potential to increase performance without increasing the clock rate. Studies indicate that. within a processor. the increase in performance is roughly proportional to the square root of the increase in complexity [BORK03]. But if the software can support the effective use of multiple processors, then doubling the number of processors almost doubles performance.Thus, the strategy is to use two simpler processors on the chip rather than one more complex processor.
In addition, with two processors, larger caches arc justified. This is important because the power consumption of memory logic on a chip is much less than that of processing logic.
As the logic density on chips continues to rise, the trend for both more cores and more cache on a single chip continues. Two-corc chips were quickly followed by four-core chips, then 8. then 16. and so on. As the caches became largcr.it rradc performance sense to create two and then three ievek of cache on a chip, with initially. the first-level cache dedicated to an individual processor and levels two and three being shared by ail the processors. It is now common for the seco nd-level cache to also be private to each core.
Chip manufacturers arc now in the process of making a huge leap forward in the number ofcorcs per chip, with more than 50 cores per chip. The leap in performance as well as the challenges in developing software to exploit such a large number ofcorcs has led to the introduction of a new term: many integrated core (MIC).
The multicorc and MIC strategy involves a homogeneous collection of general- purpose processors on a single chip. At the same time. chip manufacturers arc pursuing another design option: a chip with multiple general-purpose processors plus graphics processing units (GPUs) and specialized cores for video processing and other tasks. In broad terms, a GPU is a core designed to perform parallel operations on graphics data. Traditionally found on a plug-in graphics card (display adapter), it is used to encode and render 2D and 3D graphics as well as process video.
Since GPUs perform parallel operations on multiple sets of data, they arc increasingly being used as vector processors for a variety of applications that require repetitive computations. This blurs the line between the GPU and the CPU
[AROR12. FAT AOS. PROP11]. When a broad range of applications arc supported by such a processor, the term general-purpose computing on GPUs (GPGPU) is used.
We explore design characteristics of multicorc computers in Chapter 18 and GPGPUs in Chapter 19.

2.3 TWO LAWS THAT PROVIDE INSIGHT: AHMDAHUS LAW AND LITTLE'S LAW

In this section, we look at two equations, called “laws." The two laws arc unrehted but both provide insight into the performance of parallel systems and multicore systems.
Amdahl's Law
Computer system designers look for ways to improve system performance by advances in technology or change in design. Examples include the use of parallel processors, the use of a memory cache hierarchy, and speedup in memory access time and I/O transfer rate due to technology improvements. In ail of these cascs.it is important to note (hat a speedup in one aspect of the technology or design docsnot result in a corresponding improvement in performance.This limilation is succinctly expressed by Amdahl's law.
Amdahl's law was first proposed by Gene Amdahl in 1967 ([AMDA67], IAMDA13]) and deals with the potential speedup of a program using multiple processors compared to a single processor. Consider a program running on a single processor such that a fraction (1 — f) of the execution time involves code that is inherently sequential, and a fraction/that involves code that is infinitely parallelizable with no scheduling overhead. Lei『be the total execution time of the program using a single processor. Then the speedup using a parailci processor with N processors that fully exploits the parallel portion of the program is as follows:

image.png

This equation is illustrated in Figures 23 and 2.4. Two important conclusions can be drawn:
1.When f is small, the use of parallel processors has little effect.
2.As N approaches infinity. speedup is bound by 1/(1 - f), so that there arc diminishing returns for using more processors.
These conclusions arc too pessimistic, an assertion first put forward in [GUS1B8]. For example, a server can maintain multiple threads or multiple tasks to handle multiple clients and execute the threads or tasks in parallel up to the limit of the number of processors. Many database applications involve computations on massive amounts of data that can be split up into multiple parallel tasks.

image.png

Nevertheless, Amdahl's law illustrates the problems facing industry in the dcvclop- ment of multicorc machines with an ever-growing number of cores: The software that runs on such machines must be adapted to a highly parallel execution environment to exploit the power of parallel processing.
Amdahl's law can be generalized to evaluate any design or technical improvement in a computer system. Consider any enhancement to a feature of a system that results in a speedup. The speedup can be expressed as

image.png

Suppose (hat a feature of the system is used during execution a fraction of the time f. before enhancement. and that the speedup of (ha( feature after enhancement is SUf. Then the overall speedup of the system is

image.png

Little's Law
A fundamcnial and simple relation with broad applications is Lillie's Law [LITT61, LITT11].4 We can apply it to almost any system that is statistically in steady state, and in which there is no leakage. Specifically, we have a steady state system to which items arrive at an average rate of A items per unit time. The items stay in the system an average of W units of time. Finally, there is an average of L units in the system at any one time. Little's Law relates these three variables as L = AW.
Using queuing theory terminology. Little's Law applies to a queuing system. The central element of the system is a server, which provides some service to items. Items from some population of items arrive at the system to be served. If the server is idle. an item is served immediately. Otherwise, an arriving item joins a waiting line, or queue. There can be a single queue for a single server. a single queue for multiple servers, or multiples queues, one for each of multiple servers. When a server has completed serving an item, the item departs. If there arc items waiting in the queue, one is immediately dispatched to the server. The server in this model can represent anything that performs some function or service for a collection of items. Examples: A processor provides service to processes: a transmission line provides a transmission service to packets or frames of data; and an I/O device provides a read or write service for I/O requests.
To understand Little's formula. consider the following argument, which focuses on the experience of a single item. When the item arrives. i( will find on average L items ahead of it. one being serviced and the rest in the queue. When the item leaves the system after being serviced. i( will leave behind on average the same number of items in the system, namely L. because L is defined as the average number of items waiting. Further. the average time that the item was in the system was W. Since items arrive at a rate of λ. we can reason that in the time W. a total of λW items must have arrived. Thus w = λW.
To summarize, under steady state conditions, the average number of it errs in a queuing system equals the average rate at which items arrive multiplied by the average time that an item spends in the system. This relationship requires very few assumptions. We do not need to know what the service time distribution is. what the distribution of arrival times is. or the order or priority in which items arc scr/cd. Because of its simplicity and generality, Little's Law is extremely useful and has experienced somewhat of a revival due to the interest in performance problems related to multicorc computers.
A very simple example. from [LITT11], illustrates how Little's Law might be applied. Consider a multicorc system, with each core supporting multiple threads of execution. At some level, the cores share a common memory. The cores share a common main memory and typically share a common cache memory as well. In any ease. when a thread is executing, it may arrive al a point at which it must retrieve a piece of data from the common memory. The thread stops and sends out a request for that data. All such stopped threads arc in a queue. If the system is being used as a server, an analyst can determine the demand on the system in terms of the rate of user requests, and then translate that into the rate of requests for data from the threads generated to respond to an individual user request. For this purpose. each user request is broken down into subtasks that arc implemented as threads. We then have λ = the average rate of total thread processing required after all members' requests have been broken down into whatever detailed subtasks arc required. Define L as the average number of slopped threads waiting during some relevant time. Then W = average response lime. This simple model can serve as a guide to designers as to whether user requirements arc being met and. if not. provide a quantitative measure of the amount of improvement needed.

2.4 BASIC MEASURES OF COMPUTER PERFORMANCE

In evaluating processor hardware and setting requirements for new systems, performance is one of the key parameters to consider, along with cost. size. security, reliability, and. in some eases, power consumption.
It is difficult to make meaningful performance comparisons among different processors, even among processors in the same family. Raw speed is far less important than how a processor performs when executing a given application. Unfortunately. application performance depends not just on the raw speed of the processor but also on the instruction set. choice of implementation language. efficiency of the compiler. and skill of the programming done to implement the application.
In this section, we look at some traditional measures of processor speed. In the next section, we examine benchmarking, which is the most common approach to assessing processor and computer system performance. The following section discusses how to average results from multiple tests.
Clock Speed
Operations performed by a processor. such as fetching an instruction, decoding the instruction.performing an arithmetic operation, and so on. arc governed by a system clock. Typically, all operations begin with the pulse of the clock. Thus, at the most fundamental level, the speed of a processor is dictated by the puke frequency produced by the clock, measured in cycles per second, or Hertz (Hz).
Typically, clock signals arc generated by a quartz crystal, which generates a constant sine wave while power is applied. This wave is converted into a digital voltage pulse stream that is provided in a constant flow to the processor circuitry (Figure 25). For example, a 1-GHz processor receives 1 billion pulses per second. The rate of pulses is known as the clock rate, or clock speed. One increment, or pulse, of the clock is referred to as a clock cycle, or a clock tick. The time between pulses is the cycle time.
The clock rate is not arbitrary, but must be appropriate for the physical layout of the processor. Actions in the processor require signals to be sent from one processor element to another. When a signal is placed on a line inside the processor, it takes some finite amount of time for the voltage levels to settle down so that an accurate value (logical 1 or 0) is available. Furthermore. depending on the physical layout of the processor circuits, some signals may change more rapidly than others. Thus, operations must be synchronized and paced so that the proper electrical signal (voltage) values arc available for each operation.
The execution of an instruction involves a number of discrete steps, such as fetching the instruction from memory. decoding the various portions of the instruction. loading and storing data, and performing arithmetic and logical operations. Thus, most instructions on most processors require multiple clock cycles to complete. Some instructions may take only a few cycles, while others require dozens. In addition, when pipelining is used, multiple instructions arc being executed simultaneously. Thus, a straight comparison of clock speeds on different processors docs not tell the whole story about performance.

image.png

Instruction Execution Rate
A processor is driven by a clock with a constant frequency f or. equivalently, a constant cycle time t. where t = 1// Define the instruction count. Ic, for a program as the number of machine instructions executed for (ha( program until i( runs to completion or for some defined time interval. Note (ha( this is the number of instruction executions, not the number of instructions in the objecl code of the program. An important parameter is the average cycles per instruction (CP!) for a program. If ail instructions required the same number of clock cycles, then CPI would be a constant value for a processor. However, on any given processor, the number of clock cycles required varies for different types of instructions, such as load, store, branch, and so on. Let CPI, be the number of cycles required for instruction type i. and be the number of executed instructions of type i for a given program. Then we can calculate an overall CPI as follows:

image.png

The processor time T needed to execute a given program can be expressed as

image.png

We can refine this formulation by recognizing that during the execution of an instruction, part of the work is done by the processor. and part of the time a word is being transferred io or from memory. In this latter ease, the lime io transfer depends on the memory cycle time, which may be greater than the processor cycle time. We can rewrite the preceding equation as

image.png

where p is the number of processor cycles needed to decode and execute the instruction. ni is the number of memory references needed, and k is the ratio between memory cycle time and processor cycle time. The five performance factors in the preceding equation (/c, p, m. k. t) arc influenced by four system attributes: the design of the instruction set (known as mstniction set architecture); compiler technology (how effective the compiler is in producing an efficient machine language program from a high-level language program); processor implementation; and cache and memory hierarchy. Table 2.1 is a matrix in which one dimension shows the five performance factors and the other dimension shows the four system al tributes. An X in a cell indicates a system attribute that affects a performance factor.

image.png

A common measure of performance for a processor is the rate at which instructions arc executed, expressed as millions of instructions per second (MIPS), referred to as the MIPS rate. We can express the MIPS rate in terms of the clock rate and CPI as follows:

image.png
image.png

Another common performance measure deals only with floating-point instructions. These arc common in many scientific and game applications. Floating-point performance is expressed as millions of floating-point operations per second (MFLOPS). defined as follows:

image.png

2.5 CALCULATING THE MEAN

In evaluating some aspect of computer system pcrformancc.it is often the ease that a single number. such as execution time or memory consumed, is used to characterize performance and to compare systems. Gcarly. a single number can provide only a very simplified view of a system's capability. Nevertheless, and especially in the field of benchmarking, single numbers arc typically used for performance comparison [SMIT88].
As is discussed in Section 2.6. the use of benchmarks to compare systems involves caiculaling the mean value of a set of data points related to execution time. It turns out that there arc multiple alternative algorithms that can be used for calculating a mean value, and this has been the source of some controversy in
the benchmarking fickl. In this section, we define these alternative algorithms and comment on some of their properties. This prepares us for a discussion in the next section of mean calculation in benchmarking.
The three common formulas used for calculating a mean arc arithmetic, geometric. and harmonic. Given a set ofn real numbers (xbX2. the three means arc defined as follows:

image.png

It can be shown lhat the following inequality holds:

AM ≤ GM ≤ HM


The values arc equal only if X1 = x2 = ... xn.
We can get a useful insight into these alternative calculations by defining the functional mean. Let f(x) be a continuous monotonic function defined in the interval 0 ≤ y ≤ ∞. The functional mean with respect to the function f(x) for n positive real numbers (x1,x2,...,xn) is defined as

image.png

where image.png is the reverse of f(x),The mean values defined in Equations (2.1) through (2.3) arc special eases of the functional mean, as follows:

  • AM is the FM with respect to f(x) = x
  • GM is the FM wilh respect to f(x) = ln x
  • HM is the FM with respect to f(x) = 1/x

image.png
image.png

Let us now consider which of these means arc appropriate for a given performance measure. As a preface to these remarks, it shoukl be noted that a number of papers ([OTR06], [FLEM86], [GILA95], [JACO95], [JOHN(M], [MASH04], [SMIT88]) and books ([HENN12], [HWAN93], [JAIN91]. [LIU(X)]) over the years have argued the pros and cons of the three means for performance analysis and come to conflicting conclusions. To simplify a complex controversy. we just note (ha( the conclusions reached depend very much on the examples chosen and the way in which the objectives are stated.
Arithmetic Mean
An AM is an appropriate mcasuiu if the sum of all the measurements is a meaningful and intciusting valuc/Dic AM is a good candidate for comparing the execution time pcr- foimancu of several systems. For example, suppose we were interested in using a system for large-scale simulation studies and wanted to evaluate several alternative products. On each system we could run the simulation multiple times with different input values for each run. and then take the average execution time across all runs. The use of multiple runs with diffciunt inputs should cnsuiv that the results aiu not heavily biased by some unusual feature of a given input set. Hie AM of ail the runsis a good measure of the system's performance on simulations, and agood numbcrtousc for system comparison.
The AM used for a time-based variable (c.g.. seconds), such as program execution time. has the important property that it is directly proportional to the total time. So. if the total time doubles, the mean value doubles.
Harmonic Mean
For some situations, a system's execution rate may be viewed as a more useful measure of the value of the system. This could be cither the instruction execution rate, measured in MIPS or MFLOPS, or a program execution rate, which measures the rate at which a given type of program can be executed. Consider how we wish the calculated mean to behave. It makes no sense to say (hat we woukl like the rrcan rate to be proportional to the total rate, where the total rate is defined as the sum of the individual rates. The sum of the rates would be a meaningless statistic. Rather, we would like the mean to be inversely proportional to the total execution time. For example. if the (otai time to execute ail the benchmark programs in a suite of programs is twice as much for system C as for system D. we would want the mean value of the execution rate to be half as much for system C as for system D.
Let us look at a basic example and first examine how the AM performs. Suppose we have asci of/i benchmark programs and record the execution times of each program on a given system as t1,t2. .... tn. For simplicity. let us assume (hat each program executes the same number of operations Z; we could weight the individual programs and calculate accordingly but this would not change the condusion of our argument. The execution rate for each individual program is Ri = Z/ti. We use the AM to calculate the average execution rate.

image.png

We see that the AM execution rate is proportional to the sum of the inverse execution times, which is not the same as being inversely proportional to the sum of the execution times. Thus, the AM docs not have the desired property.
The HM yields the following result.

image.png

The HM is inversely proportional to the total execution time, which is the desired property.

image.png

The reader may wonder why go through all this effort. If we want to compare execution limes, we could simply compare the total execution times of the three systems. If we want to compare rates, we could simply take the inverse of lhe total execution time, as shown in the table. There arc two reasons for doing the individual calculations rather than only looking at the aggregate numbers:

image.png

  1. A customer or researcher may be interested not only in the overall average performance but also performance against different types of benchmark programs, such as business applications, scientific modeling, multimedia applications. and systems programs. Thus, a breakdown by type of benchmark is needed as well as a total.
  2. Usually. the different programs used for evaluation arc weighted differently. In Table 2.2. it is assumed that the two test programs execute the same number of operations. If that is not the ease, we may want to weight accordingly. Or different programs could be weighted differently to reflect importance or priority.

Let us see what the result is if test programs arc weighted proportional to the number of operations. Following the preceding notation, each program i executes Zi instructions in a time ti. Each rate is weighted by the instructions count. The weighted HM is therefore:

image.png

We see that the weighted HM is the quotient of the sum of the opcralion count divided by the sum of the execution times.
Geometric Mean
Looking at the equations for the three types of mcans.it is easier to get an intuitive sense of the behavior of the AM and the HM than that of the GM. Several observations. from [FEIT15].may be helpful in this regard. First, we note that with respect to changes in values, the GM gives equal weight to all of the values in the data set. For example, suppose the set of data values to be averaged includes a few large values and more small values. Here, the AM is dominated by the large values. A change of 10% in the largest value will have a noticeable effect, while a change in the smallest value by the same factor will have a negligible effect. In contrast. a change in value by 10% of any of the data values results in the same change in the GM: image.png.

image.png

A second observation is that for the GM of a ratio, the GM of the ratios equals the ratio of the GMs:

image.png

Compare this with Equation 2.4.
For use with execution times, as opposed to rates, one drawback of the GM is that i( may be non-monotonic relative to the more intuitive AM. In other words there may be eases where the AM of one data set is larger than (hat of another set. but the GM is smaller.

image.png

One property of the GM that has made it appealing for benchmark analysis is that it provides consistent results when measuring the relative performance of machines. This is in fact what benchmarks are primarily used for: to compare one machine with another in terms of performance metrics. The results, as we Eave seen, arc expressed in terms of values that arc normalized to a reference machine.

image.png

It is examples like this that have fueled the “benchmark means wars" in the citations listed earlier. It is safe to say that no single number can provide all the information that one needs for comparing performance across systems. However,despite the conflicting opinions in the literature. SPEC has chosen to use the GM. for several reasons:
1.As mentioned, the GM gives consistent results regardless of which system is used as a reference. Because benchmarking is primarily a comparison analysis, this is an important feature.

2.Asdocumcntcd in [MCMA93]. and confirmed in subsequent analyses by SPEC analysts [MASH04], the GM is less biased by outliers than the HM or AM.

3.[MASH04] demonstrates that distributions of performance ratios arc better modeled by lognormal distributions than by normal ones. because of the generally skewed distribution of the normalized numbers. This is confirmed in [CITR06], And. as shown in Equation (25). the GM can be described as the back-transformed average of a lognormal distribution.

image.png

2.6 BENCHMARKS AND SPEC

Benchmark Principles
Measures such as MIPS and MFLOPS have proven inadequate lo evaluating the performance of processors. Because of differences in instruction sei. the instruction execution rale is not a valid means of comparing the performance of different architectures.

image.png

Another consideration is that the performance of a given processor on a given program may not be useful in determining how that processor will perform on a very different type of application. Accordingly, beginning in the late 1980s and early 1990s, industry and academic interest shifted to measuring the performance of systems using a set of benchmark programs. The same set of programs can be run on different machines and the execution times compared. Benchmarks provide guidance to customers trying to decide which system to buy. and can be useful to vendors and designers in determining how to design systems to meet benchmark goals. [WEIC90] lists the foilouing as desirable characteristics of a benchmark program:
1.It is written in a high-level language, making it portable across different machines.

2.It is representative of a particular kind of programming domain or paradigm, such as systems programming, numerical programming, or commercial programrring.

3.It can be measured easily.

4.It has wide distribution.
SPEC Benchmarks
The common need in industry and academic and research communities for generally accepted computer performance measurements has led to the development of standardized benchmark suites. A benchmark suite is a collection of programs, defined in a high-level language. that together attempt to provide a representative test of a computer in a particular application or system programming area. The best known such collection of benchmark suites is defined and maintained by the Standard Performance Evaluation Corporation (SPEC). an industry consortium. This organization defines several benchmark suites aimed al evaluating computer systems. SPEC performance measurements arc widely used for comparison and research purposes.
The best known of the SPEC benchmark suites is SPEC CPU2006. This is the industry standard suite for processor-intensive applications. That is, SPEC CPU2006 is appropriate for measuring performance for applications that spend most of their time doing computation rather than I/O.
Other SPEC suites include the following:

  • SPECviewperf: Standard for measuring 3D graphics performance based on professional applications.
  • SPECwpc: benchmark to measure ail key aspects of workstation performance based on diverse professional applications, including media and entertainment. product development, life sciences, financial services, and energy.
  • SPECjvm2008: Intended to evaluate performance of the combined hardware and software aspects of the Java Virtual Machine (JVM) dient platform.
  • SPECjbb2013 (Java Business Benchmark): A benchmark for evaluating server-side Java-based cleclronic commerce applications.
  • SPECsfs2008: Designed to evaluate the speed and request-handling capabilities of file servers.
  • SPECvirt_sc2013: Performance evaluation of datacenter servers used in virtualized server consolidation. Measures the end-to-end performance of ail system components including the hardware. virtualization platform, and the virtualized guest operating system and application software. The benchmark supports hardware virtualization, operating system virtualization, and hardware partitioning schemes.

The CPU2006 suite is based on existing applications that have already been ported to a wide variety of platforms by SPEC industry members. In order to rrakc the benchmark results reliable and realistic, the CPU2006 benchmarks arc drawn from real-life applications, rather (han using artificial loop programs or synthetic benchmarks. The suite consists of 12 integer benchmarks uritten in C and C++.and 17 floating-point benchmarks written in C. C++, and Fortran (Tables 25 and 2.6). The suite contains over 3 million lines of code. This is the fifth generation of processor-intensive suites from SPEC replacing SPEC CPU2000, SPEC CPU95. SPEC CPU92. and SPEC CPU89 [HENN07].

image.png
image.png

To better understand published results of a system using CPU2006, we define the following terms used in the SPEC documentation:

  • Benchmark: A program written in a high-level language that can be compiied and executed on any computer that implements the compiler.
  • System under test: This is the system to be evaluated.
  • Reference machine: This is a system used by SPEC to establish a baseline performance for all benchmarks. Each benchmark is run and measured on this machine to establish a reference time for that benchmark. A system under test is evaluated by running the CPU2006 benchmarks and comparing the results for running the same programs on the reference machine.
  • Base metric: These arc required for al! reported results and have strict guidelines for compilation. In essence. the standard compiler with more or less default settings should be used on each system under test to achieve comparable results.
  • Peak metric: This enables users to attempt to optimize system performance by optimizing the compiler output. For example, different compiler oplions may be used on each benchmark, and feedback-directed optimization is allowed.
  • Speed metric: This is simply a m
  • Rate metric: This is a measurement of how many tasks a computer can accomplish in a certain amount of time; this is called a throughput, capacity. or rate measure. "Die rate metric allows the system under test to execute simultaneous tasks to take advantage of multiple processors.

SPEC uses a historical Sun system, the "Ultra Enterprise 2," which was introduced in 1997. as the reference machine. The reference machine uses a 296-MHz UltraSPARC II processor. It lakes about 12 days to do a rule-conforming run of the base metrics for CINT2006 and CFP2OO6 on the CPU2006 reference machine. Tables 25 and 2.6 show the amount of time to run each benchmark using the reference machine. The tables also show the dynamic instruction counts on the reference machine. as reported in [PHAN07]. These values arc the actual number of instructions executed during the run of each program.
We now consider the specific calculations (hat arc done to assess a system. We consider the integer benchmarks; the same procedures arc used to create a floatingpoint benchmark value. For the integer benchmarks, there arc 12 programs in the test suite. Calculation is a three-step process (Figure 2.7):
1.The first step in evaluating a system under test is to compile and run each program on the system three times. For each program, the runtime is measured and the median value is selected. The reason to use three runs and take the median value is to account for variations in execution time that arc not inlrin- sic to the program, such as disk access time variations, and OS kernel execution variations from one run to another.

image.png

2.Next. each of the 12 results is normalized by calculating the runtime ratio of the reference run time to the system run time. The ratio is calculated as follows:

image.png

where Trefi is the execution time of benchmark program i on the reference system and Tsuti is the execution time of benchmark program i on the system under test.Thus, ratios arc higher for faster machines.

3.Finally, the geometric mean of the 12 runtime ratios is calculated to yield the overall metric:

image.png

For the integer benchmarks, four separate metrics can be calculated:

  • SPECint2006: Die geometric mean of 12 normalized ratios when the benchmarks arc compiled with peak tuning.
  • SPECint_base2006: The geometric mean of 12 normalized ratios when the benchmarks arc compiled wilh base tuning.
  • SPECint_rate2006: The geometric mean of 12 normalized throughput ratios when the benchmarks arc compiled with peak tuning.
  • SPECint_rate_base2006: The geometric mean of 12 normalized throughput ratios when the benchmarks arc compiled with base tuning.

    image.png

The rate metrics take into account a system with multiple processors. To test a machine. a number ofcopicsN is selected—usually this is equal to the number of processors or the number of simultaneous threads of execution on the test system. Each individual test program's rate is determined by taking the median of three runs. Each run consists of N copies of the program running simultaneously on the test system. The execution time is the time it takes for ail the copies to finish (i.e., the lime from when the first copy starts until the last copy finishes). The rate metric for that program is calculated by the following formula:

image.png

The rate score for the system under test is determined from a geometric mean of
rales for each program in the test suite.

image.png
image.png
image.png

2.7 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS

Key Terms

image.png

Review Questions
2.1 List and briefly define some of the techniques used in contemporary processors to increase speed.

2.2 Explain the concept of performance balance.

2.3 Explain the differences among multicore systems. MICs. and GPGPUs.

2.4 Briefly characterize Amdahl's law.

2.5 Briefly characterize Little's law.

2.6 Difine MIPS and FLOPS.

2.7 List and define three methods for calculating a mean value of a set of data values.

2.8 List the desirable characteristics of a benchmark program.

2.9 What are the SPEC benchmarks?

2.10 What are the differences among base metric peak metric speed metric and rate metric?

Problems
2.1 A benchmark program is run on a 40 MHz processor, lhe executed program consists of 100.000 instruction executions. with the following instruction mix and clock cycle count:

image.png

Determine the effective CPA Ml PS rate, and execution time for this program.

2.2 Consider two different machines, with two different instruction sets, both of which have a clock rate of 200 MHz. The following measurements are recorded on the two machines running a given set of benchmark programs:

image.png

a.Determine the effective CPL MIPS rate, and execution lime for each machine.
b.Comment on the results.

2.3 Early examples of CISC and RISC design are the VAX 11/780 and the IBM RS/6000. respectively. Using a typical benchmark program, the following machine characteristics result:

image.png

The final column shows that the VAX required 12 times longer than the IBM measured in CPU time.
a.What is the relative size of the instruction count of the machine code for this benchmark program running on the two machines?
b.What are the CPI values for the two machines?

2.4 Four benchmark programs are executed on three computers with the following results:

image.png

The table shows the execution time in seconds, with 100.000.000 instructions executed in each of the four programs. Calculate the MIPS values for each computer for each program. Then calculate the arithmetic and harmonic means assuming equal weights for the four programs, and rank the computers based on arithmetic mean and harmonic mean.

2.5 The following table, based on data reported in the literature [HEAT84], show< the execution times, in seconds, for five different benchmark programs on three machines.

image.png

  1. Compute the speed metric for each processor for each benchmark. normalized to machine R. rrhat is. the ratio values for R are all 1.0. Other ratios are calculated
  2. Equation (2.5) with R treated as the reference system. Then compute the arithmetic mean value for each system using Equation (2.3). This is the approach taken in [HEAT84].

b.Repeat part(a) using M as the reference machine, This calculation was not tried in [HEAT84].
c.Which machine is the slowest based on each of the preceding two calculations?

  1. Repeat the calculations of parts (a) and (b) using the geometric mean, defined in Equation (2.6). Which machine is the slowest based on the two calculations?

2.6 To clarify the results of the preceding problem, we look at a simpler example.

image.png

a.Compute the arithmetic mean value for each system using X as the reference machine and then using Y as the reference machine. Argue that intuitively the three machines have roughly equivalent performance and that the arithmetic mean gives misleading results.
b.Compute the geometric mean value for each system using X as the reference machine and then using Y as the reference machine. Argue that the resulh are more realistic than with the arithmetic mean.

2.7 Consider the example in Section 2.5 for the calculation of average CP/and MIPS rate, which yielded the result of CPI = 2.24 and MIPS rate = 178. Now assume that the program can be executed in eight parallel tasks or threads with roughly equal number of instructions executed in each task. Execution is on an8-core system with each core (processor) having the same performance as the single processor originally used Coordination and synchronization between the parts adds an extra 25.000 instruction executions to each task. Assume the same instruction mix as in the example for each task, but increase the CPI for memory reference with cache miss to 12 cycles due to contention for memory.
a.Determine the average CPI.
b.Determine the corresponding MIPS rate
c.Calculate the speedup factor.
d.Compare the actual speedup factor with the theoretical speedup factor determined by Amdhal's law.

2.8 A processor accesses main memory with an average access lime of T、A smaller cache memory is interposed between the processor and main memory. rIhe cache has a significantly faster access time of T < T% The cache holds, at any lime,copies of some main memory words and is designed so that the words more likely to be accessed in the near future are in the cache. Assume that the probability that the next word accessed by the processor is in the cache is H. known as the hit ratio.
a.For any single memory access. what is the theoretical speedup of accessing the word in the cache rather than in main memory?
b.Let T be the average access time. Express Tasa function of T1,T2 and H. What is the overall speedup as a function of H?
c.In practice, a system may be designed so that the processor must first access the cache to determine if the word is in the cache and. if it is not. then access main memory, so that on a miss (opposite of a hit), memory access time is T1 + T2 Express T as a function of T1,T2 and H. Now calculate the speedup and compare to the result produced in part (b).

2.9 The owner of ashop observes that on average 18customers per hour arrive and there are typically 8 customers in the shop. What is the average length of time each customer spends in the shop?

2.10 We can gain more insight into Little's law by considering Figure 2.8a. Over a period of time T.a total of C items arrive at a system, wait for service, and complete service. The upper solid line shows the time sequence of arrivals, and the lower solid line shows the time sequence of departures, lhe shaded area bounded by the two lines represents the total "work" done by the system in units of job-seconds: let A be the total work. We wish to derive the relationship among L,W,and λ.
a.Figure 2.8b divides the total area into horizontal rectangles, each with a height of one job. Picture sliding all these rectangles to the left so that their left edges line up at t = 0. Develop an equation that relates A,C,and W.
b.Figure 2.8c divides the total area into vertical rectangles, defined by the vertical transition boundaries indicated by the dashed lines. Picture sliding ail these rectangles down so that their lower edges line up at N(t) = 0. Develop an equation that relates A,T,and L.
c.Finally.derive L = λW from the results of (a) and (b).

2.11 In Figure 2.8a,jobs arrive at times f = 0,1,1.5,3.25,5.25,and 7.75.The corresponding completion times are f = 2,3,3.5,4.25,8.25,and 8.75.
a.Determine the area of each of the six rectangles in Figure 2.8b and sum to ge; the total area A. Show your work.
b.Determine the area of each of the 10 rectangles in Figure 2.8c and sum to ge; the total area A. Show your work.

2.12 In Section 2.6. we specified that the base ratio used for comparing a system under test to a reference system is:

image.png


a.The preceding equation provides a measure of the speedup of the system under test compared to the reference system. Assume that the number of floating-point operations executed in the test program is/? Now show the speedup asa function of the instruction execution rate FLOPSi.
b.Another technique for normalizing performance is to express the performance of a system as a percent change relative to the performance of another sydem. Express this relative change first as a function of instruction execution rate, and then as a function of execution times.

2.13 Assume that a benchmark program executes in 480 seconds on a reference machine A. The same program executes on systems B,C,and D in 360,540,and 210 seconds, respectively.
a.Show the speedup of each of the three systems under test relative to A.
b.Now show the relative speedup of the three systems. Comment on the three ways of comparing machines (execution time, speedup, relative speedup).

2.14 Repeat the preceding problem using machine D as the reference machine. How does this affect the relative rankings of the four systems?

2.15 Recalculate the results in lable 2.2 using the computer time data of Table 2.4 and comment on the results.

2.16 Equation 2.5 shows two different formulations of the geometric mean, one using a product operator and one using a summation operator.
a.Show that the two formulas are equivalent.
b.Why would the summation formulation be preferred for calculating the geometric mean?

2.17 Project. Section 2.5 lists a number of references that document the "benchmark means ware". All of the referenced papers are available at box.com/COAIOe. Read these papers and summarize the case for and against the use of the geometric mean for SPEC calculations.

相关实践学习
部署Stable Diffusion玩转AI绘画(GPU云服务器)
本实验通过在ECS上从零开始部署Stable Diffusion来进行AI绘画创作,开启AIGC盲盒。
相关文章
|
存储 缓存 Shell
【深入理解操作系统】第一章:计算机系统漫游 | A tour of Computer Systems | 阅读笔记
【深入理解操作系统】第一章:计算机系统漫游 | A tour of Computer Systems | 阅读笔记
115 0
|
7月前
|
存储 算法 索引
【头歌·计组·自己动手画CPU】三、存储系统设计(HUST)(理论版) 【计算机硬件系统设计】
【头歌·计组·自己动手画CPU】三、存储系统设计(HUST)(理论版) 【计算机硬件系统设计】
692 1
|
存储 缓存 Java
The art of multipropcessor programming 读书笔记-硬件基础2
The art of multipropcessor programming 读书笔记-硬件基础2
The art of multipropcessor programming 读书笔记-硬件基础2
|
存储 缓存 算法
The art of multipropcessor programming 读书笔记-硬件基础1
The art of multipropcessor programming 读书笔记-硬件基础1
The art of multipropcessor programming 读书笔记-硬件基础1
|
SQL 存储 数据采集
【详谈 Delta Lake 】系列技术专题 之 基础和性能(Fundamentals and Performance)
本文翻译自大数据技术公司 Databricks 针对数据湖 Delta Lake 的系列技术文章。众所周知,Databricks 主导着开源大数据社区 Apache Spark、Delta Lake 以及 ML Flow 等众多热门技术,而 Delta Lake 作为数据湖核心存储引擎方案给企业带来诸多的优势。本系列技术文章,将详细展开介绍 Delta Lake。
【详谈 Delta Lake 】系列技术专题 之 基础和性能(Fundamentals and Performance)
|
存储 传感器 异构计算
带你读《数字设计和计算机体系结构(原书第2版·ARM版)》之三:时序逻辑设计
采用一种独特的现代数字设计方法,先介绍数字逻辑门,接着讲述组合电路和时序电路的设计,并以这些基本的数字逻辑设计概念为基础,重点介绍如何设计实际的MIPS处理器。另外,在全书的实例中运用SystemVerilog和VHDL展示基于CAD的电路设计方法和技术。通过《数字设计和计算机体系结构(原书第2版)》,读者能够构建自己的微处理器,并能够自顶向下地理解微处理器的工作原理。
|
网络架构 Java Go
带你读《计算机体系结构:量化研究方法(英文版·原书第6版)》之一:Fundamentals of Quantitative Design and Analysis
本书堪称计算机系统结构学科的“圣经”,是计算机设计领域学生和实践者的必读经典。本书系统地介绍了计算机系统的设计基础、存储器层次结构设计、指令级并行及其开发、数据级并行、GPU体系结构、线程级并行和仓库级计算机等。本书内容丰富,既介绍了当今计算机体系结构的研究成果,也引述了许多计算机系统设计开发方面的实践经验。另外,各章结尾还附有大量的习题和参考文献。
|
内存技术 Go Windows
带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之一:Basic Concepts and Computer Evolution
本书以Intel x86体系结构和ARM两个处理器系列为例,将当代计算机系统性能设计问题与计算机组成的基本概念和原理紧密联系起来,介绍了当代计算机体系结构的主流技术和最新技术。本书作者曾13次获a得美国教材和学术专著作者协会颁发的年度最佳计算机科学教材奖。目前,他是一名独立顾问,为众多计算机和网络制造商、软件开发公司以及政府前沿研究机构提供服务。
|
Java Windows 内存技术
带你读《计算机组成与设计:硬件/软件接口(英文版原书第5版RISC-V版)》之二:Instructions:Language of the Computer
全书着眼于当前计算机设计中最基本的概念,展示了软硬件间的关系,并全面介绍当代计算机系统发展的主流技术和最新成就。书中逐条指令地列举了完整的MIPS指令集,并介绍了网络和多处理器结构的基本内容。将CPU性能和程序性能紧密地联系起来是本版的一个新增内容。另外,本版对软硬件的讨论更加深入,作者展示了软硬件部件如何影响程序的性能,并在光盘中为侧重硬件和侧重软件的读者分别提供了相关资料。
|
内存技术 网络架构 Go
带你读《计算机体系结构:量化研究方法(英文版·原书第6版)》之二: Memory Hierarchy Design
本书堪称计算机系统结构学科的“圣经”,是计算机设计领域学生和实践者的必读经典。本书系统地介绍了计算机系统的设计基础、存储器层次结构设计、指令级并行及其开发、数据级并行、GPU体系结构、线程级并行和仓库级计算机等。本书内容丰富,既介绍了当今计算机体系结构的研究成果,也引述了许多计算机系统设计开发方面的实践经验。另外,各章结尾还附有大量的习题和参考文献。