| T.R | Title | User | Personal Name
 | Date | Lines | 
|---|
| 596.1 |  | BEING::POSTPISCHIL | Always mount a scratch monkey. | Thu Oct 16 1986 17:23 | 11 | 
|  |     In spite of the formula, the geometric means seem to have been
    calculated correctly.
    
    I do not believe the error in the comparisons lies with the arithmetic
    mean, but with the "normalization".  The "normalization" the author
    uses is basically a way of assigning values to the various results, and
    it assigns the highest values to the benchmarks the chosen system did
    best in.
    
    
    				-- edp 
 | 
| 596.2 | Known problem | SQM::HALLYB | Free the quarks! | Fri Oct 17 1986 12:26 | 38 | 
|  |     If you look at the original data and compute the product of all
    3 "times" you get 24000 for all 4 systems.  Not at all surprising
    that the (correcty-calculated) geometric mean is the same in all
    cases.
    
    /* This kind of technique has been bouncing around the performance
    community for quite some time now.  The main problem is that there
    is no easy way to compare different CPUs on the basis of some number
    of benchmarks. */
    
    The need for "normalization" comes from a problem with the arithmetic
    mean.  If you run benchmarks x and y on systems P and Q, and get
    times that look like:
    
    			   P	  Q
    		x	100.	10.0
    		y	  0.1	 1.0
    
    you can see that benchmark x really dominates the whole set, and
    the contribution of y is irrelevant.  So in order to give equal
    weight to x and y, you normalize with respect to one of the CPUs,
    and end up with:
    
    			  P	  Q
    		x	 1.	 0.1
    		y	 1.	10.0
    
    Then the arithmetic mean of P is 1, while the mean of Q is > 5.
    Of course if you normalize with respect to Q the situation is reversed.
    It is not unusual to see this kind of raw data, owing to various
    compiler optimizations.
    
    The geometric mean is independent of normalization since it is already
    a "multiplicative entity".  I believe the original article intended
    to point out that the arithmetic mean is a poor way to compare CPU
    times, and the geometric mean is more useful.
    
      John
 | 
| 596.3 |  | BEING::POSTPISCHIL | Always mount a scratch monkey. | Fri Oct 17 1986 20:03 | 15 | 
|  |     Re .2:
    
    If normalization is necessary, you certainly don't do it by reducing
    the effect of the bad or dominating benchmarks!
    
    A better way to do it is to figure out how relevant the various
    benchmarks are for your system.  For example, you might figure that
    sixty percent of your work will be somewhat like benchmark x and forty
    percent will be like benchmark y.  Use those figures to adjust the
    data.  If that still leaves one benchmark dominating the other,
    that is good, because, when you by the systems, that portion of the
    work will be dominating the other.
    
    
    				-- edp 
 | 
| 596.4 | Up to 4.2 times more useful | SQM::HALLYB | Free the quarks! | Sat Oct 18 1986 00:51 | 29 | 
|  |     Re .3:
    
    Yes, that is the standard suggestion that is made at this point
    in the argument.  Unfortunately at this point we tend to stray from
    the MATH content and enter an unrelated topic.  So to keep it brief,
    suffice it to say that the benchmarks x, y, z, ... have little if
    anything to do with actual workloads.  They're just a random bunch
    of programs that get passed on from one young generation to another.
    Occasionally somebody will add in a program so as to contribute
    to the sum total of Human Knowledge, but almost invariably the programs
    added have the characteristic of being fairly easy to code and most
    importantly being very easy to run.  Hence they tend to do either
    no IO or sometimes they do IO exclusively, but rarely indeed is there
    ever an attempt made to actually model a workload and even then it's
    a general workload, not anything site-specific.  Some exceptions exist.
    
    The next question usually is along the lines of "Well why run all
    these silly little benchmarks if they don't mean anything?"  There
    isn't much of an answer to this except that these little programs
    are about the only way to make any kind of comparisons across a
    wide variety of processors for a wide variety of customers, and
    even if the data is only vaguely useful it's better than comparing
    raw instruction timings and IO bus bandwidths.  Certainly better
    approaches exist but they involve a LOT of work to instrument an
    existing workload and then generate a synthetic workload to duplicate
    the observed one.  Most customers can't afford to do that, and at
    times the workload to be predicted doesn't yet exist.
      John
 | 
| 596.5 | yes | TOOK::APPELLOF | Carl J. Appellof | Mon Oct 20 1986 13:01 | 9 | 
|  |     I agree that the problem is in the "normalization".  
    there are really two components to this:  the first, as pointed
    out, is in weighting the benchmarks according to how important they
    are to YOUR workload.  The second, and only mathematical reason,
    is to reduce results of various benchmarks to some common scale
    so that an arithmetic mean can make sense.
	Obviously, the method of standardizing each benchmark against
    a different machine is not the way to do it.
    
 | 
| 596.6 | Doug Clark gave an excellent lecture on a related topic | EAGLE1::BEST | R D Best, Systems architecture, I/O | Sun Oct 26 1986 01:55 | 16 | 
|  | 
> In case anyone is interested, Doug Clark gave a very interesting
> (and amusing) talk on the rampant misuse of benchmarks about a year ago
> at an LTN technical forum.  I believe that it was entitled something
> like 'Ten Awful Ways to Measure Computer Performance'.  He discusses
> the effects of neglecting realistic cache hit ratios, compiler effects,
> why certain commonly used benchmarks are notoriously bad indicators of
> real life computer usage, AND the specious use of statistics and math
> by hardware manufacturers (including us) and other 'trick of the trade'.
> I believe it was recorded and should be available on videotape from the LTN
> library.  I can almost guarantee that this talk will have you rolling on
> the floor.  I give it my vote for one of the all time best lectures I've
> attended.
>		/R Best
 | 
| 596.7 | Median or mean | AIWEST::DRAKE | Dave (Diskcrash) Drake 619-292-1818 | Sun Oct 26 1986 03:44 | 22 | 
|  |     A few thoughts:
    
    re:0 The arithmetic mean is not usually also the median. The median
    is the value that has 50% of the observations above and below it.
    In fact I have found that median based figures of merit are very
    useful in a wide class of analysis problems. I have used them in
    image processing to "cast out"  bad data rather tha forming linear
    filters that include it. The median would in fact be a good comparison
    mechanism as it would help ignore benchmark extrema.
    
    No question, benchmarks are a pit. We try to quantify some simple
    "figure of merit" about a very complex system such as a 8800. I
    would think that it would be better to distill each processor into
    its component queueing mechanisms and provide quantitative data
    about the server time of each queue. (A queue in this case means
    any system resource that is consumed in common by processes.) Each
    processor would end up with say 5 to 10 values that would be used
    for comparison purposes. Someone would probably come along and find
    the norm of the 5 to 10 valued vector and call this the "performance".
    If we did this we could more accurately compare new applications
    against our systems. All I can say is MIPS are for DIPS.
    
 |