[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

596.0. "Which mean to use?" by TOOK::APPELLOF (Carl J. Appellof) Thu Oct 16 1986 09:37

                 <<< CURIUM::DUA0:[NOTES$LIBRARY]HPSC.NOTE;1 >>>
                         -< High Perf. Sci. Computing >-
================================================================================
Note 109.0              which mean to use--arith or geo                  1 reply
SKYLRK::RICHARD                                      77 lines  15-OCT-1986 18:05
--------------------------------------------------------------------------------

    I hope there aren't any careers on the line which depend on current
    system performance measurements such as using the arithmetic mean
    or meadian.  This is an example which shows some potential pitfalls.
    Some might think it is academic.  However, if DEC is using mathematically
    invalid methods perhaps something should change before IBM finds
    out.  I got this item at DECUS and used it without the author's
    permission.   
    
    
    arithmetic mean			geometric mean

A=(d1+d2+d3+...+dn)/n		G=(d1+d2+d3+...+dn)**(1/n)

Consider four systems, A,B,C and D, running three benchmarks, a,b,c.

			original data

			A	B	C	D
		a	20	25	15	10
		b	30	30	40	40
		c	40	32	40	60


			normalized data on A

		a	1	1.25	0.75	0.50
		b	1	1.00	1.33	1.33
		c	1	0.80	1.00	1.50

arithmetic mean		1	1.02	1.03	1.11
geometric mean		1	1.00	1.00	1.00

SYSTEM A IS BEST.

			normalized data on B

		a	0.80	1	0.60	0.40
		b	1.00	1	1.33	1.33
		c	1.25	1	1.25	1.875

arithmetric mean	1.03	1	1.06	1.20
geometric mean		1.00	1	1.00	1.00

SYSTEM B IS BEST.

			normalized data on C

		a	1.333	1.677	1	0.677
		b	0.75	0.75	1	1.00
		c	1.00	0.80	1	1.50

arithmetic mean		1.03	1.07	1	1.06
geometric mean		1.00	1.00	1	1.00

SYSTEM C IS BEST.

			normalized data on D

		a	2.00	2.50	1.50	1
		b	0.75	0.75	1.00	1
		c	0.677	0.533	0.667	1

arithmetic mean		1.14	1.26	1.06	1
geometric mean		1.00	1.00	1.00	1

SYSTEM D IS BEST.

The ARITHMETIC MEAN gives normalized results that are dependent on
which system is used at the standard, i.e. using system A,B,C, or D
as the standard give system A,B,C, or D as the fastest system respectively.

The GEOMETRIC MEAN gives normalized results that are independent of the
system for the standard.

Comments?
    
    Gregory Richardson             
================================================================================
Note 109.1              which mean to use--arith or geo                   1 of 1
TOOK::APPELLOF "Carl J. Appellof"                    12 lines  16-OCT-1986 09:30
                       -< Any statisticians out there? >-
--------------------------------------------------------------------------------

    I'm going to put this in the MATH notes file too to see what it
    stirs up from the statisticians in the company.  There are a bunch
    of ways to normalize the test results.  Obviously, the way it's
    done in your example is NOT the way to do it.  The geometric mean
    that I learned in high school was (a*b*c)**(1/n) and NOT
    (a+b+c)**(1/n).
    
    At least this shows the fallacy of normalizing each test on a different
    machine.
    
    Carl

T.R	Title	User	Personal Name	Date	Lines
596.1		BEING::POSTPISCHIL	Always mount a scratch monkey.	`Thu Oct 16 1986 17:23`	11
	In spite of the formula, the geometric means seem to have been calculated correctly. I do not believe the error in the comparisons lies with the arithmetic mean, but with the "normalization". The "normalization" the author uses is basically a way of assigning values to the various results, and it assigns the highest values to the benchmarks the chosen system did best in. -- edp
596.2	Known problem	SQM::HALLYB	Free the quarks!	`Fri Oct 17 1986 12:26`	38
	If you look at the original data and compute the product of all 3 "times" you get 24000 for all 4 systems. Not at all surprising that the (correcty-calculated) geometric mean is the same in all cases. /* This kind of technique has been bouncing around the performance community for quite some time now. The main problem is that there is no easy way to compare different CPUs on the basis of some number of benchmarks. */ The need for "normalization" comes from a problem with the arithmetic mean. If you run benchmarks x and y on systems P and Q, and get times that look like: P Q x 100. 10.0 y 0.1 1.0 you can see that benchmark x really dominates the whole set, and the contribution of y is irrelevant. So in order to give equal weight to x and y, you normalize with respect to one of the CPUs, and end up with: P Q x 1. 0.1 y 1. 10.0 Then the arithmetic mean of P is 1, while the mean of Q is > 5. Of course if you normalize with respect to Q the situation is reversed. It is not unusual to see this kind of raw data, owing to various compiler optimizations. The geometric mean is independent of normalization since it is already a "multiplicative entity". I believe the original article intended to point out that the arithmetic mean is a poor way to compare CPU times, and the geometric mean is more useful. John
596.3		BEING::POSTPISCHIL	Always mount a scratch monkey.	`Fri Oct 17 1986 20:03`	15
	Re .2: If normalization is necessary, you certainly don't do it by reducing the effect of the bad or dominating benchmarks! A better way to do it is to figure out how relevant the various benchmarks are for your system. For example, you might figure that sixty percent of your work will be somewhat like benchmark x and forty percent will be like benchmark y. Use those figures to adjust the data. If that still leaves one benchmark dominating the other, that is good, because, when you by the systems, that portion of the work will be dominating the other. -- edp
596.4	Up to 4.2 times more useful	SQM::HALLYB	Free the quarks!	`Sat Oct 18 1986 00:51`	29
	Re .3: Yes, that is the standard suggestion that is made at this point in the argument. Unfortunately at this point we tend to stray from the MATH content and enter an unrelated topic. So to keep it brief, suffice it to say that the benchmarks x, y, z, ... have little if anything to do with actual workloads. They're just a random bunch of programs that get passed on from one young generation to another. Occasionally somebody will add in a program so as to contribute to the sum total of Human Knowledge, but almost invariably the programs added have the characteristic of being fairly easy to code and most importantly being very easy to run. Hence they tend to do either no IO or sometimes they do IO exclusively, but rarely indeed is there ever an attempt made to actually model a workload and even then it's a general workload, not anything site-specific. Some exceptions exist. The next question usually is along the lines of "Well why run all these silly little benchmarks if they don't mean anything?" There isn't much of an answer to this except that these little programs are about the only way to make any kind of comparisons across a wide variety of processors for a wide variety of customers, and even if the data is only vaguely useful it's better than comparing raw instruction timings and IO bus bandwidths. Certainly better approaches exist but they involve a LOT of work to instrument an existing workload and then generate a synthetic workload to duplicate the observed one. Most customers can't afford to do that, and at times the workload to be predicted doesn't yet exist. John
596.5	yes	TOOK::APPELLOF	Carl J. Appellof	`Mon Oct 20 1986 13:01`	9
	I agree that the problem is in the "normalization". there are really two components to this: the first, as pointed out, is in weighting the benchmarks according to how important they are to YOUR workload. The second, and only mathematical reason, is to reduce results of various benchmarks to some common scale so that an arithmetic mean can make sense. Obviously, the method of standardizing each benchmark against a different machine is not the way to do it.
596.6	Doug Clark gave an excellent lecture on a related topic	EAGLE1::BEST	R D Best, Systems architecture, I/O	`Sun Oct 26 1986 01:55`	16
	> In case anyone is interested, Doug Clark gave a very interesting > (and amusing) talk on the rampant misuse of benchmarks about a year ago > at an LTN technical forum. I believe that it was entitled something > like 'Ten Awful Ways to Measure Computer Performance'. He discusses > the effects of neglecting realistic cache hit ratios, compiler effects, > why certain commonly used benchmarks are notoriously bad indicators of > real life computer usage, AND the specious use of statistics and math > by hardware manufacturers (including us) and other 'trick of the trade'. > I believe it was recorded and should be available on videotape from the LTN > library. I can almost guarantee that this talk will have you rolling on > the floor. I give it my vote for one of the all time best lectures I've > attended. > /R Best
596.7	Median or mean	AIWEST::DRAKE	Dave (Diskcrash) Drake 619-292-1818	`Sun Oct 26 1986 03:44`	22
	A few thoughts: re:0 The arithmetic mean is not usually also the median. The median is the value that has 50% of the observations above and below it. In fact I have found that median based figures of merit are very useful in a wide class of analysis problems. I have used them in image processing to "cast out" bad data rather tha forming linear filters that include it. The median would in fact be a good comparison mechanism as it would help ignore benchmark extrema. No question, benchmarks are a pit. We try to quantify some simple "figure of merit" about a very complex system such as a 8800. I would think that it would be better to distill each processor into its component queueing mechanisms and provide quantitative data about the server time of each queue. (A queue in this case means any system resource that is consumed in common by processes.) Each processor would end up with say 5 to 10 values that would be used for comparison purposes. Someone would probably come along and find the norm of the 5 to 10 valued vector and call this the "performance". If we did this we could more accurately compare new applications against our systems. All I can say is MIPS are for DIPS.