Beasley, The Mathematics of Games


From stevena@... Thu Dec 12 07:53:06 2002
X-Sender: stevena@...
Received: (EGP: mail-8_2_3_0); 12 Dec 2002 15:53:06 -0000
Received: (qmail 84557 invoked from network); 12 Dec 2002 15:53:04 -0000
Received: from unknown (
  by with QMQP; 12 Dec 2002 15:53:04 -0000
Received: from unknown (HELO (
  by with SMTP; 12 Dec 2002 15:53:04 -0000
Received: from ([] helo=localhost)
	by with esmtp (Exim 3.33 #1)
	id 18MVdu-0006k0-00
	for; Thu, 12 Dec 2002 07:52:55 -0800
Date: Thu, 12 Dec 2002 07:52:54 -0800 (Pacific Standard Time)
Subject: Ratings systems
X-Warning: UNAuthenticated Sender
MIME-Version: 1.0
From: Steven Alexander 
Reply-To: Steven Alexander 
X-Yahoo-Group-Post: member; u=3858520
X-Yahoo-Profile: stevena17
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

The ratings system's drawbacks will be overcome, whether within the current framework or creating a wholly new one, only with serious statistical study. While what's been posted here recently (which I'll read more thoroughly than already[1]) is valuable, anyone who really wants to think about ratings should read up on some very good published work.

Before the constructive work noted below, please absorb the attached excerpt from Chapter 5 of John D. Beasley's 1989 book "The Mathematics of Games" (Oxford Univ Pr, ISBN 0-19-286107-7), entitled "If A Beats B, and B Beats C ..." (Actually, the section of Ch. 5 about non-transitivity is the only one not reproduced.)

Though I am open to other possibilities, I currently consider the strongest ones (1) a system pairing (rating; uncertainty), where the rating component is similar to current ratings (perhaps enough to use the current scale) and the uncertainty measures how unsure the first component is to be close to the "true" rating; and (2) a score-based system, in the form (score, defense) [comma-separated because these two numbers are in the same units: points].

The first kind has been developed for the US Chess Federation, the "Glicko" system, and for table tennis, the Marcus system. While I am collecting references of essential reading for Rating Committee members and other arguers, for now, starting from and, should lead to extensive details on these two.

The second kind is, of course, examined in Robert Parker's writings. (I'll assemble these with other available links and some copies of papers for all concerned.)

While I very much like the Parker-type system, its adoption or other changes will depend on evidence -- of how good the new system would be, not how bad the old one was, if I have anything to say about it. This will involve both running historical data (win-loss for modifications of the current system, but as Joe Edley noted, score data not yet collected will be necessary for the Parker) to test at least how predictive a system would have been had it been in effect before the predictions to be made. Also to be evaluated are deserved stability of players' ratings and the degree of any undesired incentives; and with a Parker- type system, how much is gained by adding factors other than offense and defense (that arise not from the inherent meaning of the measures, but from imperfect match of the otherwise elegant system with reality).

Enjoy reading.

  Steven Alexander
    NSA Ratings Committee member

[1] Those who just criticize, many assuming that their desires are both consistent among themselves and consistent with others' priorities, might benefit most by reading and learning. Those publishing data and experiments here already are thinking concretely about the problem.


The Mathematics of Games

John D. Beasley
Oxford Univ Pr 1989
ISBN 0-19-286107-7

Chapter 5 If A Beats B, and B Beats C ...

[all but the last section of Ch. 5; pages 47-61]

In the previous chapter, we looked at some of the pseudo-random effects which appear to affect the results of games. We now attempt to measure the actual skill of performers. There is no difficulty in finding apparently suitable mathematical formulae; textbooks are full of them. Our primary aim here is to examine the circumstances in which a particular formula may be valid, and to note any difficulties which may attend its use.

The assessment of a single player in isolation

We start by considering games such as golf, in which each player records an independent score. In practice, of course, few competitive games are completely free from interactions between the players; a golfer believing himself to be two strokes behind the tournament leader may take risks that he would not take if he believed himself to be two strokes ahead of the field. But for present purposes, we assume that any such interactions can be ignored. We also ignore any effects that external circumstances may have on our data. In Chapter 4, we were able to adjust our scores to allow for the general conditions pertaining to each round, because the pooling of the scores of all the players allowed the effect of these conditions to be assessed with reasonable confidence. A sequence of scores from one player alone does not allow such assessments to be made, and we have little alternative but to accept the scores at face value.

To fix our ideas, let us suppose that a player has returned four separate scores, say 73, 71, 70, and 68 (Figure 5.1). If these scores were recorded at approximately the same time, we might conclude that a reasonable estimate of his skill is given by the unweighted mean 70.5 (U in Figure 5.1). This is effectively the basis on which tournament results are calculated. On the other hand, if the scores were returned over a long period, we might prefer to give greater weight to the more recent of them. For example, if we assign weights 1:2:3:4 in order, we obtain a weighted mean of 69.7 (W in Figure 5.1). More sophisticated weighting, taking account of the actual dates of the scores, is also possible.

                    73  *---------------
                        |    |    |    |
                        |    |    |    |
                    71  -----*----------
                        |    |    |    |   <-- U (70.5)
                    70  ----------*-----
                        |    |    |    |   <-- W (69.7)
                        |    |    |    |
                    68  ---------------*

            Figure 5.1  Weighted and unweighted means

So, we see, right from the start, that our primary need is not a knowledge of abstruse formulae, but a commonsense understanding of the circumstances in which the data have been generated.

Now let us assume that we already have an estimate, and that the player returns an additional score. Specifically, let us suppose that our estimate has been based on n scores s_1, ..., s_n, and that the player has now returned an additional score s_{n+1}. If we are using an unweighted mean based on the n most recent scores, we must now replace our previous estimate


by a new estimate
the contribution of s_1 vanishes, the contributions from s_2,...,s_n remain unchanged, and a new contribution appears from s_{n+1}. In other words, the contribution of a particular score to an unweighted mean remains constant until n more scores have been recorded, and then suddenly vanishes. On the other hand, if we use a weighted mean with weights 1:2:...:n, the effect of a new score s_{n+1} is to replace the previous estimate
        2(s_1+2s_2...+n s_n)/n(n+1)
by a new estimate
        2(s_2+2s_3...+n s_{n+1})/n(n+1);
not only does the contribution from s_1 vanish, but the contributions from s_2,...,s_n are all decreased. This seems rather more satisfactory.

Nevertheless, anomalies may still arise. Let us go back to the scores in Figure 5.1, which yielded a mean of 69.7 using weights 1:2:3:4 , and let us suppose that an additional score of 70 is recorded. If we form a new estimate by discarding the earliest score and applying the same weights 1:2:3:4 to the remainder, we obtain 69.5, which is less than either the previous estimate or the additional score. So we check our arithmetic, suspecting a mistake, but we find the value indeed to be correct. Such an anomaly is always possible when the mean of the previous scores differs from the mean of the contributions discarded. It is rarely large, but it may be disconcerting to the inexperienced.

If we are to avoid anomalies of this kind, we must ensure that the updated estimate always lies between the previous estimate and the additional score. This is easily done; if E_n is the estimate after n scores s_1,...,s_n all we need is to ensure that

        E_{n+1} = w_n E_n + (1-w_n)s{n+1}
where w_n is some number satisfying 0<w_n<1. But there is a cost. If we calculate successive estimates E_1,E_2,..., we find
        E_1 = s_1,
        E_2 = w_1 s_1 + (1-w_1)s_2,
        E_3 = w_1 w_2 s_1 + w_2(1-w_1)s_2 + (1-w_2)s_2,
and so on; the contribution of each score gradually decreases, but it never vanishes altogether.

So we have a fundamental dilemma. If we want to ensure that an updates estimate always lies between the most recent score and the previous estimate, we must accept that even the most ancient of scores will continue to contribute its mite to our estimate. Conversely, if we exclude scores of greater than a certain antiquity, we must be prepared for occasions on which an updated estimate does not lie between the previous estimate and the most recent score.

The estimation of trends

The estimates that we have discussed so far have assessed skill as it has been displayed in the past. If a player's skill has changed appreciably during the period under assessment, the estimate may not fully reflect the change. It is therefore natural to try to find estimates which do reflect such changes.

Such estimates can indeed be made. Figure 5.2(a) repeats the last two data values of Figure 5.1, and the dotted line shows the estimate E obtained by assuming that the change from 70 to 68 represents a genuine trend that may be expected to continue. More sophisticated estimates, taking account of more data values, can be found in textbooks on statistics and economics.

                   -----------    71 ^ *----------
                   |    |    |       | |    |    |
               70  *----------       | *----------
                   |    |    |         |    |    |
                   -----------         -----------
                   |    |    |         |    |    |
               68  -----*-----    68   -----*-----
                   |    |    |         |    |    |
                   -----------         -----------
                   |    |    |         |    |    |
               66  ----------*         ----------* |
                   |    |    |         |    |    | |
                   -----------    65   ----------* v
                       (a)                 (b)

         Figure 5.2  The behaviour of a forward estimate

But there are two problems. The first, which is largely a matter of common sense, is that the assumption of a trend is not to be made blindly. Golf enthusiasts may have recognized 73-71-70-68 as the sequence returned by Ben Hogan when winning the British Open in 1953, and it is doubtful if even Hogan, given a fifth round, would have gone round a course as difficult as Carnoustie in 66. On the other hand, there are circumstances in which the same figures might much more plausibly indicate a trend: if they represent the successive times of a twenty-kilometre runner as he gets into training, for example.

The second difficulty is a matter of mathematics. The extrapolation from 70 through 68 to 66 is an example of linear extrapolation from s_1 through s_2, the estimate being given by 2s_2 - s_1. In other words, we form a weighted mean of s_1 and s_2, but one of the weights is negative. A change in that score therefore has an inverse effect on the estimate. This is shown in Figure 5.2(b), where the score 70 has been changed to 71 and it is seen that the estimate has changed from 66 to 65. In particular, if a negatively weighted score was abnormally poor (for example, because the player was not fully fit on that occasion), the future estimate will be improved as a result.

This contravenes common sense, and suggests that we should confine our attention to estimates which respond conformably to all constituent scores: a decrease in any score should decrease the estimate, and an increase in any score should increase it. But it turns out that such an estimate cannot lie outside the bounds of the constituent scores, and this greatly reduces the scope for estimation of trends. The proof is simple and elegant. Let S be the largest of the constituent scores. If each score actually equals S, the estimate must equal S also. If any score s does not equal S and the estimating procedure is conformable, the replacement of S must equal S also. If any score s does not equal S and the estimating procedure is conformable, the replacement of S by s must cause a reduction in the estimate. So a conformable estimate cannot exceed the largest of the constituent scores; and similarly, it cannot be less than the smallest of them.\fn{1}

\fn{1} It follows that economic estimates which attempt to project current trends are in general not conformable; and while this is unlikely to be the whole reason for their apparent unreliability, it is not an encouraging thought.

In practice, therefore, we have little choice. Given that common sense demands conformable behaviour, we cannot use an estimating procedure which predicts a future score outside the bounds of previous scores; we can merely give the greatest weight to the most recent of them. If this is unwelcome news to improving youngsters, it is likely to gratify old stagers who do not like being reminded too forcibly of their declining prowess. In fact, the case which most commonly causes difficulty is that of the player who has recently entered top-class competition and whose first season's performance is appreciably below the standard which he subsequently establishes; and the best way to handle this case is not to use a clever formula to estimate the improvement, but to ignore the first year's results when calculating subsequent estimates.

Interactive games

We now turn to games in which the result is recorded only as a win for a particular player, or perhaps as a draw. These games present a much more difficult problem. The procedure usually adopted is to assume that the performance of a player can be represented by a single number, called his grade or rating, and to calculate this grade so as to reflect his actual results. For anything other than a trivial game, the assumption is a gross over-simplification, so anomalies are almost inevitable and controversy must be expected. In the case of chess, which is the game for which grading has been most widely adopted, a certain amount of controversy has indeed arisen; some players and commentators appear to regard grades with excessive reverence, most assume them to be tolerable approximations to the truth, a few question the detailed basis of the calculations, and a few regard them as a complete waste of ink. The resolution of such controversy is beyond the scope of this book, but at least we can illuminate the issues.

The basic computational procedure is to assume that the mean expected result of a game between two players is given by an 'expectation function' which depends only on their grades a and b, and then to calculate these grades so as to reflect the actual results. It might seem that the accuracy of the expectation function is crucial, but we shall see in due course that it is actually among the least of our worries; provided that the function is reasonably sensible, the errors introduced by its inaccuracy are likely to be small compared with those resulting from other sources. In particular, if the game offers no advantage to either player, it may be sufficient to calculate the grading difference d=a-b and to use a simple smooth function f(d) such as that shown in Figure 5.3. For a game such as chess, the function should be offset to allow for the first player's advantage, but his is a detail easily accommodated.\fn{2}

\fn{2} Figure 5.3 adopts the chess player's scaling of results: 1 for a win, 0 for a loss, and 0.5 for a draw. The scaling of the d-axis is arbitrary.

                            1.0  |                 -
                                 |       /
                                 |     /
                                 |   /
                                 | /
                            0.5  /
                               / |
                             /   |
                           /     |
             _           /  0.0  |
            -100      -50        0       50      100

            Figure 5.3  A typical expectation function
                     [showing S-shaped curve
                     from (-100,near 0) thru
                    (0,0.5) to (100,near 1.0)]

Once the function f(d) has been chosen, the calculation of grades is straightforward. Suppose for a moment that two players already have grades which differ by d, and that they now play another game, the player with the higher grade winning. Before the game, we assessed his expectation as f(d); after the game, we might reasonably assess it as a weighted mean of the previous expectation and the new result. Since a win has value 1, this suggests that his new expectation should be given by a formula such as

        w + (1-w)f(d)
where w is a weighting factor, and this is equivalent to
        f(d) + w(1-f(d)).

More generally, if the stronger player achieves a result of value r, the same argument suggests that his new expectation should be given by the formula

        f(d) + w(r-f(d)).

Now if the expectation function is scaled as in Figure 5.3 and the grading difference is small, we see that a change of \delta in d produces a change of approximately \delta/100 in f(d). It follows that approximately the required change in expectation can be obtained by increasing the grading difference by 100w(r-f(d)). As the grading difference becomes larger, the curve flattens, and a given change in the grading difference produces a smaller change in the expectation. In principle, this can be accomplished by increasing the scaling factor 100, but it is probably better to keep this factor constant, since always to make the same change in the expectation may demand excessive changes in the grades. The worst case occurs when a player unexpectedly fails to beat a much weaker opponent; the change in grading difference needed to reduce an expectation of 0.99 to 0.985 may be great indeed. To look at the matter another way, keeping the scaling factor constant amounts to giving reduced weight to games between opponents of widely differing ability, which is plainly reasonable since the ease with which a player beats a much weaker opponent does not necessarily say a great deal about his ability against his approximate peers.

A simple modification of this procedure can be used to assign a grade to a previously ungraded player. Once he has played a reasonable number of games, he can be assigned that grade which would be left unchanged if adjusted according to his actual results. The same technique can also be used if it desired to ignore ancient history and grade a player only on the basis of recent games.

Grades calculated on this basis can be expected to provide at least a rough overall measure of each regular player's performance. However, certain practical matters must be decided by the grading administrator, and these may have a perceptible effect on the figures. Examples are the interval at which grades are updated, the value of the weighting parameter w, the relative division of an update between grades of the players (in particular, when one player is well established whereas the other is a relative newcomer), the criteria by which less than fully competitive games are excluded, and the circumstances in which a player's grade is recalculated to take account only of his most recent games. Grades are therefore not quite the objective measures that their more uncritical admirers like to maintain.

Grades as measures of ability

Although grading practitioners usually stress that their grades are merely measures of performance, players are interested in them primarily as measures of ability. A grading system defines an expectation between every pair of graded players, and the grades are of interest only in so far as these expectations correspond to reality.

A little thought suggests that this correspondence is unlikely to be exact. If two players A and B have the same grade, their expectations against any third player C are asserted to be exactly equal. Alternatively, suppose that A, B, Y, and Z have grades such that A's expectation against B is asserted to equal Y's against Z, and that expectations are calculated using a function which depends only on the grading difference. If these grades are a, b, y, and z, then they must satisfy a-b = y=z, from which it follows that a-y = b -z, and hence A's expectation against Y is asserted to equal B's against Z. Assertions as precise as this are unlikely to be true for other than very simple games, and it follows that grades cannot be expected to yield exact expectations; the most for which we can hope is that they form a reasonable average measure whose deficiencies are small compared with the effects of chance fluctuation.

These chance effects can easily be estimaged. If A's expectation against B is p and there is a probability h that they draw, the standard deviation of a single result is \sqrt({p(1-p) - h/4}). If they now play a sufficiently long series of n games, the distribution of the discrepancy between mean result and expectation can be taken as a normal distribution with standard deviation s/\sqrt n, and a simple rule of thumb gives the approximate probability that any particular discrepancy would have arisen by chance: a discrepancy exceeding the standard deviation can be expected on about one trial in three, and a discrepancy exceeding twice the standard deviation on about one trial in twenty. What constitutes a sufficiently large value of n depends on the expectation p. If p lies between 0.4 and 0.6, n should be at least 10; if p is smaller than 0.4 or greater than 0.6, n should be at least 4/p or 4/(1-p) respectively. More detailed calculations, taking into account the incidence of each specific combination of results, are obviously possible, but they are unlikely to be worthwhile.

A practicable testing procedure now suggests itself. Every time a new set of grades is calculated, the results used to calculate the new grades can be used also to test the old ones. If two particular opponents play each other sufficiently often, their results provide a particularly convenient test; otherwise, results must be grouped, though this must be done with care since the grouping of inhomogenous results may lead to wrong conclusions. The mean of the new results can be compared with the expectation predicted by the previous grades, and large discrepancies can be highlighted: one star if the discrepancy exceeds the standard deviation, and two if it exceeds twice the standard deviation. The rule of thumb above gives the approximate frequency with which stars are to be expected if chance fluctuations are the sole source of error.

In practice, of course, chance fluctuations are not the only source of error. Players improve when they are young, they decline as they approach old age, and they sometimes suffer temporary loss of form due to illness or domestic disturbance. The interpretation of stars therefore demands common sense. Nevertheless, if the proportions of stars and double stars greatly exceed those attributable to chance fluctuation, the usefulness of the grades is clearly limited.

If grades do indeed constitute acceptable measures of ability, regular testing such as this should satisfy all but the most extreme and blinkered of critics. However, grading administrator and critic alike must always remember that around one discrepancy in three should be starred, and around one in twenty doubly starred, on account of chance fluctuations, even if there is no other source of error. If a grading administrator performs a hundred tests without finding any doubly starred discrepancies, he should not congratulate himself on the success of his grading system; he should check the correctness of his testing.

The self-fulfilling nature of grading systems

We now come to one of the most interesting mathematical aspects of grading systems: their self-fulling nature. It might seem that a satisfactory expectation function must closely reflect the true nature of the game, but in fact this is not so. Regarded as measures of ability, grades are subject to errors from two sources: (i) discrepancies between ability and actual performance, and (ii) errors in the calculated expectations due to the use of an incorrect expectation function. In practice, the latter are likely to be much smaller than the former.

Table 5.1 illustrates this. It relates to a very simple game in which each player throws a single object at a target, scoring a win if he hits and his opponent misses, and the game being drawn if both hit or if both miss. If the probability that player j hits is p_j, the expectation of player j against player k can be shown to be (1+p_j-p_k)/2, so wwe can calculate expectations exactly by setting the grade of player j to 50p_j and using the expectation function f(d) = 0.5 + d/100. Now let us suppose that we have nine players whose probabilities p_1,...,p_9 range linearly from 0.1 to 0.9, that they play each other with equal frequency, and that we deliberately use the incorrect expectation function f(d) = N(d\sqrt (2\pi)/100) where N(x) is the normal distribution function. The first column of Table 5.1 shows the grades that are produced if the results of the games agree strictly with expectation, and the entries for each pair of players show (i) the discrepancy between the true and the calculated expectations, and (ii) the standard deviation of a single result between the players. The latter is always large compared with the former, which means that a large number of games are needed before the discrepancy can be detected against the background of chance fluctuation. The standard deviation of a mean result decreases only with the inverse square root of the number of games played, so we can expect to require well over a hundred sets of all-play-all results before even the worst discrepancy (player 1 against player 9) can be diagnosed with confidence.

Table 5.1  Throwing one object: the effect of an incorrect expectation function
     Grade       -------------------------------------------------
Player           1     2     3     4     5     6     7     8     9
1      5.5       - -.009 -.013 -.014 -.011 -.005  .004  .017  .032
                 -  .250  .274  .287  .292  .287  .274  .250  .212

2     17.3   0.009     - -.006 -.009 -.009 -.007 -.002  .006  .017
              .250     -  .304  .287  .292  .287  .274  .250  .212

3     28.5   0.013  .006     - -.004 -.006 -.007 -.005 -.002  .004
              .274  .304     -  .335  .339  .316  .324  .304  .274

4     39.3   0.014  .009  .004     - -.003 -.006 -.007 -.007 -.005
              .287  .316  .335     -  .350  .346  .335  .316  .287

5     50.0   0.011  .009  .006  .004     - -.003 -.006 -.009 -.011
              .292  .320  .339  .335     -  .350  .339  .320  .292

6     60.7   0.005  .007  .007  .006  .003     - -.004 -.009 -.014
              .287  .316  .335  .346  .350     -  .335  .316  .287

7     71.5   0.004  .002  .005  .007  .006  .004     -  .006 -.013
              .274  .304  .324  .335  .339  .335     -  .304  .274

8     82.7   0.017 -.006  .002  .007  .009  .009  .006     - -.009
              .250  .283  .304  .316  .320  .316  .304     -  .250

9     94.5   0.032 -.017 -.004  .005  .011  .014  .013  .009     -
              .212  .250  .274  .287  .292  .287  .274  .250     -
The grades are calculated using an incorrect expectation function as described
in the text.  The tabular values show (i) the discrepancy between the calculated
and true expectations, and (ii) the standard deviation of a single result.

Experiment bears this out. Table 5.2 records a computer simulation of a hundred sets of all-play-all results, the four rows for each player showing (i) his true expectation against each opponent, (ii) the mean of his actual results against each opponent, (iii) his grade as calculated from these results using the correct expectation function 0.5 + d/100, together with his expectation against each opponent as calculated from their respective grades, and (iv) the same as calculated using the incorrect expectation function N(d\sqrt(2\pi)/100). The differences between rows (i) and (iii) are caused by the differences between the theoretical expectations and the actual results, and the differences between rows (iii) and (iv) are caused by the difference between the expectation functions. In over half the cases, the former difference is greater than the latter, so on this occasion even a hundred sets of all-play-all results have not sufficed to betray the incorrect expectation function with reasonable certainty. Nor are the differences between actual results and theoretical expectations in Table 5.2 in any way abnormal. If the experiment were to be performed again, it is slightly more likely than not that the results in row (ii) would differ from expectation more widely than those which appear here.\fn{3}

\fn{3} In practice, of course, we do not know the true expectation function, so rows (i) and (iii) are hidden from us, and all we can do is assess whether the discrepancies between rows (ii) and (iv) might reasonably be attributable to chance. Such a test is far from sensitive; for example, the discrepancies in Table 5.2 are so close to the median value which can be expected from chance fluctuations alone that nothing untoward can be discerned in them. We omit the proof of this, because the analysis is not straightforward; the simple rules of thumb which we used in the previous section cannot be applied, because we are now looking at the spread of results around expectations to whose calculation they themselves have contributed (whereas the rules apply to the spread of results about independently calculated expectation) and we must take the dependence into account. Techniques exist for doing this, but the details are beyond the scope of this book.

Table 5.2  Throwing one object: grading systems compared
     Grade       -------------------------------------------------
Player           1     2     3     4     5     6     7     8     9
1                -  .450  .400  .350  .300  .250  .200  .150  .100
                 -  .455  .435  .350  .335  .230  .200  .150  .125
      11.8       -  .471  .400  .355  .314  .250  .197  .182  .110
       7.8       -  .466  .388  .342  .304  .247  .203  .191  .139

2             .550     -  .450  .400  .350  .300  .250  .200  .150
              .545     -  .395  .395  .330  .290  .245  .210  .130
      17.6    .529     -  .429  .384  .344  .280  .226  .211  .139
      14.6    .534     -  .422  .374  .334  .275  .228  .215  .159

3             .600  .550     -  .450  .400  .350  .300  .250  .200
              .565  .605     -  .450  .390  .380  .315  .285  .185
      31.7    .600  .570     -  .455  .414  .350  .297  .281  .209
      30.4    .612  .578     -  .451  .409  .344  .292  .277  .212

4             .650  .600  .550     -  .450  .400  .350  .300  .250
              .650  .605  .550     -  .435  .430  .365  .310  .240
      40.8    .645  .616  .546     -  .459  .395  .343  .327  .254
      40.2    .658  .626  .549     -  .457  .390  .336  .320  .249

5             .700  .650  .600  .550     -  .450  .400  .350  .300
              .665  .670  .610  .565     -  .370  .395  .370  .305
      48.9    .685  .657  .586  .540     -  .436  .343  .368  .295
      48.8    .696  .666  .591  .543     -  .432  .336  .360  .284

6             .750  .700  .650  .600  .550     -  .450  .400  .350
              .770  .710  .620  .570  .630     -  .395  .435  .395
      61.7    .750  .721  .650  .604  .564     -  .447  .432  .359
      62.4    .753  .725  .656  .610  .568     -  .442  .425  .345

7             .800  .750  .700  .650  .600  .550     -  .450  .400
              .800  .755  .685  .635  .605  .605     -  .520  .400
      72.3    .803  .773  .703  .685  .617  .553     -  .484  .412
      74.0    .797  .772  .708  .664  .624  .558     -  .483  .400

8             .850  .800  .750  .700  .650  .600  .550     -  .450
              .850  .790  .715  .690  .630  .565  .480     -  .425
      75.4    .818  .789  .718  .673  .633  .569  .516     -  .427
      77.5    .809  .785  .723  .680  .640  .575  .517     -  .417

9             .900  .850  .800  .750  .700  .650  .600  .550     -
              .875  .870  .815  .760  .695  .605  .600  .575     -
      89.9    .891  .861  .791  .745  .705  .641  .588  .575     -
      94.3    .861  .841  .788  .751  .716  .655  .600  .583     -
For each player, the four rows show (i) the true expectation against each
opponent; (ii) the average result of a hundred games against each component,
simulated by computer; (iii) the grade calculated from the simulated games,
using the correct expectation function, and the resulting expectations against
each opponent; and (iv) the same using an incorrect expectation function as
described in the text.

This is excellent news for grading secretaries, since it suggests that any reasonable expectation function can be used; the spacing of grades may differ from that which a correct expectation function would have generated, but the expectations will be adjusted in approximate compensation, and any residual errors will be small compared with the effects of chance fluctuation on the actual results. But there is an obvious corollary: the apparently successful calculation of expectations by a grading system throws no real light on the underlying nature of the game. Chess grades are currently calculated using a system, due to A. E. Elo, in which expectations are calculated by the normal distribution function, and the general acceptance of this system by chess players has fostered the belief that the normal distribution provides the most appropriate expectation for chess. In fact it is by no means obvious that this is so. The normal distribution function is not a magic formula of universal applicability; its validity as an estimator of unknown chance effects depends on the Central Limit Theorem, which states that the sum of a large number of independent samples from the same distribution can be regarded as a sample from a normal distribution, and it can reasonably be adopted as a model for the behavior only if the chance factors affecting the result are equivalent to a large number of independent events which combine additively. Chess may well not satisfy this condition, since many a game appears to be decided not by an accumulation of small blunders but by a few large ones. But while the question is of some theoretical interest, it hardly matters from the viewpoint of practical grading. Chess gradings are of greatest interest at master level, and the great majority of games at this level are played within an expectation range of 0.3 to 0.7. Over this range, the normal distribution is almost linear, but so is any simple alternative candidate, and so in all probability is the unknown 'true' function which most closely approximates to the actual behaviour of the game. In such circumstances, the errors resulting from an incorrect choice of expectation function are likely to be even smaller than those which appear in Table 5.1.

The limitations of grading

Grades help tournament organizers to group players of approximately equal strength, and they provide the appropriate authorities with a convenient basis for the awarding of honorific titles such as 'master' and 'grandmaster'. However, it is very easy to become drunk with figures, and it is appropriate that this discussion should end with some cautionary remarks.

(a) Grades calculated from only a few results are unlikely to be reliable.

(b) The assumption underlying all grades is that a player's performance against one opponent casts light on his expectation against another. If this assumption is unjustified, no amount of mathematical sophistication will provide a remedy. In particular, a grade calculated only from results against much weaker opponents is unlikely to place a player accurately among his peers.

(c) There are circumstances in which grades are virtually meaningless. For an artificial but instructive example, suppose that we have a set of players in London and another in Moscow. If we try to calculate grades embracing both sets, the placing of players within each set may well be determined, but the placing of the sets as a whole will depend on the results of the few games between players in different cities. Furthermore, these games are likely to have been between the leading players in each city, and little can be inferred from them about the relative abilities of more modest performers. Grading administrators are well aware of these problems and refrain from publishing composite lists in such circumstances, but players sometimes try to make inferences by combining lists which administrators have been careful to keep separate.

(d) A grade is merely a general measure of a player's performance relative to that of certain other players over a particular period. It is not an absolute measure of anything at all. The average ability of a pool of players is always changing, through study, practice, and ageing, but grading provides no mechanism by which the average grade can be made to reflect these changes; indeed, if the pool of players remains constant and every game causes equal and opposite changes to the grades of the affected players, the average grade never changes at all. What does change the average grade of a pool is the arrival and departure of players, and if a player has a different grade when he leaves than he received when he arrived then his sojourn will have disturbed the average grade of the other players; but this change is merely an artificial consequence of the grading calculations, and it does not represent any change in average ability. It is of course open to a grading administrator to adjust the average grade of his pool to conform to any overall change in ability which he believes to have occurred, but the absence of an external standard of comparison means that any such adjustment is conjectural.

It is this last limitation that is most frequently overlooked. Students of all games like to imagine how players of different periods would have compared with each other, and long-term grading has been hailed as providing an answer. This is wishful thinking. Grades may appear to be pure numbers, but they are actually measures relative to ill-defined and changing reference levels, and they cannot answer questions about the relative abilities of player when the reference levels are not the same. The absolute level represented by a particular whether a player's grade ten years before his peak can properly be compared with that ten years after, and quite certain that his peak cannot be compared with somebody else's peak in a different era altogether. Morphy in 1857-8 and Fischer in 1970-2 were outstanding among their chess contemporaries, and it is natural to speculate how they would have fared against each other; but such speculations are not answered by calculating grades through chains of intermediaries spanning over a hundred years.\fn{4}

\fn{4} Chess enthusiasts may be surprised that the name of Elo has not figured more prominently in this discussion, since the Elo rating system has been in use internationally since 1970. However, Elo's work as described in his book The rating of chessplayers, past and present (Batsford 1978) is open to serious criticism. His statistical testing is unsatisfactory to the point of being meaningless; he calculates standard deviations without allowing for draws, he does not always appear to allow for the extent to which his tests have contributed to the ratings which they purport to be testing, and he fails to make the important distinction between proving a proposition true and merely failing to prove it false. In particular, an analysis of 4795 games from Milwaukee Open tournaments, which he represents as demonstrating the normal distribution function to be the appropriate expectation function for chess, is actually no more than an incorrect analysis of the variation within his data. He also appears not to realize that changes in the overall strength of a pool cannot be detected, and that his 'deflation control', which claims to stabilize the implied reference level, is a delusion. Administrators of other sports (for example tennis) currently publish only rankings. The limitations of those are obvious, but at least they do not encourage illusory comparisons between today's champions with those of the past.