less_retarded_wiki/elo.md
2024-12-10 20:41:11 +01:00

14 KiB

Elo

The Elo system (named after Arpad Elo, NOT an acronym) is a mathematical system for rating the relative strength of players in a certain competitive game, most notably and widely used in chess but also elsewhere (video games, table tennis, ...). { I have seen a cool video where someone computed the Elo of all NPC players in the Pokemon games. ~drummyfish } Based on number of wins, losses and draws against other Elo rated opponents, the system computes a number (rating) for each player that highly correlates with that player's current strength/skill; as games are played, ratings of players are constantly being updated to reflect changes in their strength. The numeric rating can then be used to predict the probability of a win, loss or draw of any two players in the system, as well as doing all kinds of other nice things such as tracking player's improvement over time, constructing ladders of current top players and matchmaking players of similar strength in online games. For example if player A has an Elo rating of 1700 and player B 1400, A is expected to win in a game with player B with the probability of 85%. The system is designed in a very clever way -- it uses the ability to estimate the outcome of a game between two players and then corrects the ratings of the players based on whether they do better or worse than expected. This way the ratings change and converge to stop near the value that reflects the player's true strength. Elo is a system designed in a smart way but still remains mathematically a pretty "keep it simple" one -- this means it has a few flaws and shortcomings (mentioned below), which keep being addressed by alternative rating systems such as Glicko (which further adds e.g. confidence intervals). However the simplicity of Elo has also shown to be a big advantage, it does a great job for a very small "price" and this quality to price ratio so far seems to be uncontested. Elo is good enough for most practical uses without requiring too complex mathematics or large amounts of data constantly being available. For this it remains in wide use despite other systems being objectively more accurate in predictions: usually the high complexity of the competing systems shows only diminishing returns.

What we call a "game" here need not always be a typical game, Elo rating may be used for example in a video sharing platform to help the recommendation algorithm by letting videos compete for attention and then assigning them rating. When the site recommends the user two videos at once, they are effectively playing a game to win attention: whichever gets clicked wins the game, and this way we may find out which videos are the most popular AND also how popular each one is relative to others.

The Elo system was created specifically for chess (even though it can be applied to other games as well, it doesn't rely on any chess specific rules) and described by Arpad Elo in his 1978 book called The Rating of Chessplayers, Past and Present, by which time it was already in use by FIDE. It replaced older rating systems, most notably the Harkness system.

Elo rates only RELATIVE performance, not absolute, i.e. the rating number of a player says nothing in itself, it is only the DIFFERENCE in rating points between two players that matters, so in an extreme case two players rated 300 and 1000 in one rating pool may in another one be rated 10300 and 11000 (the difference of 700 is the only thing that stays the same, mean value can change freely). This may be influenced by initial conditions and things such as rating inflation (or deflation) -- if for example a chess website assigns some start rating to new users which tends to overestimate an average newcomer's abilities, newcomers will come to the site, play a few games which they will lose, then they ragequit but they've already fed their points to the good players, causing the average rating of a good player to grow over time (it's basically like an economy where the rating points are the currency, new overrated players have the same effect as printing money). This is one of several issues the Elo system has to deal with. Other issues include for example magic constants: the constant K (change rate) and the initial rating of new players have to somehow be set, and the system itself doesn't say what the ideal values are.

Yet another shortcoming is that ratings (including relative differences) depend on the order of games. I.e. when several games are played between N players and we update the ratings after each game, then the ratings of all the players (and their differences, i.e. predictions the system will make) at the end will depend on the order in which the games were played -- playing the games with exact same results but in different order will generally result in different ratings. This also holds for grouping: we may update ratings after each game or group several games together and count them as one match, outcome of which will be the average outcome of all the games -- and this may affect ratings too. So the rating partially depends on something that has nothing to do with the player's skill. This may not be such a huge problem in practice, tiny differences and fluctuations are usually ignored, but eventually this IS an undesirable property of the system. Some other systems address this by always computing every player's rating based on whole history of games he ever played, which fixes the issue but also brings in more computational complexity (imagine having to recompute everything from scratch after every single game, AND having to keep the record of complete history of all games).

It also must be said that Elo is a simplification of reality, as is any attempt at capturing skill with a single number -- even though it is a very good predictor of something akin a "skill" and outcomes of games, trying to capture "skill" with a single number is similar to trying to capture such a multidimensional attribute as intelligence with a single dimensional IQ number. For example due to psychology, many different areas of the game to be mastered and different playstyles transitivity may be broken in reality: it may happen that player A mostly beats player B, player B mostly beats player C and player C mostly beats player A, which Elo won't capture. However this is not an issue of the Elo system specifically but rather of our simplified model of reality -- any other system that tries to capture skill as a one dimensional number, no matter how advanced, will suffer the same flaw.

Besides mathematical inaccuracies Elo (as well as other systems in general) also comes with more potential practical problems such as creating focus on grinding (players strategically choosing weaker opponents to maximize their rating), players refusing to play in order to not lose points, removing fun from games by implementing super effective matchmaking that just maximizes number of draws etcetc. Despite all the described flaws however it must be held that Elo is pretty nice and very useful, it's usually just its wrong application (for example in the mentioned matchmaking) where it starts to create trouble.

Elo rating can also be converted to (or from) alternative measures such as material or time advantage, i.e. given let's say two chess players with known ratings, we may be able to say how big of a handicap the stronger player must suffer in order for the two to be on par. However the relationship will probably not be simple, we can't say "this much Elo difference equals this many pawns in handicap" -- having a two pawn material advantage in a beginner game hardly makes a difference but on the absolute top level losing two pawns is decisively also a lost game (despite this some approximations were given, e.g. Fisher and Kannan estimated that in computer chess 100 Elo was roughly equivalent to one pawn).

How It Works

Initial rating of players is not specified by Elo, each rating organization applies its own method (e.g. assign an arbitrary value of let's say 1000 or letting the player play a few unrated games to estimate his skill).

Suppose we have two players, player 1 with rating A and player 2 with rating B. In a game between them player 1 can either win, i.e. score 1 point, lose, i.e. score 0 points, or draw, i.e. score 0.5 points. (Some games may allow to give more possible outcomes besides win/loss/draw, some wins may be "stronger" than others for example -- this is still compatible with Elo as long as we can map the outcome to the range between 0 and 1.)

The expected score E of a game between the two players is computed using a sigmoid function (400 is just a magic constant that's usually used, it makes it so that a positive difference of 400 points makes a player 10 times more likely to win):

E = 1 / (1 + 10^((B - A)/400))

For example if we set the ratings A = 1700 and B = 1400, we get a result E ~= 0.85, i.e in a series of many games player 1 will get an average of about 0.85 points per game, which can mean that out of 100 games he wins 85 times and loses 16 times (but it can also mean that out of 100 games he e.g. wins 70 times and draws 30). Computing the same formula from the player 2 perspective gives E ~= 0.15 which makes sense as the number of points expected to gain by the players have to add up to 1 (the formula says in what ratio the two players split the 1 point of the game).

After playing a game the ratings of the two players are adjusted depending on the actual outcome of the game. The winning player takes some amount of rating points from the loser (i.e. the loser loses the same amount of point the winner gains which means the total number of points in the system doesn't change as a result of games being played). The new rating of player 1, A2, is computed as:

A2 = A + K * (R - E)

where R is the outcome of the game (for player 1, i.e. 1 for a win, 0 for loss, 0.5 for a draw) and K is the change rate which affects how quickly the ratings will change (can be set to e.g. 30 but may be different e.g. for new or low rated players). So with e.g. K = 25 if for our two players the game ends up being a draw, player 2 takes 9 points from player 1 (A2 = 1691, B2 = 1409, note that drawing a weaker player is below the expected result).

How to compute Elo difference from a number of games? This is useful e.g. if we have a chess engine X with Elo EX and a new engine Y whose Elo we don't know: we may let these two engines play 1000 games, note the average result E and then compute the Elo difference of the new engine against the first engine from this formula (derived from the above formula by solving for Elo difference B - A):

B - A = log10(1 / E - 1) * 400

Some Code

Here is a C code that simulates players of different skills playing games and being rated with Elo. Keep in mind the example is simple, it uses the potentially imperfect rand function etc., but it shows the principle quite well. At the beginning each player is assigned an Elo of 1000 and a random skill which is normally distrubuted, a game between two players consists of each player drawing a random number in range from from 1 to his skill number, the player that draws a bigger number wins (i.e. a player with higher skill is more likely to win).

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define PLAYERS 101
#define GAMES 10000
#define K 25          // Elo K factor

typedef struct
{
  unsigned int skill;
  unsigned int elo;
} Player;

Player players[PLAYERS];

double eloExpectedScore(unsigned int elo1, unsigned int elo2)
{
  return 1.0 / (1.0 + pow(10.0,((((double) elo2) - ((double) elo1)) / 400.0)));
}

int eloPointGain(double expectedResult, double result)
{
  return K * (result - expectedResult);
}

int main(void)
{
  srand(100);

  for (int i = 0; i < PLAYERS; ++i)
  {
    players[i].elo = 1000; // give everyone initial Elo of 1000

    // normally distributed skill in range 0-99:
    players[i].skill = 0;

    for (int j = 0; j < 8; ++j)
      players[i].skill += rand() % 100;

    players[i].skill /= 8;
  }

  for (int i = 0; i < GAMES; ++i) // play games
  {
    unsigned int player1 = rand() % PLAYERS,
                 player2 = rand() % PLAYERS;

    // let players draw numbers, bigger number wins
    unsigned int number1 = rand() % (players[player1].skill + 1),
                 number2 = rand() % (players[player2].skill + 1);

    double gameResult = 0.5;

    if (number1 > number2)
      gameResult = 1.0;
    else if (number2 > number1)
      gameResult = 0.0;
  
    int pointGain = eloPointGain(eloExpectedScore(
      players[player1].elo,
      players[player2].elo),gameResult);

    players[player1].elo += pointGain;
    players[player2].elo -= pointGain;
  }

  for (int i = PLAYERS - 2; i >= 0; --i) // bubble-sort by Elo
    for (int j = 0; j <= i; ++j)
      if (players[j].elo < players[j + 1].elo)
      {
        Player tmp = players[j];
        players[j] = players[j + 1];
        players[j + 1] = tmp;
      }

  for (int i = 0; i < PLAYERS; i += 5) // print
    printf("#%d: Elo: %d (skill: %d\%)\n",i,players[i].elo,players[i].skill);

  return 0;
}

The code may output e.g.:

#0: Elo: 1134 (skill: 62%)
#5: Elo: 1117 (skill: 63%)
#10: Elo: 1102 (skill: 59%)
#15: Elo: 1082 (skill: 54%)
#20: Elo: 1069 (skill: 58%)
#25: Elo: 1054 (skill: 54%)
#30: Elo: 1039 (skill: 52%)
#35: Elo: 1026 (skill: 52%)
#40: Elo: 1017 (skill: 56%)
#45: Elo: 1016 (skill: 50%)
#50: Elo: 1006 (skill: 40%)
#55: Elo: 983 (skill: 50%)
#60: Elo: 974 (skill: 42%)
#65: Elo: 970 (skill: 41%)
#70: Elo: 954 (skill: 44%)
#75: Elo: 947 (skill: 47%)
#80: Elo: 936 (skill: 40%)
#85: Elo: 927 (skill: 48%)
#90: Elo: 912 (skill: 52%)
#95: Elo: 896 (skill: 35%)
#100: Elo: 788 (skill: 22%)

We can see that Elo quite nicely correlates with the player's real skill.