Elo vs Glicko

If you’ve ever done any matchmaking coding (…or played chess) you’re most likely familiar with the Elo rating system (PSA: it’s Elo, not ELO, it’s not an acronym, it’s named after a person - Arpad Elo). It’s a rating system used by most chess federations (most notably, FIDE) and has been adapted by many video games. It’s trivial to implement and get up and running, but suffers from a few issues:

  • it’s just one number, there’s no reliability (or deviation) information of any kind. Let’s assume your system gives a 1500 rating to a new player (no games played). Given 2 players with 1500 rating, one with 100 games played, the other one with 0, it’s obvious the rating of the first one is much more trustworthy. Elo doesn’t care, rating changes are controlled by a single system parameter (K) and - in the case of equal rating - it’s a zero-sum game, so winner’s rating will increase by K, while the other player’s rating will drop by K points.

  • it’s been created to measure individual ratings, so doesn’t work for teams out of the box. There’s some work done to alleviate that, but it’s tricky

  • Elo doesn’t care about activity. In a way, it rewards ‘retiring’ after achieving a very high rating, as at this point winning doesn’t give you many points, but any loss causes a big drop

In 1995, a professor from Boston University - Mark Glickman created an alternative rating system - Glicko (and later an improved version - Glicko2), that’s supposed to deal with some of these issues. Most notable differences are:

  • each player described by 3 parameters: rating, ratings deviation (RD) and volatility. Greater rating deviation results in much greater rating changes (so, typically, it should quickly converge to your ‘real’ rating)

  • ratings are recalculated per ‘period’, not after each game. A rating period duration is at the discretion of the administrator, but recommendation is to include 5-10 games. If a player didn’t play any games during certain period, his rating stays the same, but his RD will typically increase, so the longer you rest, the less reliable your rating becomes and they more rapid will be the change once you come back

Calculations are a little bit complex, but shouldn’t be a problem for modern computers. Glicko has been adapted by a number of video games like CS:GO, Heroes of the Storm or Guild Wars 2. I’ve decided to test how these 2 algorithms perform on a real world data. I knew Nate Silver and his crew have been calculating the Elo rating for NBA teams, I’ve decided to do something similar for the NHL. I’ve only included 2015-2016 regular season hero, so 82 games for each team. Used a vanilla win-tie-loss system (1/0.5/0), nothing fancy like home-ice advantage or the victory margin (home ice could make sense, margin probably not, hockey’s too volatile). Scores have been provided by The Hockey Reference. Random observations:

  • Elo took much less time to implement/tweak, it basically ‘just works’. There’s just a single parameter (K) and the usual rule of thumb is: set it to some high value for the first X game, use a lower value after that. 5/38 used 20 for their NBA ratings, my system uses 50 for the first 10 games and 20 afterwards. K is also trivial to ‘get’ intuitively, it’s simply the biggest possible rating change. Glicko2 uses a system parameter ‘tau’, recommended range 0.3-1.2, but I couldn’t really observe big impact in that range (opted for 0.5). Lower values are supposed to prevent big volatility changes. In my case volatility pretty much stayed at 0.06 (starting value) anyway (which is a little bit surprising, as NHL results are fairly ‘chaotic’, much more than NBA for sure, you don’t see teams going 73-9).

  • It took me a while to ‘convince’ Glicko that Pittsburgh should be the highest rated team, not Washington (cheating, I know, but I’m objective here, I’m actually Capitals fan). The recommended 5-10 games per period might work fine for chess or a ‘lifetime’ rating, but in a season of 82 games that seemed way too long. It’s definitely worth playing with period length, for video games it might make sense to make it even longer than 10 games (higher frequency). For my experiment I ended up with a period of 4 days, which usually resulted in 2-3 games for every team.

  • Both systems ‘agreed’ for the most part. They had exactly the same success rate in the playoffs (actually got the same matchups wrong) - 10 out of 15 (ratings not updated during the playoffs). San Jose Sharks was the most unexpected for both of them (I think it was a little bit underappreciated, mostly because of the slow start, it never had time to bounce back properly). Going just by gut feeling, I’d probably agree with Elo a little bit more (rated SJS higher for example), but that’s a personal thing, you could easily defend either classification.

  • Glicko changes much more rapidly, even with fairly small RDs. As mentioned, Elo rating changes are capped at K (=20 for the most part), Glicko’s swings can be much greater, even after RD gets smaller (it was ~50 at the end of RS for most teams).

  • Neither system deals with the team rating, but they’ve not been designed for that. FWIW, Microsoft’s TrueSkill is based on Glicko.

  • Finally, there’s no nice way to say it, but both systems agree - Maple Leafs were the worst team of 2015/16. Congratulations, Toronto.

Python script + complete standings here.

More Reading
Older// DD2016 - video