Elo - Roberson Rating System

Introduction / Abstract

I invented a modified Elo rating system while working for Intel Corp. in Hillsboro Or around 1996. I started a chess club that played blitz chess meeting at lunch time and after hours in the cafeteria. We grew to 60+ people quickly and were playing 100+ blitz games per day. Eventually, the group became competitive and I added a rating system to calm some of the who's better than who debates. I was using the standard Elo rating system and a provisional system for new people. Anybody with a USCF OTB (over the board) rating was started with that rating. After a while, the rating system started exhibiting deflationary tendancies: some were improving too quickly for the rating system to keep up and thus lowering the ratings of the rest. At first, I solved the problem by putting some people back on the provisi onal system. This became too manual motivating me to create a new rating system. After reviewing what was really happening, I decided to adjust the Elo system to deal with the problem. This worked very well and the need for periodic manual intervention disappeared. The fundamental problem with the standard USCF Elo system of the 1990s was that is zero sum and thus deflationary for all rated below 2100 which is the vast majority of chess players. All rating systems that are zero sum will exhibit the same deflationary behavior and thus need to be replaced.

What was happening at Intel

Many of the new people joining the club had played chess to varying degrees before, but had never played blitz. Some had played with a clock while others had not. Many hadn't played in years or a decade. So, these people had a/some learning curve(s) to overcome and some simple ones at that. It takes some time for a chess player to get used to a chess clock. It takes more time to get used to blitz time controls and pacing yourself. It takes more time to regularly realize you can win a lost game by moving fast when your opponent is low on time. It takes more time to realize you need to know your openings deeper and blindly repeat elementary endgames. It takes even more time to develop an efficient prioritized thinking process to find good moves in a few seconds and more ....

So, we see several learning obstacles. When each is overcome a player experiences a nonlinear increase in ability and performance. For some players, this series of obstacles is overcome in quick succession, but most in a nonlinear manner. It happens each time a player has an epiphany about one of the obstacles. Experience allows you to notice repeated issues, then you come to a relization leading to a jump in ability/performance. Add to these blitz chess specific obstacles the obstacles of learning to play better chess and you see the age old adage of practice makes perfect, thus the more you play and pay attention the more likely you are to have an epiphany and improve.

From a ratings perspective, while we were using the standard Elo rating system, the higher rated players were losing points despite the fact that some of them were improving. This occured due to relatively rapid improvement of the lower rated players and the base Elo system is zero sum. The problems became quickly noticable because we were playing and rating many games per day in a closed group.

Does the behavior observed at the Intel lunch blitz club exist in OTB classical time controls?

Yes, I believe it does. I have seen rating charts for individuals that have periods of high oscillation, then they become gradually more consistent. At that point, it seems they experience an epiphany because they get a sudden nonlinear rating jump. Just after the jump, their performance oscillates highly again. The process then repeats multiple times. Also, as the rating gets higher the nonlinear jumps are smaller. Others may have a period of stability or small linear improvement then a nonlinear jump. I reference the following USCF rating charts.

  1. Example exhbits early ease of improvement and rating burst after oscilations.
  2. Example
  3. Example doesn't exhibit oscillations but does exhibit periodic sudden jumps in rating.
  4. Example doesn't exhibit oscillations but does exhibit periodic sudden jumps in rating.
  5. Example exhibits ease of improvement in lower rating levels
  6. Example

I'm not claiming that all experience this cycle, but some do and a good rating system must allow it to happen instead of force some predetermined false model. When claiming that all people follow a model, all you need is one counter example to prove such a model wrong. Also, notice the first graph relative to any of the others. You will see the learning curve is rather different than the rest. The first graph is for one of the world elite in chess: Hikaru Nakamura.

What was off with the Elo system and why

Arpad Elo expereinced the issues of chess ratings in the 1950's under the Harkness rating system. He believed that he could improve on it and developed the Elo system which was adopted by the USCF (United States Chess Federation) in 1960 and by FIDE (Federation International des Eschec) 10 years later in 1970. Elo thought (quite correctly) that a good and correct rating system needed to model human behavior/performance. So, he gathered lots of tournament data a set out to find out what really happens. He came to the conclusion that ratings should be adjusted based based on performance vs expectation. Also, his model made the system zero sum for competitors the same rating groups. His system suggests that humans perform with in a 400 point standard deviation.

Arpad Elo's system was a breakthrough in performance modeling / rating systems. Amazingly, he developed the system in the late 1950's before the invention of the PC. However, it lacks a certain concept. The Elo system didn't allow for nonlinear rapid improvement in the lower end of the spectrum. The K value which is used to decide the maximum number of points you can gain for winning a game had only 3 values:

32 24 16 rating < 2100 2100 <= rating < 2400 rating >= 2400
For matches where both players have ratings in the same ratings range the system was zero sum. If you gained X rating points, your opponent lost X rating points. Thus, a person or persons improving at a faster than average pace for the group, would steal points from others who would in turn get them back from others thus deflating the ratings of the higher rated players for that rating group. This was happening at Intel in our blitz chess club. The effect is more pronounced at blitz than classical time controls due to the greater number of games that can be played per day/week.

Why does this happen when Arpad Elo created the system based on real data from the USCF? I have a theory - the data was uncontrollably biased. In the 1950's, all USCF chess ratings were for classical time controls only. On top of that, the vast majority of competitors were adult males. These two issues introduced a data bias which kept Dr. Elo from noticing the problem.

What a rating system should do

  1. Model human behavior/performance: we have our good and bad days - we get worse, we improve
  2. The fundamental concept of competitive sports performance improvement: ease of improvement is inversely proportional to playing strength - the weaker you are the easier it is to improve.
  3. Allow nonlinear bursts due to epiphanies
  4. Allow lower rated players to improve quickly
  5. Rating changes based on performance vs expectation
  6. Experience is irrelevant or minor relevance: people improve at different rates - a system shouldn't consider number of games played except in the beginning for a beginner.
  7. We can perform weakly due to fatigue and drop our rating drastically - mostly an online marathon behavior - rating systems should allow the rating to drop or even increase instead of holding it relatively steady.
  8. Allow rapid rating gains for prolonged periods for players that start vastly underrated ( an NM starts playing rapid chess with a rating of 1200).
  9. Online play is not a level playing field: home distractions exist, upgrades in hardware (mouses) can impact a rating - a rating system must be able to adjust to such a change
  10. Modeled on real data
  11. The system as a whole must allow itself to gain and lose points dynamically with the changes in the abilities of the group.
  12. Nobody should gain rating points over a long term without improving their ability.

Elo - Roberson System

Based on what was happening at the Intel Oregon Blitz Chess Club, it was clear that the system shouldn't be zero sum. Lower rated players must be allowed to improve faster than higher rated players and not ulitmately at the expense of the higher rated players. It should allow competitors of any rating to drop points and/or gain points (back).

At first, I put people experiencing rapid rating gains back on the provisional system and restored some of the rating points lost to the higher rated opponents. Of course, some of this was a manual process. I concluded that the easiest approach was to modify the Elo rating system giving it a fully floating K value. The K value would be different for nearly every competitor. It would be purely a function of the competitor's rating. Thus, resulting in a system that:

Now, the only question is what should the function K = f(R) look like? It should be inversely proportional to the rating, but exactly what?

I thought it should be nonlinear. However, it was the late 1990s and I didn't have access to enough data to be sure of it. I decided to think through the corner cases. The one that bothered me was somebody has a lucky (or unlucky) day. In a nonlinear system, such a player moves up fast but takes longer to fall back down. Such a case could lead to inflation as other people will be gaining points from him while is on his way down. Also, a higher rated player may have a bad day thus artificially elevating the rating of a lower rated player. The higher rated will have an easier time going back up than the lower rated player going down. This could be a problem if the K value is too large for lower ratings and tapers down too rapidly.

A linear system has the same problem, but at a much smaller magnitude because the change in K value as the rating changes is smaller. Such a system is not zero-sum thus being less likely to be deflationary, but still too large a gap in K values per rating could cause problems. I opted for the linear system where the K value start large for a low rating and decays linearly as the rating increases. This worked quite well:

I never tried to improve on it again. It worked and that was it.

K values per rating

Once I decided on a linear system, it was just a matter of finding the correct value for C in the following equation.

K = (2100-R) x C + 24
At first, I thought the K value for a rating should be adjusted +10 for every 100 points from 2100 producing C=0.1. I reviewed the ratings changes that would occur in the club's history data. The players that I had to put back on the provisional rating system would have gained too much resulting in the entire system being inflationary. Using the idea of a binary search, I ran through the data again with C=0.05 which equates to a gain of 10 to K for each 200 point gap of R to 2100. After reviewing the rating changes to the club's history data, I decided that this could be the correct value for C. I implemented the adjustments to my coded Elo rating system and we used it for years. It never exhibited noticable inflationary or deflationary behavior. The table below shows K values based on various ratings.

K ValueGain for E = 50%Rating
12462100
11457300
10452500
9447700
8442900
74371100
64321300
54271500
44221700
34171900
24122100
24122200
24122300
1682400
1682500
1682600+

The gold highlighted values exhibit the same K values for ratings >= 2100 as with the normal Elo system. Thus, the possibilty of somebody climbing from 2100 to FM, IM or GM faster than was previously possible is not an issue. The K values for ratings between each of these 200 point gaps are to be linearly interpolated. For example, the K value for a rating of 1760 is 41. An exception exists for ratings >= 2100. After 2100, the K value is a step function.

Is the Elo-Roberson rating system inflationary or deflationary.

By itself, the Elo system is deflationary (not inflationary) for all people in the same K group due to it being a zero sum system. Adjusting the system so that the system as a whole can gain points is the way to stop it from being deflationary. Allowing, it to gain too many points will make it inflationary and all ratings could go up. Not enough gain in total system points due to improving players will allow it to continue being deflationary, but not as much as before.

In practice the system didn't exhibit any signs of inflation or deflation:

None of the above issues for either problem were exhibited. Participation satisfaction went up for the higher rated members: they were no longer guaranteed to lose points from a slow drain. Due to reduced individual ratings oscillation, individual rankings oscilated less. Individual rating gains were more gradual and plateaus were less harsh.


------------------ Fill in this Blank --------------------------------------

------------------ Fill in this Blank --------------------------------------

USCF Rating System as of April 2017

The USCF used a bonus point system to add points to a members rating and thus to the system as a whole when a member out played his/her expectation in a tournament where the member played 3 or more games. This was insufficient during the 1990s and 2000s. An updated system was implemented using a floating K value recognizing the potential for rapid improvement in lower rated players. The following table shows K values for varios rating levels.

K values per rating

K ValueGain for E = 50%Rating
8040700
72.72736.36900
61.53830.71100
53.33326.671300
47.05823.5291500
38.09519.0471700
30.76915.38461900
23.529411.7642100
20.512810.2562200
17.398.69562300
15.687.8432400
15.687.8432500
15.687.8432600+

Glicko system #1 and what is wrong with it

The primary problem is the fundamental concept around its creation - "the more you play the more consistent you become". Another way of saying this is the more you practice the less likely you are to improve. Also, your ability to improve has nothing to do with your current level of play. This flies directly in the face of points 2, 3, 4 and 5. Of extreme importance is that it defies the age old concept of "practice makes perfect".

The "95% confidence level" concept is untrue. My personal experience on chess.com is summarized in my recent bullet rating improvement of 400+ points in less than a month. Of course, I hit the lowest RD value I could by the time I made a 200 point gain. (I had been playing a while to start with). So, by the time I was at start+200 the system was 95% sure that my real rating was between start+150 and start+250. It continued to be 95% sure that I should not improve for the next 25 or more games resulting in another 200 rating point gain.

RD = min(
(RD0)2 + c2t
,350)

The variables c and t don't have anything to do with real variance. The variable t is time since last game - the longer the time gap, the larger the value of t and RD. The constant c is based on the uncertainty of a player's skill over a certain amount of time.

In the above equation, we see that the RD value in no way is a function of the variance of your rating/performance. In statistics, a standard deviation is a function of the variance of your rating/performance, thus the RD value is not a real standard deviation.

The points you can gain in a match with somebody 100 points higher than you are the same no matter your rating. If you are 1100 or if you are 2000. This goes against the fundamental concept of competitive sports performance improvement: ease of improvement is inversely proportional to playing strength. In other words, the system believes your ability to improve is the same if you are rated 1100 or 2100 instead of being easier to improve when you are at the lower end of the spectrum than the upper end.

E=0%+ & E=50% Rating Gain for Elo, Elo-Roberson, USCF and Glicko

Old USCF EloElo-RobersonNew USCFGlicko
Gain for E = 0%Gain for E = 50%Rating
3216100
3216300
3216500
3216700
3216900
32161100
32161300
32161500
32161700
32161900
24122100
24122200
24122300
1682400
1682500
1682600+
Gain for E = 0%Gain for E = 50%Rating
12462100
11457300
10452500
9447700
8442900
74371100
64321300
54271500
44221700
34171900
24122100
24122200
24122300
1682400
1682500
1682600+
Gain for E=0%Gain for E = 50%Rating
P.E.P.E.100
P.E.P.E.300
P.E.P.E.500
8040700
72.72736.36900
61.53830.71100
53.33326.671300
47.05823.5291500
38.09519.0471700
30.76915.38461900
23.529411.7642100
20.512810.2562200
17.398.69562300
15.687.8432400
15.687.8432500
15.687.8432600+
Gain for E=0%Gain for E = 50%Rating
168100
168300
168500
168700
168900
1681100
1681300
1681500
1681700
1681900
1682100
1682200
1682300
1682400
1682500
1682600+

In the USCF columns above, P.E. indicates the use of a special rating formula for those that are effectively provisionally rated. In the table, E stands for expectation and E=50% happens when two people of equal rating are paired with each other and E= 0% are the points won when the lower rated player has zero expectation of winning because he is severely outrated. Here is a list of noteworthy observations from the above table.

  1. The difference between the Old USCF Elo system and the new one. The points gained vary with rating in the new one.
  2. The similarity between E=50% in the current USCF system and the Elo-Roberson system.
  3. Glicko E=50% is the same regardless of rating like the old USCF Elo system for those rated below 2100.
  4. Glicko points are half that of the old USCF system for those rated below 2100. Glicko numbers are for people that play regularly.
  5. The Old USCF system is zero sum and thus deflationary for all rated below 2100.
  6. The Glicko system is zero sum and thus deflationary for all players that have high participation levels.
  7. The new USCF system and Elo-Roberson system is zero sum only for pairings of two players of equal ratings therefore not deflationary.

Glicko system #2 and what is wrong with it


------------------ Fill in this Blank --------------------------------------

------------------ Fill in this Blank --------------------------------------

Special issues with online systems

Online systems are readily accessible, thus allowing for far more games per day/hour much less far more games per week. Any system that degrades your chances of improvement as the number of your games rises will become and issue faster than in OTB play. Also, there are external factors influencing performance level:

  1. Online play is not a level playing field - there are too many variables from house to house.
  2. It is often easier for kids to get some peace and quite time to play online chess than it is for their parents.
  3. The probability of being interrupted during a game is higher for adults with kids than adults without kids or kids
  4. Hardware: a weak computer can impede your ability to play at your best
  5. A bad mouse: can consume time and increase the probability of mouse slips
  6. You've been playing consistently and your RD is low, then you get a coach and your playing strength improves faster than the low RD will allow.
  7. Internet outages: wifi router issues, power interupptions, ISP interruptions can result in undue loses.

All of these and more produce variability more than the Glicko system assumes. To say that they all affect everybody the same is ludicrus. For example, some people use laptops and have a UPS connected to their modems and routers while others don't. Some people have a wife and two+ kids while others don't.

If you have been playing a lot and your RD value has you at a low improvement rate and you have an old bad mouse. When you buy a new gamer mouse that is much more responsive, the system will not care or notice. You will play better but your number of games is high enough that you are not expected to improve.

This happened to me on chess.com. I had an old mouse where I had to often make mouse movements twice to move from a1 to h8... My son bought himself a high end gaming mouse and waterfalled me his low end gaming mouse. Low end or not it is amazingly better than what I was using. I tested the concept by playing several G/1 bullet games with the old mouse. There were several games where I lost and my opponent had at least 15 seconds left and some 30 or more seconds left. After about 7 games or more, I switched to the low end gaming mouse. I started winning games. When I lost on time, the most my opponents had on me was around 10 seconds and many times less than 5 seconds.

Obviously, my online playing ability just improved, but the Glicko rating system didn't know that and thought that the more I played the less I should improve.

In another example, our chess team on chess.com has an NM, but the team matches uses the rapid ratings and he had not played rapid games on that server. Thus, his rating was 1200. He tried for a week to improve it. He managed to get it into the 1700s. It took many games because the system reduced the number of possible points he could gain per game with every game. Sometimes (due to fatigue) he'd make a mistake and lose a game, but the number of possible points for a win is still reduced. It has been a month and his rapid rating is 1800s.

Conclusion

The Elo-Roberson and the new USCF rating systems are valid improvements over the standard Elo system. They are neither inflationary or deflationary and they adhere to the behavior charactersitics that model human competitve sport performance especially in chess. On the other hand, Glicko systems #1 and #2 violate some of the behavior charactersitics. The Glicko systems erroneously assume that the more you play the less likely you are to improve. Also, they erroneously assume that consistency for all can exist in the nonlevel playing field of internet chess.

The Elo-Roberson and new USCF systems are more capable of handling large improvements in ability and more capable of dealing with improvements from epiphanies during regular play than the Glicko systems.

I suggest that any server using the Glicko system replace it with another system that more properly adheres to the behavior charactersitics that model human competitve sport performance such as the Elo-Roberson system or the new USCF system.

More Rating Charts for Referenece

  1. Example
  2. Example
  3. Example
  4. Example
  5. Example
  6. Example
  7. Example
  8. Example
  9. Example
  10. Example
  11. Example
  12. Example
  13. Example
  14. Example
  15. Example
  16. Example

References

  1. USCF Rating System as of June 11, 2020

Potential reviewers:

Rachel Roberson, Evan Roberson, Patricia Roberson
John Timmel, Craig Jones, Walter High, Thad Rodgers, Kevin Hyde, Maurice Dana
Dr Joseph Graves, Dr Gene Tagliarini
David Scott, Dr Joseph Brandenburg, Tom Fletcher, Tom Crispin
Jesse Griffin, Wayne Jones, Anand Mahadevan, Joe Swindler
Peter Hornsby, Alloyis Lip