RoboRumble/RankingSystem - Robo Wiki -= Collecting Robocode Knowledge =-

Ranking system

The way the ranking system works right now is as follows:

All bots start at 1600 (just an arbitrary number).
The performance of a bot against a given enemy is measured by using the %of score it gets against that enemy. Results from previous matches are not added but rolled using an exponential smooth (alpha = 0.8). In this way variance is reduced but allows for old results to fade away.
Every time a battle is uploaded, the %scores for each contendant are updated.
Inmediately about that, for each contendant, ratings are updated. To update a rating, the ratings of the other bots but the contendant are fixed. Then for each of this bots, the % of score of the contendant is read, the expected % is calculated using the current ratings and the % of scores, and the contendant rating is updated using the fourmula 3 * (expected % - % of score).
The formula used to calculate the expected results is the same used in the ER.

An easy way to have the ratings 'settle down' is to repeatedly run through the set of results several times adjusting the ratings only, eventually the ratings will settle down. Apart from the saved data issue, all results are as valid as another so that results form earlier games can be re-cycled in the rating system (if you do this the rating constant can be reduced so that the rating is more accurate. In other words, what I am saing is that you have more than enough battles to get an accurate rating of each bot now. -- Paul Evans

Good idea. It's like runnning a training set for a NN serveral times. I'w try to implement it. --Albert

If you find you have to merge result information it is possible to perform the same sort of operation on the merged results (this is what the eternal rumble does) - just let me know if you want more details (I've done a bit of work in this area). -- Paul Evans

By running the results through again, all you are really doing to making the ranking constant higher, no? -- Tango

No - this is not the case - because the rating system as it is being used here gives 'priority' to the latest results (which is normally what is desired with say chess ratings - to show the current 'form' of players). Robots on the other hand are more consistant and what is important is the average strength of a bot over all it's battles. - The fastest way of getting consistant results would be to run through the results first with a high ranking constant, then repeat through the results with a decreasing constant each time. -- Paul Evans

Two further things ought to be done to the rankings:

Refine the rating formular - it may not be the 'best fit' of rating difference to expected score (with a iterated rating system it is possible to tweek the formular so that the 'S' curve of the expected results passes through the middle of the plot of the results). The original formular created was based on a much smaller data set.
Normalise the ratings - only the difference in rating is important - not the absolute value - what can be done is to create a 'benchmark' bot - somwhere in the middle of ability and fix it's rating (I would suggest PrairieWolf version 2.61 has a rated at 1700 - all other ratings relative to that).

-- Paul Evans

Temporarily upping the constant has been suggested numerously, I'd assume it's coming soon. As for normalising the ratings, doesn't starting every bot at 1600 have the effect of creating that as a median rating anyway? -- Kuuran

I like the idea of bots evolving over time (even if most of them will not do), so averaging all the matches seems to me a little bit static. What about rolling results by enemy (the further in time, the less important). It should help us to keep data files small, and provide more stability to the system, at the same time it would allow for some "evolution".

for a given bot, at a certain moment (t), we keep an "estimate" of the score distribution for each enemy (i) Ei(t).
for a given battle, we have a real sample of this estimate Si(t+1). Note that for the enemy, it will be 1-Si(t+1).
We calculate the new Estimate Ei(t+1) = apha * Si(t+1) + (1-alpha) * Ei(t) - let's say alpha = 0.2.
Finally, we use Ei(t+1), 1-Ei(t+1), the current robot rating and the current enemy rating to calculate the rating change.
It would also stabilize the rankings, because the variance for Ei is lower than the one for Si.

-- Albert

Starting the ratings at 1600 means that the average rating will be 1600 - however a sort of rating deflation will take place as, typically, most new bots are entered at the top end of the performance range, by standardising the rating on a particular bot it is possible to see how the general standard is improving over time (it also does not matter if a cull of bots occurs). -- Paul Evans

If you like the idea of bot evolution then it is very important to allow data directoies to travel to their @home clients and back. Note also however that the longer a bot version stays static the less important the early battles are, I don't think a kind of rolling average is required for the results, after several days a bot's total scores against it's enemies will basically be close to it's trained up scores. Given a set of real scores, it is possible through a simple, fast, iterative process to create ratings with minimal error (this can be done at each half hour interval if required) - all that is required is to choose the set of real scores - e.g. scores of the last day, all scores, all score from bots within 200 rating points - the choice is yours. Data wise all you need is a table of 'bot A', 'Bot B', 'Total Score of Bot A', 'Total Score of Bot B', 'Total Rounds' for each pairing. The total rounds is required to weight the importance of each pairing. -- Paul Evans

I think that by "bot evolution" Albert meant new versions of the bot. I think we are planning to keep that table you are talking about in the data files already. Which would mean we have what's necessary to implement that scheme. The upload servlet should maybe do it everytime it runs and it hasn't done it for half an hour or so? Maybe you could see if you could implement it into that servlet? Or provide a class that could be used by the servlet? -- PEZ

What's wrong with that ranking deflation? I think it's kind of a good thing, that way a rating tells you how good a robot is against it's contemporaries, instead of how good it is against bots in general. This allows for direct comparison across generational gaps. For example, SandboxDT 2.11 is alot better than many previous versions of Sandbox, but the 1900+ rating shows that it also has a larger lead on the average bot than any before (as memory serves), whereas a direct fight against something like Yngwie or Fermat 1.6 wouldn't yield terribly useful results. This provides sort of a more valid basis of comparison to questions like 'given the technologies evolved at the time, which bot was more significant', 'who was the better champion', etc. It's a pity we don't have such ratings for older bots. We could make an oldies league using RoboRumble and make them, but I don't think that's happening anytime soon. -- Kuuran

I'm not reffering only to different versions. It's just that I don't feel confortable with a system that uses the full history. Imagine one of my bots is very unlucky for some reason. I'd like its results to fade away with time, and get replaced by new ones. I agree with Paul in that using all history is mathematically correct, and that in the long run the results will return to "normality". But as Keynes said (I'm translating back to english, so may be they are not the exact words): "In the long run we all will be dead." -- Albert

Kuuran, I think you may have misunderstood - what I am suggesting is you rate all the bots as you are at the moment - then take a look at the rating that say PrairieWolf has - lets say it is 1712.6 - then take all the ratings and subtract 12.6 from all of them - this means that PW has a rating of 1700 - the differences in ratings would be preserved, but ratings from one month to the next would be approximatly equivalent. If you do not do this the absolute ratings are dependant on a) the start value, b) the rating value of bots removed from the league, c) the rating of new bots added to the league - in other words the absolute rating value has no foundation. A rating of 1900 means nothing without a benchmark - but in reality only the rating difference tells you how good a bot is against it contempories. -- Paul Evans

Right, but the start value of the league doesn't change, and this is the average score. So seeing how far above or below the modern average seems to me like a better method than seeing how far above a single bot, whose own abilities against it's contemporaries will fluctuate. Grounding against the average just seems more relevant to me.

Yes, every bot affects it, but individually not terribly much, and in a sense this is what makes it a valid reflection of a given time. The way I see it the difference is do we (not having a reference bot) look at the difference from 1600 - which will be the score of modern averagebot - or do we (having a reference bot) look at the difference from 1700 - which is the score of a bot whose ability (and change in ability) may or may not reflect general trends. If we lock around one bot and my rating continues to go up it would tell me that I'm improving faster than that one bot, if we simply set the system so it orients itself around 1600, it's telling me that I'm improving faster than my opponents, and in good position. Also, if I rate 2000 (I wish) sometime in the future, it tells me that I'm further above that time's average than SandboxDT is above today's average. If at sometime in the future I want to know how much better than PrairieWolf I am, I can just look at the difference of our scores then, I don't see the point in keeping a non-evolving bot as a reference to an evolving system.

This also restricts higher scores to innovators. Mogbot is nothing impressive today, but in his day he broke all sorts of ground, and would have rated very highly in this contest. Now that everyone has access to the data required to make their own Mogbot clones, and has had time to develop even better variations of those systems (or new system) it won't do nearly as well. However, if I am to compare Mogbot's score against PrairieWolf's score in a number of battles, then repeat the process for an average modern patternmatcher, I would find Mogbot is the worse bot. Head to head this may be true, but you're comparing against a bot out of it's time, a 'how good were they then' rating would put Mogbot miles above a modern clone. This way the only bots posting the big numbers are the ones who did something special, or had something their opponents didn't, instead of just the ones who have been participating in more recent competition.

Am I missing something? -- Kuuran

I agree, being able to compare from different times is important, so keep it like it is. Although why did you (whoever it was) pick 1600? It seems a fairly random number. Did you just think that was a roughly average score from the ER? I would prefer 0 to be the average, then -ve numbers are below average, and +ve above. If the ranking system wouldn't work with 0 (ie. it would get divide by 0 errors somewhere, i don't remember how the system works, so i'm not sure), then stick with it at 1600 and then in the last stage of the process minus 1600 off each ranking. -- Tango

Actually, 0 would work fine mathematically. The median is arbitrary. My impression is that the ER uses 1500, so you can expect our scores to converge on roughly 100 higher than the ER Ratings for each bot. The RobocodeLittleLeague will use a similar method centered on zero. -- Kawigi

There are logical reasons for using 0, 1600 is just an abitary value chosen because someone thought it seemed like a good idea. Are there limits on the rankings? Could you get -ve rankings on the current system? If what you say is true, you should be able to. If i remember the system correctly, there is a curve of rankings. Does that have an asymtote (sp?) at some ranking? -- Tango

I don't believe there is a real asymptote in the sense that you are thinking. As the relative score of two bots approaches 100% or 0%, the difference in score should approach +infinity and -infinity. However, it appears that in order to get a negative score with the average set at 1600, this bot would have to have a score of (it appears) 1/400 of the opponent's score. Say in 10 rounds it lost 1200-3 or something. Does that look correct? -- Kawigi

Add SittingDuck to the rumble a while and we will see. -- PEZ

I was going to say that, but SD get's 0 scores almost every round, and didn't you tell to server to ignore 0 scores because they usually crash everything? -- Tango

How about having plugable ranking system, rather than having only one. So the battle results can be evaluted diffrently. -- SSO?

Well, a 1600 based and a 0 based system could be run by just adding another collumn to the rankings page with the 1600 result minus 1600. Using a bot as a reference would be harder to do at the same time, but would be quite possible, although i think there is only one person who wants that, so it wouldn't be prioritised. -- Tango

I've subtracted 1600 from the rating figures now. We can test it a little while and see if we like it. I would like to right justify the rating column, but I don't know the wiki syntax for it. -- PEZ

Well, the score when a bot is defaulting is something like 10*number of rounds to 0 or 0 to 0 in all cases, other scores with a 0 in them are legitimate, and I assume the code could be modified to check for a legitimate 0 score and handle the math accordingly so as not to produce a NaN score. -- Kuuran

Ok then - I give up on normalising the rating :( However If possible I would like to have a look at the rateing formular to ensure it gives the best estimate of score for a given rateing difference - can anyone send me either some, or all of the results - a simple text file would be fine (idealy one record per battle or one record per robot pair total scores). -- Paul Evans

I can send that text file to you tonight Paul. I'll need to mangle the data files some. -- PEZ

Just wait for one or two days. The new server creates this file automatically. -- Albert

I'd like a copy of that too. :-D --David Alves

You'll get copies enough ones we have got the mirroring working. =) -- PEZ

As you can see, the new rating system is FAST. It took SandboxDT 5 matches to get into the first place. -- Albert

I think I might stop posting new rankings now when Frankie is less than 2 points behind DT. =) -- PEZ

I think I liked when the rankings where zero based a little while there. It's as simple for me as printing "rating - 1600". What's your vote? Less than four votes means I decide. =) -- PEZ

I vote for you to decide :-) --Albert

Considering the computing power we have now, we could (in the future) stop using this kind of mathematical engines to calculate the rankings and go for a real league, all fight all, you score 1 point when you win, 0 when you loose, etc. It would be the definitive ranking. -- Albert

Exactly where my line of thought wondered yesterday. I revrote the ranking publishing script to fit the new file formats. When I had tested it a little I shut down Tomcat, wiped the rankings folder and started Tomcat again (with the new servlets). It took exactly 35 seconds before we had the first 20 or so battles uploaded! Plain amazing. Do we save the wins and losses count somewhere? If so I could maybe publish that ranking too. -- PEZ

Albert, I don't need the individual battle results - just the data on the Details page rolled up into one file if that helps. -- Paul Evans

It's your choice. I'll package such a file for you right away so that you can check it's format and content and then I can send you a new file in a few days when we have more data. Stay tuned. -- PEZ

I guess each pairing is represented twice in the data I have. Would you rather have both representations (bot1-vs-bot2 and bot2-vs-bot1) or should I filter out one of the records? -- PEZ

Either, it doesn't matter -- Paul Evans

I've just looked up Exponential Smoothing - it's the same as my Rolling Average, an Exponetial Smoothing Alpha of 0.2 is the same as a Rolling Average 'n' of 4.0 with a weighting of 1.0 (n = (1.0/Alpha?)-1.0). Mind you 'Exponential Smoothing' sounds much more impressive. -- Paul Evans

Now a datafile with all pairings is created at the same time as the current ranks are published. The file will be kept available at: http:/robocode/rumble/allpairings.zip It only contains one record for each pairing. (See a few lines above about bot1, bot2). -- PEZ

Thanks - I've had a look at the file, it does not look like the formula needs to be changed - it looks like I got it correct the first time :-) -- Paul Evans

There are 3 faulty records in the file, one of them reads "davidalves.net.[DuelistMicro 1]?.11,ranking,1600" the other two also have the second column set to 'ranking'. -- Paul Evans

I'll look into it, but I suspect it could have to do with a data file being updated while I read it. -- PEZ

The Exponetial Smoothing has an issue with respect to an good or bad first result - for example lets say the average result between two particular bots would be 50% - but the first result comes in at 70% (to 30%), there is not much to do about it 70% should stand at this stage, the second (and subsequent results) come in at 50% the smoothing gives 66% (Alpha=0.2), after another 50% result 62.8 etc. by this stage the average would be 56.6% - the oldest result has the most importance - not what you intended - in fact the first result has a five fold weight with Alpha=0.2. I think the overweighting of the first result can be removed by averaging the first 5 results after that using an Exponential smooth with an Alpha of 0.2. -- Paul Evans

What wrong with averageing all the results? Why smooth it at all? -- Tango

I think that averaging all results means you have to have all results in store and that's something we want to avoid. Ask amarok why this can get in the way. =) -- PEZ

This thechnique is a well known statistical tool. I agree it has inconvenients, but any method we use will have them (one of the advantages of this method is that you don't need the previous results, just the smoothed current value - moving averages, or the method you propose need more data). If you don't feel confortable with alpha = 0.8, we can reduce it (I defined it as a first aproach, but may be it makes for a too slow change). Note anyway, that the scenario you mentioned above is not realistic at all. Let's say the real %score is 50% (ie. bots are the same one) and the first result is a 70%/30%, then the next ones will be both up and down of 50% (ie. 40%/60%) so the smoothed result will converge faster (in other words, if I should give you an estimate of %score in your example, I would bet for a %score higher that 50%). Statistics are tricky, and the only way to really validate a method is to compare its results to real things. Have you noticed some inconsistencies in the ratings? (exclude mini to nano for now, they are not stable yet). -- Albert

Yes, Marshmallow is not in top-10! =) Is it maybe possible to adjust the alpha smoothness parameter in the process? Maybe set it to something weighting old and new results the same for the first 5 or so updates and then move to 0.8? -- PEZ

Nano seems quite stable. The middle of the contest may not be entirely stable, but the middle bots never seem to stabilize entirely (as they're often inconsistent against different bots of similar ranking). Of the bots on top the only one with anywhere near moving momentum is FunkyChicken, and that's only -16. -- Kuuran

In order to calculate the mean of all the results you would not need to have all previous results. You would need the mean of all the previous results (call this pm), and the number of previous results (pn). That is the same amount of information as we keep now. Then if you call the latest result "lr", you simply need to do (pm*pn+lr)/(pn+1). -- Tango

A momentum of 16 is very low. If you divide the momentum by 100, you know how much the bot will move when its new battle is uploaded, assuming the %score for that battle is equal to his history (smoothed %score). For FunkyChichen? it means a +0,16.

About calculating the % score: I have made some test, and here it go my conclusions:

I agree that alpha = 0.8 is high. A better constant would be alpha = 0.7.
For alpha = 0.7, the function behaves in a very similar way to a moving average with 5 values, which is the proposed by Paul.
Another good method seems to be a moving average of 5 battles (which is equivalent to use - with the current setting -175 rounds per pair to analize them). But it is more complicated to calculate (you need to save more data) and doesn't give better results than ES with alpha = 0.7.
Using the average for all values is not a good idea, because for bots that improve over time, it can take ages (like 20+ battles per enemy) for the %score to settle. This indicator is not responsive enough, so we better eliminate it from candidates list.

PEZ, are you changing the upload servlet? If so, please could you change the constants in the servlet? Just change lines 90 and 91 as follows:

wins1 = 0.8 * wins1 + 0.2 * real1; --> wins1 = 0.7 * wins1 + 0.3 * real1;
wins2 = 0.8 * wins2 + 0.2 * real2; --> wins2 = 0.7 * wins2 + 0.3 * real2;

-- Albert

OK, I have just nudged the servlet yet (only extracted the loading of bot files to a separate, Singleton, object). But it behaves as before (with concurrent data access and all). I could install it with the new alpha values. But not right now. I'm in the middle of cooking a meal for me and my family and I really can only do one thing at the time any good. =) No real hurry though, right? I'll install the new values later tonight. -- PEZ

No hurry at all. It's simply to avoid two people changing the servlet at the same time. -- Albert

Oh, I know the 16 is low, that was my point :) Of the top five nanos that is by far and large the biggest momentum, NanoSatan has a -1.5 I believe it was.. :p So I'd say nanos are fairly stable. -- Kuuran

You can do an average without storage - newAverage = oldAverage*oldCount/(oldCount+thisCount) -- Paul Evans

Isn't that equivilent to the method i said to calc means?

Bots do not currently get better with time. If we sorted out the saved data thing, they might, but no-ones done anything about that AFAIK. Without saved data, the only change is a complete new version, but i don't believe the current system uses battles from previous version in the calculations, does it? -- Tango

No. The only thing that is inherited between versions is initial rating. -- Albert

That's what i thought. -- Tango

As I've stated numerously, all the clients doing the bulk of the battles by now already have plentiful data directories locally, I don't see why those don't count as learning. -- Kuuran

I completely agree. It's clear that bots will eventually learn, because they will be running battles in the same clients once and again. -- Albert

And, as Kuuran and Jim has stated, this will increase to be the case when bots start to deal with data persistance in new ways to be able to store data on more enemies, or the most important enemies or reuse data for other enemies and such. -- PEZ

OK then - there has been some confusion - the Alpha value in the code is in fact 0.8 - not 0.2 as indicated at the top of the page - this means that 80% of your score is applicable to the last result. This does not produce alot of smoothing - but it does honor the requirement of loosing old data quite quickly (actually pretty much disregards it!). An Alpha of 0.2 is not much better than an average - I would suggest an Alpha of 0.5 for the time being - it offeres a bit of smoothing, and rapid response to improving bots - with a value any higher than 0.5 you may as well just use the latest result of a battle. -- Paul Evans

No. You were right. Alpha 0.8 means 80% of previous score is mantained, and the alpha at the top of the page is wrong (it is just 1-alpha). Finally we will use alpha 0.7, that is roughtly similar to a moving average of 5 matches (see previous posts) -- Albert

An Alpha of 0.2ish would be equivalent to a moving average of 5 by my reconing, an Alpha of 0.7 uses 70% for the last result, 21% of the previous result, 6.3% of the result before that ... IMHO the tail off is too fast. -- Paul

I'm not sure where this belongs, but it'd be nice to get a league stability figure, just an average or sum of all the momentums in the league. -- Kuuran

The updated servlet is in operation now. With the change Albert wrote above. -- PEZ

Thanks PEZ. The confusion is growing. Forget about alphas and so on. The current formula used is the following one:

 new somoothed result = 0.7 * (old smoothed result) + 0.3 * (outcome of the last battle).

I hope it helps to clarify. -- Albert

Just to clarify, - the present formular does not smooth much, and I think relies too much on the last result. -- Paul

Forget about alphas and so on. ... As I never really have had a clue about this alpha thingy I have nothing to forget. I just do as I am told to do. =) -- PEZ

I think the question is: Is somebody unconfortable with current rankings because he thinks there are missplaced bots? -- Albert