RoboRumble/OldRankingChat 031201 - Robo Wiki -= Collecting Robocode Knowledge =-

Thanks! Though there's still a larger gap between #1 and #2 than it is between #2 and #20 something, so maybe Paul is not in that big hurry yet. -- PEZ

Wow. You guys have made a real powerhouse this time; Griffon is pretty darn good. You guys are doing something right with these wiki bots, because just about every one of them is a ProblemBot for Fractal. I have some good ideas to add to Fractal, and a whole new type of gun manager similar to VirtualGuns and BestPSpace to build and write up on; unfortunately I won't be able to work on it for like 2 weeks because I'm behind in all my classes. Fractal is still only like 1/10th complete; I just haven't found the time to make it good. Stupid university... -- Vuen

I've just had a look at Griffon's movement - acording to DT's stats, at Griffons preferred fighting distance (which is further than DT's) DT has a hit rate of around 15.7% - by comparison DT will hit itself with a hit rate of 15% (and this is at a closer fighting distance). Overall the movement is only slightly worse than DT at longer distances, and perhaps 1% to 2% at mid distances. However to achive these hit rates DT has to use it's most segmented gun which usully comes into play at around round 50! - the standard guess factor guns show no difference between DT's movement and Griffons. Good movement well done :). -- Paul Evans

Way cool. Thanks for the feedback. Your segmentations still keep some secrets. RoboGrapherBot (soon to be released) doesn't see where DT's movement is better than Griffon's. I wish you would release a grapher for DT. =) -- PEZ

Thanks very much Paul. This is very exciting news for me! I thought there was a chance that it was very solid. Now I think the difference is the gun. And for that, I think that more than 1500 bytes will be required to close the remaining 50 points or so. Getting closer. -- jim

Getting closer - yes, but don't forget a 45%-55% score represents a difference of about 50 rating points. on a separate note... a grapher for DT would, I think, take away the sense of achivement should you beat DT - I keep my segmentation secret so as to not spoil your enjoyment :) -- Paul Evans

You could always label the secret segments as "secret 1" and such. And, I can assure you, Paul, that I would enjoy beating DT either way. =) -- PEZ

Well I have never been closer than 50 points before so I am going to live in the moment for a few minutes. And I too would enjoy bringing you down, if only for a nano second. Especially if I managed to catch a screen capture. I remain convince though that the way to do it is to focus on beating other bots first. When I look at SandboxDT's results page, there is no bot in the left hand column that SandboxDT does not score at least 50% against with the exception of Griffon, which I know it can beat by at least 55%-45% from my own testing. Thats a remarkable acheivement and the secret to closing the gap. First beat them, then beat SandboxDT. At which point you will release a new one and we will fall behind by another 10,000 points =^> -- jim

How about you just make DT give hit rates for prefered distances (with a varity of guns)in the debug window at runtime? It would be good to compare... -- Tango

BlestPain strikes back! Now Griffon is #3. In Sweden we say "old is oldest". Dunno what the expression would be in English. -- PEZ

I don't think "oldest" has the same positive connotation here that it must have in Sweden. :) -- nano

Maybe not. But what about "age"? =) -- PEZ

Wow... I can't believe Fractal jumped so many positions with so few changes. I haven't built any of the concepts that are supposed to bring out its real power yet; all I really did was tweak it's movement distancing to make it more survivable and less passive (while more predictable), and remove its bin decay. Here is the result I'm most proud of:

pe.SandboxDT_2.11	46.6	2	8-10-2003:21:39	32.5	14.1

grin* getting closer... -- Vuen

Now NanoLaulectrik? is 3rd. The nano-pattern matchers are having a big fun, going up and down and exchanging positions among them. It's interesting to see. -- Albert

I think that now would be a good time to print [the minibot rankings] and decorate your walls. =) -- PEZ

Um... After about two weeks with Fractal 0.32 in a stable position in the rumble, it just dropped almost 15 places over the course of a day and is still dropping. How come the sudden change in ranking? -- Vuen

Could someone without any data files have just started running battles? Does Fractal take a long time to learn? Does Fractal have a reference mode? -- Tango

I checked the detail records and there is nothing strange there (no low scores related to some client, nor 0 survival records, nor big failures against unexpected bots). Rating for Fractal evolved from 1693 (12/10) to 1691 (19/10) to 1676 (now). That makes a change of 17. I just made a quick check with two bots: MicroAspid (changed its rating down 13 points) and PrairieWolf (changed its rating up 11 points and then down again 11 points). It seems the oscillation is in the normal variance parameters. Note that the rating system behaves like a rubber band, and the rating for a bot with a fixed performance oscillates depending on the rating of the other bots in a kind of dynamic equilibrium. Oscillations should reduce if we implement the 70/30 rule, or if we raise the alpha constant (currently 0.7 - but raising it would have some undesired results to consider). -- Albert

Eek, edit conflict. Fractal doesn't save data; the only thing that might affect it is if its gl.txt file is on and it is trying to use RobocodeGLV014, but this would make it crash and the servlets don't accept scores of 0 so it shouldn't affect its ranking. I think the new servlets may be the problem; I hadn't actually realized that a new set of servlets had been installed until after I posted the above, because I noticed the 60%/40% wins/losses weren't being coloured on Fractal's details sheet, so I went looking in RoboRumble/ServerDevelopment to see what was up. Were there any rules changes in the latest update? If so, then that would explain it; it would then of course be just my dumb luck that any change would negatively affect Fractal's performance... After writing this I read Albert's post above. This makes sense now. Thanks :) -- Vuen

Here is something strange: I went to check Tron's result against BlestPain, but it is no longer there? Acording to the results pages Tron 2.02 has only fought around 70 bots in 1000 battles. The same happens with BlestPain, DT, CigaretBH, etc. the only "correct" one I checked is Shadow 2.01. Was the ranking reset recently? I too have been noticing some strange oscillations lately... -- ABC

I think it is produced by the new filter. It excludes bots that have not fought for the last 5 days, but also excludes pairs that have not been executed for 5 days. It could explain also the oscillations (even if tey will always be there with more or less strenght). I't take a closer look. In any case, it should be corrected when the new system to update participants in the rankings is in place. -- Albert

That must be it, the dance continues, the more "speciallised" bots are jumping all over the place as their problem bots enter/leave the details page... -- ABC

Who's bot is ad.Neo, and when did it start doing so well? It hasn't had many battles yet, but still... that's a very good 2nd place... -- Tango

Looks good to me too - over 1000 plus rounds it looks like it looses to DT 46.5%/53.5% so the ranking appears to be correct - the bad news is that it is claimed to be a 'test robot' in the repository! It would be nice to see what the history is for this bot/robocoder. -- Paul Evans

It also says it is derived from "many many bots". I could be wrong, but it looks to me like he did some good tweaks to some open sourced bots (probably Iiley's movement and Kawigi's gun). A very good bot anyway, would be very nice to hear some comments from the author. -- ABC

Tron and BlestPain are the only two bots that beat it. I can't help but smile :). Tron has always been one of my favorite bots... -- Vuen

Tron kicks ass! And it also happens to be one of the few top bots that fall into the trap of HypoLeach. =) -- PEZ

That's cool (beating ad.neo, not losing to Hypo ;)), I think Tron is a bit outside the "normal curve", it can give top10 bots a very good fight but loses to some unexpected low ranked ones... -- ABC

Believe it or not, should we use a league ranking system based on win/lose (like the ones used in soccer or basketball) instead of score difference, right now Neo would rule over DT. The first only loses against Tron, but DT loses against Neo Jekyl and Grifon. I'm maliciously tempted to create this ranking system just to force Paul to release DT 2.2 :-) In any case, it points out the inherent difficulty to rate bots. -- Albert

My prediction: The next version of SandboxDT will come with pre-loaded data for at least some bots. I know from my own personal testing that Jekyl, Griffon, and Neo will not beat DT if DT is given enough rounds to learn. The RoboRumble's distributed nature almost garuntees that SandboxDT will not get this chance in a timely manner. SandboxDT is still the king. I am starting to think it always will be -- jim

I'd be surprised if DT came with preloaded data, unless it proves really, utterly necessary. It's one of the truly cool things about DT, Tron and some other top bots. That they can rank so high without the advantage of preloaded data, fighting bots that have spent lots of training against them. DT is so much king that it's hard to fathom. -- PEZ

Back to a yet unresolved topic. Maybe the new VertiLeach is not as strong as the previous version, but it still doesn't lose to a single minibot. Yet it ranks #3 and the #1 ranked bot loses against 6 bots, inlcuding VertiLeach. How about Premier League rules instead? 3 points for a win, 0 for a loss and 1 point for a tie. We could define tie like being 50% +/- 1% or something. The current ranking system is great for leagues where not all bots can fight all others, but with the distributed power of RoboRumble@Home we don't have to give the bots an estimated ranking I think. I'm thinking that a win/loss should be determined by the accumulated % score share in each pairing. Now when the system both cleans out bots that are not any longer participants and prioritises bots with few battles the PL rules should work pretty well. -- PEZ

Completely agree. Why using theoretical estimates when we can have the real ranking? Also, the current infrastructure allows it. We should put a servlet that executes periodically (ie. 12.00PM) and just use the existing data to build the classification. Later, we could modify the client/servet applets to priorize unfought matches. -- Albert

Cool! Maybe the rating of a bot should be an index like "100 * (points / points_possible)". -- PEZ

In other words, percentage of possible score. Sounds good. -- Tango

Thanks! That's exactly what I intended, though I couldn't express it in words at the moment. -- PEZ

I'm not a big fan of 'Premier League' rules. Neo only loses to Tron, while SandboxDT currently loses to Neo, Jekyl, and Griffon; these rules would have Neo sitting on top, while in my opinion SandboxDT is the better bot. I like the regular league the way it is, but it's a good idea to create an aside premier league scoresheet that uses the same data as the regular league. -- Vuen

Yikes - I just realized that I resaid almost exactly what Albert said above. Wow. Remind me to read a page before speaking next time... -- Vuen

I think that's the preloaded-data advantage. Griffon wouldn't beat DT if it wasn't because it was trained on DT before it was uploaded. What if we ban preloaded data? It could be enforced by the client wiping any data directory in the bots jar file after downloading. -- PEZ

I don't like the idea of banning preloaded data. It reduces desing options (and everyone has the posibility to preload data). It is like real life: you can go to a competition without any information, thinking you are good enough to win, or you can take a time to analyze your opponents so you have a better chance to win. -- Albert

I certainly am one that has tried to explore the design paths of the preloaded data strategy. But never the less, DT is the best bot and it should somehow be identified as such by any league rules. We risk getting "fake" updates of the robots where Paul maybe changes the movement slightly and preloads DT with data on the enemies that preloads data on DT and then the authors of these bots load their bots with data on this new DT and it gets like a cat chasing its own tail. But, I agree that banning preloaded data constrains the design options a bit too much. What about we set the battles to be 100 rounds each? That would at least limit the benefits of preloaded data some. -- PEZ

I don't see why people don't just all have preloaded data. If Paul wants to prove he has a good bot without data then he can add another mode to the properties file that doesn't use preloaded data. It would undoubtedly improve DTs rating to have preloaded data. -- Tango

The thing is that with a preloaded data strategy the timing of your entry becomes important. You will need to keep training your bot and send up new versions whenever new bots or new versions of bots are entered. That's a bit pathetic I think. Paul doesn't need to prove that DT is king to me. I know it all too well. It keeps me awake at nights. -- PEZ

Preloaded data is useful for weight restricted bots that don't wish to include learning code and for bots that take many many rounds to learn. Preloaded data is often static and, because it is not the basis for learning it is very small allowing many hundreds of bots to be stored. DT's main problem is that the learning data is so large it can only hold data for some bots - what DT need's to do is convert that large statistical data to 'preloaded' style data just prior to deleting the stats to make more space. I have no problem with preloaded data - there are defences such as adaptive movement. I'm also happy with 35 rounds - it makes data saving an important element of a bot finally I'm happy with the rating system It means all battles are important - not just those bots that may beat you. -- Paul Evans

But I think we should move from the adapted rating e implemented using the /PremierLeague system). It is also more interesting to have both ... believe me. -- Albert

I think both is best. /PremierLeague rules will be easier to understand etc, and the current rules are traditional, and mean everyone has a chance to effect everyone else. With the current system, you don't have to beat DT to drag down it's rating. In fact, a bot that checked the enemy's name, and if it is DT fights the best way it knows, and if it isn't DT is just acts like SittingDuck (with a little change to make sure the score isn't 0, because 0's are ignored), would serious damage DT's rating. Could be fun... -- TangoMy opinion about the ER-type rating systems is that you can better tell if a robot will stand the test of time. A good, stable robot should be invincible to weak bots and be able to compete with the top bots. In other words, one with low specialization should be the goal. With this rating system, we can project how well they maI don't need to change my data saving - just change the delete criteria to keep data on the best opponents, I don't need to improve learning speed, and I only need to tune my movement to 3 or 4 bots - There is no challenge here. The existing rating rules are intuitive - all battles count - the better you do against each and every opponent the better your rating. -- Paul Evans

I promise to make it a challenge for you. =) -- PEZ

This proposed PremierLeague will be run in ce implemented using the /PremierLeague system). It is also more interesting to have both ... believe me. -- Albert

My opinion about the ER-type rating systems is that you can better tell if a robot will stand the test of time. A good, stable robot should be invincible to weak bots and be able to compete with the top bots. In other words, one with low specialization should be the goal. With this rating system, we can project how well they man't like it :-( I insist: the more you think about it, the more difficult is to rate a bot. Anyway, I think that the problem here is that there are two different paradigms to rate the bots: one inherited from RoboLeague and EternalRumble, that is fndamentally based on the score difference, and another one derived from the "real world" sports, where only wins/loses are considered (regardless of how important were the differences). In order to make everyone happy, I plan to have both (the second one implemented using the /PremierLeague system). It is also more interesting to have both ... believe me. -- Albert

My opinion about the ER-type rating systems is that you can better tell if a robot will stand the test of time. A good, stable robot should be invincible to weak bots and be able to compete with the top bots. In other words, one with low specialization should be the goal. With this rating system, we can project how well they may do against bots they're not fighting against, which is necessary in some leagues, because it's not feasible for them to fight all bots. But more importantly, you can project how good they will probably be against bots that haven't even been written yet. Some bots are just fundamentally good, not just taking advantage of temporary 'trends' in Robocode, but basing their strategy on good, sound principles of AI, Machine Learning, and so forth. That's the difference between robots like Yngwie who just work well and stuff and robots like HaikuTrogdor who just do a good job against Linear and Head-on aim, because that's all they usually have to worry about beating. At the level that bots fight at right now, it seems that if a bot can be trivially defeated somehow, it will be. -- Kawigi

The thing that any sports league in the world does not account for is a "quality win". A performance where the last placed team does much better than expected. In the real world there is no way to account for this. In the Robocode world we have a method to quanitatively say that a bot performed better than expected and should thus influence the statnding of the bot wether it won or not. We can say definitively that Bot A outperforms it's expected result has exposed some weaknesses that needs to be addressed in Bot B. We can definitively say that that the king outperforms every other robots, in one on one competion against all other robots, better than any other robot under the same circumstances. What sports league in the world can say that? Taking the EPL example, if Arsenal (currently #1) beats Chelsea (currently #2) 3 - 2 and Arsenal beats Liecester (currently #20) 3 -2 which is the more impressive result? They all ammount to 3 points for Arsenal but how does the table account for the unexpected performance of Liecester? The answer is that it can not. So the weakness exposed by the Liecester team does not get acknowledged in any way. A Robocode like system would still award marks to both teams but it would not award full marks to Arsenal as it did not perform up to expected results.

One last thing, if you go with simple percentage of score whats to stop me from calculating the score that I currently have and deciding that I have enough to win now and go into a while (true) { ar.getX(); } loop? That would make 10,000 calls to a get method and set my bots energy to 0 denying my opponent all bullet damage going forward and allowing them to only get survival and minimal kill bonus. If, in a 35 round match I determine that I am up by 350 or so with 15 rounds to go why risk it? My opponent will only get 150 points in survival bonuses and I doubt the rest of the bonuses will put them in a position to win. Under the current ER system, this would only be rewarded if the opponent was ranked above me, and I would somehow have to now that in a dynamicaly evolving system. In the PremierLeague it simply becomes a valid strategy. -- jim

That's why both tables are good. You have outlined all the benifits for the ER system, so we definitely shouldn't get rid of that, but there are also benifits in the premier system, even if it's just interesting, so we should have both. Simple. -- Tango

Yeah, if you want to keep the old school system for reference, let it be so. I don't see why it's interesting to project a bots future performance. The future will come and tell eventually anyway. I say "winner takes it all" =). -- PEZ

And now the bot with the wrong package has the /PremierLeague crown! That's pretty cool I think. And iiley's coming bot will be a definitive throne contender while also helping AdNeo keep it's edge over DT. Now I think it is becoming more important to quickly see too that new and updated bots gets all their pairings.

And what about we keep the /PremierLeague snapshots once a week and give that servlet page a drop box where you can choose to view these snapshots? (When viewing a weekly snapshot there is no pointn't need the current code trying to give new bots their initial 500 rounds. As long as the clients try to make each bot fight all other bots an equal number of times, new and updated bots will get duly exercised. -- PEZ

Yikes. Check out this rating:

RATING DETAILS FOR sgs.DogManSPE? 1.1 IN GAME roborumble

Noran.CornersReborn_1.0 6.1 1 5-11-2003:4:1 91.5 -85.4

That is like, the biggest problem bot index ever. I'm curious :536) </pre>

Wolverine crashes with the following message:

SYSTEM: You have made 10000 calls to getXX methods without calling execute()
SYSTEM: Robot disabled: Too many calls to getXX methods

So I think the reason is not RoboRumble, but the bots themselves. For Ender, it can happen that the erros only occurs on certain clients where it has written lots of information.

-- Albert

Not good. Now there is a mini that can somewhat clearly beat VertiLeach in the RR@H. Tityus! What to do? -- PEZ

Well, just add Tityus an "if (VertiLeach) don't shot" statement :-) -- Albert

It would be like a team order in car racing. With the benefit that the bots don't have ego's to match dudes like Kenny Br�ck. =) In fact in my tests Verti beats both Tityus and Fhqwhgads (the latter quite comfortably), I think Verti just need some more battles in some pairings for this to show. -- PEZ

November 7 2003: http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=minirumble Sweeet! =) -- PEZ

Looking at the PremierLeague ranking table it's striking how similar the rankings are. Some bots are much stronger in one game than the other of course, but it's still quite similar. Same kings in the megabot and minibot games for instance. I find the PL ranking table much more interesting to read since I can so easily understand it while the ELO-based ranking is opaque to me. I think two issues have been mixed and confused in the "debate" we have had about the choice of ranking system.

"Winner takes all" versus "relative strength"
"Ease of understanding" versus "magic"

It has been a bit like we had to choose between 1 or 2 here. While it is actually possible to choose the best from each. The most important thing for me is "ease of understanding". I don't like at all to have an opaque magic function decide the ranking when we have all pairings actually fought and the infrastructure makes a new bot get all its pairings in a jiffy. "Winner takes all" is cool for me, but I can see how "relative strength" measures something important too. And I think it is what most of you opposing the PL rankning feels strongest about.

What about we make a ranking where a bot is measured on its average share of the score it has in all its pairings? That's very easy to understand. And the resulting ranking table would be what the ELO-based ranking is trying to predict, if I have understood that much about it correctly. The ranking table could have all three figures in it;

average %share
wins/losses count
the ELO-based rating estimate

The table would be sorted on "average %share" by default, but we could make it sortable by the other figures at well. Sorting it on the the ELO-based estimate should produce a very similar table as the default sort or the ELO-magic is not doing what it should. If the tables are very similar then the ELO-based figure could be removed out of redundancy. If the tables are very different then the ELO-figure is of little importance anyway and could be removed for that reason.

I think I can produce a script on the server that produces a current "real relative strength" rankings table. It will take me a good share of time I would otherwise spend on making VertiLeach stronger though. So someone but me should think that table is of interest before I go ahead hacking it together.

-- PEZ

To me, the PremierLeague is OK as it is. I prefer "winner takes all" systems to "percentage score" systems. I think I said it before, but using this rating system would be like deciding the soccer league winner by using a formula like (goals scored /(goals scored + goals received)). The information is there anyway for anybody who want's to know it. Just divide the %wins column by the #maches one and you get it.

I agree the proposed rating system is clearer than the ELO one, but if we move to this new rating system, I think two conditions should be fullfilled (a) We should remove the ELO rating system (it would create a lot of confusion to have 2 similar systems in place). (b) There should be an strong consensus, since the ELO rating system is the standard %score rating system, and the new one should be standard also.

So if everybody agrees in changing the current ELO rating system by the new one, then it is OK for me.

-- Albert

I don't agree -- Paul Evans

For some reason that doesn't suprise me... -- Tango (BTW, i don't agree either, i see no problem having both)

I would like to see the continuance of the ELO based system too. You need look no farther than cx.micro.Smoke to see the difference in the two systems. As I type this, Smoke is #6 in the PL and #19 in the traditional ELO based system. -- jim

I don't agree either. I haven't a clue how the ELO ranking system works, but I don't really care; I just know that it's designed by lots and lots of people who are much smarter than me, and that it takes into account the amount by which you thrash a bot, while the PremierLeague doesn't. I don't think your #2 is an issue at all; relative strength should be entirely how it is decided. If people are curious about how the ranking works, just make the scoring piece of the servlet open source so that they can see how their bot is ranked. -- Vuen

Or better yet, modify the output of the details page to include the solved equation in a new column for the bot pairing in question. Then you will get to see the formula in action. -- jim

I think both Vuen and Jim misunderstood the proposition. I am not suggesting we scrap the ELO based system for the current PL one (even though I wouldn't mind that either). What's proposed is that we use the ELO-based way of considering the relative strengths for the rankings, but we skip the obfuscation with the magic formula. The ELO-based system is a great system for estimating ratings when there's no chance all pairings can be run. But now when we are running all pairings (over and over again) it borders to the ridiculous to continue with an estimate. I also suggest we build a the table including all three scorings to begin with, but that we probably will remove the ELO column once we see that it's about the same as the "real strength" one. -- PEZ

The servlets ARE open source. -- Albert

And I have read the sources. I have also tried hard to figure the ELO-based ranking system. I don't understand it anyway. And I refuse to just lean back and trust others are smarter than me. I know they are, but I would much rather have a ranking system that's transparent even for non-statiticians like me. Everywhere I look where these kinds of ranking systems are used (chess and tennis are two visible examples) it is a means to give all players a relative ranking without having to play all pairings. Something that is impossible in those games. But we (Albert) has solved that problem and thus there's no need to obfuscate the rankings with voodoo. Even if it's damn cool voodoo. -- PEZ

OK, to give us a more complete picture from which to make some descision I have hacked the server classes generating the PL results some. Now the server produces both types of PL rankings. The "real relative strength" one looks like so for the general category:

http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=roborumble&table=2

If you study this table and compare it to the ELO-based ranking you'll see that they are about as similar as I had thought they would be. The only real difference is that one contains a very easily understood score (DT collects 72.9% of the score of all the pairings it has participated in) while the other contains an arbritrary voodoo score (DT has 1892.65). Only where there are really close ratings you can see a difference in ranking (like between BlestPain and VertiLeach) I'd much rather have the ranking decided by a score I can easily understand than by one that's opaque to me.

I'd say we only need two rankings;

Real (measured) relative strength
Winner takes all

If your curious about how the other games look like from a "real relative strength" perspective:

I haven't checked all tables, but the table for minis shows identical rankings to the ELO-based table. Identical meaning each and every bot gets exactly the same rank. Now tell me why we should obfuscate these rankings with statistical fomulas.

-- PEZ

The ELO estimates are more than just giving rankings for pairings that haven't happened, it is comparing the estimates to the real result, to see if a bot is a ProblemBot or not. If we just use your new system, we won't have the ProblemBot ratings, which are very useful. -- Tango

I'm pretty sure you can produce ProblemBot ratings also without the voodoo. But since it's mainly a tool for helping us spot where we might have room for improvement in our bots let's keep the ProblemBot ratings as they are. No need to base the rankings on the same voodoo. -- PEZ

To have the current ProblemBot ratings, you need to have the current Rankings, because that's what they are based one. You don't have to display them, but if you have them, you may as well. -- Tango

But there's no real need for using the current Rankings for the PBI. A non-voodoo way would be to just calculate a bots PBI by a simple difference:

expected  = 50 + myStrength - opponentStrength
PBI = real - expected

This gives the following PBI's for a selection of VertiLeach's opponents:

Opponent Strength Expected Real PBI ELO-PBI Difference

DT 72,90 44,81 29,70 -15,11 -13,80 -1,31

Tron 63,45 54,26 47,70 -6,56 -7,00 0,44

LostLion? 33,61 84,10 68,70 -15,40 -13,00 -2,40

Nibbler 61,64 56,07 56,90 0,83 0,00 0,83

FloodMini 67,22 50,49 57,40 6,91 6,80 0,11

Tityus 65,77 51,94 49,90 -2,04 -2,30 0,26

Griffon 67,54 50,17 59,40 9,23 9,10 0,13

Not exactly the same PBI, but still just as useful. Maybe even more useful since it's easier to understand.

-- PEZ

For me whats more important than understanding the system is continuity. The RR@H's ELO based system is the closest thing left to the Eternal Rumble. I have spent way to much time getting this *close* to the #1 position to want to willingly change the scoring system. If I ever manage to become #1 I want there to be no lingering doubt that it's tainted in any way. What you propose PEZ is a different view into the same data. Code it up, put a link there and see who uses it. Darwin will decide if it is better or not. Thats one of the strengths of the RR@H as it is now. As resistant as I am to the idea, maybe I will like it better too. I do not know. But if you are telling me it is an either/or situation than I am for the status quo. -- jim

I don't think there's a point in keeping the ELO-based rankings. It's just confusing with two so similar tables. We can keep the figures there a while. RR@H is so far away from ER anyway, keeping the current Rankings doesn't bring it closer. -- PEZ

Wow, for Nanos the ranking is exactly the same! -- Albert

Yup, and for minis as well (maybe I have said that?). I guess it shows that the ELO-thingy works. At least when you are doing the estimate from the full population. =) -- PEZ

For me ELO-style gives much more information. Like problem bots. Like I can see ratings skewed by bots entered with pre-learned enemy info and what it takes to learn their 'real' strength (I'd say some 1500-2000 rounds.). Btw I see no point in doing that - one can't learn true bot rating quickly, just see the bot going up then steadily down.

Pez above you gave this example table with 'Strength' in it as base for calculations. Where does it (strength) come from? -- Frakir

I feel like a real DonQuijote here. =) I can't see where the ELO-figure says more than the strength figure. As the example above shows ProblemBot index can be calculated just as easily fromt the "real strength". In the above ProblemBot calculation "strength" is the average %score a bot has collected in all its pairings. It's what the "real strength" ranking is based on. I. e. the "score" column in the ranking table (http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=roborumble&table=2). For VertiLeach this score was 67,71% at the time of the calculation above. -- PEZ

Well maybe since I am addict chess player I don't find ELO being voodo :) (OT) Just noticed something peculiar: have a look at http://rumble.robowiki.dyndns.org/servlet/RatingDetails?game=roborumble&name=Noran.CornersReborn_1.0 That is one crazy bot... Kills Ender (problem index 89.3!), wins with Wolverine, loses to most all others with some REALLY bad scores. -- Frakir

Ender and Wolverine have bugs in them so they crash against some bots, so get really low scores. -- Tango

Maybe it's not voodoo to you, but it's a bit unecessary to massage the results like that just to end up with the same ranking, in'it? I think I have seen a mention about CornersReborn? elsewhere. In fact on my machine it doesn't even run. I think something is wrong with it. -- PEZ

Time for a vote (BTW have you thought how you would do Melee without ELO :)) -- Paul Evans

Name ELO Only %Score ELO & Premier %Score & Premier All three Notes

Paul Evans X x Uppercase X = prefer, lowercase x = Can Live With

Sparafucil3 X x Ditto

Vuen X x x x x I suggest all 3, but mainly sorted by ELO

Albert x x X I would preffer to have Premier and one of other two

PEZ X and rank by the two non-voodoo methods

Tango X and rank by whatever viewer chooses

ABC x X ELO main ranking, PL "just for fun" :)

Kawigi x X x ditto on ABC. Vuen's idea to let the user click on a heading to sort it differently is also a good idea, but might not lend itself to using wiki pages for the results (easy enough on the dynamic results)

Alcatraz X I like rankings. Lots of them. I think there should be tons of ways of measuring bots. Like, more than three.

SSO? X good to see different performances

I see... I understand now what you mean PEZ. I still however trust the ELO rankings more. A statistical estimate will certainly not be inaccurate from a 'real relative strength' when there are full pairings; this is counter-intuitive to the concept of statistics. Now that I see the %Score column however, I do not mind it, and if you choose to go this way that's fine by me; it would still be nice to keep the ELO rankings anyway however. The ELO rankings will still be a better ranking until all bots have full pairings, and since people are constantly changing versions or adding new bots, this provides a better ranking while they attempt to achieve full pairings. Bots rarely keep full pairings anyway; look at the premier league. The top few spots oscillate like crazy every time a bot is swapped in or out. Anyway my suggestion is to have just one ranking table that has ELO, %Score, and Premier League on it, and we can just click the table headings to sort it how we like. -- Vuen

Paul, noone has suggested we do Melee without ELO. =) I prefer to solve new problems as they arrive, not before. Vuen, your suggestion is excellent, especially since it's identical to my original suggestion up there. =) (Even if I think it's a bit silly to have two sorts producing identical results.) I don't think bots would oscillate much more by the %score than by ELO even when one or more pairings are not run. Where the score is tight bots oscillate already as it is. And when bots don't have all pairings they don't tend to have a very correct ELO rating either. It would not be like the PL ranks of course, which is much more real-world. -- PEZ

I've changed the "premier and any other method" heading to "all three" because it makes more sense. -- Tango

OK, since I think your vote, with your changes to the options, is about spot on to my prefs; I changed my vote to reflect that. -- PEZ

From what I understand about the ELO ranking, the %score method is exactly the same with a "linear" expected score. We allready agreed that a 1% score difference between two close ranked bots should be more significant than 1% difference between a top bot and a low ranked one. That is, imho, the big advantage of the ELO system over the %score (even if the results are very close). There is no magic/voodoo involved, Paul just adjusted the relation between ranking and score difference (the famous S-curve) to better reflect the real world, resulting in a system where, even with few pairings, you can better predict your final ranking based on partial results. You can do that with the %score system too, it's just that it should work slightly better/faster. About the PL, I like it too, Tron goes up 10 places... :) -- ABC

Agreed, it's just that the prediction seems unnecessary when we have the answer already. Besides, the discriminating feature of the ELO seems to do little, or no, difference in the end. I think the PL is hee to stay anyway, it's such a straight-forward ranking. Tron is such a great bot, I think the PL better reflects its quality than the %Score systems (ELO included). -- PEZ

Someone added a note to mine, that i never said... i will assume it was a mistake. I think all the rankings should be shown on one page, with both the %/rating/score *and* the position for each shown. It can then be sorted by whatever method the view wants. That way you can easilly tell if a bot is doing better in one system than another. -- Tango

It was me. The table looked funny after your edit and I thought I fixed it. Didn't mean to make you say stuff. =) Everybody, look over the table and the options we now have and make sure your votes are where you want them. Here's a conversation to help you choose:

Friend of mine (F): What does that rating figure mean?
Me: It's some statistical calculation based on the %score my bot has got against it's opponents. ... plus 1600.
F: Plus 1600?
Me: Yeah, plus 1600.
F: So, if I subtract 1600 from your rating I get 223 something... What does that say?
Me: I have no idea. It's something about an S-curve. ... But look at this table instead it lists the average %score instead and says my bot averages at 67.78% of the score against all other bots.
F: Yeah, and those bots also get about 67% ... But look at thise one! (pointing at DT) It gets almost 3% more than any competitor!
Me: Righto! It's the bot we've all been chasing for more than a year now, it almost seems impossible to build a stronger bot.

-- PEZ

I'd say that was more your inability to explain it well, rather than it being overly complex. Point your friend at the explanation on Paul's (or whoever it was's) website. -- Tango

My inability to explain it well stems from my inability to understand it well. I have those web pages printed and gave them to my friend. He's a math head so he understood it. But there's no way I can ever look at rating 1856.7 and say it tells me anything at all. My friend, after reading the papers, asked the, to me obvious, question; "But why do you use an estimate when you have the real thing?". I think it's like not using the % votes in a political election, but instead use some statistical calculation that tries to estimate the outcome. There you have a receipt to get major protests from your citizens. =) -- PEZ

In 'ELO' type of rating system you assume (quite sensibly) normal distribution of participants, then you force it into ratings. So when you tell me 'Avg rating is set to 1600' I can tell you that bot rated 1856.7 is supposed to get 73% points versus a 1600 bot. Next nice thing about elo is normalization - that means 1300 bot should get same score over 1200 as 2000 over 1900. In other systems I can not reliably predict match outcome. And contrary to what you posted above usually very few matches are enough to get good estimate how new bot rates in the pool. -- Frakir

The ELO ratings on their own aren't meant to tell you anything, it is how the ratings compare to the ratings of other bots that matters. I know that when DT is 50 points ahead of it's nearest rival, it is doing very well, and i know that when my bot is 200 points below 2nd to last place, i really need to do something about it. The actual rating is irrelevent, thats why it doesn't make any difference what you set the average to. -- Tango

Maybe I've missed all the arguments for and against, but it seems to me that you all are trying to get a ranking system that is stable - ie A bot ranked 4th will beat a bot ranked 50th every time. But due to the nature of robocode, matches are inherantly unstable due to randomness in bots and starting positions. I think the aim for any ranking system isnt to show that one bot is better than other, but rather one bot is better than another bot for that particular round of matches. We should embrace the randomness of robocode, not make it stable. Look at DT standing at the top after all these months, wouldnt it be nice if he was knocked of the top, if only as the result of a lucky match? The way I see it, luck should play a part. Wouldnt it be boring to watch football if you could predict the result of a game between Manchester United and Luton Town. But you cant, Luton could beat ManU? in a one off match as a freak result. Thats what makes the game fun. Thats what should make any robocode competition fun. Bring back some kind of ladder or knock-out competition instead of your stable rankings! Anyway, rant over, im sure a lot (or all) of you disagree with me about this, so go ahead, tell me why im wrong. I wont listen anyway. :D

--wolfman

I agree that one-off matches are fun, but they don't help you make a good bot. The aim of RR@H is to get stable rankings so it is easy to tell if your bot is any good. I think it would be great fun to have a league that only ran 1 round for each pairing each season, and judged the entire table on that. -- Tango

What I ask is what does 1823.7 tell that 67.78% doesn't? 67.78% predicts that this bot should beat a bot with strength 50% by 17.78%. It's just as reliable as the ELO figure (which I think you'll have to tell me how you arrived at). What the %Score based ranking provides is transparency. And, you _can't_ reliably predict the outcome of a particular pairing using ELO or any other system. If you could, the PBI column on the Details page would not be what it is. My observation that the ELO type ranking produces some instability while a bot collects all its pairings are just that, an observation. But that's not to say that the ELO based system we use in RR@H isn't good at predicting a bots ranking. I am one of the first in the line of people who are amazed about how reliably it can do this. What I am saying is that we don't need the predictive qualities of ELO in RR@H. In other leagues it's needed (if you want stable rankings) but not here. It takes a bot less than a day to collect enough pairings to get a stable ranking using measured %Score strength. And that's with the few clients running that we have today (which I think does not exeed 5). Once we really push GO with RR@H we might have 100 clients running and then it will take less than an hour.

Wolfman, RR@H is about producing a stable ranking (read the project goals somewhere on the RR@H part of the wiki). I agree fully with you that other leagues, providing enjoying combat, are needed to. Particularly I miss the face2face competition. But I think we could use the RR@H framework for running that kind of competitions too. Either by making the clients switch modes or by making the server filter the battles uploaded in different bins. Feel welcome to look at this. The source is always included in the RR@H zip packages.

-- PEZ

Ok I now understand what you are missing with percentage.. :) While percentage works fine with bots around equal strength, it stops working when differences are huge. Example: Bot A beat bot B 1000-90 (91.74% score); also A beats C 1000-60 (94.34%). What is _predicted_ relative strength of B to C? If you say B is better by 2.6% you will be way off! Elo predicts accurately here: B performed better then C by almost 110 rating points and should get 60% score versus C. You are missing whole normalization thing ('S'-curve). As a result bots will always 'lose' rating playing bots far below their rating and perform 'better' versus bots of similar or better strength -- Frakir

P.S. Which can bring possibility of doctoring ratings by choosing higher rated opponents and playing selected matches on your RR@home machine... or seeding Sandbox to play the bottom of the pack, or whatever -- Frakir

Are you saying by using the relative percentages thing to attempt to weight that bot's result more than others? Not sure if it would work the way it's currently set up, but you are right in the respect that it's more reliable where people don't think about it. -- Kawigi

The point PEZ was trying to make is that ALL pairings between ALL bots are run. It's not possible to exclude bots from being played, because they are guaranteed to be played on other people's RR@H. It doesn't matter what the predicted relative score of B against C is, because the match B vs C would have been played. I still feel though that ELO is the way to go, but I can live with relative strength if you really want to change it PEZ : ). It's your server anyway, so it's up to you... -- Vuen

I hope we can have them all in one table, maybe PEZ will like how elo normalizes things when compared to percentages; I am almost fairly sure currently percentage order != elo order of bots. --Frakir

Agree with PEZ and Vuen. "ELO is better because it allows me to predict the expected outcome of A against B" is a weak argument, since you have the REAL outcome from A against B. It is OK for me to keep ELO, since it has a long tradintion in Robocode, but right now I think it is just an overkill (a complex stystem used when no longer needed) since ALL pairing are there (note that in "real life" ELO is used only in sports where not all pairings can be played. No sport where all pairings are played uses a system like this). -- Albert

(Edit conflict * 2!!!) I wouldn't take this time to argue my point if I thought it was up to me. =) Here I have finally got someone try to give me one possible advantage of the ELO versus raw %Score. Thanks Frakir. I really appreciate that you keep trying to point out just what makes ELO preferable. It was driving me a bit nuts to just have "don't agree" and "we should keep ELO" thrown at me. But the fact remains we do not need to predict the relative strength of bot B to C, we just wait a few hours and the answer will arrive. I'm not sure what ranking system you feel is weekest against manipulation. I think raw %score is more robust here. You can play DT vs VertiLeach all you want. The raw %score between these bots will just get more exact. Nothing at all will happen to the rest of the table. With the ELO type of rating I have no clue what would happen. Which is very much why I feel so strongly about getting rid of it. We saw in the start of the RR@H that those kinds of manipulation attempts (focused pairings) disturbed the rankings, but I'm not sure that would be the case any longer. As Vuen points out we now have a client which enforces that all pairings will be faught. Albert has succeeded very well in one of his major design goals of making the system robust.

Frakir, do you mean the current %score order is way off from the ELO order? And, if so, could it be that the ELO order (which we build dynamically) is just more recent than the %score one (which we build every 12th hour at the moment)? -- PEZ

I have no idea... but if by some sheer chance some bot played more games versus low rated opponents then it statistically should, then he is 'percentage-wise' overrated, but its ELO is fine. Similar argument can be used against ELO (more games vs problem bots), but it would affect both: elo and percentage one. One more tiny point here: percentage rating can possibly fluctuate more (I play agains last bot in the pack, I get 99.3%, my rating goes considerably up). I think we can have a table with both, at least for some time. -- Frakir

Your rating will only increase considerably if you haven't played that pairing before. I too think we can have both figures in the table, at least for a while. -- PEZ

What we'll lose with percentages will be that nice normalization: Suppose SandboxXP? will get 99.9% while VertiLeach30? will have 99.8%. Just a tiny bit off, almost no difference.... In fact strength difference is as huge as between 50% and 66.6% and translates to same 185 elo points... (wich means Sandbox is supposed to get 66.6% versus VL to justify that 0.1% difference -- Frakir

I agree, normalisation is probably the biggest issue in favour of ELO. And as a consequence of that, problem bots. Your problem bot rating against the top and bottom bots will be different in ELO and %score (at least i think it will, i haven't actually checked). The ELO one will be more acurate/useful. -- Tango

My tests indicates that the difference is small and I can't see that it goes in any particular direction. But since the PBI is mainly a nifty extra feature it needn't be all that exact. If my bot underperforms significantly against a particular opponent both ranking systems will show it. When DT and Verti reaches the levels of 99.8% strength we could maybe discuss if ELO would show the 0.1% differences better. =) -- PEZ

Why not put both side by side and see what the actual PBI's are for the current data? Your tests are likely not to be acruate to notice the problems. If you draw a curve of the ELO rankings for the RR@H it is very linear except for the top and bottom 3 or 4 bots. If your tests didn't include the very top, and very bottom, you would not have seen the problem. (NB I haven't actually drawn such a curve for some time, so the rankings may have changed, i don't know) -- Tango

Aw, crappy; I was just about to put a comparison between %score (table=2) and ELO, and the premier league current rankings page just died. http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=roborumble There's only 9 bots in it! Was there a problem in the page generation? -- Vuen

No, there's something other going on. Look at the /ReportedProblems page... But now the rankings are rebuilt again. -- PEZ

Yes, something strange was going on earlier. I looked at ad.Neo results few hours ago and it had 0.6% against Noran.RandomT_0.1 (one battle then). -- Frakir

For the %score method to work correctly (I still trust ELO better), we are assuming that complete results for all pairings are fast to generate. That is true today, but we are still only scratching the surface of RR@H's potential. How about increasing the rounds per match to 100 (or even 500), that would be very cool. Also, I miss melee competition! Imho we are wasting time discussing small details of a time-proven method of bot ranking instead of moving on to bigger and better things... (much like when you have a good bot and tweak it endlessly instead of trying new theories ;)) -- ABC

Agree about moving forward. Desagree about increasing the number of rounds. 35 rounds are enough. This way it gives some advantage to "smart" bots (the ones which learn fast) over the "wise" bots (the ones that can learn a lot about the enemy, but take a long time to do it). I never understood why people says "my bot is better because it is able to beat the other one after 1000 rounds". If a bot beats another one on 35 rounds once and again, then it is clear to me that the first one is better, it doesn't matter what would happen on 500+ rounds. -- Albert

I partially disagree here. You may write your bot optimised for 'unknown' opponent: eg with stat gun there are interesting methods to select shooting offset when there is very little data, or when there is not enough data to shell out random noise (not the same thing). Energy management can also be optimised for low gun hit rates (I'll post something about it soon) or for high gun hit rates (trained gun) and differences here can be really big. In fact my current test bot is somewhat 'optimised' for unknown opponent (leagues like RR@home) but also knows some to take advantage of trained gun... Anyway those 2 things are really different and I value more 1000+ round battles even when I optimise things for short ones (because it is easier). -- Frakir

It takes us again to the ontological question of "what is a best bot?". and I think there is no answer for it, because there are as many answers as robocoders. Of course we could say "a best bot is the one that is able to cope with any criteria anyone proposes" but we can not implement all possible criteria (also, if we did, we should ponderate it, and we would be stucked again). So my proposal is to leave it as it is. I think it gives a picture good enough for everybody to decide which is the best bot. -- Albert

My idea of the best way to find out wich is the best bot is a "everyone fights everybody else once over 1000 rounds without save data" league. Sure, the current setup gives a pretty good picture of the relative strength of all bots, but there is still a significant "luck factor" involved, specially between close ranked oponents. After 1000 rounds there is a much smaller error margin, and both the short and long term learning have been used. -- ABC

I think the luck factor is almost neglectable. But a 1000 rounds rumble with no saved data would still be cool. It could be held as a one off shoot now and then. -- PEZ

I like that! System limit of 200kb saved data (and RR pool of 200 bots) is another design consideration to make your bot slightly weaker in short battles but stronger on average (less segments, or just saving partial data) -- Frakir

Opponent	Strength	Expected	Real	PBI	ELO-PBI	Difference
DT	72,90	44,81	29,70	-15,11	-13,80	-1,31
Tron	63,45	54,26	47,70	-6,56	-7,00	0,44
LostLion?	33,61	84,10	68,70	-15,40	-13,00	-2,40
Nibbler	61,64	56,07	56,90	0,83	0,00	0,83
FloodMini	67,22	50,49	57,40	6,91	6,80	0,11
Tityus	65,77	51,94	49,90	-2,04	-2,30	0,26
Griffon	67,54	50,17	59,40	9,23	9,10	0,13

Name	ELO Only	%Score	ELO & Premier	%Score & Premier	All three	Notes
Paul Evans	X		x			Uppercase X = prefer, lowercase x = Can Live With
Sparafucil3	X		x			Ditto
Vuen	X	x	x	x	x	I suggest all 3, but mainly sorted by ELO
Albert	x	x	X			I would preffer to have Premier and one of other two
PEZ					X	and rank by the two non-voodoo methods
Tango					X	and rank by whatever viewer chooses
ABC	x		X			ELO main ranking, PL "just for fun" :)
Kawigi	x		X		x	ditto on ABC. Vuen's idea to let the user click on a heading to sort it differently is also a good idea, but might not lend itself to using wiki pages for the results (easy enough on the dynamic results)
Alcatraz					X	I like rankings. Lots of them. I think there should be tons of ways of measuring bots. Like, more than three.
SSO?					X	good to see different performances