[Home]RoboRumble/OldRankingChat 031201

Robo Home | RoboRumble | Changes | Preferences | AllPages

Difference (from prior major revision) (no other diffs)

Changed: 1,11c1
Ah, now the rankings got a bit sensible much quicker! But look at the harmless Piranha!. The only difference between 1.4 and 1.5 is that the latter sports Tityus movement, which is clearly inferior to the good ol' Gouldingi movement. I curious about where Tityus would rank with that movement. Can't fit it into the mini size together with that gun though... -- PEZ

Can someone look at gg.Wolverine 2.0 (it's open source) and work out why it is getting loads of "10000 getXX methods without execute" errors against some bots (eg. mine (Recrimpo)), but not others. It seems like a serious bug in the bot. Should we remove it from the rumble, as it is just going to destabalise rankings? -- Tango

Can't fit Gouldingi's old movement in with Tityus's gun? Maybe you just don't have the hackeresque experience to do it like us! How did I fit FhqwhgadsMicro's gun into a Micro and still have it move at all? ;-) On a side note, Sedan is in 2nd currently, where it belongs (Although I wasn't complaining before I went to bed and FloodMini was in 2nd).-- Kawigi

Cooooool... Haven't touched it in 6 months yet DuelistMini is still #12 :-D --David Alves

Kawigi; Maybe you have forgot that my mini's refuse to sacrifice coding principles for codesize? =) Now that the MinibotChallenge? is dead I might let my minis grow out of their size constraints all together. -- PEZ

An interesting finding with the RH@H setup is that I think Marshmallow will find it almost impossible to enter top-10. With the ER it only fought bots in its neighbourhood which meant it could compensate for its slow learning with persistant data. But now when it fights all bots it has no room for data on all enemies... It might very well mean it keeps updating it stats on really low-ranked bots and has no room for stats on DT, Sedan, BlestPain and such bots where it should really need it. =) Expect my next bot targeting the top-10 to learn faster than good 'ol M. -- PEZ
Thanks! Though there's still a larger gap between #1 and #2 than it is between #2 and #20 something, so maybe Paul is not in that big hurry yet. -- PEZ

Changed: 13c3
PEZ, have you considered intelligently selecting which results to save based on score? (ie: close score means save as much data as available, solid win means save some data, blowout means save little to no data) -- Kuuran
Wow. You guys have made a real powerhouse this time; Griffon is pretty darn good. You guys are doing something right with these wiki bots, because just about every one of them is a ProblemBot for Fractal. I have some good ideas to add to Fractal, and a whole new type of gun manager similar to VirtualGuns and BestPSpace to build and write up on; unfortunately I won't be able to work on it for like 2 weeks because I'm behind in all my classes. Fractal is still only like 1/10th complete; I just haven't found the time to make it good. Stupid university... -- Vuen

Changed: 15c5
Yes I have. But the situation is new and I think the way to go really is to make sure the bot can perform also without saved data. If Marhmallow skipped saving data everytime it got severely beaten over 35 rounds against a new bot it would seldomly save data. =) -- PEZ
I've just had a look at Griffon's movement - acording to DT's stats, at Griffons preferred fighting distance (which is further than DT's) DT has a hit rate of around 15.7% - by comparison DT will hit itself with a hit rate of 15% (and this is at a closer fighting distance). Overall the movement is only slightly worse than DT at longer distances, and perhaps 1% to 2% at mid distances. However to achive these hit rates DT has to use it's most segmented gun which usully comes into play at around round 50! - the standard guess factor guns show no difference between DT's movement and Griffons. Good movement well done :). -- Paul Evans

Removed: 17d6
Heh, obviously learning faster is better :) But what I meant is that if you win by alot then don't save data, on the idea that when you're winning by a large margin you either don't need data to trash that bot (and thus won't waste data on low-rankers) or you have enough already (in this case storing more probably wouldn't hurt you, but if it won't help too much not storing won't hurt you either). -- Kuuran

Changed: 19c8
I think good guns are now a factor - in Eternal Rumble you could get a good rating by being a good mover, good guns against good movers don't make much difference to your score, good guns against bad movers makes a big difference. To get a good rateing you need to thrash a lower ranking bot by rapidly seeking out it's weak movement and hitting it. I don't think data saving is that important - DT got well clear at the top before it had fought most opponents more than once. -- Paul Evans
Way cool. Thanks for the feedback. Your segmentations still keep some secrets. RoboGrapherBot (soon to be released) doesn't see where DT's movement is better than Griffon's. I wish you would release a grapher for DT. =) -- PEZ

Changed: 21c10
It seems NanoLauLectrik got the pleace it deserver :-) among the nanos, enfront of FunkyChicken, Moebius, and Kakuru. It makes me happy. -- Albert
Thanks very much Paul. This is very exciting news for me! I thought there was a chance that it was very solid. Now I think the difference is the gun. And for that, I think that more than 1500 bytes will be required to close the remaining 50 points or so. Getting closer. -- jim

Changed: 23c12
Apoptygma bothers me. It's contents are mostly things I said 'wouldn't it be cool if I could fit this into a micro?' about, so I didn't expect it to win any awards in competition, but I still expected it to be around 40-50. I guess I'll have to make a competitive version that ditches the VirtualGun? array and has a stronger movement (call it Berzerk, maybe? ;). -- Kuuran
Getting closer - yes, but don't forget a 45%-55% score represents a difference of about 50 rating points. on a separate note... a grapher for DT would, I think, take away the sense of achivement should you beat DT - I keep my segmentation secret so as to not spoil your enjoyment :) -- Paul Evans

Changed: 25c14
Good guns is certainly the major factor now. Marhsmallow's guns aren't too bad I think, once they have taken 500 rounds or so to gather data. =) I can't get my movement together now at all. (I have spent two months on it and I still get slaughtered in the movement challenge....) But when I do I'll start working with the Tityus guns and see if I can climb the targeting challenge ladder. Don't expect me to stay out of top-10 for too long. =) -- PEZ
You could always label the secret segments as "secret 1" and such. And, I can assure you, Paul, that I would enjoy beating DT either way. =) -- PEZ

Changed: 27c16
"I don't think data saving is that important - DT got well clear at the top before it had fought most opponents more than once. " Well, I used the same robocode installation for both beta 1 and beta 2, so my copy of DT had hundreds of rounds of data saved from the first beta when the new rankings were started. :-P --David Alves
Well I have never been closer than 50 points before so I am going to live in the moment for a few minutes. And I too would enjoy bringing you down, if only for a nano second. Especially if I managed to catch a screen capture. I remain convince though that the way to do it is to focus on beating other bots first. When I look at SandboxDT's results page, there is no bot in the left hand column that SandboxDT does not score at least 50% against with the exception of Griffon, which I know it can beat by at least 55%-45% from my own testing. Thats a remarkable acheivement and the secret to closing the gap. First beat them, then beat SandboxDT. At which point you will release a new one and we will fall behind by another 10,000 points =^> -- jim

Changed: 29c18
I forgot about that - I think DT can hod data on about 70 opponents - (the 70 most recently fought) - saved data for DT will be at best used in 2 in 5 of the battles once it has trained up - with more opponents even less - I wonder if it will keep it's lead after that :) -- Paul Evans
How about you just make DT give hit rates for prefered distances (with a varity of guns)in the debug window at runtime? It would be good to compare... -- Tango

Changed: 31c20
DT obviously has guns that can perform very well without saved data. That hardly is an issue for debate. =) -- PEZ
BlestPain strikes back! Now Griffon is #3. In Sweden we say "old is oldest". Dunno what the expression would be in English. -- PEZ

Changed: 33c22,23
DT also takes about 500 rounds of saved data to hit. (ok, maybe a slight exaggeration, or maybe it still can't be hit at that point) -- Kawigi
I don't think "oldest" has the same positive connotation here that it must have in Sweden. :) -- nano
* Maybe not. But what about "age"? =) -- PEZ

Changed: 35c25,27
I think that Kuuran hit the right answer (or at least I agree with him =^> ). I think that the move to this format will force people to do a couple of things they have never thought about before. For one I think that it will become more important to selectively save data about bots than ever before. It was the first thing that struck me about this format. I also think that people will need to start adding algorithms to trim their data directory as well. It will do me no good to recognize that I should save data on this opponent if I have no mechanism for removing data that I could do without. I am also wondering if it would be possible to figure out from my stats buffer that the movement in one bot (Cigaret for instance) is the same, or close, as the movement in another bot(Sedan for instance) and simply reuse the data from one for the other. -- jim
Wow... I can't believe Fractal jumped so many positions with so few changes. I haven't built any of the concepts that are supposed to bring out its real power yet; all I really did was tweak it's movement distancing to make it more survivable and less passive (while more predictable), and remove its bin decay. Here is the result I'm most proud of:
pe.SandboxDT_2.11	46.6	2	8-10-2003:21:39	32.5	14.1

*grin* getting closer... -- Vuen

Changed: 37c29
I've also thought before that it would be an interesting test to try using data from an old version of a bot on a new version (in a VG sort of setup), and even extending it to try saved guns from other robots of the same package on new robots. It occured to me, though, in doing some version-type stuff, that if the new movement is different, it could prevent my gun from becoming really great against that opponent (with non-rolling stats), but it would still do better in the beginning because of KentuckyWindage. -- Kawigi
Now NanoLaulectrik? is 3rd. The nano-pattern matchers are having a big fun, going up and down and exchanging positions among them. It's interesting to see. -- Albert

Changed: 39c31
PEZ and Albert, I like the the new Rankings Page! I think this is very good stuff. Thanks! jim
I think that now would be a good time to print [the minibot rankings] and decorate your walls. =) -- PEZ

Changed: 41c33
I guess the page that is new is the detailed ratings page. I like it too! But I have nothing to do with it. It's Albert's work with some good suggestions from Paul. About sorting. If someone knows a good cross-browser way of sorting HTML tables we could add the sorting on the client side. In line with the whole @Home thought. =) -- PEZ



Changed: 43c35
I can write a script to do sorting on the client side if you like (I do Javascript + CSS for a living :-P) Only problem is that it won't work in older browsers, only recent versions of Mozilla, IE, Opera, and other browsers that support the w3c DOM. In particular, Netscape 4 is hopeless. --David Alves
Um... After about two weeks with Fractal 0.32 in a stable position in the rumble, it just dropped almost 15 places over the course of a day and is still dropping. How come the sudden change in ranking? -- Vuen

Changed: 45c37
Netscape 4 has been hopeless for very, very long. It only need to work in modern browsers. If you can make it so that it works in my browser (Safari, Konqueror based MaxOSX? browser) it's a bonus. I've only found a few solutions out there that works here. Most work on IE5+, some work on IE5+ and Mozilla, very few on a broader range of modern browsers. The problem with the solutions I have found Googling around is that they are either huge or commersial or both. What do you say Albert? Would a client side sort be desireable? -- PEZ
Could someone without any data files have just started running battles? Does Fractal take a long time to learn? Does Fractal have a reference mode? -- Tango

Changed: 47c39
I figured out how to do a quicksort in scheme... (Just thought I'd add that as an irrelevant comment) Since we're technically chatting abount rankings here, though, I'm noticing a sort of division here. There are bots which do extremely well against really bad bots and bots which beat the good bots well, but don't beat the less competitive bots by as much as they should (I tend to think the former are primarily pattern-matchers and the latter are primarily statistical variants, or even robots with bad guns and good movement). Whatever the reason for this, it appears that some robots do better when only faced with bots of their own caliber, and others do better against just everyone. My question is which is better? Is it better to include SandboxDT vs. SpareParts in the final ranking, or to focus on how SandboxDT does against Wilson, Iiley and I (and PEZ or whoever else pokes their heads in the top 6). I'm curious what the opinions are, because at the moment, I'm running tests for the RobocodeLittleLeague using completely random pairings, but I suspect there are some advantages for stability battling bots against those close to them. -- Kawigi
I checked the detail records and there is nothing strange there (no low scores related to some client, nor 0 survival records, nor big failures against unexpected bots). Rating for Fractal evolved from 1693 (12/10) to 1691 (19/10) to 1676 (now). That makes a change of 17. I just made a quick check with two bots: MicroAspid (changed its rating down 13 points) and PrairieWolf (changed its rating up 11 points and then down again 11 points). It seems the oscillation is in the normal variance parameters. Note that the rating system behaves like a rubber band, and the rating for a bot with a fixed performance oscillates depending on the rating of the other bots in a kind of dynamic equilibrium. Oscillations should reduce if we implement the 70/30 rule, or if we raise the alpha constant (currently 0.7 - but raising it would have some undesired results to consider). -- Albert

Changed: 49c41
Both variants have their merits. But I strongly believe that a really good bot should trash bots at the bottom of the rankings and play well against top ranked bots. That's one of the things I really like about the RH@H and I think that why it hasn't been done before is mostly because of the lack of computing power. The robrumble rankings tell a truer story than any league preceding it I would say. -- PEZ
Eek, edit conflict. Fractal doesn't save data; the only thing that might affect it is if its gl.txt file is on and it is trying to use RobocodeGLV014, but this would make it crash and the servlets don't accept scores of 0 so it shouldn't affect its ranking. I think the new servlets may be the problem; I hadn't actually realized that a new set of servlets had been installed until after I posted the above, because I noticed the 60%/40% wins/losses weren't being coloured on Fractal's details sheet, so I went looking in RoboRumble/ServerDevelopment to see what was up. Were there any rules changes in the latest update? If so, then that would explain it; it would then of course be just my dumb luck that any change would negatively affect Fractal's performance... After writing this I read Albert's post above. This makes sense now. Thanks :) -- Vuen

Changed: 51c43
I keep thinking the best rankig system is one similar to a football league (ie. all play against all, and get points by winning). In this context, I like the idea of random pairings (because at the end everyone will play against everyone) but I don't like using the scores to determine wich bot is better (Can you imagine a football league where the winnner is the team that scores more goals during the league, regardless of it it wins or loses the matches?). For me the best ranking would be one with random pairings, but that doesn't uses the score but the battles win/lose ratio. -- Albert
Here is something strange: I went to check Tron's result against BlestPain, but it is no longer there? Acording to the results pages Tron 2.02 has only fought around 70 bots in 1000 battles. The same happens with BlestPain, DT, CigaretBH, etc. the only "correct" one I checked is Shadow 2.01. Was the ranking reset recently? I too have been noticing some strange oscillations lately... -- ABC

Changed: 53c45
About sorting: everything that we can do in the client, we should do in the client. So I agree with that script (even if I can't imagine how it works). -- Albert
I think it is produced by the new filter. It excludes bots that have not fought for the last 5 days, but also excludes pairs that have not been executed for 5 days. It could explain also the oscillations (even if tey will always be there with more or less strenght). I't take a closer look. In any case, it should be corrected when the new system to update participants in the rankings is in place. -- Albert

Changed: 55c47
About scores versus wins/losses. There are different goals with RR@H and a soccer league. The former aims to show how the bot ranks against eachother and the latter how the teams fair in the league. With this I mean sport leagues have much more room for randomness and injuries and day-to-day circumstances and such. A bot is a bot and unless we have enough computing power scores is the way to answer the question asked. But, is it correct to calculate the number of wins/losses from the "% score" and number of battles fought? If so I could maybe also publish that sort of league too. -- PEZ
That must be it, the dance continues, the more "speciallised" bots are jumping all over the place as their problem bots enter/leave the details page... -- ABC

Changed: 57c49
I know you do't want to add load to the server, but the simple way to provide a sort is with an optional "sort=" option on the RankingDetails? page (used by links on the headers of the details table) the sevlet can sort the table data in a jiffy. -- Paul Evans (I can have a stab of writing it if you wish).
Who's bot is ad.Neo, and when did it start doing so well? It hasn't had many battles yet, but still... that's a very good 2nd place... -- Tango

Changed: 59c51
Yes, I don't think it will add too much load to the server (we are a rather small community after all). The reasons I suggested client side was to honour the @Home philosophy and so that Albert wouldn't have to do it in the servlet side (since we obvioulsy have more pressing matters to fix there). But awaiting David's client side solution please feel free to add that sort to the servlet. I have a CVS server here if you and Albert (and whoever else starts hacking at the same files) would want that help to synch your changes. -- PEZ
Looks good to me too - over 1000 plus rounds it looks like it looses to DT 46.5%/53.5% so the ranking appears to be correct - the bad news is that it is claimed to be a 'test robot' in the repository! It would be nice to see what the history is for this bot/robocoder. -- Paul Evans

Changed: 61c53
I can't look at it today, I have a golf match, if it looks like no one else is looking at it come Sunday/Monday? I will have a go - but anyone else is free to do the job as I have no servlet experience and the only packages I have ever written start with pe. -- Paul Evans
It also says it is derived from "many many bots". I could be wrong, but it looks to me like he did some good tweaks to some open sourced bots (probably Iiley's movement and Kawigi's gun). A very good bot anyway, would be very nice to hear some comments from the author. -- ABC

Changed: 63c55
You don't need any servlet experience to sort some lists. I've seen in the source code you have published that you know how to use the API. =) -- PEZ
Tron and BlestPain are the only two bots that beat it. I can't help but smile :). Tron has always been one of my favorite bots... -- Vuen

Changed: 65c57
How are the pairings chosen on the client side? Could it actually be a psuedo random pairing as opposed to a truly random pairing? I ask becuse [Jekyl] has only faced 130 of 185 possible participants through 255 battles. There are 50+ bots that it has never fought before. I am sure that others are in the same situation. How much could this affect a bots overall rating, if at all (especially given that a bot may compare well to some of the top 25 which it has yet to face and poorly vs some of the bottom bots that is may have faced mulitple times)? Is there any way to tell a client to look for matches that have not been fought? Is this outside the design goals of RoboRumble? -- jim
Tron kicks ass! And it also happens to be one of the few top bots that fall into the trap of HypoLeach. =) -- PEZ

Changed: 67c59
Why would the pairings be pseudo random. Doesn't it sound quite likely that you after 255 battles have not been paired against some 50+ bots out of 180? And, it shouldn't affect your ranking I think. Other have fought those 50 bots and you have fought those others. -- PEZ
That's cool (beating ad.neo, not losing to Hypo ;)), I think Tron is a bit outside the "normal curve", it can give top10 bots a very good fight but loses to some unexpected low ranked ones... -- ABC

Changed: 69c61
Heh, on the subject of facing the entire spectrum, overall performance is what is measured here. Look at Apoptygma, it scored near even against (or even beat) many bots 50+ or even more positions above it, but lost by blowouts to bots ranked around 1100. Comparing well to some better bots might find it ranked around 50 instead of around 100, but in fact it's far too unreliable to be ranked higher than it is once you consider the whole picture (as I've come to realize). In that sense this is a great system. On the subject of sorting, I was going to say something along the lines of what Paul said, there are plenty of blazing fast sorts for numbers. I have no servlet experience either, but I could write the sort class itself. -- Kuuran
Believe it or not, should we use a league ranking system based on win/lose (like the ones used in soccer or basketball) instead of score difference, right now Neo would rule over DT. The first only loses against Tron, but DT loses against Neo Jekyl and Grifon. I'm maliciously tempted to create this ranking system just to force Paul to release DT 2.2 :-) In any case, it points out the inherent difficulty to rate bots. -- Albert

Changed: 71c63
Thanks Kuuran, thats mostly what I was interested in knowing. -- jim
My prediction: The next version of SandboxDT will come with pre-loaded data for at least some bots. I know from my own personal testing that Jekyl, Griffon, and Neo will not beat DT if DT is given enough rounds to learn. The RoboRumble's distributed nature almost garuntees that SandboxDT will not get this chance in a timely manner. SandboxDT is still the king. I am starting to think it always will be -- jim

Changed: 73,80c65
What would you think about outputting the data as a javascript array? The trouble I'm having isn't building a sorted table, it's trying to get the data from the original table into an array. Something like:

var results = new Array();
results = [
['header1', 'header2', 'header3'],
['data1', 'data2', 'data3' ],
etc...

I'd be surprised if DT came with preloaded data, unless it proves really, utterly necessary. It's one of the truly cool things about DT, Tron and some other top bots. That they can rank so high without the advantage of preloaded data, fighting bots that have spent lots of training against them. DT is so much king that it's hard to fathom. -- PEZ

Changed: 82c67
--David Alves
Back to a yet unresolved topic. Maybe the new VertiLeach is not as strong as the previous version, but it still doesn't lose to a single minibot. Yet it ranks #3 and the #1 ranked bot loses against 6 bots, inlcuding VertiLeach. How about Premier League rules instead? 3 points for a win, 0 for a loss and 1 point for a tie. We could define tie like being 50% +/- 1% or something. The current ranking system is great for leagues where not all bots can fight all others, but with the distributed power of RoboRumble@Home we don't have to give the bots an estimated ranking I think. I'm thinking that a win/loss should be determined by the accumulated % score share in each pairing. Now when the system both cleans out bots that are not any longer participants and prioritises bots with few battles the PL rules should work pretty well. -- PEZ

Changed: 84c69
Not a problem, if you tell which syntax I should use. -- Albert
Completely agree. Why using theoretical estimates when we can have the real ranking? Also, the current infrastructure allows it. We should put a servlet that executes periodically (ie. 12.00PM) and just use the existing data to build the classification. Later, we could modify the client/servet applets to priorize unfought matches. -- Albert

Changed: 86,93c71
Use this syntax:
<pre>
var results = new Array();
results = [
['header1', 'header2', 'header3'],
['data1', 'data2', 'data3' ],
['data1', 'data2', 'data3' ],
['data1', 'data2', 'data3' ]];
Cool! Maybe the rating of a bot should be an index like "100 * (points / points_possible)". -- PEZ

Changed: 96,104c74
Give me a div with an id of "resultsTable" like so:



...


</pre>
--David Alves
In other words, percentage of possible score. Sounds good. -- Tango

Added: 105a76
* Thanks! That's exactly what I intended, though I couldn't express it in words at the moment. -- PEZ

Added: 106a78
I'm not a big fan of 'Premier League' rules. Neo only loses to Tron, while SandboxDT currently loses to Neo, Jekyl, and Griffon; these rules would have Neo sitting on top, while in my opinion SandboxDT is the better bot. I like the regular league the way it is, but it's a good idea to create an aside premier league scoresheet that uses the same data as the regular league. -- Vuen

Changed: 108c80
Heh, someone's been running only nanos, which is good for their stability, however, until micros and minis get as many matches (and can move further from 1600) those divisions are hilarious. Not that I mind having a top 3 bot in every mini weight, but somehow I don't think NanoLauLectrik, Smog, NanoSatan and FunkyChicken quite as legitimately occupy top 4 in micros and minis as they do in nanos ;) -- Kuuran
* Yikes - I just realized that I resaid almost exactly what Albert said above. Wow. Remind me to read a page before speaking next time... -- Vuen

Changed: 110c82
How do I run only Nanos/Micros?/Minis?? --David Alves
I think that's the preloaded-data advantage. Griffon wouldn't beat DT if it wasn't because it was trained on DT before it was uploaded. What if we ban preloaded data? It could be enforced by the client wiping any data directory in the bots jar file after downloading. -- PEZ

Changed: 112c84
Nevermind, I did it the brute-force way by changing particip1v1.txt to only contain bots with codesize < 1500. --David Alves
I don't like the idea of banning preloaded data. It reduces desing options (and everyone has the posibility to preload data). It is like real life: you can go to a competition without any information, thinking you are good enough to win, or you can take a time to analyze your opponents so you have a better chance to win. -- Albert

Changed: 114c86
Why can't you just wait a few days? Doesn't the nano-only-run show that it destabilizes the rankings for the other games? What do you think mini-only does? Yes, it can't do it very much since the roborumble game has quite a few battles in it. But still, this is completely unecessary. It only complicates the system if we have to introduce filters against all sorts of scenarios. Run the client as-is is my suggestion. In due time we will have the rankings. If your more curious than that install Tomcat and the servlets locally. As it stands I'm considering we should wipe the current rankings files so we clean out these experiments. Someone please tell me that's not necessary? -- PEZ
I certainly am one that has tried to explore the design paths of the preloaded data strategy. But never the less, DT is the best bot and it should somehow be identified as such by any league rules. We risk getting "fake" updates of the robots where Paul maybe changes the movement slightly and preloads DT with data on the enemies that preloads data on DT and then the authors of these bots load their bots with data on this new DT and it gets like a cat chasing its own tail. But, I agree that banning preloaded data constrains the design options a bit too much. What about we set the battles to be 100 rounds each? That would at least limit the benefits of preloaded data some. -- PEZ

Changed: 116c88
It was me. I was trying the new client functionality, that allows you to select the category to run. Today I'w run minis :-) And there is a reason there... the number of nano matches represents a % equal to (NANOS/TOTAL)^2, being NANOS and TOTAL the number of NanoBots and bots. If they don't get some help, this league will starve. -- Albert
I don't see why people don't just all have preloaded data. If Paul wants to prove he has a good bot without data then he can add another mode to the properties file that doesn't use preloaded data. It would undoubtedly improve DTs rating to have preloaded data. -- Tango

Changed: 118c90
Umm, I don't quite follow. But I'm sure you know what your are doing. =) Maybe you should post some rules on how the client should be used against this server. (If peple start other servers they can pick and choose rules of their own I mean). Distributed computing is powerful, but if more than one person starts manipulating the results (here "manipulate" is not necessarily negative) we will start getting problems. One way to somewhat enforce rules is to place the settings on the server as we have discussed before. (That means all settings except server URL and name I think). Maybe with that scheme of "if no server available when I start, use previous settings and sort the settings out when I upload" to not impale the rubustness of the system we have today. -- PEZ
The thing is that with a preloaded data strategy the timing of your entry becomes important. You will need to keep training your bot and send up new versions whenever new bots or new versions of bots are entered. That's a bit pathetic I think. Paul doesn't need to prove that DT is king to me. I know it all too well. It keeps me awake at nights. -- PEZ

Changed: 120c92
I think PEZ has an important point in the paragraph 3 spots above this one (Just got my first edit conflict :-P). I wanted to get more accurate minibot ratings soon, and it sounds like Albert wanted to get more accurate nanobot rankings, but there is a potential for mini/micro/nano-only clients to destabilize the overall rankings. Imagine the following scenerio: a nanobot, SuperNano?, easily beats all other nanobots, but, like most nanobots, doesn't do very well against large bots. If someone is running a nano-only client, then SuperNano? will have a higher % of matches against other nanos than it should being used to calculate its rating. I'm talking about its overall rating here, not its nanobot rating. Since matches against nanos are wins, it will have a higher rating than it should. A good workaround might be to have mini/micro/nano-only clients inform the server that they are nanobot-only, then have the server use those matches only in the calculation of ratings in the nano league, don't incorporate them into the general league. Or just don't run mini/micro/nano-only clients. :-p --David Alves
Preloaded data is useful for weight restricted bots that don't wish to include learning code and for bots that take many many rounds to learn. Preloaded data is often static and, because it is not the basis for learning it is very small allowing many hundreds of bots to be stored. DT's main problem is that the learning data is so large it can only hold data for some bots - what DT need's to do is convert that large statistical data to 'preloaded' style data just prior to deleting the stats to make more space. I have no problem with preloaded data - there are defences such as adaptive movement. I'm also happy with 35 rounds - it makes data saving an important element of a bot finally I'm happy with the rating system It means all battles are important - not just those bots that may beat you. -- Paul Evans

Changed: 122c94
Nanos represent aprox. 30/200 bots in the competition. It means that only 30^2 on 200^2 are battles between two nanos that can count for the competition (it is aprox. 1 battle for each 40 battles). Of course there are less nanos, so you don't need the same number of battles as in the general competition. But in any case, the nanos competition moves 6 times slower than the normal one (ie. it took 2 days for the general competition to settle, it would take 12 for nanos to do it). So I though we need a system to speed up these leagues, and that's what i'm testing. About rules to run the clients: With the current system, and once the ratings are stabilized (now micros and minis are not), there should't be any problem in running more battles here or there (well, I can think of some obscure tricks to benefit a bot, but I won't post them here for now :-)). About server sending information, I have been thinking about that also, and may be is a good solution, as long as it is compatible with the current system. -- Albert
But I think we should move from the adapted rating e implemented using the /PremierLeague system). It is also more interesting to have both ... believe me. -- Albert

Changed: 124c96
We wrote at the same time :-) I like the idea. I'w change the client so if you decide to focus in a catagory, results are only uploaded to that (and lower size) competitions. -- Albert
I think both is best. /PremierLeague rules will be easier to understand etc, and the current rules are traditional, and mean everyone has a chance to effect everyone else. With the current system, you don't have to beat DT to drag down it's rating. In fact, a bot that checked the enemy's name, and if it is DT fights the best way it knows, and if it isn't DT is just acts like SittingDuck (with a little change to make sure the score isn't 0, because 0's are ignored), would serious damage DT's rating. Could be fun... -- TangoMy opinion about the ER-type rating systems is that you can better tell if a robot will stand the test of time. A good, stable robot should be invincible to weak bots and be able to compete with the top bots. In other words, one with low specialization should be the goal. With this rating system, we can project how well they maI don't need to change my data saving - just change the delete criteria to keep data on the best opponents, I don't need to improve learning speed, and I only need to tune my movement to 3 or 4 bots - There is no challenge here. The existing rating rules are intuitive - all battles count - the better you do against each and every opponent the better your rating. -- Paul Evans

Changed: 126c98
I think Paul's idea with setting percentages in the client for the various games should be considered. While we have a known set of games these percentages could be hardcoded into the client. I don't follow the math up there but it seems like you could keep the same pace in all games by setting these percentages right. (Provided you also implement David's scheme of course). -- PEZ
I promise to make it a challenge for you. =) -- PEZ

Changed: 128c100

Here's the math for a competition with 10 nanos, 20 micros, 30 minis, and 40 megabots.



This proposed PremierLeague will be run in ce implemented using the /PremierLeague system). It is also more interesting to have both ... believe me. -- Albert

Removed: 130,142d101
A client in megabot mode send in results as follows:

(10 * 9) / (100 * 99) = 0.9091 % nanobot matches
(30 * 29) / (100 * 99) - .009091 = 7.8788 % microbot matches
(60 * 59) / (100 * 99) - .078788 - .009091 = 26.970 % minibot matches
1 - (60 * 59) / (100 * 99) = 64.242 % megabot matches

A client in minibot mode will send in:

(10 * 9) / (60 * 59) = 2.5423 % nanobot matches
(30 * 29) / (60 * 59) - .025423 = 22.034 % microbot matches
1 - (30 * 29) / (60 * 59) = 75.424 % minibot matches

A client in microbot mode will send in:

(10 * 9) / (30 * 29) = 10.3448 % nanobot matches
1 - (10 * 9) / (30 * 29) = 89.6552 % microbot matches

A client in nanobot mode will send in 100% nanobot matches

Removed: 144,149d102
From there you can calculate what % of the time the client should be in each mode to balance out the 4 types as follows:

.25 / .6424 = 38.9166 % of matches should be in megabot mode
(.25 - .10496) / .75424 = 19.2299 % of matches should be in minibot mode
(.25 - .07303) / .89655 = 19.7387 % of matches should be in microbot mode
22.1148 % of matches should be in nanobot mode

I may be off by a little due to rounding but I think that the math is correct. Now all you need to do is apply the same method to the real numbers for mega/mini/micro/nano. :-P

Removed: 151d103
--David Alves

Removed: 153d104
It looks like NanoSatan2 is doing just a /bit/ better!!!! :) -- Kuuran

Removed: 156d106
Why are results only posted weekly now? I mean, sure, every half hour was excessive, but what about once per day? --David Alves

Changed: 158c108
There is no need to wait to see the results. You can see then on real time. The weekly ones are to have some "stable" rankings. Go to RoboRumble/CurrentRankings to find the link to the real time rankings. -- Albert
I think both is best. /PremierLeague rules will be easier to understand etc, and the current rules are traditional, and mean everyone has a chance to effect everyone else. With the current system, you don't have to beat DT to drag down it's rating. In fact, a bot that checked the enemy's name, and if it is DT fights the best way it knows, and if it isn't DT is just acts like SittingDuck (with a little change to make sure the score isn't 0, because 0's are ignored), would serious damage DT's rating. Could be fun... -- Tango

Changed: 160c110
The best part about the dynamic one is they show who the real best nanobot is. -- Kawigi
My opinion about the ER-type rating systems is that you can better tell if a robot will stand the test of time. A good, stable robot should be invincible to weak bots and be able to compete with the top bots. In other words, one with low specialization should be the goal. With this rating system, we can project how well they man't like it :-( I insist: the more you think about it, the more difficult is to rate a bot. Anyway, I think that the problem here is that there are two different paradigms to rate the bots: one inherited from RoboLeague and EternalRumble, that is fndamentally based on the score difference, and another one derived from the "real world" sports, where only wins/loses are considered (regardless of how important were the differences). In order to make everyone happy, I plan to have both (the second one implemented using the /PremierLeague system). It is also more interesting to have both ... believe me. -- Albert

Changed: 162c112
Scary. I just got an impulse to write a nano. But I managed to fight the impulse back. =) -- PEZ
I think both is best. /PremierLeague rules will be easier to understand etc, and the current rules are traditional, and mean everyone has a chance to effect everyone else. With the current system, you don't have to beat DT to drag down it's rating. In fact, a bot that checked the enemy's name, and if it is DT fights the best way it knows, and if it isn't DT is just acts like SittingDuck (with a little change to make sure the score isn't 0, because 0's are ignored), would serious damage DT's rating. Could be fun... -- Tango

Changed: 164c114
How come marshmellow hasn't had any battles in over 30 years? 1970 looks like a default start date...
My opinion about the ER-type rating systems is that you can better tell if a robot will stand the test of time. A good, stable robot should be invincible to weak bots and be able to compete with the top bots. In other words, one with low specialization should be the goal. With this rating system, we can project how well they may do against bots they're not fighting against, which is necessary in some leagues, because it's not feasible for them to fight all bots. But more importantly, you can project how good they will probably be against bots that haven't even been written yet. Some bots are just fundamentally good, not just taking advantage of temporary 'trends' in Robocode, but basing their strategy on good, sound principles of AI, Machine Learning, and so forth. That's the difference between robots like Yngwie who just work well and stuff and robots like HaikuTrogdor who just do a good job against Linear and Head-on aim, because that's all they usually have to worry about beating. At the level that bots fight at right now, it seems that if a bot can be trivially defeated somehow, it will be. -- Kawigi

Changed: 166c116
21 pez.Marshmallow 1.9 1748.87 1154 1-1-1970:1:0
The thing that any sports league in the world does not account for is a "quality win". A performance where the last placed team does much better than expected. In the real world there is no way to account for this. In the Robocode world we have a method to quanitatively say that a bot performed better than expected and should thus influence the statnding of the bot wether it won or not. We can say definitively that Bot A outperforms it's expected result has exposed some weaknesses that needs to be addressed in Bot B. We can definitively say that that the king outperforms every other robots, in one on one competion against all other robots, better than any other robot under the same circumstances. What sports league in the world can say that? Taking the EPL example, if Arsenal (currently #1) beats Chelsea (currently #2) 3 - 2 and Arsenal beats Liecester (currently #20) 3 -2 which is the more impressive result? They all ammount to 3 points for Arsenal but how does the table account for the unexpected performance of Liecester? The answer is that it can not. So the weakness exposed by the Liecester team does not get acknowledged in any way. A Robocode like system would still award marks to both teams but it would not award full marks to Arsenal as it did not perform up to expected results.

Changed: 168c118
The details page says the latest battle was a few minutes ago. (i just uploaded a few hundred, so chances are marshmellow was in at least on of them) That seems correct, so why isn't the info coming from the same place? -- Tango
One last thing, if you go with simple percentage of score whats to stop me from calculating the score that I currently have and deciding that I have enough to win now and go into a while (true) { ar.getX(); } loop? That would make 10,000 calls to a get method and set my bots energy to 0 denying my opponent all bullet damage going forward and allowing them to only get survival and minimal kill bonus. If, in a 35 round match I determine that I am up by 350 or so with 15 rounds to go why risk it? My opponent will only get 150 points in survival bonuses and I doubt the rest of the bonuses will put them in a position to win. Under the current ER system, this would only be rewarded if the opponent was ranked above me, and I would somehow have to now that in a dynamicaly evolving system. In the PremierLeague it simply becomes a valid strategy. -- jim

Changed: 170c120
The Rankings info comes from a summary file. A file that always is updated when new results are uploaded. This makes it extra vulnerable for the concurrent updates we still have. While that is so Rankings info will always be a bit strange. The details pages are not at all as likely to be updated cincurrently though so that info will mostly be correct. However, the concurrent update problem will be fixed eventually. It's just that I haven't found the time when I feel alert enough to do it. If someone else feels like doing it the DataManager class is where to do it. So it won't conflict with updates to the functionality of the servlets. There's some outline code in there that is commented out which you might or might not want to follow. Whoever accepts this task (including myself), please state so on the todo page. -- PEZ
That's why both tables are good. You have outlined all the benifits for the ER system, so we definitely shouldn't get rid of that, but there are also benifits in the premier system, even if it's just interesting, so we should have both. Simple. -- Tango

Changed: 172c122
The ratings file is damaged and only 70 bots appear in it!!! Could it be another concurrency issue? -- Albert
Yeah, if you want to keep the old school system for reference, let it be so. I don't see why it's interesting to project a bots future performance. The future will come and tell eventually anyway. I say "winner takes it all" =). -- PEZ

Changed: 174c124
It probably is. I'll give it a quick check and see if I can fix the temporary problem. And on Saturday I'll try to find enough time to fix the cuncurrent update problem. If I can't fix the 70-bots problem quickly, it'll have to wait until at least late tonight (CET). I'm going to a late night fotball game (AIK vs Valencia, UEFA Cup). -- PEZ
And now the bot with the wrong package has the /PremierLeague crown! That's pretty cool I think. And iiley's coming bot will be a definitive throne contender while also helping AdNeo keep it's edge over DT. Now I think it is becoming more important to quickly see too that new and updated bots gets all their pairings.

Changed: 176c126
The file appears to be truncated. But it grows quite quickly again. Now it's 102 bots in the file. And it seems that the bots that were temporarily away come back with their rating intact. I'll leave it be for now. The upload clients will rebuild the rankings by themselves I think. And on Sat I'll see if I can write that update queue handler. -- PEZ
And what about we keep the /PremierLeague snapshots once a week and give that servlet page a drop box where you can choose to view these snapshots? (When viewing a weekly snapshot there is no pointn't need the current code trying to give new bots their initial 500 rounds. As long as the clients try to make each bot fight all other bots an equal number of times, new and updated bots will get duly exercised. -- PEZ

Changed: 178c128
Just one more observation. It seems this was a good way to get rid of those old versions of the bots that lingered around. =) -- PEZ
Yikes. Check out this rating:

Changed: 180c130
In fact, there should be only small fluctuations, as the basic data is in the details files, and bots are added to the rating files when they fight a battle. No need to fix it (it will correct itself). Just that would be good to avoid it happen. -- Albert
RATING DETAILS FOR sgs.DogManSPE? 1.1 IN GAME roborumble

Changed: 182c132
There must be something wrong with the way the ranking is calculated, how do you explain vuen.cake's situation? It started by beating that trinity guy with a score that would, if I understood it right, place it around 20th-30th, yet it went straight to the 2nd place. Since then it got 3 more below average results, but didn't move down a single spot... even with a small number of battles the ranking system should already know where to place it... -- ABC
|Noran.CornersReborn_1.0|6.1|1|5-11-2003:4:1|91.5|-85.4|

Changed: 184c134,140
I had a small panic when I looked at the active rating table - but it seams I do not have to worry too much. Cake looks like it started with ranking far too high - possibly DT's ranking - was there a previous version, if not it should have started at a rating of 1600. -- Paul Evans
That is like, the biggest problem bot index ever. I'm curious :536)
</pre>
* Wolverine crashes with the following message:

SYSTEM: You have made 10000 calls to getXX methods without calling execute()
SYSTEM: Robot disabled: Too many calls to getXX methods


Changed: 186c142
Yep, that's probably what happened. But still, It should have gone down mutch faster, imho. Everything seems to happen a bit to slow with this ranking, wouldn't that "multi-pass" sugestion of yours make it a bit more responsive? Or maybe make a bot fighting with it's current neighbouring adversaries a little more probable? I like the fact that everybody fights everybody else, unlike the ER, but the cost seems to be a very slowly stabilising table... -- ABC
So I think the reason is not RoboRumble, but the bots themselves. For Ender, it can happen that the erros only occurs on certain clients where it has written lots of information.

Changed: 188c144
Maybe it will be able to stabilize faster when more people start running the software; I've basically been running it for 10 hours a day since I have it, because I can leave my computer on all day while I'm away or in class. Still, there has to be a bug in the ratings somewhere to explain Cake's situation; it's nowhere NEAR that good, and shouldn't even be in the top 100 (probably not even the top 150). I thought bots were supposed to start really low on the list and move up as they fight, rather than the other way around. If the average index is rolled, is the rolling initialized with the first value given? The rolling average should start initialized at zero, and have to roll its way up; initializing it with a non-zero value will make the first battle fought take like 50 other battles to wear off. I'm just guessing anyway, maybe the bug is something completely unrelated. meh. -- Vuen
-- Albert

Changed: 190c146
The problem is not the speed at witch battles are run, after over 133000(!) battles fought most bots are still going up/down too often. -- ABC
Not good. Now there is a mini that can somewhat clearly beat VertiLeach in the RR@H. Tityus! What to do? -- PEZ

Changed: 192c148
Wow, Cake currenly holds a momentum of -1192.5 in 36th place. Heh. Looks like its ranking was short-lived :) -- Vuen
Well, just add Tityus an "if (VertiLeach) don't shot" statement :-) -- Albert

Changed: 194c150
The problem with Cake's initial rating is probably related to the /ConcurrentUpdateProblem. And the issue with how slow a bot finds its rating neighbourhood will be fixed soon when the new client arrives that prioritizes battles where new or updated bots are involved. With that client in place it will only be a matter of hours at the current battle speed of the rumble. However I also think that at some places in the ranking tables the bots change places a bit too often. I suggested somewhere on this wiki that we maybe should consider lowering the impact the result of a battle has on an older bot when it meets a new/updated bot. Currently new bots start with a ranking of 1600 (well, all except Cake) and updated bots start with the ranking of the previous version. Say this initial ranking is way off. Say it's way too low (like if it was Paul Evans' new MegaBot) for the sake of argument. Then bots fighting this bot will have their rating adjusted wrongly, won't they? The system should favour age before beauty or something like that I think. -- PEZ
It would be like a team order in car racing. With the benefit that the bots don't have ego's to match dudes like Kenny Bräck. =) In fact in my tests Verti beats both Tityus and Fhqwhgads (the latter quite comfortably), I think Verti just need some more battles in some pairings for this to show. -- PEZ

Changed: 196c152
I thought it already worked like that. I thought the first 20 battles of any bot only effect it's ranking, not it's oponent. Was that changed at some point? -- Tango
November 7 2003: http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=minirumble Sweeet! =) -- PEZ

Changed: 198c154,157
I wasn't aware about that. -- PEZ
Looking at the PremierLeague ranking table it's striking how similar the rankings are. Some bots are much stronger in one game than the other of course, but it's still quite similar. Same kings in the megabot and minibot games for instance. I find the PL ranking table much more interesting to read since I can so easily understand it while the ELO-based ranking is opaque to me. I think two issues have been mixed and confused in the "debate" we have had about the choice of ranking system.
# "Winner takes all" versus "relative strength"
# "Ease of understanding" versus "magic"
It has been a bit like we had to choose between 1 or 2 here. While it is actually possible to choose the best from each. The most important thing for me is "ease of understanding". I don't like at all to have an opaque magic function decide the ranking when we have all pairings actually fought and the infrastructure makes a new bot get all its pairings in a jiffy. "Winner takes all" is cool for me, but I can see how "relative strength" measures something important too. And I think it is what most of you opposing the PL rankning feels strongest about.

Changed: 200c159,163
About that panic, Paul. Do you feel it creep on to you now when BlestPain is only 2.6 points behind DT?
What about we make a ranking where a bot is measured on its average share of the score it has in all its pairings? That's very easy to understand. And the resulting ranking table would be what the ELO-based ranking is trying to predict, if I have understood that much about it correctly. The ranking table could have all three figures in it;
# average %share
# wins/losses count
# the ELO-based rating estimate
The table would be sorted on "average %share" by default, but we could make it sortable by the other figures at well. Sorting it on the the ELO-based estimate should produce a very similar table as the default sort or the ELO-magic is not doing what it should. If the tables are very similar then the ELO-based figure could be removed out of redundancy. If the tables are very different then the ELO-figure is of little importance anyway and could be removed for that reason.

Changed: 202c165
http:/robocode/uploads/pez/BPclosinginonDT.png
I think I can produce a script on the server that produces a current "real relative strength" rankings table. It will take me a good share of time I would otherwise spend on making VertiLeach stronger though. So someone but me should think that table is of interest before I go ahead hacking it together.

Changed: 206,222c169
I wouldn't worry too much, that small difference must also be a consequence of the strange way this ranking is being calculated. There is no way BlestPain could be considered "almost as good as" DT, imho. There are many bots out there that can defeat BP, even if by a small margin, but I don't know of a single one that comes close to winning against DT. And, in this case, both have fought over 1000 battles, it is not a case of lack of results, there must be something wrong with the expected score formula or with the way the rankings are being recalculated. -- ABC

Its true that in 1000 maches its hard to beat DT. But don't forget that rumble acts like no saved data. The bot should learn as soon as possible in 35 rounds.(More oportunites against DT) Saving data doses not help to much beacause the battles run on diffrent clients. This is the new constraint of rumble. --SSO

The first 20 battles affect the enemy ranking (I removed the restriction when I changed the way ratings are calculated and never added it back - I'w do for the next release). -- Albert

It's true you have a small chance against DT if your battle is the first one on that particular client and you are lucky enough to win those 35 rounds, but that also affects everybody else, including BlestPain. Even with 35 round battles there is currently no bot that comes close to DT's performance, and this ranking doesn't seem to reflect that, imo. -- ABC

Note that currently, DT has been beated by Tron, Smoke, Teancum, SandboxLump and Chamaleon, and ties to Sonda (6 bots). BlestPain has been beaten by Tron, CigaretBH, Cigaret, SandboxDT and Sedan, and ties to Smoke (6 bots). So it is not performing much better than BlestPain (just because it is able to beat bots by a wider margin). The conclusion to me is that the ranking system is OK. Note that ER was specially favourable to DT, with short 10 rounds matches, where movement and saved data were the key. Now with longer games and many clients running battles, guns and fast learning play a bigger role, because (1) you play against much more enemies, so you play less rounds against a given enemy (2) battles are executed in different clients. -- Albert

Your right of course, I didn't check BlestPain's (impressive!) record, and was basing my thoughts on the fact that Tron's current development version can beat BP by 60-40 and get's crushed by DT 30-70 in survival, it just barely wins in total score against BP and looses 60-40 against DT... I assumed that the general trend against other bots would be something similar, I forgot that Tron has a somewhat "different" way of dodging. ;) -- ABC

Btw, BlestPain just took 1st place away from DT, I still find it a little suspicious... ;) -- ABC

Heheh, I was just coming to mention that ;) and GARB, already almost 200 battles and Fractal still hasn't faced SandboxDT... -- Vuen

SandboxDT is now in 4th place and going down! :O -- ABC
To me, the PremierLeague is OK as it is. I prefer "winner takes all" systems to "percentage score" systems. I think I said it before, but using this rating system would be like deciding the soccer league winner by using a formula like (goals scored /(goals scored + goals received)). The information is there anyway for anybody who want's to know it. Just divide the %wins column by the #maches one and you get it.

Changed: 224c171
5th now. Something's obviously wrong. I suppose it's kindof redundant of me to post this, but, meh. It's 11 am, and I have 30 pages to read (<-and understand!) and 5 online physics assignments to do before I go to bed. In other words, I'm wasting time here =) -- Vuen
I agree the proposed rating system is clearer than the ELO one, but if we move to this new rating system, I think two conditions should be fullfilled (a) We should remove the ELO rating system (it would create a lot of confusion to have 2 similar systems in place). (b) There should be an strong consensus, since the ELO rating system is the standard %score rating system, and the new one should be standard also.

Changed: 226,234c173
It's not necessarily the rating system that flukes here. It could be DT itself that has some bug (someone mentioned an array index out of bounds). Now the count of bots that DT loses to are 8. And it ties against a few more. Interestingly all this stir is happening above my bots. Tityus and Gloomy's rankings are rock solid. Until they meet DT maybe. =) -- PEZ

I'w take a closer look, but I agree with PEZ that probably the problem is not in the rankings (may be DT is has some bug in some new client?). All my bots are quite stable now, and its ranking is solid and logical. -- Albert

I am beginning to suspect the /CuncurrentUpdateProblem? again. Think about the Cake incident. It looked like Cake got initialized with DTs rating, right? Possibly something is wrong with DTs record. -- PEZ

I have been analyzing the individual battle records, and there are some strange ones for DT. My theory is that there are some clients that are running DT in challenge mode:
@people running the clients: please check DT is running in normal mode (not challenge one).
@people with bots that can run in challenge mode (like DT and Tron): Please consider releasing a version that runs only in battle mode.
So if everybody agrees in changing the current ELO rating system by the new one, then it is OK for me.

Changed: 238,299c177
That's it! And It's probably my fault too. I have been leaving my work computer on all night running battles and I have RR@H installed in the same folder I use for testing... That explains Tron's inconsistence too. :-\

Just checked it, I was running DT in reference mode. Is there maybe a way of deleting all the battles run by me since yesterday? -- ABC

Not an easy one, but as long as the problem is solved, the ratings should fix themselves fast enough. -- Albert

Eek. Perhaps we should warn people who install the client to install in in a seperate folder... -- Vuen

We will do. In any case, I insist on releasing "safe" versions that do NOT run in challenge/reference mode (and when possible compatible with java 1.3). -- Albert

It should be made a requirement rather than a suggestion that rr@h should be installed in a new blank folder. So many robots now come with configuration changes now that affect their performance that not only is it impossible to remember which ones you've set properties on, but it's a complete hassle starting up rr@h because you have to reset everything. While this is the user's fault it still degrades the credibility of the 'ease of use' of the rr@h software. Plus, bots that have been downloaded before others or that are used to test more than others gather much more data. Suppose you've been watching SandboxDT fight against all sorts of opponents to watch its movement, then download Tron and run the rr@h client. SandboxDT will have information on everyone while Tron will be blank, giving SandboxDT a huge advantage in the rumble. This should not be allowed; a fresh installation will keep everything seperate from your test robocode. When I first installed rr@h I created a folder called rrhome, installed robocode in it in c:\rrhome\robocode, then unzipped the rr@h software into c:\rrhome. This way anything I do in robocode has no effect on my rumble client. I'll check the license agreement for RC, and if we can package it with it I'll make an easy-install zip file containing robocode that automatically unzips to c:\rrhome\robocode. -- Vuen

Yup. That's the way to go. If you do it with something like rrsetup.jar you can make it copy any existing robocode installation. That way you don't need to worry about any license stuff I think. -- PEZ

Any initiative that helps reducing uncertainity about rankings is great. So please go ahead with it. Please, remember to package codesize.jar also, since the client needs it to evaluate bots codesize.

On the other hand, I keep thinking that the ultimate reponsability for this kind of errors is for the developer of the bot, not the one running the client. People must be concious that the new environment is not thigthly controled as it was on the ER or MinibotChallenge?. If someone feels his bot needs data, then he is responsible to package it into the .jar file it delivers. It also applies for parameters (if someone wants to make sure it doesn't happens again, then just remove them from the bot). And of course, we will do whatever posssible to avoid it to happen. -- Albert

;Kawigi said (way above): ... Is it better to include SandboxDT vs. SpareParts in the final ranking, or to focus on how SandboxDT does against Wilson, Iiley and I (and PEZ or whoever else pokes their heads in the top 6). ...

Since I am a bit unsure whether "top 6" was a completely arbitrary set. Would you choose that set today to? =) -- PEZ

It seems DT is recovering slowly. Albert is there anything I can do with the datafiles manually to reset DT back to where it belongs? It is unfair that the snapshot archives from Sept 2003 (which is settled tomorrow morning, CET) should list DT anywhere else than at ranking #1. Even considering RR@H is still in the testing phase. -- PEZ

I'w try to run DT focused on the bots with "wrong" results. It should correct its %score and speed up DT recovery.

It's fixed now. Again: @people running the clients: please make sure you don't have bots in reference mode. @people with bots that can run in reference mode: seriously consider releasing a "battle mode only" version for the RR@H. -- Albert

Or rather, release "reference mode only" versions and keep the vanilla version in "battle mode". -- PEZ

Hmm... I still think that it's unfair to those who provide these tools that they would have to release multiple versions. It's a real pain to have to package and upload two versions of your bot, and it will clutter the repository with duplicates of every bot. For example, right now Fractal has a properties file that allows it to graph its opponent's movement curve directly onto the battlefield as it fights. This makes Fractal far more vulnerable to the above problems than even the reference bots; the RR@H client will run robocode.jar rather than RobocodeGLV014, so if Fractal is left in GL mode, even if the RobocodeGLV014 installation is there Fractal throws an AccessControlException? for trying to access them and effectively gets a score of 0. If it's not made a requirement to have RR@H installed in a seperate folder, in the next version of Fractal I will simply remove this graphing ability altogether and not bother releasing a GL version; and I think other bot developers may do the same. -- Vuen

Of course it should be made a requirement. But it's not easily enforced. The packaging you suggest will help and from there there shouldn't be too much of a problem any more. But still, if you want to be sure; Make sure your battle bot only does battle. -- PEZ

On the contrary, it can be easily enforced; you can make the RoboRumble client for example delete any of the .properties files that Robocode creates, such as window.properties, after Robocode creates it. Then when RR@H is started up again, it can check if window.properties exists before starting up Robocode. If it does exist, that means the Robocode installation was ran seperate from the RR@H client, and the client should perform some appropriate action. It could perhaps simply warn the user that it is being run an unisolated installation, or it could refuse starting up, or what it could do is delete the folder in .robotcache for every bot in the rumble and refresh the bot list. The first and last are what I would consider appropriate action; warn the user with a swing messagebox, and refresh all rumble bots back to their original .jar contents. -- Vuen

Robocode refreshes it for you if you just "touch" all bot jars. But I think this would be too tough. For non-rr@h-purposes people might want their robotcache left as it is. Appropriate action would rather be to just quit. In any case you still risk that your bot is running in a non-battle mode. The best way to prevent that is to release a bot that concentrates on battle only. -- PEZ

Don't worry about Fractal. RR@H server refuses any result with an score of 0, because it assumes the bot crashed and the battle is not valid. The same will happen for example, for a bot not compatible with java 1.3 when someone runs RR@H under java 1.3. The real problem is for bots that will run normally but can behave diferently depending on the set-up. -- Albert

Ah. Well that's good to know :). Thanks Albert. On a side note, VertiLeach 0.2 is in 3rd place! -- Vuen

Now it has dropped to 6th. Its performance is a bit too arbritrary for a top-3 bot maybe. On the other hand, it's designed to kick MiniBot ass, and it does. Let's see if it can cling on to that #1 spot. I'm incredibly proud of it winning the RobocodeLittleLeague 1v1 mini division too. =) Now, I might let it grow into a megabot which can maybe perform a bit stabler. -- PEZ

It's not fun to see the development of the minibot rankings! Fight Verti, fight! -- PEZ

Can someone see a reason or why GlowBlowAPM suddenly wakes up and goes up on #1 in the minibot rankings? Looking at its details it seems to lose against more bots than FloodMini and certainly more bots than VertiLeach. I know GlowBlowAPM is a strong mini, just that it's surprising to me that this change in ranking happens after 700+ battles. -- PEZ

I don't see this as all-of-a-sudden at all. GlowBlowAPM has been #1 minibot before, and it's battled back and forth with FloodMini for it ever since the minirumble started. It isn't really made to beat FloodMini and VertiLeach, it was meant to trash robots who used pattern-matching and head-on aim. Note that HumblePieLite is one of its significant problem bots, too. Not only is GlowBlowAPM's movement optimized against itself, it dodges itself. I agree, though, that it may be more significant that FloodMini and Sedan only has a losing score to something like 5 robots, and VertiLeach still only has a losing score to FhqwhgadsMicro. -- Kawigi

Well, I think the variation with +/- 30 points in gap between these two bots feels a bit on the unstable side. Don't expect to see VertiLeach lose against more bots. In my tests it wins clearly against all minis. Which made me assume it would rank #1, but there's obviously something about the ranking that I don't understand. I'm currently uploading version 0.2.1 of VertiLeach which according to my tests (which are now based on RR@H) is marginally better than v0.2 against minis and clearly worse than v0.2 against all-size bots. This is a bit surprising because it has fewer bots it loses against than the last version in all-size suits and [wins more clearly against all minis]. Which is why I upload it. RR@H can say what it wants, I regard this as being the best minibot. =) I would welcome a slight change to the ratings calculations where wins are favoured. That is, I think it should be a bigger rating change gap between 47.5% and 52.5% than it is between for instance 40% and 45% or 55% and 60%. -- PEZ

I agree. Once a bot clearly wins an enemy (lets say 66% of the score) it should not matter if it gets a higher or lower percentage. The same should apply in the other sense (once a bot loses and it is unable to get more that a 33%, it shouldn't be important if it is a 10% or a 30%). It would mean also that for any pair of bots that have a big rating difference, the results would only affect its ratings if they are "unexpected" (that is, a bot that is expected to get 33% or less would affect the enemy rating if he gets more than 33%). Let me know if you like this approacg so I can change the ratings calculations. -- Albert

I though that was why that "S-curve" was used, I agree it seems a bit disadjusted(?). Paul Evans is probably the guy to talk with about changing it... -- ABC

I'm not sure I understand how this scheme would favour wins. Can you elaborate? -- PEZ

If I understand it correctly, it doesn't favour wins, it has an effect similar to what Albert described, the difference of winning by 70% or 90% is much smaller that between 40%-60%. It probably can be "skewed" to favour wins, I am not the guy to do it though, it was Paul who adjusted it to the ER. The problem here is that he did it for a much smaller data set, and for a competition where a bot only fought it's neighbours. -- ABC

In my oppinion, the current rating system (both in RR@H and ER) has two disadvatages:

# It scores bots according to its percentage of score. But bots are not designed to score the maximum amount of points, but to beat the maximum number of enemies (and that's the intuitive criteria to evaluate a bot). In other words, when somebody desings a bot, once it clearly beats an enemy, it doesn't cares about beating it by 75% or by 95%. Because the current system rates the bot according to this 75%/95%, there is a missadjustment between the expected rating and the real one.
I don't agree -- Paul Evans

Changed: 301c179
# The ELO rating system (and this one is just a variation) expects a "normal" distribution of the score according to the rating difference, but we are not sure this condition is fullfilled. Things get worst because when two enemies are close in the rating, the function has a closely lineal behaviour, but when two enemies are far away, any error in the adjustment of the function or in the hypothesis can have a bigger impact.
For some reason that doesn't suprise me... -- Tango (BTW, i don't agree either, i see no problem having both)

Changed: 303c181
There proposed approach provides a solution for the problems mentioned:
I would like to see the continuance of the ELO based system too. You need look no farther than cx.micro.Smoke to see the difference in the two systems. As I type this, Smoke is #6 in the PL and #19 in the traditional ELO based system. -- jim

Changed: 305c183
# Once a bot clearly beats an enemy, or it is beaten by it (pe. 66% to 70% score) any bigger difference will not be taken into account. So the rating system will behave according to the expectations and will be aligned with bots behaviour (that is to beat the maximum number of enemies, not to get the maximum score).
I don't agree either. I haven't a clue how the ELO ranking system works, but I don't really care; I just know that it's designed by lots and lots of people who are much smarter than me, and that it takes into account the amount by which you thrash a bot, while the PremierLeague doesn't. I don't think your #2 is an issue at all; relative strength should be entirely how it is decided. If people are curious about how the ranking works, just make the scoring piece of the servlet open source so that they can see how their bot is ranked. -- Vuen

Changed: 307c185
# Setting the maximum/minimum score to 70%/30% means that (according to the current curve) you don't expect an enemy with a rating difference of 200+ (aprox.) to beat nor tie you. So you will not be affected by it unless it gets a good "unexpected" result. By doing like this, you are restricting to a segment of the "expected rating line" of about +/-200 points, in the most linear part of the curve, so you minimize any missadjustment the curve could have.
Or better yet, modify the output of the details page to include the solved equation in a new column for the bot pairing in question. Then you will get to see the formula in action. -- jim

Changed: 309c187
The global behaviour of the new rating system would be that would disregard results for far rated enemies (unless they get an unexpected result) and would move from a points based rating system to a win/lose rating system, which is globally more intuitive and more robust. -- Albert
I think both Vuen and Jim misunderstood the proposition. I am not suggesting we scrap the ELO based system for the current PL one (even though I wouldn't mind that either). What's proposed is that we use the ELO-based way of considering the relative strengths for the rankings, but we skip the obfuscation with the magic formula. The ELO-based system is a great system for estimating ratings when there's no chance all pairings can be run. But now when we are running all pairings (over and over again) it borders to the ridiculous to continue with an estimate. I also suggest we build a the table including all three scorings to begin with, but that we probably will remove the ELO column once we see that it's about the same as the "real strength" one. -- PEZ

Changed: 311c189
I like it. But wouldn't it be better to make the client behave a bit more like the ER and mostly fight bots against their neighbours? It would make the ranking evolve much faster, that's for sure. Wouldn't that have a similar effect as, as you describe it, "ignoring" results of battles between bots with very distant ratings? I sure would like a more "ladder-like" system. -- ABC
The servlets ARE open source. -- Albert

Changed: 313c191
Well, one of the things that I like of RR@H is that makes sure a top bot is really a top bot, just because it faces every one. My impression is that bots tend to get overspecialized to kill its neighbourgs, and quite frequently you get nasty surprises when they face bots that should be easily beaten. The proposed system would be a kind of leage, with %scores lower than 30% beeing a lose, higher than 70% beeing a win, and percentages between 30% and 70% beeing a kind of graded tie. Also by limiting the fights to a "local range" the problem about %scores would persist: Is it better a bot that beats all enemies by a 70%, or a bot that loses against an enemy and beats the rest by a 90%? To me, the first one is better, but an ER rating system would probably say the second is best... Of course you can think the other way (to me, what makes DT the best bot is that it beats almost all bots, not its %score) -- Albert
And I have read the sources. I have also tried hard to figure the ELO-based ranking system. I don't understand it anyway. And I refuse to just lean back and trust others are smarter than me. I know they are, but I would much rather have a ranking system that's transparent even for non-statiticians like me. Everywhere I look where these kinds of ranking systems are used (chess and tennis are two visible examples) it is a means to give all players a relative ranking without having to play all pairings. Something that is impossible in those games. But we (Albert) has solved that problem and thus there's no need to obfuscate the rankings with voodoo. Even if it's damn cool voodoo. -- PEZ

Changed: 315c193
I like it too. And I like the quality of bots having to face all enemies and not just those in the neighbourhood. The speed with which a bot will get a correct rating will increase greatly once the clients try to even out the number of battles each bot fights. Though I don't see 55% as a tie. It's a win. -- PEZ
OK, to give us a more complete picture from which to make some descision I have hacked the server classes generating the PL results some. Now the server produces both types of PL rankings. The "real relative strength" one looks like so for the general category:

Changed: 317c195
I think it would be interesting to have both ranking systems side by side using the same data. Some bots are designed to get high scores, some to win, both are valid. Although, i would make it a simple win/lose, and not take any notice of actual %. I guess you would need a draw for actual 50/50 games, but that is very unlikely to ever actually happen. -- Tango
* http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=roborumble&table=2

Changed: 319c197
I agree with Tango. Would it be possible to see this side by side to see what the effects of the proposed changes are? I think Albert's proposed %70/%30 scores also make a lot of sense. I think there needs to be some differentiation between bots but at some level it is over kill. Seeing SandboxDT have cf.C8_1.1 listed as a problem bot, after beating it with %70+ of the score is not logical. Would it be possible to simply say that no bot should be expected to win by larger than some margin (ie %70)? What affect would that have on the rankings? -- jim
If you study this table and compare it to the ELO-based ranking you'll see that they are about as similar as I had thought they would be. The only real difference is that one contains a very easily understood score (DT collects 72.9% of the score of all the pairings it has participated in) while the other contains an arbritrary voodoo score (DT has 1892.65). Only where there are really close ratings you can see a difference in ranking (like between BlestPain and VertiLeach) I'd much rather have the ranking decided by a score I can easily understand than by one that's opaque to me.

Changed: 321c199,201
I guess we can put in place the new rating system and see what happens (and reverse if necessary). We can not have both at the same time (but we all know how the current one works, so it should not be a problem). Please, vote the prefered min/max percentages:
I'd say we only need two rankings;
# Real (measured) relative strength
# Winner takes all

Changed: 323,327c203,207
* none - Vuen
* 30%/70%
* 35%/65% - Albert
* 40%/60% - PEZ, ABC
* 45%/55%
If your curious about how the other games look like from a "real relative strength" perspective:
* http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=minirumble&table=2
* http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=microrumble&table=2
* http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=nanorumble&table=2
* http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=teamrumble&table=2

Changed: 329c209
Note that the closer the percentages to 50%, the slower will be the change of the ratings. Also note that a close minimum maximum can not properly deal with uncertain results. -- Albert
I haven't checked all tables, but the table for minis shows identical rankings to the ELO-based table. Identical meaning each and every bot gets exactly the same rank. Now tell me why we should obfuscate these rankings with statistical fomulas.

Changed: 331,433c211
Intutively I would have said 45/55, but since Albert says it has significant weaknesses I make a comprimise and vote for 40/60. -- PEZ

45/55 might be a bit too radical. On the other hand, if we want to quickly see the effect of this change, I think we should go for the lower values: 40/60 -- ABC

Hmmm, I don't know if I see the purpose in this. DT can probably beat most top-50 bots about 70%-30%, so I think if cf.C8_1.1 gets 30% of the total score against DT with its general rating, he is a legitimate problem bot for DT. FunkyChicken beats most bots in that rating region with between 70% and 80% of the total score. Statistical targeting sort of assumes you can't generally hit your opponent 100% of the time, and some robots' strength is that they hit some other robots nearly 100% of the time. I see this as a legitimate weakness of such a targeting algorithm, basically that it is less specifically adapted toward bad movement. -- Kawigi

Yeah, it's quite a philosofical question. But using a wild percentage score would be like deciding the winner of the football league based on the number of goals scored by each team, not the number of matches won. Probably it is a legitimate method, but it is quite counterintuitive to me... -- Albert

The football analogy is quite good. Think about a fotball team (we're talking "soccer" here, but I would guess the same goes for american fotball) that wins all its matches during the season and still doesn't top the score board. I and many with me would object. As I object about my test runs showing VertiLeach winning against all minis and still not be the #1 minibot. It's just not intuitive. For the record. I think I would take the same stand if it was Fhqwhgads winning against all other minis. -- PEZ
* Winning against all minis implies battling all minis enough to really be able to tell. -- Kawigi

It does not worry me what rules are used - instead of setting at 70/30 rule just have the system select battles of bots within 200 rating points (or whatever the selected value is) (it won't make much difference). However I am concerned that some of you are writing bots which 'give-up' once they have got 70% of the score - perhaps like football players they are saving themselves for the next match. (what you will find is that by the time you have written a bot that beats DT it will automatically thrash bots further down the rankings). I had a look at the distribution of scores with high difference of rating points with the RR@H scores and the curve looked good to me (others can check - the data is available). For those that wish to count a win as a win and ignore score, why not go the whole hog and count rounds won, or even bullets hit - I'm up for it :). p.s. If you have a bot that is never thrashed but has a low ranking have a look at the gun - good movement against a good gun or a bad gun ensures a near draw, poor movement against a good gun ensures a thrashing. A good mover against DT can get near a draw even if it fires randomly (without firing at all it can get 30% of the score), but movement is only half the story, the present system reflects this truth. -- Paul Evans

I agree with Paul; I think there must certainly be a difference between a 70% win and a 90% win. With the roborumble matching up every bot against every bot, this creates much more room for bot improvement. For example, with data management; bots now have to manage and maintain data on 200 bots rather than just 10 or so. If a system is implemented where anything above 70% is ignored, bots can simply ignore data on bots they already thrash without any more intelligent decision. This is bad. Bots should have to intelligently decide which data to keep and throw away; the way it is now opens the floor to all sorts of impressive data optimization algorithms using rating comparisons to figure out what data is best to keep and throw away. Other examples are for example using APM and anti-gunning curves; against some of the lower bots, the predicted percentages SandboxDT gets of even 90% score against the lower bots should be attainable. Movement algorithms with accurate data can produce bots that can almost never get hit be some of the lower end bots, thus reducing the enemy score to near zero; against lower bots gunning patterns can be written to wait on high-chance shots and fire with almost guaranteed success rates. The floor right now is wide open to make a bot that's totally lethal to the lower end bots; this should certainly matter quite a bit in the score. If we rewrite the ranking system to bound scores by 70%-30%, we lose all sorts of conceptual possibilities that have only begin to develop with the current ranking system of the RR&H. My vote is to leave it the way it is. -- Vuen

Note, that I have only asked for intuitiveness of the rankings. I suggested giving an extra bonus for a win (defined as > 50% score). In all other respects I am quite satisfied with the current system. "Winning against all minis implies battling all minis enough to really be able to tell". I have. I can tell. You can too if you like, just set up a RoboLeague with all minis and run 20 seasons of 35 1v1 battles with VertiLeach as focused opponent. I can also tell that even so my bot doesn't rank #1 among minis. I think that's wrong. I don't see the point in taking it to other extremes, like pure survival or pure hit rates that Paul suggests. Sure we can design a survivalist league with this infrastructure. But that's something else entirely. VertiLeach sure doesn't satisfy with winning 70%. It fights best it can to the end. This shows in its [details]. -- PEZ

One of the advantages of RR@H is that it is becoming the most tested/analyzed/discussed robocode league :-) I don't like the idea of having a rating system and wonder "What if" it was different. Considering we are in testing phase, my proposal is to change to a bounded limits rating system, see what happens (pe. leave it for one week) and discuss again to decide the final rating system. -- Albert

Of course we are in the testing phase. And of course, since it only takes a week, we should test your proposed aproach. -- PEZ

A weeks test sounds like a good plan. I don't see why it couldn't be run side by side though. Couldn't you have 2 versions of RR@H running on the server, and the data for each battle is given to both? -- Tango
* What would be the point with this? We already have the answer for how the current rating system behaves. -- PEZ
* I don't want to know how it behaves, i want to know how it *compares*. You could just compare the dynamic rankings for the new system with the last "offical" ranking from the old, but it would be better if know they are the same battles. -- Tango

I don't think you have to do anything to the league to test a new rating system - I think you have all the necessary information in data files to investigate the effects of different rating systems already. -- Paul Evans
* Good point. If I pack a fresh all-pairings-file, would that be enough for someone knowledgable to produce the alternative ratings? -- PEZ

Please do, I can surely try. Maybe you could make the server automatically package it in certain intervals? I made a RR@H server version integrated with a lotus domino server (I'm not very confortable with java servlets), I could maybe even maintain some alternative ratings if the client was changed to submit battle results to multiple servers... can't promise anything about an online server, but at least with a regularly updated all-pairings file I could manually publish some 70/30/wins/survival/whatever ratings. -- ABC

Yeah. I can create an small program that uploads the packaged results to a local server installation (with the modified rating system) and "rebuild" the ratings. --Albert

The all-pairings file has been created weekly for quite a while now. Now I have created fresh files. I've changed it so that it now creates a allpairings_game.zip for each game we are currently running which means roborumble, teamrumble, minirumble, microrumble and nanorumble. I think we may have some broken records left there in the files. I might find those and clean them out later, hope you can work with the files anyway. I have separated the building of wiki published rankings and the allpairings files now so that we can have different schedules. Please let me know how frequently you would like these files built. The roborumble game file is here http:/robocode/rumble/allpairings_roborumble.zip I think you can work out how to reach the other files. =) -- PEZ

Cool, I suspected it was being updated, didn't know it was weekly. I'll do some more experiments with "fresh" data, but leave it to Albert to publish any alternative ratings. -- ABC

On the matter of speed, couldn't the rating constant be set to a higher value in the first 10 battles fought? I was thinking something like this: Every bot starts with 1600 (even new versions). The first few battles (10 is probably good) are run with an increased constant (5x?) and do not influence the enemy's rating. Those battles are "fixed" so that the enemy is allways the closest one on the ranking table. For example: first battle against a "median" bot (1600), the result says it should go up 100 points, 2nd battle against a 1700 bot, it goes down but not so much this time, etc. It would be something like a binary search of a bot's aproximate position before it starts the real test of fighting everybody else... -- ABC

I agree with the idea of not updating the enemy ratings when a new bot fights the first battles, and that increasing the constant during this battles would be good. But I can't agree with the idea of new versions starting at 1600. The best guess for a new version is its previous rating. About the binary search, it presents some concurrency issues (which client would run them? How they syncronize? What happens if the assigned client fails to upload the battles?) that would require too much programing. -- Albert

Oh man... I can't believe I submitted ProtoStar?... -- Vuen

I say it again. When the clients favour battles with new/updated bots, the speed with which the bot find their neighbourhood will be fast enough. -- PEZ

Yes, it would be hard to do with multiple clients, the only way I see it possible is if the author is required to do a qualifying run for a new candidate. Scratch that, just upping the constant should be good.

About that constant (sorry about the brainstorming), I thought it just made things faster by sacrificing some precision, but I got some strange results after some testing. First I did some tests by processing all the pairings with the current formula (rankingdelta = scoredelta * 3), it takes a long time to stabilise, around 20 iterations, and by iteration 14 there are still some bots changing places significantly, namely the ones with less pairings run, like VertiLeach that comes into the top 30 in iteration 14 and climbs to the top 15 by iteration 18. No problem, these are the correct positions if we took these battle results as the absolut truth, I thought. But then I upped the constant by 10x (ranking = score * 30), like I expected it stabilised much faster, 2 iterations for DT to get it's 1860 points and no significant place changes after that. But I got a quite different ranking this way! Chameleon took 2nd place (from 3rd), BlestPain got 6th (from 2nd), and everybody else got shuffled around as much as 5 positions... How do you explain this? Maybe te increased constant favours less specialised bots? Or it just helps bots with fewer results? More/less resistant to flukes? I don't know, but it sure seems it can make the table stabilise much sooner... -- ABC

Just remember that a wiki is a brainstorming tool first and foremost. Storm on. I think that the way to go would be to "quarantine" new/updated bots for some 50 battles or so before they start to impact the ranking of the opponent bots. This combined with clients that try to level the number of battles for the bots in the rumble should do the trick. Interesting that you got different rankings all together with the factor 30. What is a apecialised bot anyway? -- PEZ

I think it is a bot that doesn't follow that S-curve like it should, it wins against some bots it should loose and looses against other it should beat, acording to the current ranking. -- ABC

Like VertiLeach then? (Well, it doesn't loose against many bots, but it has lots of red and green boxes in the ProblemBot index column.) -- PEZ

Yep, unexpected scores (in both directions), not wins/losses. A specialized bot takes longer to find it's spot in the ranking table. -- ABC

You can find a simulated ranking for 70%/30% limits in http://www.geocities.com/albert_pv/Rankings7030.htm -- Albert

About the rankings, I suspect (And the problem is not new to me, but I also felt the same in the ER) that there are many equilibrium points in the rankings (ie. there are many combinations of ratings that make them stable). Some times, the equilibrium is broken and all bots jump to another equilibrium point (an example could be GlowBlow in the Minis competition, just a couple of days ago). In this case, the higher the rating change constant, the easier bots move from one state to another. Also, if the constant is to high, it can happen that ratings never settle into an stable state, and cicle arround. -- Albert

Like Paul said, the 70/30 rule doesn't change much... I'm now convinced that the lack of stability of the ranking is related to the way the ranking is updated and to the constant used in those updates. If you feed the server with the same battles over and over you get a stable ranking at some point, the speed at wich that equilibrium is reached depends on the constant you use, a higher value will make it faster. In my experience, when the top bot (DT) stops going up you have a stable table. What I discovered (and wasn't expecting) is that the position of most of the bots varies a lot depending on the constant used... I'll publish my results later today, I have 20 iterations made with c=3 and 3 with c=30, both seem to have reached a stable equilibrium, but have some major differences in the position of the top30 bots. -- ABC

Doesn't change much? It makes worlds of difference for VertiLeach. I know I'm heavily biased, but the 70/30 boundaries makes the rankings much more intuitive in my opinion. DT only looses to Chameleon. BlestPain only loses to DT, Smoke, Cigaret and VertiLeach. And VertiLeach is the next bot in the row of bots with few bots that can beat it. It's logical with a #3 ranking for Verti. If you don't think it changes much I guess that means you don't mind if we switch to this ranking? My vote is obvious I guess. =) -- PEZ

Hehe, I didn't notice Verti was so high up, Tron also went up from 20th to 14th. Kawigi and Rozu might not be too happy about it though, FloodMini and GlowBlowAPM went down like crazy... It changes a lot, but so do successive iterations of the same formula, so I'm not really convinced that those are stable results. Was that a single iteration through the results, Albert?-- ABC

I simulated how the rankings would be calculated in the server. Picked a random row, feeded it to a local server as a battle, then picked another one, feeded it into the server, and so on. I feeded a total of 168.000 "simulated" battles. BTW, are you feeding the file to your simulation in a linear way? May be the order of the bots is affecting its rating? -- Albert

I agree that rankings change a lot with a 70/30 rule. All Aspids are down with the new system :-( -- Albert

Yes, I'm processing the allpairings file sorted by ascending time of last battle. I do around 19.000 "battles" per iteration, so your method should be roughly equivalent to me doing 8-9 iterations, assuming that the order of the battles doesn't influence the final "equilibrium state". I think I observed a very stable equilibrium at around 400.000 battles (20 iterations), with the normal formula (no 70/30 rule) and c=3. If I have the time I'll try it with the 70/30 formula. I don't think we should change it before we reflect on wich bots go up/down and why... Unless of course PEZ refuses to play if he doesn't see VertiLeach in the 3rd spot. ;) -- ABC

=) I can wait. But for the record, what I find hardest to see is a sub #1 rank among the minibots. Maybe you can run some simulations on that game? -- PEZ

OK, I'w do it tonight. -- Albert

Here are the results of my tests using the normal rule: http://robocode.aclsi.pt/ranking/Ranking_pass1.html

If you replace the 1 in that url with numbers from 1 to 20 you'll see the evolution of each pass through the allpairings file. -- ABC

I tested the minirumble with my simulator: with the current result set (VertiLeach has fewer battles than all the others) and no 70/30 rule, it places 2nd after a lot of passes. With the 70/30 rule it gets to 1st place after a while, with GlowBlowAPM slowly dropping to 6th place and FloodMini holding on to a close 2nd. It seems GlowBlowAPM is a typical bot that suffers from the 70/30 rule, maybe because it is a "generalist" bot? If so, the change to 70/30 has the intended result. I'm not convinced that this is a "good" change, though, even if Tron benefits from it... -- ABC

I have a question. Are a bots results continuely re-evaluated relative to it's current position and all other bots current positions? By that I mean as the new Jekyl 0.55 (hopefully) rises through the ranks, are it's current results re-evaluated to say "That was good enough when you were at #30 but it's no longer a very good result considering where you are now #20"? What about when a result against a bot that was #30 was OK but now that bot has fallen to #50 and they no longer are? -- jim

Nope, the current setup is similar to a rolling average, your results are evaluated after each fight, influence your rating by some factor (the famous constant ;)), and are forgoten. -- ABC

Thanks ABC! -- jim

ABC, can you publish those results too please? If you have the time and kindness. Since the file you have contains quite few VertiLeach matches maybe you could test with this file: http:/robocode/uploads/pez/allpairings_test_minirumble.zip It's the results of 400+ RR@H battles with VertiLeach as focused opponent. Verti still has fewer battles than the rest, but 400+ should be enough shouldn't it? -- PEZ

Here you go: http://robocode.aclsi.pt/ranking/miniranking7030.html
I'll try it with your test file. -- ABC


About reevaluating results, I have to contradict ABC :-( Results are added to the previous ones (against the same enemy) using a kind of rolling average. This is, a %score is maintained against every enemy. But every time a bot fights a battle, ALL its results (stored as an "average" % of score) are reevaluated against ALL current ratings of the enemies it has fought some time, and a momentum (up or down) is calculated to move the bot up or down. Take a look to the Details page of any bot to see how it works. -- Albert

BTW, I made some simulations and there is a single solution (well... it can be shifted, but the difference between ratings is constant) that satisfies that ALL the momentums are 0. So we are happy now about the robustness of the ratings (changes occur just because convergence speed and variance in the results uploaded). -- Albert

Well, I'm glad I was wrong then. :) I got the wrong impression when I read the server's source code, that should make it much more stable. -- ABC

PEZ, your results using the normal rules: http://robocode.aclsi.pt/ranking/testranking_normal.html

Thanks dude! Nice rankings considering they are with the old school rules. =) -- PEZ
-- PEZ

Changed: 435c213
Would anyone object if I set my client to run battles with Jekyl as a focused competitor? I can't see that it should scew the ratings any, it would just speed up the process for the new Jekyl to get its proper ranking. -- PEZ
The ELO estimates are more than just giving rankings for pairings that haven't happened, it is comparing the estimates to the real result, to see if a bot is a ProblemBot or not. If we just use your new system, we won't have the ProblemBot ratings, which are very useful. -- Tango

Changed: 437c215
I certainly would not =^> But I am willing to wait. The suspense is half the fun. -- jim
I'm pretty sure you can produce ProblemBot ratings also without the voodoo. But since it's mainly a tool for helping us spot where we might have room for improvement in our bots let's keep the ProblemBot ratings as they are. No need to base the rankings on the same voodoo. -- PEZ

Changed: 439c217
But painful. Like a BlestPain. =) -- PEZ
To have the current ProblemBot ratings, you need to have the current Rankings, because that's what they are based one. You don't have to display them, but if you have them, you may as well. -- Tango

Changed: 441c219,223
PEZ, the same test with "VertiLeach to the throne" rules :), 60/40 this time: http://robocode.aclsi.pt/ranking/ranking_test6040.html, I'm ok with some focused Jekyl fights... -- ABC
But there's no real need for using the current Rankings for the PBI. A non-voodoo way would be to just calculate a bots PBI by a simple difference:

expected = 50 + myStrength - opponentStrength
PBI = real - expected

This gives the following PBI's for a selection of VertiLeach's opponents:

Changed: 443c225,234
Thanks. That's a really nice ranking. I think I will print it on an A2 sheet and put it on the wall. =) -- PEZ
|Opponent |Strength |Expected |Real |PBI |ELO-PBI |Difference
|DT |72,90 |44,81 |29,70 |-15,11 |-13,80 |-1,31
|Tron |63,45 |54,26 |47,70 |-6,56 |-7,00 |0,44
|LostLion? |33,61 |84,10 |68,70 |-15,40 |-13,00 |-2,40
|Nibbler |61,64 |56,07 |56,90 |0,83 |0,00 |0,83
|FloodMini |67,22 |50,49 |57,40 |6,91 |6,80 |0,11
|Tityus |65,77 |51,94 |49,90 |-2,04 |-2,30 |0,26
|Griffon |67,54 |50,17 |59,40 |9,23 |9,10 |0,13


Not exactly the same PBI, but still just as useful. Maybe even more useful since it's easier to understand.


Changed: 445c236
Ha! I think I'll print it on an A2 sheet and BURN IT! ;-) -- Kawigi
-- PEZ

Changed: 447c238
I calculated the "solution" for the mini ratings (using the unlimited rating system) and including only the top 20 bots, and VertiLeach gets the first spot. It means that GlowBlow and Flood are 1rst and 2nd just because they perform better against the lower than 20 bots. -- Albert
For me whats more important than understanding the system is continuity. The RR@H's ELO based system is the closest thing left to the Eternal Rumble. I have spent way to much time getting this *close* to the #1 position to want to willingly change the scoring system. If I ever manage to become #1 I want there to be no lingering doubt that it's tainted in any way. What you propose PEZ is a different view into the same data. Code it up, put a link there and see who uses it. Darwin will decide if it is better or not. Thats one of the strengths of the RR@H as it is now. As resistant as I am to the idea, maybe I will like it better too. I do not know. But if you are telling me it is an either/or situation than I am for the status quo. -- jim

Changed: 449c240
Maybe it also means that VertiLeach's movement depends on having a good movement to "leech" form? That can be seen as a design flaw. ;) -- ABC
I don't think there's a point in keeping the ELO-based rankings. It's just confusing with two so similar tables. We can keep the figures there a while. RR@H is so far away from ER anyway, keeping the current Rankings doesn't bring it closer. -- PEZ

Changed: 451c242
Indeed. Though version 0.2.1 tries to fix that somewhat. Still Verti manages to win against most bots even if their movements sucks. What it finds the hardest is multi-mode bots like Ares and PrairieWolf (and, as a consequence of its design, bots with excellent guns like DT and Tron). VertiLeach aside. Wouldn't you agree that a rating system that selects a bot that beats all other bots as #1 is what we should opt for? -- PEZ
Wow, for Nanos the ranking is exactly the same! -- Albert

Changed: 453c244
Yes, I agree. But I also agree that the best bot is the one that performs best in all situations. DT beats all bots and is at #1. VertiLeach's case is somewhat special, it seems that it exploits certains weaknesses of the top bots (a bit like I did with DT 1.91 and mirror movement) but does not perform so well as the others at killing the rest of the pack. That also means it has an increased probability (even if a small one) of loosing a battle against an "average" bot. So, in the long run, is it really a better bot than FloodMini? It's a tough question... -- ABC
Yup, and for minis as well (maybe I have said that?). I guess it shows that the ELO-thingy works. At least when you are doing the estimate from the full population. =) -- PEZ

Changed: 455c246,248
Of course DT is #1. And as you have seen it's #1 anyway we slice things. Now follow along ... VertiLeach doesn't lose against any minibot. Not to FloodMini, not to any minibot. Exploits some weakness of the top bots? Not of DT, not of Tron. -- PEZ
For me ELO-style gives much more information. Like problem bots. Like I can see ratings skewed by bots entered with pre-learned enemy info and what it takes to learn their 'real' strength (I'd say some 1500-2000 rounds.). Btw I see no point in doing that - one can't learn true bot rating quickly, just see the bot going up then steadily down.

Pez above you gave this example table with 'Strength' in it as base for calculations. Where does it (strength) come from? -- Frakir

Changed: 457c250
The RR@H results say that it so far is really close to FloodMini and (at least with version 0.2, since it's seen more battles) it loses to one or two minibots and one or two microbots. Although I don't doubt that under certain conditions, it can beat about anything.
I feel like a real DonQuijote here. =) I can't see where the ELO-figure says more than the strength figure. As the example above shows ProblemBot index can be calculated just as easily fromt the "real strength". In the above ProblemBot calculation "strength" is the average %score a bot has collected in all its pairings. It's what the "real strength" ranking is based on. I. e. the "score" column in the ranking table (http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=roborumble&table=2). For VertiLeach this score was 67,71% at the time of the calculation above. -- PEZ

Changed: 459c252,253
Well, my opinion is still that a bot that can score higher against other bots even if it loses to VertiLeach should occupy the #1 position. Just because VertiLeach wins against all minis shouldn't make it the best; consider two bots, one that beats everybody 55% to 45%, the other that loses to that bot and beats everybody else 80% 20%. I would consider the second bot MUCH more deserving of the top position. I'm not a big fan of this rating cap; just being able to beat everyone denies so many concepts that provide high-scoring points against lower level bots. Verti's current movement is such that even if it can win it has a lot of trouble stomping on bots that others like FloodMini can tear to pieces. As much as I admire VertiLeach's movement, I still feel that FloodMini is the top bot. -- Vuen
Well maybe since I am addict chess player I don't find ELO being voodo :)
(OT) Just noticed something peculiar: have a look at http://rumble.robowiki.dyndns.org/servlet/RatingDetails?game=roborumble&name=Noran.CornersReborn_1.0 That is one crazy bot... Kills Ender (problem index 89.3!), wins with Wolverine, loses to most all others with some REALLY bad scores. -- Frakir

Changed: 461c255
That's the first time someone has explained the rationale behind the current ranking system in a way I can almost agree to. But I still think wins should count a little more than it currently does. It would make it more immediately intuitive. -- PEZ
Ender and Wolverine have bugs in them so they crash against some bots, so get really low scores. -- Tango

Changed: 463c257
The other problem with just using wins is that over one or two 40-round battles, winning 55%/45% may not be a true victory at all, but a little luck in one way or another. Likewise losing 45%/55%. -- Kawigi
Maybe it's not voodoo to you, but it's a bit unecessary to massage the results like that just to end up with the same ranking, in'it? I think I have seen a mention about CornersReborn? elsewhere. In fact on my machine it doesn't even run. I think something is wrong with it. -- PEZ

Changed: 465c259
But if instead of 55%/45% is 70%/30% (or whaterver you need to feel comforatble about not beeing just luck). Which bot would be better? I think the first one. -- Albert
Time for a vote (BTW have you thought how you would do Melee without ELO :)) -- Paul Evans

Changed: 467c261,272
To be clear, I think that in Vuen's contrieved example that bot 1 would should be ranked #1. "So, you beat everyone else and so do I, let's decide this in combat just you and I." When it comes to luck or not luck, when you meet 180+ enemies luck and unluck tend to even out and even if not, it will once a bot has been in the league for a couple of weeks. And, as Albert says, noone (but I) has been proposing 55/45. Let's use 70/30 and it should both iron out luck and still leave some room for proving your strength against bots with a low ranking. -- PEZ
|Name |ELO Only|%Score|ELO & Premier|%Score & Premier|All three |Notes
|Paul Evans |X | |x | | |Uppercase X = prefer, lowercase x = Can Live With
|Sparafucil3 |X | |x | | |Ditto
|Vuen |X |x |x |x |x |I suggest all 3, but mainly sorted by ELO
|Albert |x |x |X | | |I would preffer to have Premier and one of other two
|PEZ | | | | |X | and rank by the two non-voodoo methods
|Tango | | | | |X | and rank by whatever viewer chooses
|ABC |x | |X | | |ELO main ranking, PL "just for fun" :)
|Kawigi |x | |X | |x |ditto on ABC. Vuen's idea to let the user click on a heading to sort it differently is also a good idea, but might not lend itself to using wiki pages for the results (easy enough on the dynamic results)
|Alcatraz | | | | |X |I like rankings. Lots of them. I think there should be tons of ways of measuring bots. Like, more than three.
|SSO? | | | | |X | good to see different performances
I see... I understand now what you mean PEZ. I still however trust the ELO rankings more. A statistical estimate will certainly not be inaccurate from a 'real relative strength' when there are full pairings; this is counter-intuitive to the concept of statistics. Now that I see the %Score column however, I do not mind it, and if you choose to go this way that's fine by me; it would still be nice to keep the ELO rankings anyway however. The ELO rankings will still be a better ranking until all bots have full pairings, and since people are constantly changing versions or adding new bots, this provides a better ranking while they attempt to achieve full pairings. Bots rarely keep full pairings anyway; look at the premier league. The top few spots oscillate like crazy every time a bot is swapped in or out. Anyway my suggestion is to have just one ranking table that has ELO, %Score, and Premier League on it, and we can just click the table headings to sort it how we like. -- Vuen

Changed: 469c274
...But RoboCode has never been the kind of game where a placing is decided on a single match; robots next to eachother in placing don't fight eachother to see who would be ranked first. Scoring in RoboCode is not like bowling for example, where your score is completely independant of who you're playing against. Bots will lose to certain bots, but win against others than the bots it loses to can't beat. In other words, it's not the type of game where 'if I can beat you, then I can beat everyone you can beat'. Just because two bots beat everyone else doesn't mean they are equal up to that point, and their ranking should not be decided by the one-on-one matchup of the two. The amount by which they beat everyone else should matter much more then how they do against eachother. In my example up there I still feel that the 2nd bot should by far be ranked #1. -- Vuen
Paul, noone has suggested we do Melee without ELO. =) I prefer to solve new problems as they arrive, not before. Vuen, your suggestion is excellent, especially since it's identical to my original suggestion up there. =) (Even if I think it's a bit silly to have two sorts producing identical results.) I don't think bots would oscillate much more by the %score than by ELO even when one or more pairings are not run. Where the score is tight bots oscillate already as it is. And when bots don't have all pairings they don't tend to have a very correct ELO rating either. It would not be like the PL ranks of course, which is much more real-world. -- PEZ

Changed: 471c276
*Pats Vuen on the back* PEZ claims to have tested against quite a large range of minis (and I assume he would have tested against top micros, too, but it appears F-Micro and Smoke give him fits). But its true. When I finished FloodMini 1.3, the reason I predicted he was top 5 material was not because he could beat the #5 at the time (who, come to think of it, may have been Marshmallow). It was because I'd run about 200 or 250 rounds with the entire top 40 and it only lost to 5 of them. And, I suspect, I would find that he wouldn't lose to a lot more among all bots, as I didn't tweak FloodMini to beat anyone in particular. That version was just a general improvement I thought up and implemented. FloodMini is an all-around strong bot, and if you look at specialization indeces, it's one of the most generally dependable bots there are. I don't admire VertiLeach's movement as much, it looks curiously like mine. -- Kawigi
I've changed the "premier and any other method" heading to "all three" because it makes more sense. -- Tango

Changed: 473c278
In all my testing Smoke and FhqwhgadsMicro gets beaten by VertiLeach. Not by much, but some 55%+ of the score. Have your testing reached a different result? It sounds like you agree that a bot beating more bots than any other bot is intuitively a #1 candidate? Verti's movement is nothing to be impressed by. It's static distancing if there ever was one. What Verti forces forward is a a battle of the guns and it has a specialized gun for the fighting distance it has choosen. So it is kind of DynamicDistancing too. =) By not allowing FloodMini to choose the distance it gains an edge against FloodMini that few bots can say they have. -- PEZ
OK, since I think your vote, with your changes to the options, is about spot on to my prefs; I changed my vote to reflect that. -- PEZ

Changed: 475c280
Now, I seem to have messed up the wiki published ranking tables badly... Hope I can fix it today. -- PEZ
From what I understand about the ELO ranking, the %score method is exactly the same with a "linear" expected score. We allready agreed that a 1% score difference between two close ranked bots should be more significant than 1% difference between a top bot and a low ranked one. That is, imho, the big advantage of the ELO system over the %score (even if the results are very close). There is no magic/voodoo involved, Paul just adjusted the relation between ranking and score difference (the famous S-curve) to better reflect the real world, resulting in a system where, even with few pairings, you can better predict your final ranking based on partial results. You can do that with the %score system too, it's just that it should work slightly better/faster. About the PL, I like it too, Tron goes up 10 places... :) -- ABC

Changed: 477c282
Well, you're probably right. You guys know way more about this ranking stuff than I do. Whatever you guys agree on is fine by me. -On a lighter note, I'm finally one of SandboxDT's problem bots! :D -- Vuen
Agreed, it's just that the prediction seems unnecessary when we have the answer already. Besides, the discriminating feature of the ELO seems to do little, or no, difference in the end. I think the PL is hee to stay anyway, it's such a straight-forward ranking. Tron is such a great bot, I think the PL better reflects its quality than the %Score systems (ELO included). -- PEZ

Changed: 479c284
This page was getting very long so i have moved some of the older stuff to /OldRankingChat? -- Tango
Someone added a note to mine, that i never said... i will assume it was a mistake. I think all the rankings should be shown on one page, with both the %/rating/score *and* the position for each shown. It can then be sorted by whatever method the view wants. That way you can easilly tell if a bot is doing better in one system than another. -- Tango

Changed: 481c286,294
Good initiative. If someone has the time it would also be good if redundant information on these and other pages got cleaned away. -- PEZ
It was me. The table looked funny after your edit and I thought I fixed it. Didn't mean to make you say stuff. =) Everybody, look over the table and the options we now have and make sure your votes are where you want them. Here's a conversation to help you choose:
* Friend of mine (F): What does that rating figure mean?
* Me: It's some statistical calculation based on the %score my bot has got against it's opponents. ... plus 1600.
* F: Plus 1600?
* Me: Yeah, plus 1600.
* F: So, if I subtract 1600 from your rating I get 223 something... What does that say?
* Me: I have no idea. It's something about an S-curve. ... But look at this table instead it lists the average %score instead and says my bot averages at 67.78% of the score against all other bots.
* F: Yeah, and those bots also get about 67% ... But look at thise one! (pointing at DT) It gets almost 3% more than any competitor!
* Me: Righto! It's the bot we've all been chasing for more than a year now, it almost seems impossible to build a stronger bot.

Changed: 483c296
Question about the rating / rankings. I thought all battles between minis would count in the minirumble. But comparing Griffons details for the minirumble and roorumble games this is clearly not so. Should it? If not, how is it supposed to work? -- PEZ
-- PEZ

Changed: 485c298
If you are running the client with RUNONLY = MINIBOTS, the battles will only count for MINI/MICRO/NANO ratings. Also if for some reason a client was unable to check the codesize of the bot, it will upload the battles only to the general ranking. -- Albert
I'd say that was more your inability to explain it well, rather than it being overly complex. Point your friend at the explanation on Paul's (or whoever it was's) website. -- Tango

Changed: 487c300
Wow! Griffon is now second and beats SandboxDT (by a very small margin). -- Albert
My inability to explain it well stems from my inability to understand it well. I have those web pages printed and gave them to my friend. He's a math head so he understood it. But there's no way I can ever look at rating 1856.7 and say it tells me anything at all. My friend, after reading the papers, asked the, to me obvious, question; "But why do you use an estimate when you have the real thing?". I think it's like not using the % votes in a political election, but instead use some statistical calculation that tries to estimate the outcome. There you have a receipt to get major protests from your citizens. =) -- PEZ

Changed: 489c302
And VertiLeach seems to have tired of waiting for that rating rules to change. =) #1 in the minibot game! I'm pretty pleased to say the least. Griffon's success shows how extremely important a good movement is. -- PEZ
In 'ELO' type of rating system you assume (quite sensibly) normal distribution of participants, then you force it into ratings. So when you tell me 'Avg rating is set to 1600' I can tell you that bot rated 1856.7 is supposed to get 73% points versus a 1600 bot. Next nice thing about elo is normalization - that means 1300 bot should get same score over 1200 as 2000 over 1900. In other systems I can not reliably predict match outcome. And contrary to what you posted above usually very few matches are enough to get good estimate how new bot rates in the pool. -- Frakir

Changed: 491c304
My question for Paul is how does SandboxDT see the movement of this bot? Paul I know you do not take requests too often for this, but if you are in the mood I would be curious to know what DT thinks. I spent a lot of time on this and would be curious to know if I am on the right track. -- jim
The ELO ratings on their own aren't meant to tell you anything, it is how the ratings compare to the ratings of other bots that matters. I know that when DT is 50 points ahead of it's nearest rival, it is doing very well, and i know that when my bot is 200 points below 2nd to last place, i really need to do something about it. The actual rating is irrelevent, thats why it doesn't make any difference what you set the average to. -- Tango

Changed: 493c306
In my 1000 round test DT beat Griffon 55% to 45% (by total score), i think it might be time for paul to release another update. Well done jim and pez! :) -- Tango
Maybe I've missed all the arguments for and against, but it seems to me that you all are trying to get a ranking system that is stable - ie A bot ranked 4th will beat a bot ranked 50th every time. But due to the nature of robocode, matches are inherantly unstable due to randomness in bots and starting positions. I think the aim for any ranking system isnt to show that one bot is better than other, but rather one bot is better than another bot for that particular round of matches. We should embrace the randomness of robocode, not make it stable. Look at DT standing at the top after all these months, wouldnt it be nice if he was knocked of the top, if only as the result of a lucky match? The way I see it, luck should play a part. Wouldnt it be boring to watch football if you could predict the result of a game between Manchester United and Luton Town. But you cant, Luton could beat ManU? in a one off match as a freak result. Thats what makes the game fun. Thats what should make any robocode competition fun. Bring back some kind of ladder or knock-out competition instead of your stable rankings! Anyway, rant over, im sure a lot (or all) of you disagree with me about this, so go ahead, tell me why im wrong. I wont listen anyway. :D

Changed: 495c308
Thanks! Though there's still a larger gap between #1 and #2 than it is between #2 and #20 something, so maybe Paul is not in that big hurry yet. -- PEZ
--wolfman

Changed: 497c310
Wow. You guys have made a real powerhouse this time; Griffon is pretty darn good. You guys are doing something right with these wiki bots, because just about every one of them is a ProblemBot for Fractal. I have some good ideas to add to Fractal, and a whole new type of gun manager similar to VirtualGuns and BestPSpace to build and write up on; unfortunately I won't be able to work on it for like 2 weeks because I'm behind in all my classes. Fractal is still only like 1/10th complete; I just haven't found the time to make it good. Stupid university... -- Vuen
I agree that one-off matches are fun, but they don't help you make a good bot. The aim of RR@H is to get stable rankings so it is easy to tell if your bot is any good. I think it would be great fun to have a league that only ran 1 round for each pairing each season, and judged the entire table on that. -- Tango

Changed: 499c312
I've just had a look at Griffon's movement - acording to DT's stats, at Griffons preferred fighting distance (which is further than DT's) DT has a hit rate of around 15.7% - by comparison DT will hit itself with a hit rate of 15% (and this is at a closer fighting distance). Overall the movement is only slightly worse than DT at longer distances, and perhaps 1% to 2% at mid distances. However to achive these hit rates DT has to use it's most segmented gun which usully comes into play at around round 50! - the standard guess factor guns show no difference between DT's movement and Griffons. Good movement well done :). -- Paul Evans
What I ask is what does 1823.7 tell that 67.78% doesn't? 67.78% predicts that this bot should beat a bot with strength 50% by 17.78%. It's just as reliable as the ELO figure (which I think you'll have to tell me how you arrived at). What the %Score based ranking provides is transparency. And, you _can't_ reliably predict the outcome of a particular pairing using ELO or any other system. If you could, the PBI column on the Details page would not be what it is. My observation that the ELO type ranking produces some instability while a bot collects all its pairings are just that, an observation. But that's not to say that the ELO based system we use in RR@H isn't good at predicting a bots ranking. I am one of the first in the line of people who are amazed about how reliably it can do this. What I am saying is that we don't need the predictive qualities of ELO in RR@H. In other leagues it's needed (if you want stable rankings) but not here. It takes a bot less than a day to collect enough pairings to get a stable ranking using measured %Score strength. And that's with the few clients running that we have today (which I think does not exeed 5). Once we really push GO with RR@H we might have 100 clients running and then it will take less than an hour.

Added: 500a314
Wolfman, RR@H is about producing a stable ranking (read the project goals somewhere on the RR@H part of the wiki). I agree fully with you that other leagues, providing enjoying combat, are needed to. Particularly I miss the face2face competition. But I think we could use the RR@H framework for running that kind of competitions too. Either by making the clients switch modes or by making the server filter the battles uploaded in different bins. Feel welcome to look at this. The source is always included in the RR@H zip packages.

Changed: 502c316
Way cool. Thanks for the feedback. Your segmentations still keep some secrets. RoboGrapherBot (soon to be released) doesn't see where DT's movement is better than Griffon's. I wish you would release a grapher for DT. =) -- PEZ
-- PEZ

Changed: 504c318,319
Thanks very much Paul. This is very exciting news for me! I thought there was a chance that it was very solid. Now I think the difference is the gun. And for that, I think that more than 1500 bytes will be required to close the remaining 50 points or so. Getting closer. -- jim
Ok I now understand what you are missing with percentage.. :) While percentage works fine with bots around equal strength, it stops working when differences are huge. Example: Bot A beat bot B 1000-90 (91.74% score); also A beats C 1000-60 (94.34%). What is _predicted_ relative strength of B to C?
If you say B is better by 2.6% you will be way off! Elo predicts accurately here: B performed better then C by almost 110 rating points and should get 60% score versus C. You are missing whole normalization thing ('S'-curve). As a result bots will always 'lose' rating playing bots far below their rating and perform 'better' versus bots of similar or better strength -- Frakir

Changed: 506c321,322
Getting closer - yes, but don't forget a 45%-55% score represents a difference of about 50 rating points. on a separate note... a grapher for DT would, I think, take away the sense of achivement should you beat DT - I keep my segmentation secret so as to not spoil your enjoyment :) -- Paul Evans
P.S. Which can bring possibility of doctoring ratings by choosing higher rated opponents and playing selected matches on your RR@home machine... or seeding Sandbox to play the bottom of the pack, or whatever -- Frakir
* Are you saying by using the relative percentages thing to attempt to weight that bot's result more than others? Not sure if it would work the way it's currently set up, but you are right in the respect that it's more reliable where people don't think about it. -- Kawigi

Changed: 508c324
You could always label the secret segments as "secret 1" and such. And, I can assure you, Paul, that I would enjoy beating DT either way. =) -- PEZ
The point PEZ was trying to make is that ALL pairings between ALL bots are run. It's not possible to exclude bots from being played, because they are guaranteed to be played on other people's RR@H. It doesn't matter what the predicted relative score of B against C is, because the match B vs C would have been played. I still feel though that ELO is the way to go, but I can live with relative strength if you really want to change it PEZ : ). It's your server anyway, so it's up to you... -- Vuen

Changed: 510c326
Well I have never been closer than 50 points before so I am going to live in the moment for a few minutes. And I too would enjoy bringing you down, if only for a nano second. Especially if I managed to catch a screen capture. I remain convince though that the way to do it is to focus on beating other bots first. When I look at SandboxDT's results page, there is no bot in the left hand column that SandboxDT does not score at least 50% against with the exception of Griffon, which I know it can beat by at least 55%-45% from my own testing. Thats a remarkable acheivement and the secret to closing the gap. First beat them, then beat SandboxDT. At which point you will release a new one and we will fall behind by another 10,000 points =^> -- jim
I hope we can have them all in one table, maybe PEZ will like how elo normalizes things when compared to percentages; I am almost fairly sure currently percentage order != elo order of bots. --Frakir

Changed: 512c328
How about you just make DT give hit rates for prefered distances (with a varity of guns)in the debug window at runtime? It would be good to compare... -- Tango
Agree with PEZ and Vuen. "ELO is better because it allows me to predict the expected outcome of A against B" is a weak argument, since you have the REAL outcome from A against B. It is OK for me to keep ELO, since it has a long tradintion in Robocode, but right now I think it is just an overkill (a complex stystem used when no longer needed) since ALL pairing are there (note that in "real life" ELO is used only in sports where not all pairings can be played. No sport where all pairings are played uses a system like this). -- Albert

Changed: 514c330
BlestPain strikes back! Now Griffon is #3. In Sweden we say "old is oldest". Dunno what the expression would be in English. -- PEZ
(Edit conflict * 2!!!) I wouldn't take this time to argue my point if I thought it was up to me. =) Here I have finally got someone try to give me one possible advantage of the ELO versus raw %Score. Thanks Frakir. I really appreciate that you keep trying to point out just what makes ELO preferable. It was driving me a bit nuts to just have "don't agree" and "we should keep ELO" thrown at me. But the fact remains we do not need to predict the relative strength of bot B to C, we just wait a few hours and the answer will arrive. I'm not sure what ranking system you feel is weekest against manipulation. I think raw %score is more robust here. You can play DT vs VertiLeach all you want. The raw %score between these bots will just get more exact. Nothing at all will happen to the rest of the table. With the ELO type of rating I have no clue what would happen. Which is very much why I feel so strongly about getting rid of it. We saw in the start of the RR@H that those kinds of manipulation attempts (focused pairings) disturbed the rankings, but I'm not sure that would be the case any longer. As Vuen points out we now have a client which enforces that all pairings will be faught. Albert has succeeded very well in one of his major design goals of making the system robust.

Changed: 516,517c332
I don't think "oldest" has the same positive connotation here that it must have in Sweden. :) -- nano
* Maybe not. But what about "age"? =) -- PEZ
Frakir, do you mean the current %score order is way off from the ELO order? And, if so, could it be that the ELO order (which we build dynamically) is just more recent than the %score one (which we build every 12th hour at the moment)? -- PEZ

Changed: 519,521c334
Wow... I can't believe Fractal jumped so many positions with so few changes. I haven't built any of the concepts that are supposed to bring out its real power yet; all I really did was tweak it's movement distancing to make it more survivable and less passive (while more predictable), and remove its bin decay. Here is the result I'm most proud of:
pe.SandboxDT_2.11	46.6	2	8-10-2003:21:39	32.5	14.1

*grin* getting closer... -- Vuen
I have no idea... but if by some sheer chance some bot played more games versus low rated opponents then it statistically should, then he is 'percentage-wise' overrated, but its ELO is fine. Similar argument can be used against ELO (more games vs problem bots), but it would affect both: elo and percentage one. One more tiny point here: percentage rating can possibly fluctuate more (I play agains last bot in the pack, I get 99.3%, my rating goes considerably up). I think we can have a table with both, at least for some time. -- Frakir

Changed: 523c336
Now NanoLaulectrik? is 3rd. The nano-pattern matchers are having a big fun, going up and down and exchanging positions among them. It's interesting to see. -- Albert
Your rating will only increase considerably if you haven't played that pairing before. I too think we can have both figures in the table, at least for a while. -- PEZ

Changed: 525c338
I think that now would be a good time to print [the minibot rankings] and decorate your walls. =) -- PEZ
What we'll lose with percentages will be that nice normalization: Suppose SandboxXP? will get 99.9% while VertiLeach30? will have 99.8%. Just a tiny bit off, almost no difference.... In fact strength difference is as huge as between 50% and 66.6% and translates to same 185 elo points... (wich means Sandbox is supposed to get 66.6% versus VL to justify that 0.1% difference -- Frakir

Changed: 527c340


I agree, normalisation is probably the biggest issue in favour of ELO. And as a consequence of that, problem bots. Your problem bot rating against the top and bottom bots will be different in ELO and %score (at least i think it will, i haven't actually checked). The ELO one will be more acurate/useful. -- Tango

Changed: 529c342
Um... After about two weeks with Fractal 0.32 in a stable position in the rumble, it just dropped almost 15 places over the course of a day and is still dropping. How come the sudden change in ranking? -- Vuen
My tests indicates that the difference is small and I can't see that it goes in any particular direction. But since the PBI is mainly a nifty extra feature it needn't be all that exact. If my bot underperforms significantly against a particular opponent both ranking systems will show it. When DT and Verti reaches the levels of 99.8% strength we could maybe discuss if ELO would show the 0.1% differences better. =) -- PEZ

Removed: 531d343
Could someone without any data files have just started running battles? Does Fractal take a long time to learn? Does Fractal have a reference mode? -- Tango

Changed: 533c345
I checked the detail records and there is nothing strange there (no low scores related to some client, nor 0 survival records, nor big failures against unexpected bots). Rating for Fractal evolved from 1693 (12/10) to 1691 (19/10) to 1676 (now). That makes a change of 17. I just made a quick check with two bots: MicroAspid (changed its rating down 13 points) and PrairieWolf (changed its rating up 11 points and then down again 11 points). It seems the oscillation is in the normal variance parameters. Note that the rating system behaves like a rubber band, and the rating for a bot with a fixed performance oscillates depending on the rating of the other bots in a kind of dynamic equilibrium. Oscillations should reduce if we implement the 70/30 rule, or if we raise the alpha constant (currently 0.7 - but raising it would have some undesired results to consider). -- Albert
Why not put both side by side and see what the actual PBI's are for the current data? Your tests are likely not to be acruate to notice the problems. If you draw a curve of the ELO rankings for the RR@H it is very linear except for the top and bottom 3 or 4 bots. If your tests didn't include the very top, and very bottom, you would not have seen the problem. (NB I haven't actually drawn such a curve for some time, so the rankings may have changed, i don't know) -- Tango

Changed: 535c347,349
Eek, edit conflict. Fractal doesn't save data; the only thing that might affect it is if its gl.txt file is on and it is trying to use RobocodeGLV014, but this would make it crash and the servlets don't accept scores of 0 so it shouldn't affect its ranking. I think the new servlets may be the problem; I hadn't actually realized that a new set of servlets had been installed until after I posted the above, because I noticed the 60%/40% wins/losses weren't being coloured on Fractal's details sheet, so I went looking in RoboRumble/ServerDevelopment to see what was up. Were there any rules changes in the latest update? If so, then that would explain it; it would then of course be just my dumb luck that any change would negatively affect Fractal's performance... After writing this I read Albert's post above. This makes sense now. Thanks :) -- Vuen
Aw, crappy; I was just about to put a comparison between %score (table=2) and ELO, and the premier league current rankings page just died.
http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=roborumble
There's only 9 bots in it! Was there a problem in the page generation? -- Vuen

Changed: 537c351
Here is something strange: I went to check Tron's result against BlestPain, but it is no longer there? Acording to the results pages Tron 2.02 has only fought around 70 bots in 1000 battles. The same happens with BlestPain, DT, CigaretBH, etc. the only "correct" one I checked is Shadow 2.01. Was the ranking reset recently? I too have been noticing some strange oscillations lately... -- ABC
No, there's something other going on. Look at the /ReportedProblems page... But now the rankings are rebuilt again. -- PEZ

Changed: 539c353
I think it is produced by the new filter. It excludes bots that have not fought for the last 5 days, but also excludes pairs that have not been executed for 5 days. It could explain also the oscillations (even if tey will always be there with more or less strenght). I't take a closer look. In any case, it should be corrected when the new system to update participants in the rankings is in place. -- Albert
Yes, something strange was going on earlier. I looked at ad.Neo results few hours ago and it had 0.6% against Noran.RandomT_0.1 (one battle then). -- Frakir

Changed: 541c355
That must be it, the dance continues, the more "speciallised" bots are jumping all over the place as their problem bots enter/leave the details page... -- ABC
For the %score method to work correctly (I still trust ELO better), we are assuming that complete results for all pairings are fast to generate. That is true today, but we are still only scratching the surface of RR@H's potential. How about increasing the rounds per match to 100 (or even 500), that would be very cool. Also, I miss melee competition! Imho we are wasting time discussing small details of a time-proven method of bot ranking instead of moving on to bigger and better things... (much like when you have a good bot and tweak it endlessly instead of trying new theories ;)) -- ABC

Changed: 543c357
Who's bot is ad.Neo, and when did it start doing so well? It hasn't had many battles yet, but still... that's a very good 2nd place... -- Tango
Agree about moving forward. Desagree about increasing the number of rounds. 35 rounds are enough. This way it gives some advantage to "smart" bots (the ones which learn fast) over the "wise" bots (the ones that can learn a lot about the enemy, but take a long time to do it). I never understood why people says "my bot is better because it is able to beat the other one after 1000 rounds". If a bot beats another one on 35 rounds once and again, then it is clear to me that the first one is better, it doesn't matter what would happen on 500+ rounds. -- Albert

Removed: 545d358
Looks good to me too - over 1000 plus rounds it looks like it looses to DT 46.5%/53.5% so the ranking appears to be correct - the bad news is that it is claimed to be a 'test robot' in the repository! It would be nice to see what the history is for this bot/robocoder. -- Paul Evans

Changed: 547c360
It also says it is derived from "many many bots". I could be wrong, but it looks to me like he did some good tweaks to some open sourced bots (probably Iiley's movement and Kawigi's gun). A very good bot anyway, would be very nice to hear some comments from the author. -- ABC
I partially disagree here. You may write your bot optimised for 'unknown' opponent: eg with stat gun there are interesting methods to select shooting offset when there is very little data, or when there is not enough data to shell out random noise (not the same thing). Energy management can also be optimised for low gun hit rates (I'll post something about it soon) or for high gun hit rates (trained gun) and differences here can be really big. In fact my current test bot is somewhat 'optimised' for unknown opponent (leagues like RR@home) but also knows some to take advantage of trained gun... Anyway those 2 things are really different and I value more 1000+ round battles even when I optimise things for short ones (because it is easier). -- Frakir

Changed: 549c362
Tron and BlestPain are the only two bots that beat it. I can't help but smile :). Tron has always been one of my favorite bots... -- Vuen
It takes us again to the ontological question of "what is a best bot?". and I think there is no answer for it, because there are as many answers as robocoders. Of course we could say "a best bot is the one that is able to cope with any criteria anyone proposes" but we can not implement all possible criteria (also, if we did, we should ponderate it, and we would be stucked again). So my proposal is to leave it as it is. I think it gives a picture good enough for everybody to decide which is the best bot. -- Albert

Changed: 551c364
Tron kicks ass! And it also happens to be one of the few top bots that fall into the trap of HypoLeach. =) -- PEZ
My idea of the best way to find out wich is the best bot is a "everyone fights everybody else once over 1000 rounds without save data" league. Sure, the current setup gives a pretty good picture of the relative strength of all bots, but there is still a significant "luck factor" involved, specially between close ranked oponents. After 1000 rounds there is a much smaller error margin, and both the short and long term learning have been used. -- ABC

Changed: 553c366
That's cool (beating ad.neo, not losing to Hypo ;)), I think Tron is a bit outside the "normal curve", it can give top10 bots a very good fight but loses to some unexpected low ranked ones... -- ABC
I think the luck factor is almost neglectable. But a 1000 rounds rumble with no saved data would still be cool. It could be held as a one off shoot now and then. -- PEZ

Removed: 555d367
Believe it or not, should we use a league ranking system based on win/lose (like the ones used in soccer or basketball) instead of score difference, right now Neo would rule over DT. The first only loses against Tron, but DT loses against Neo Jekyl and Grifon. I'm maliciously tempted to create this ranking system just to force Paul to release DT 2.2 :-) In any case, it points out the inherent difficulty to rate bots. -- Albert

Changed: 557c369
My prediction: The next version of SandboxDT will come with pre-loaded data for at least some bots. I know from my own personal testing that Jekyl, Griffon, and Neo will not beat DT if DT is given enough rounds to learn. The RoboRumble's distributed nature almost garuntees that SandboxDT will not get this chance in a timely manner. SandboxDT is still the king. I am starting to think it always will be -- jim
I like that! System limit of 200kb saved data (and RR pool of 200 bots) is another design consideration to make your bot slightly weaker in short battles but stronger on average (less segments, or just saving partial data) -- Frakir

Removed: 559d370
I'd be surprised if DT came with preloaded data, unless it proves really, utterly necessary. It's one of the truly cool things about DT, Tron and some other top bots. That they can rank so high without the advantage of preloaded data, fighting bots that have spent lots of training against them. DT is so much king that it's hard to fathom. -- PEZ

Thanks! Though there's still a larger gap between #1 and #2 than it is between #2 and #20 something, so maybe Paul is not in that big hurry yet. -- PEZ

Wow. You guys have made a real powerhouse this time; Griffon is pretty darn good. You guys are doing something right with these wiki bots, because just about every one of them is a ProblemBot for Fractal. I have some good ideas to add to Fractal, and a whole new type of gun manager similar to VirtualGuns and BestPSpace to build and write up on; unfortunately I won't be able to work on it for like 2 weeks because I'm behind in all my classes. Fractal is still only like 1/10th complete; I just haven't found the time to make it good. Stupid university... -- Vuen

I've just had a look at Griffon's movement - acording to DT's stats, at Griffons preferred fighting distance (which is further than DT's) DT has a hit rate of around 15.7% - by comparison DT will hit itself with a hit rate of 15% (and this is at a closer fighting distance). Overall the movement is only slightly worse than DT at longer distances, and perhaps 1% to 2% at mid distances. However to achive these hit rates DT has to use it's most segmented gun which usully comes into play at around round 50! - the standard guess factor guns show no difference between DT's movement and Griffons. Good movement well done :). -- Paul Evans

Way cool. Thanks for the feedback. Your segmentations still keep some secrets. RoboGrapherBot (soon to be released) doesn't see where DT's movement is better than Griffon's. I wish you would release a grapher for DT. =) -- PEZ

Thanks very much Paul. This is very exciting news for me! I thought there was a chance that it was very solid. Now I think the difference is the gun. And for that, I think that more than 1500 bytes will be required to close the remaining 50 points or so. Getting closer. -- jim

Getting closer - yes, but don't forget a 45%-55% score represents a difference of about 50 rating points. on a separate note... a grapher for DT would, I think, take away the sense of achivement should you beat DT - I keep my segmentation secret so as to not spoil your enjoyment :) -- Paul Evans

You could always label the secret segments as "secret 1" and such. And, I can assure you, Paul, that I would enjoy beating DT either way. =) -- PEZ

Well I have never been closer than 50 points before so I am going to live in the moment for a few minutes. And I too would enjoy bringing you down, if only for a nano second. Especially if I managed to catch a screen capture. I remain convince though that the way to do it is to focus on beating other bots first. When I look at SandboxDT's results page, there is no bot in the left hand column that SandboxDT does not score at least 50% against with the exception of Griffon, which I know it can beat by at least 55%-45% from my own testing. Thats a remarkable acheivement and the secret to closing the gap. First beat them, then beat SandboxDT. At which point you will release a new one and we will fall behind by another 10,000 points =^> -- jim

How about you just make DT give hit rates for prefered distances (with a varity of guns)in the debug window at runtime? It would be good to compare... -- Tango

BlestPain strikes back! Now Griffon is #3. In Sweden we say "old is oldest". Dunno what the expression would be in English. -- PEZ

I don't think "oldest" has the same positive connotation here that it must have in Sweden. :) -- nano

Wow... I can't believe Fractal jumped so many positions with so few changes. I haven't built any of the concepts that are supposed to bring out its real power yet; all I really did was tweak it's movement distancing to make it more survivable and less passive (while more predictable), and remove its bin decay. Here is the result I'm most proud of:

pe.SandboxDT_2.11	46.6	2	8-10-2003:21:39	32.5	14.1

Now NanoLaulectrik? is 3rd. The nano-pattern matchers are having a big fun, going up and down and exchanging positions among them. It's interesting to see. -- Albert

I think that now would be a good time to print [the minibot rankings] and decorate your walls. =) -- PEZ


Um... After about two weeks with Fractal 0.32 in a stable position in the rumble, it just dropped almost 15 places over the course of a day and is still dropping. How come the sudden change in ranking? -- Vuen

Could someone without any data files have just started running battles? Does Fractal take a long time to learn? Does Fractal have a reference mode? -- Tango

I checked the detail records and there is nothing strange there (no low scores related to some client, nor 0 survival records, nor big failures against unexpected bots). Rating for Fractal evolved from 1693 (12/10) to 1691 (19/10) to 1676 (now). That makes a change of 17. I just made a quick check with two bots: MicroAspid (changed its rating down 13 points) and PrairieWolf (changed its rating up 11 points and then down again 11 points). It seems the oscillation is in the normal variance parameters. Note that the rating system behaves like a rubber band, and the rating for a bot with a fixed performance oscillates depending on the rating of the other bots in a kind of dynamic equilibrium. Oscillations should reduce if we implement the 70/30 rule, or if we raise the alpha constant (currently 0.7 - but raising it would have some undesired results to consider). -- Albert

Eek, edit conflict. Fractal doesn't save data; the only thing that might affect it is if its gl.txt file is on and it is trying to use RobocodeGLV014, but this would make it crash and the servlets don't accept scores of 0 so it shouldn't affect its ranking. I think the new servlets may be the problem; I hadn't actually realized that a new set of servlets had been installed until after I posted the above, because I noticed the 60%/40% wins/losses weren't being coloured on Fractal's details sheet, so I went looking in RoboRumble/ServerDevelopment to see what was up. Were there any rules changes in the latest update? If so, then that would explain it; it would then of course be just my dumb luck that any change would negatively affect Fractal's performance... After writing this I read Albert's post above. This makes sense now. Thanks :) -- Vuen

Here is something strange: I went to check Tron's result against BlestPain, but it is no longer there? Acording to the results pages Tron 2.02 has only fought around 70 bots in 1000 battles. The same happens with BlestPain, DT, CigaretBH, etc. the only "correct" one I checked is Shadow 2.01. Was the ranking reset recently? I too have been noticing some strange oscillations lately... -- ABC

I think it is produced by the new filter. It excludes bots that have not fought for the last 5 days, but also excludes pairs that have not been executed for 5 days. It could explain also the oscillations (even if tey will always be there with more or less strenght). I't take a closer look. In any case, it should be corrected when the new system to update participants in the rankings is in place. -- Albert

That must be it, the dance continues, the more "speciallised" bots are jumping all over the place as their problem bots enter/leave the details page... -- ABC

Who's bot is ad.Neo, and when did it start doing so well? It hasn't had many battles yet, but still... that's a very good 2nd place... -- Tango

Looks good to me too - over 1000 plus rounds it looks like it looses to DT 46.5%/53.5% so the ranking appears to be correct - the bad news is that it is claimed to be a 'test robot' in the repository! It would be nice to see what the history is for this bot/robocoder. -- Paul Evans

It also says it is derived from "many many bots". I could be wrong, but it looks to me like he did some good tweaks to some open sourced bots (probably Iiley's movement and Kawigi's gun). A very good bot anyway, would be very nice to hear some comments from the author. -- ABC

Tron and BlestPain are the only two bots that beat it. I can't help but smile :). Tron has always been one of my favorite bots... -- Vuen

Tron kicks ass! And it also happens to be one of the few top bots that fall into the trap of HypoLeach. =) -- PEZ

That's cool (beating ad.neo, not losing to Hypo ;)), I think Tron is a bit outside the "normal curve", it can give top10 bots a very good fight but loses to some unexpected low ranked ones... -- ABC

Believe it or not, should we use a league ranking system based on win/lose (like the ones used in soccer or basketball) instead of score difference, right now Neo would rule over DT. The first only loses against Tron, but DT loses against Neo Jekyl and Grifon. I'm maliciously tempted to create this ranking system just to force Paul to release DT 2.2 :-) In any case, it points out the inherent difficulty to rate bots. -- Albert

My prediction: The next version of SandboxDT will come with pre-loaded data for at least some bots. I know from my own personal testing that Jekyl, Griffon, and Neo will not beat DT if DT is given enough rounds to learn. The RoboRumble's distributed nature almost garuntees that SandboxDT will not get this chance in a timely manner. SandboxDT is still the king. I am starting to think it always will be -- jim

I'd be surprised if DT came with preloaded data, unless it proves really, utterly necessary. It's one of the truly cool things about DT, Tron and some other top bots. That they can rank so high without the advantage of preloaded data, fighting bots that have spent lots of training against them. DT is so much king that it's hard to fathom. -- PEZ

Back to a yet unresolved topic. Maybe the new VertiLeach is not as strong as the previous version, but it still doesn't lose to a single minibot. Yet it ranks #3 and the #1 ranked bot loses against 6 bots, inlcuding VertiLeach. How about Premier League rules instead? 3 points for a win, 0 for a loss and 1 point for a tie. We could define tie like being 50% +/- 1% or something. The current ranking system is great for leagues where not all bots can fight all others, but with the distributed power of RoboRumble@Home we don't have to give the bots an estimated ranking I think. I'm thinking that a win/loss should be determined by the accumulated % score share in each pairing. Now when the system both cleans out bots that are not any longer participants and prioritises bots with few battles the PL rules should work pretty well. -- PEZ

Completely agree. Why using theoretical estimates when we can have the real ranking? Also, the current infrastructure allows it. We should put a servlet that executes periodically (ie. 12.00PM) and just use the existing data to build the classification. Later, we could modify the client/servet applets to priorize unfought matches. -- Albert

Cool! Maybe the rating of a bot should be an index like "100 * (points / points_possible)". -- PEZ

In other words, percentage of possible score. Sounds good. -- Tango

I'm not a big fan of 'Premier League' rules. Neo only loses to Tron, while SandboxDT currently loses to Neo, Jekyl, and Griffon; these rules would have Neo sitting on top, while in my opinion SandboxDT is the better bot. I like the regular league the way it is, but it's a good idea to create an aside premier league scoresheet that uses the same data as the regular league. -- Vuen

I think that's the preloaded-data advantage. Griffon wouldn't beat DT if it wasn't because it was trained on DT before it was uploaded. What if we ban preloaded data? It could be enforced by the client wiping any data directory in the bots jar file after downloading. -- PEZ

I don't like the idea of banning preloaded data. It reduces desing options (and everyone has the posibility to preload data). It is like real life: you can go to a competition without any information, thinking you are good enough to win, or you can take a time to analyze your opponents so you have a better chance to win. -- Albert

I certainly am one that has tried to explore the design paths of the preloaded data strategy. But never the less, DT is the best bot and it should somehow be identified as such by any league rules. We risk getting "fake" updates of the robots where Paul maybe changes the movement slightly and preloads DT with data on the enemies that preloads data on DT and then the authors of these bots load their bots with data on this new DT and it gets like a cat chasing its own tail. But, I agree that banning preloaded data constrains the design options a bit too much. What about we set the battles to be 100 rounds each? That would at least limit the benefits of preloaded data some. -- PEZ

I don't see why people don't just all have preloaded data. If Paul wants to prove he has a good bot without data then he can add another mode to the properties file that doesn't use preloaded data. It would undoubtedly improve DTs rating to have preloaded data. -- Tango

The thing is that with a preloaded data strategy the timing of your entry becomes important. You will need to keep training your bot and send up new versions whenever new bots or new versions of bots are entered. That's a bit pathetic I think. Paul doesn't need to prove that DT is king to me. I know it all too well. It keeps me awake at nights. -- PEZ

Preloaded data is useful for weight restricted bots that don't wish to include learning code and for bots that take many many rounds to learn. Preloaded data is often static and, because it is not the basis for learning it is very small allowing many hundreds of bots to be stored. DT's main problem is that the learning data is so large it can only hold data for some bots - what DT need's to do is convert that large statistical data to 'preloaded' style data just prior to deleting the stats to make more space. I have no problem with preloaded data - there are defences such as adaptive movement. I'm also happy with 35 rounds - it makes data saving an important element of a bot finally I'm happy with the rating system It means all battles are important - not just those bots that may beat you. -- Paul Evans

But I think we should move from the adapted rating e implemented using the /PremierLeague system). It is also more interesting to have both ... believe me. -- Albert

I think both is best. /PremierLeague rules will be easier to understand etc, and the current rules are traditional, and mean everyone has a chance to effect everyone else. With the current system, you don't have to beat DT to drag down it's rating. In fact, a bot that checked the enemy's name, and if it is DT fights the best way it knows, and if it isn't DT is just acts like SittingDuck (with a little change to make sure the score isn't 0, because 0's are ignored), would serious damage DT's rating. Could be fun... -- TangoMy opinion about the ER-type rating systems is that you can better tell if a robot will stand the test of time. A good, stable robot should be invincible to weak bots and be able to compete with the top bots. In other words, one with low specialization should be the goal. With this rating system, we can project how well they maI don't need to change my data saving - just change the delete criteria to keep data on the best opponents, I don't need to improve learning speed, and I only need to tune my movement to 3 or 4 bots - There is no challenge here. The existing rating rules are intuitive - all battles count - the better you do against each and every opponent the better your rating. -- Paul Evans

I promise to make it a challenge for you. =) -- PEZ

This proposed PremierLeague will be run in ce implemented using the /PremierLeague system). It is also more interesting to have both ... believe me. -- Albert

I think both is best. /PremierLeague rules will be easier to understand etc, and the current rules are traditional, and mean everyone has a chance to effect everyone else. With the current system, you don't have to beat DT to drag down it's rating. In fact, a bot that checked the enemy's name, and if it is DT fights the best way it knows, and if it isn't DT is just acts like SittingDuck (with a little change to make sure the score isn't 0, because 0's are ignored), would serious damage DT's rating. Could be fun... -- Tango

My opinion about the ER-type rating systems is that you can better tell if a robot will stand the test of time. A good, stable robot should be invincible to weak bots and be able to compete with the top bots. In other words, one with low specialization should be the goal. With this rating system, we can project how well they man't like it :-( I insist: the more you think about it, the more difficult is to rate a bot. Anyway, I think that the problem here is that there are two different paradigms to rate the bots: one inherited from RoboLeague and EternalRumble, that is fndamentally based on the score difference, and another one derived from the "real world" sports, where only wins/loses are considered (regardless of how important were the differences). In order to make everyone happy, I plan to have both (the second one implemented using the /PremierLeague system). It is also more interesting to have both ... believe me. -- Albert

I think both is best. /PremierLeague rules will be easier to understand etc, and the current rules are traditional, and mean everyone has a chance to effect everyone else. With the current system, you don't have to beat DT to drag down it's rating. In fact, a bot that checked the enemy's name, and if it is DT fights the best way it knows, and if it isn't DT is just acts like SittingDuck (with a little change to make sure the score isn't 0, because 0's are ignored), would serious damage DT's rating. Could be fun... -- Tango

My opinion about the ER-type rating systems is that you can better tell if a robot will stand the test of time. A good, stable robot should be invincible to weak bots and be able to compete with the top bots. In other words, one with low specialization should be the goal. With this rating system, we can project how well they may do against bots they're not fighting against, which is necessary in some leagues, because it's not feasible for them to fight all bots. But more importantly, you can project how good they will probably be against bots that haven't even been written yet. Some bots are just fundamentally good, not just taking advantage of temporary 'trends' in Robocode, but basing their strategy on good, sound principles of AI, Machine Learning, and so forth. That's the difference between robots like Yngwie who just work well and stuff and robots like HaikuTrogdor who just do a good job against Linear and Head-on aim, because that's all they usually have to worry about beating. At the level that bots fight at right now, it seems that if a bot can be trivially defeated somehow, it will be. -- Kawigi

The thing that any sports league in the world does not account for is a "quality win". A performance where the last placed team does much better than expected. In the real world there is no way to account for this. In the Robocode world we have a method to quanitatively say that a bot performed better than expected and should thus influence the statnding of the bot wether it won or not. We can say definitively that Bot A outperforms it's expected result has exposed some weaknesses that needs to be addressed in Bot B. We can definitively say that that the king outperforms every other robots, in one on one competion against all other robots, better than any other robot under the same circumstances. What sports league in the world can say that? Taking the EPL example, if Arsenal (currently #1) beats Chelsea (currently #2) 3 - 2 and Arsenal beats Liecester (currently #20) 3 -2 which is the more impressive result? They all ammount to 3 points for Arsenal but how does the table account for the unexpected performance of Liecester? The answer is that it can not. So the weakness exposed by the Liecester team does not get acknowledged in any way. A Robocode like system would still award marks to both teams but it would not award full marks to Arsenal as it did not perform up to expected results.

One last thing, if you go with simple percentage of score whats to stop me from calculating the score that I currently have and deciding that I have enough to win now and go into a while (true) { ar.getX(); } loop? That would make 10,000 calls to a get method and set my bots energy to 0 denying my opponent all bullet damage going forward and allowing them to only get survival and minimal kill bonus. If, in a 35 round match I determine that I am up by 350 or so with 15 rounds to go why risk it? My opponent will only get 150 points in survival bonuses and I doubt the rest of the bonuses will put them in a position to win. Under the current ER system, this would only be rewarded if the opponent was ranked above me, and I would somehow have to now that in a dynamicaly evolving system. In the PremierLeague it simply becomes a valid strategy. -- jim

That's why both tables are good. You have outlined all the benifits for the ER system, so we definitely shouldn't get rid of that, but there are also benifits in the premier system, even if it's just interesting, so we should have both. Simple. -- Tango

Yeah, if you want to keep the old school system for reference, let it be so. I don't see why it's interesting to project a bots future performance. The future will come and tell eventually anyway. I say "winner takes it all" =). -- PEZ

And now the bot with the wrong package has the /PremierLeague crown! That's pretty cool I think. And iiley's coming bot will be a definitive throne contender while also helping AdNeo keep it's edge over DT. Now I think it is becoming more important to quickly see too that new and updated bots gets all their pairings.

And what about we keep the /PremierLeague snapshots once a week and give that servlet page a drop box where you can choose to view these snapshots? (When viewing a weekly snapshot there is no pointn't need the current code trying to give new bots their initial 500 rounds. As long as the clients try to make each bot fight all other bots an equal number of times, new and updated bots will get duly exercised. -- PEZ

Yikes. Check out this rating:

RATING DETAILS FOR sgs.DogManSPE? 1.1 IN GAME roborumble

Noran.CornersReborn_1.06.115-11-2003:4:191.5-85.4

That is like, the biggest problem bot index ever. I'm curious :536) </pre>

SYSTEM: You have made 10000 calls to getXX methods without calling execute()
SYSTEM: Robot disabled: Too many calls to getXX methods

So I think the reason is not RoboRumble, but the bots themselves. For Ender, it can happen that the erros only occurs on certain clients where it has written lots of information.

-- Albert

Not good. Now there is a mini that can somewhat clearly beat VertiLeach in the RR@H. Tityus! What to do? -- PEZ

Well, just add Tityus an "if (VertiLeach) don't shot" statement :-) -- Albert

It would be like a team order in car racing. With the benefit that the bots don't have ego's to match dudes like Kenny Bräck. =) In fact in my tests Verti beats both Tityus and Fhqwhgads (the latter quite comfortably), I think Verti just need some more battles in some pairings for this to show. -- PEZ

November 7 2003: http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=minirumble Sweeet! =) -- PEZ

Looking at the PremierLeague ranking table it's striking how similar the rankings are. Some bots are much stronger in one game than the other of course, but it's still quite similar. Same kings in the megabot and minibot games for instance. I find the PL ranking table much more interesting to read since I can so easily understand it while the ELO-based ranking is opaque to me. I think two issues have been mixed and confused in the "debate" we have had about the choice of ranking system.

  1. "Winner takes all" versus "relative strength"
  2. "Ease of understanding" versus "magic"
It has been a bit like we had to choose between 1 or 2 here. While it is actually possible to choose the best from each. The most important thing for me is "ease of understanding". I don't like at all to have an opaque magic function decide the ranking when we have all pairings actually fought and the infrastructure makes a new bot get all its pairings in a jiffy. "Winner takes all" is cool for me, but I can see how "relative strength" measures something important too. And I think it is what most of you opposing the PL rankning feels strongest about.

What about we make a ranking where a bot is measured on its average share of the score it has in all its pairings? That's very easy to understand. And the resulting ranking table would be what the ELO-based ranking is trying to predict, if I have understood that much about it correctly. The ranking table could have all three figures in it;

  1. average %share
  2. wins/losses count
  3. the ELO-based rating estimate
The table would be sorted on "average %share" by default, but we could make it sortable by the other figures at well. Sorting it on the the ELO-based estimate should produce a very similar table as the default sort or the ELO-magic is not doing what it should. If the tables are very similar then the ELO-based figure could be removed out of redundancy. If the tables are very different then the ELO-figure is of little importance anyway and could be removed for that reason.

I think I can produce a script on the server that produces a current "real relative strength" rankings table. It will take me a good share of time I would otherwise spend on making VertiLeach stronger though. So someone but me should think that table is of interest before I go ahead hacking it together.

-- PEZ

To me, the PremierLeague is OK as it is. I prefer "winner takes all" systems to "percentage score" systems. I think I said it before, but using this rating system would be like deciding the soccer league winner by using a formula like (goals scored /(goals scored + goals received)). The information is there anyway for anybody who want's to know it. Just divide the %wins column by the #maches one and you get it.

I agree the proposed rating system is clearer than the ELO one, but if we move to this new rating system, I think two conditions should be fullfilled (a) We should remove the ELO rating system (it would create a lot of confusion to have 2 similar systems in place). (b) There should be an strong consensus, since the ELO rating system is the standard %score rating system, and the new one should be standard also.

So if everybody agrees in changing the current ELO rating system by the new one, then it is OK for me.

-- Albert

I don't agree -- Paul Evans

For some reason that doesn't suprise me... -- Tango (BTW, i don't agree either, i see no problem having both)

I would like to see the continuance of the ELO based system too. You need look no farther than cx.micro.Smoke to see the difference in the two systems. As I type this, Smoke is #6 in the PL and #19 in the traditional ELO based system. -- jim

I don't agree either. I haven't a clue how the ELO ranking system works, but I don't really care; I just know that it's designed by lots and lots of people who are much smarter than me, and that it takes into account the amount by which you thrash a bot, while the PremierLeague doesn't. I don't think your #2 is an issue at all; relative strength should be entirely how it is decided. If people are curious about how the ranking works, just make the scoring piece of the servlet open source so that they can see how their bot is ranked. -- Vuen

Or better yet, modify the output of the details page to include the solved equation in a new column for the bot pairing in question. Then you will get to see the formula in action. -- jim

I think both Vuen and Jim misunderstood the proposition. I am not suggesting we scrap the ELO based system for the current PL one (even though I wouldn't mind that either). What's proposed is that we use the ELO-based way of considering the relative strengths for the rankings, but we skip the obfuscation with the magic formula. The ELO-based system is a great system for estimating ratings when there's no chance all pairings can be run. But now when we are running all pairings (over and over again) it borders to the ridiculous to continue with an estimate. I also suggest we build a the table including all three scorings to begin with, but that we probably will remove the ELO column once we see that it's about the same as the "real strength" one. -- PEZ

The servlets ARE open source. -- Albert

And I have read the sources. I have also tried hard to figure the ELO-based ranking system. I don't understand it anyway. And I refuse to just lean back and trust others are smarter than me. I know they are, but I would much rather have a ranking system that's transparent even for non-statiticians like me. Everywhere I look where these kinds of ranking systems are used (chess and tennis are two visible examples) it is a means to give all players a relative ranking without having to play all pairings. Something that is impossible in those games. But we (Albert) has solved that problem and thus there's no need to obfuscate the rankings with voodoo. Even if it's damn cool voodoo. -- PEZ

OK, to give us a more complete picture from which to make some descision I have hacked the server classes generating the PL results some. Now the server produces both types of PL rankings. The "real relative strength" one looks like so for the general category:

If you study this table and compare it to the ELO-based ranking you'll see that they are about as similar as I had thought they would be. The only real difference is that one contains a very easily understood score (DT collects 72.9% of the score of all the pairings it has participated in) while the other contains an arbritrary voodoo score (DT has 1892.65). Only where there are really close ratings you can see a difference in ranking (like between BlestPain and VertiLeach) I'd much rather have the ranking decided by a score I can easily understand than by one that's opaque to me.

I'd say we only need two rankings;

  1. Real (measured) relative strength
  2. Winner takes all

If your curious about how the other games look like from a "real relative strength" perspective:

I haven't checked all tables, but the table for minis shows identical rankings to the ELO-based table. Identical meaning each and every bot gets exactly the same rank. Now tell me why we should obfuscate these rankings with statistical fomulas.

-- PEZ

The ELO estimates are more than just giving rankings for pairings that haven't happened, it is comparing the estimates to the real result, to see if a bot is a ProblemBot or not. If we just use your new system, we won't have the ProblemBot ratings, which are very useful. -- Tango

I'm pretty sure you can produce ProblemBot ratings also without the voodoo. But since it's mainly a tool for helping us spot where we might have room for improvement in our bots let's keep the ProblemBot ratings as they are. No need to base the rankings on the same voodoo. -- PEZ

To have the current ProblemBot ratings, you need to have the current Rankings, because that's what they are based one. You don't have to display them, but if you have them, you may as well. -- Tango

But there's no real need for using the current Rankings for the PBI. A non-voodoo way would be to just calculate a bots PBI by a simple difference:

expected  = 50 + myStrength - opponentStrength
PBI = real - expected
This gives the following PBI's for a selection of VertiLeach's opponents:

Opponent Strength Expected Real PBI ELO-PBI Difference
DT 72,90 44,81 29,70 -15,11 -13,80 -1,31
Tron 63,45 54,26 47,70 -6,56 -7,00 0,44
LostLion? 33,61 84,10 68,70 -15,40 -13,00 -2,40
Nibbler 61,64 56,07 56,90 0,83 0,00 0,83
FloodMini 67,22 50,49 57,40 6,91 6,80 0,11
Tityus 65,77 51,94 49,90 -2,04 -2,30 0,26
Griffon 67,54 50,17 59,40 9,23 9,10 0,13

Not exactly the same PBI, but still just as useful. Maybe even more useful since it's easier to understand.

-- PEZ

For me whats more important than understanding the system is continuity. The RR@H's ELO based system is the closest thing left to the Eternal Rumble. I have spent way to much time getting this *close* to the #1 position to want to willingly change the scoring system. If I ever manage to become #1 I want there to be no lingering doubt that it's tainted in any way. What you propose PEZ is a different view into the same data. Code it up, put a link there and see who uses it. Darwin will decide if it is better or not. Thats one of the strengths of the RR@H as it is now. As resistant as I am to the idea, maybe I will like it better too. I do not know. But if you are telling me it is an either/or situation than I am for the status quo. -- jim

I don't think there's a point in keeping the ELO-based rankings. It's just confusing with two so similar tables. We can keep the figures there a while. RR@H is so far away from ER anyway, keeping the current Rankings doesn't bring it closer. -- PEZ

Wow, for Nanos the ranking is exactly the same! -- Albert

Yup, and for minis as well (maybe I have said that?). I guess it shows that the ELO-thingy works. At least when you are doing the estimate from the full population. =) -- PEZ

For me ELO-style gives much more information. Like problem bots. Like I can see ratings skewed by bots entered with pre-learned enemy info and what it takes to learn their 'real' strength (I'd say some 1500-2000 rounds.). Btw I see no point in doing that - one can't learn true bot rating quickly, just see the bot going up then steadily down.

Pez above you gave this example table with 'Strength' in it as base for calculations. Where does it (strength) come from? -- Frakir

I feel like a real DonQuijote here. =) I can't see where the ELO-figure says more than the strength figure. As the example above shows ProblemBot index can be calculated just as easily fromt the "real strength". In the above ProblemBot calculation "strength" is the average %score a bot has collected in all its pairings. It's what the "real strength" ranking is based on. I. e. the "score" column in the ranking table (http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=roborumble&table=2). For VertiLeach this score was 67,71% at the time of the calculation above. -- PEZ

Well maybe since I am addict chess player I don't find ELO being voodo :) (OT) Just noticed something peculiar: have a look at http://rumble.robowiki.dyndns.org/servlet/RatingDetails?game=roborumble&name=Noran.CornersReborn_1.0 That is one crazy bot... Kills Ender (problem index 89.3!), wins with Wolverine, loses to most all others with some REALLY bad scores. -- Frakir

Ender and Wolverine have bugs in them so they crash against some bots, so get really low scores. -- Tango

Maybe it's not voodoo to you, but it's a bit unecessary to massage the results like that just to end up with the same ranking, in'it? I think I have seen a mention about CornersReborn? elsewhere. In fact on my machine it doesn't even run. I think something is wrong with it. -- PEZ

Time for a vote (BTW have you thought how you would do Melee without ELO :)) -- Paul Evans

Name ELO Only%ScoreELO & Premier%Score & PremierAll three Notes
Paul Evans X x Uppercase X = prefer, lowercase x = Can Live With
Sparafucil3X x Ditto
Vuen X x x x x I suggest all 3, but mainly sorted by ELO
Albert x x X I would preffer to have Premier and one of other two
PEZ X and rank by the two non-voodoo methods
Tango X and rank by whatever viewer chooses
ABC x X ELO main ranking, PL "just for fun" :)
Kawigi x X x ditto on ABC. Vuen's idea to let the user click on a heading to sort it differently is also a good idea, but might not lend itself to using wiki pages for the results (easy enough on the dynamic results)
Alcatraz X I like rankings. Lots of them. I think there should be tons of ways of measuring bots. Like, more than three.
SSO? X good to see different performances
I see... I understand now what you mean PEZ. I still however trust the ELO rankings more. A statistical estimate will certainly not be inaccurate from a 'real relative strength' when there are full pairings; this is counter-intuitive to the concept of statistics. Now that I see the %Score column however, I do not mind it, and if you choose to go this way that's fine by me; it would still be nice to keep the ELO rankings anyway however. The ELO rankings will still be a better ranking until all bots have full pairings, and since people are constantly changing versions or adding new bots, this provides a better ranking while they attempt to achieve full pairings. Bots rarely keep full pairings anyway; look at the premier league. The top few spots oscillate like crazy every time a bot is swapped in or out. Anyway my suggestion is to have just one ranking table that has ELO, %Score, and Premier League on it, and we can just click the table headings to sort it how we like. -- Vuen

Paul, noone has suggested we do Melee without ELO. =) I prefer to solve new problems as they arrive, not before. Vuen, your suggestion is excellent, especially since it's identical to my original suggestion up there. =) (Even if I think it's a bit silly to have two sorts producing identical results.) I don't think bots would oscillate much more by the %score than by ELO even when one or more pairings are not run. Where the score is tight bots oscillate already as it is. And when bots don't have all pairings they don't tend to have a very correct ELO rating either. It would not be like the PL ranks of course, which is much more real-world. -- PEZ

I've changed the "premier and any other method" heading to "all three" because it makes more sense. -- Tango

OK, since I think your vote, with your changes to the options, is about spot on to my prefs; I changed my vote to reflect that. -- PEZ

From what I understand about the ELO ranking, the %score method is exactly the same with a "linear" expected score. We allready agreed that a 1% score difference between two close ranked bots should be more significant than 1% difference between a top bot and a low ranked one. That is, imho, the big advantage of the ELO system over the %score (even if the results are very close). There is no magic/voodoo involved, Paul just adjusted the relation between ranking and score difference (the famous S-curve) to better reflect the real world, resulting in a system where, even with few pairings, you can better predict your final ranking based on partial results. You can do that with the %score system too, it's just that it should work slightly better/faster. About the PL, I like it too, Tron goes up 10 places... :) -- ABC

Agreed, it's just that the prediction seems unnecessary when we have the answer already. Besides, the discriminating feature of the ELO seems to do little, or no, difference in the end. I think the PL is hee to stay anyway, it's such a straight-forward ranking. Tron is such a great bot, I think the PL better reflects its quality than the %Score systems (ELO included). -- PEZ

Someone added a note to mine, that i never said... i will assume it was a mistake. I think all the rankings should be shown on one page, with both the %/rating/score *and* the position for each shown. It can then be sorted by whatever method the view wants. That way you can easilly tell if a bot is doing better in one system than another. -- Tango

It was me. The table looked funny after your edit and I thought I fixed it. Didn't mean to make you say stuff. =) Everybody, look over the table and the options we now have and make sure your votes are where you want them. Here's a conversation to help you choose:

-- PEZ

I'd say that was more your inability to explain it well, rather than it being overly complex. Point your friend at the explanation on Paul's (or whoever it was's) website. -- Tango

My inability to explain it well stems from my inability to understand it well. I have those web pages printed and gave them to my friend. He's a math head so he understood it. But there's no way I can ever look at rating 1856.7 and say it tells me anything at all. My friend, after reading the papers, asked the, to me obvious, question; "But why do you use an estimate when you have the real thing?". I think it's like not using the % votes in a political election, but instead use some statistical calculation that tries to estimate the outcome. There you have a receipt to get major protests from your citizens. =) -- PEZ

In 'ELO' type of rating system you assume (quite sensibly) normal distribution of participants, then you force it into ratings. So when you tell me 'Avg rating is set to 1600' I can tell you that bot rated 1856.7 is supposed to get 73% points versus a 1600 bot. Next nice thing about elo is normalization - that means 1300 bot should get same score over 1200 as 2000 over 1900. In other systems I can not reliably predict match outcome. And contrary to what you posted above usually very few matches are enough to get good estimate how new bot rates in the pool. -- Frakir

The ELO ratings on their own aren't meant to tell you anything, it is how the ratings compare to the ratings of other bots that matters. I know that when DT is 50 points ahead of it's nearest rival, it is doing very well, and i know that when my bot is 200 points below 2nd to last place, i really need to do something about it. The actual rating is irrelevent, thats why it doesn't make any difference what you set the average to. -- Tango

Maybe I've missed all the arguments for and against, but it seems to me that you all are trying to get a ranking system that is stable - ie A bot ranked 4th will beat a bot ranked 50th every time. But due to the nature of robocode, matches are inherantly unstable due to randomness in bots and starting positions. I think the aim for any ranking system isnt to show that one bot is better than other, but rather one bot is better than another bot for that particular round of matches. We should embrace the randomness of robocode, not make it stable. Look at DT standing at the top after all these months, wouldnt it be nice if he was knocked of the top, if only as the result of a lucky match? The way I see it, luck should play a part. Wouldnt it be boring to watch football if you could predict the result of a game between Manchester United and Luton Town. But you cant, Luton could beat ManU? in a one off match as a freak result. Thats what makes the game fun. Thats what should make any robocode competition fun. Bring back some kind of ladder or knock-out competition instead of your stable rankings! Anyway, rant over, im sure a lot (or all) of you disagree with me about this, so go ahead, tell me why im wrong. I wont listen anyway. :D

--wolfman

I agree that one-off matches are fun, but they don't help you make a good bot. The aim of RR@H is to get stable rankings so it is easy to tell if your bot is any good. I think it would be great fun to have a league that only ran 1 round for each pairing each season, and judged the entire table on that. -- Tango

What I ask is what does 1823.7 tell that 67.78% doesn't? 67.78% predicts that this bot should beat a bot with strength 50% by 17.78%. It's just as reliable as the ELO figure (which I think you'll have to tell me how you arrived at). What the %Score based ranking provides is transparency. And, you _can't_ reliably predict the outcome of a particular pairing using ELO or any other system. If you could, the PBI column on the Details page would not be what it is. My observation that the ELO type ranking produces some instability while a bot collects all its pairings are just that, an observation. But that's not to say that the ELO based system we use in RR@H isn't good at predicting a bots ranking. I am one of the first in the line of people who are amazed about how reliably it can do this. What I am saying is that we don't need the predictive qualities of ELO in RR@H. In other leagues it's needed (if you want stable rankings) but not here. It takes a bot less than a day to collect enough pairings to get a stable ranking using measured %Score strength. And that's with the few clients running that we have today (which I think does not exeed 5). Once we really push GO with RR@H we might have 100 clients running and then it will take less than an hour.

Wolfman, RR@H is about producing a stable ranking (read the project goals somewhere on the RR@H part of the wiki). I agree fully with you that other leagues, providing enjoying combat, are needed to. Particularly I miss the face2face competition. But I think we could use the RR@H framework for running that kind of competitions too. Either by making the clients switch modes or by making the server filter the battles uploaded in different bins. Feel welcome to look at this. The source is always included in the RR@H zip packages.

-- PEZ

Ok I now understand what you are missing with percentage.. :) While percentage works fine with bots around equal strength, it stops working when differences are huge. Example: Bot A beat bot B 1000-90 (91.74% score); also A beats C 1000-60 (94.34%). What is _predicted_ relative strength of B to C? If you say B is better by 2.6% you will be way off! Elo predicts accurately here: B performed better then C by almost 110 rating points and should get 60% score versus C. You are missing whole normalization thing ('S'-curve). As a result bots will always 'lose' rating playing bots far below their rating and perform 'better' versus bots of similar or better strength -- Frakir

P.S. Which can bring possibility of doctoring ratings by choosing higher rated opponents and playing selected matches on your RR@home machine... or seeding Sandbox to play the bottom of the pack, or whatever -- Frakir

The point PEZ was trying to make is that ALL pairings between ALL bots are run. It's not possible to exclude bots from being played, because they are guaranteed to be played on other people's RR@H. It doesn't matter what the predicted relative score of B against C is, because the match B vs C would have been played. I still feel though that ELO is the way to go, but I can live with relative strength if you really want to change it PEZ : ). It's your server anyway, so it's up to you... -- Vuen

I hope we can have them all in one table, maybe PEZ will like how elo normalizes things when compared to percentages; I am almost fairly sure currently percentage order != elo order of bots. --Frakir

Agree with PEZ and Vuen. "ELO is better because it allows me to predict the expected outcome of A against B" is a weak argument, since you have the REAL outcome from A against B. It is OK for me to keep ELO, since it has a long tradintion in Robocode, but right now I think it is just an overkill (a complex stystem used when no longer needed) since ALL pairing are there (note that in "real life" ELO is used only in sports where not all pairings can be played. No sport where all pairings are played uses a system like this). -- Albert

(Edit conflict * 2!!!) I wouldn't take this time to argue my point if I thought it was up to me. =) Here I have finally got someone try to give me one possible advantage of the ELO versus raw %Score. Thanks Frakir. I really appreciate that you keep trying to point out just what makes ELO preferable. It was driving me a bit nuts to just have "don't agree" and "we should keep ELO" thrown at me. But the fact remains we do not need to predict the relative strength of bot B to C, we just wait a few hours and the answer will arrive. I'm not sure what ranking system you feel is weekest against manipulation. I think raw %score is more robust here. You can play DT vs VertiLeach all you want. The raw %score between these bots will just get more exact. Nothing at all will happen to the rest of the table. With the ELO type of rating I have no clue what would happen. Which is very much why I feel so strongly about getting rid of it. We saw in the start of the RR@H that those kinds of manipulation attempts (focused pairings) disturbed the rankings, but I'm not sure that would be the case any longer. As Vuen points out we now have a client which enforces that all pairings will be faught. Albert has succeeded very well in one of his major design goals of making the system robust.

Frakir, do you mean the current %score order is way off from the ELO order? And, if so, could it be that the ELO order (which we build dynamically) is just more recent than the %score one (which we build every 12th hour at the moment)? -- PEZ

I have no idea... but if by some sheer chance some bot played more games versus low rated opponents then it statistically should, then he is 'percentage-wise' overrated, but its ELO is fine. Similar argument can be used against ELO (more games vs problem bots), but it would affect both: elo and percentage one. One more tiny point here: percentage rating can possibly fluctuate more (I play agains last bot in the pack, I get 99.3%, my rating goes considerably up). I think we can have a table with both, at least for some time. -- Frakir

Your rating will only increase considerably if you haven't played that pairing before. I too think we can have both figures in the table, at least for a while. -- PEZ

What we'll lose with percentages will be that nice normalization: Suppose SandboxXP? will get 99.9% while VertiLeach30? will have 99.8%. Just a tiny bit off, almost no difference.... In fact strength difference is as huge as between 50% and 66.6% and translates to same 185 elo points... (wich means Sandbox is supposed to get 66.6% versus VL to justify that 0.1% difference -- Frakir

I agree, normalisation is probably the biggest issue in favour of ELO. And as a consequence of that, problem bots. Your problem bot rating against the top and bottom bots will be different in ELO and %score (at least i think it will, i haven't actually checked). The ELO one will be more acurate/useful. -- Tango

My tests indicates that the difference is small and I can't see that it goes in any particular direction. But since the PBI is mainly a nifty extra feature it needn't be all that exact. If my bot underperforms significantly against a particular opponent both ranking systems will show it. When DT and Verti reaches the levels of 99.8% strength we could maybe discuss if ELO would show the 0.1% differences better. =) -- PEZ

Why not put both side by side and see what the actual PBI's are for the current data? Your tests are likely not to be acruate to notice the problems. If you draw a curve of the ELO rankings for the RR@H it is very linear except for the top and bottom 3 or 4 bots. If your tests didn't include the very top, and very bottom, you would not have seen the problem. (NB I haven't actually drawn such a curve for some time, so the rankings may have changed, i don't know) -- Tango

Aw, crappy; I was just about to put a comparison between %score (table=2) and ELO, and the premier league current rankings page just died. http://rumble.robowiki.dyndns.org/servlet/PremierLeague?game=roborumble There's only 9 bots in it! Was there a problem in the page generation? -- Vuen

No, there's something other going on. Look at the /ReportedProblems page... But now the rankings are rebuilt again. -- PEZ

Yes, something strange was going on earlier. I looked at ad.Neo results few hours ago and it had 0.6% against Noran.RandomT_0.1 (one battle then). -- Frakir

For the %score method to work correctly (I still trust ELO better), we are assuming that complete results for all pairings are fast to generate. That is true today, but we are still only scratching the surface of RR@H's potential. How about increasing the rounds per match to 100 (or even 500), that would be very cool. Also, I miss melee competition! Imho we are wasting time discussing small details of a time-proven method of bot ranking instead of moving on to bigger and better things... (much like when you have a good bot and tweak it endlessly instead of trying new theories ;)) -- ABC

Agree about moving forward. Desagree about increasing the number of rounds. 35 rounds are enough. This way it gives some advantage to "smart" bots (the ones which learn fast) over the "wise" bots (the ones that can learn a lot about the enemy, but take a long time to do it). I never understood why people says "my bot is better because it is able to beat the other one after 1000 rounds". If a bot beats another one on 35 rounds once and again, then it is clear to me that the first one is better, it doesn't matter what would happen on 500+ rounds. -- Albert

I partially disagree here. You may write your bot optimised for 'unknown' opponent: eg with stat gun there are interesting methods to select shooting offset when there is very little data, or when there is not enough data to shell out random noise (not the same thing). Energy management can also be optimised for low gun hit rates (I'll post something about it soon) or for high gun hit rates (trained gun) and differences here can be really big. In fact my current test bot is somewhat 'optimised' for unknown opponent (leagues like RR@home) but also knows some to take advantage of trained gun... Anyway those 2 things are really different and I value more 1000+ round battles even when I optimise things for short ones (because it is easier). -- Frakir

It takes us again to the ontological question of "what is a best bot?". and I think there is no answer for it, because there are as many answers as robocoders. Of course we could say "a best bot is the one that is able to cope with any criteria anyone proposes" but we can not implement all possible criteria (also, if we did, we should ponderate it, and we would be stucked again). So my proposal is to leave it as it is. I think it gives a picture good enough for everybody to decide which is the best bot. -- Albert

My idea of the best way to find out wich is the best bot is a "everyone fights everybody else once over 1000 rounds without save data" league. Sure, the current setup gives a pretty good picture of the relative strength of all bots, but there is still a significant "luck factor" involved, specially between close ranked oponents. After 1000 rounds there is a much smaller error margin, and both the short and long term learning have been used. -- ABC

I think the luck factor is almost neglectable. But a 1000 rounds rumble with no saved data would still be cool. It could be held as a one off shoot now and then. -- PEZ

I like that! System limit of 200kb saved data (and RR pool of 200 bots) is another design consideration to make your bot slightly weaker in short battles but stronger on average (less segments, or just saving partial data) -- Frakir


Robo Home | RoboRumble | Changes | Preferences | AllPages
Edit text of this page | View other revisions
Last edited December 1, 2003 18:29 EST by Tango (diff)
Search: