The approach here is to make a simple statistical test to see if we can say the scores are significantly different or not. Note that if you use it and it says that there is no significant difference, it does not mean that the bots are equal, but that you have to run more battles :-)
Here it goes, a formula to check if a bot is better than another:
if ((Winner Score - Loser Score)/Number? of rounds)/(120/sqrt(Number of rounds)) > 1,28 then you can say Winner is better than Loser. if not, there is not enough data to conclude that ... żmay be you must run more battles?
You can also reverse the formula and say that a bot must win by 1968 points in a 100 rounds battle (or by 6223 in a 1000 rounds battle, or 768 in a 25 rounds battle) to say it is clearly better than another.
It is a simple test that can be used, and can avoid running long 1000 battles.
Some hypotheses I made: a) Confidence level is 90% (assuming Winner can be equal or better than loser, but not worse). You can use 1,64 instead of 1,28 for a 95% test. b) There is no clear way to calculate variance, because Robocode gives accumulated results. But if you think about it, you will see that the standard deviation is between 60 and 180, so I used the mean point (120). c) I assume that bots perform ideally at the same level in all rounds, but this may not be true (ie. learning bots that store results from round to round).
One example: SandboxDT vs. Duelist.
There are 2 battles:
a) The first one is the 100 round battle on the Duels site, where Duelist beat Sandbox 1.2 8121-7537.
Applying the formula: ((8121-7537)/100)/(120/10) = 0.487 -> You cannot conclude Duelist is better than SandboxDT 1.2.
b) The second one is the result posted yesterday, where SandoxDT? 1.3 beat Duelist 79205-78326 in a 1000 rounds battle.
Applying the formula: ((79205-78326)/1000)/(120/1000^0.5) = 0.23 -> You cannot conclude SandoxDT? is better than Duelist.
I'm having trouble remembering some of the statistical basis for this stuff, but I'd like to apply it elsewhere. Could someone explain maybe how you can 'rate' the validity of data given the sample size, standard deviation, and other basic stuff like that? Basically I want to know something like a confidence interval. -- Kawigi
The margin of error formula looks like this:
M = 1.96 * sqrt((percent * (1 - percent)) / number of samples)
That is the margin of error with a 95% confidence interval. I won't go into details about where the 1.96 comes from, as it becomes complicated very quickly, but suffice to say that the number required is provided by this equation:
n = sqrt(2) * aerf(CI)
Where aerf is the inverse of the so-called error function (usually denoted "erf"), and CI is the confidence interval (i.e. "0.95" in our case). Some other solutions of this equation are as follows:
sqrt(2) * aerf(0.90) = 1.645 sqrt(2) * aerf(0.99) = 2.576 sqrt(2) * aerf(0.999) = 3.291
A 95% confidence interval, as applied to margin of error, means that 19 out of every 20 samples will lie within the margin of error. However, 1 out of every 20 will be totally and completely whack. Basically, you can be 95% sure that the correct answer is somewhere inside your margin of error, but there is the distinct possibility that your estimate is completely off -- the true answer has a 5% chance of being ENTIRELY different from your estimate.
I hope this is at least a good summary. I learned all of this stuff while researching for a little utility I wrote that fights two bots together for only the number of battles necessary for one of them to come out as the winner with 99.9% confidence (or some other confidence that I specify). It works very well, and I can even specify a minimum number of battles, but it has the downside that the answer is a little bit questionable when a lot of bot learning is going on.
Within which interval can the margin of error be expected for a successful test? -- PEZ
Is there a way to figure out what my confidence interval is that my estimate is even valid? -- Kawigi
In my utility, I run margin of error on both bots' scores after every round, and if their margins do not overlap, I declare the test to be complete. That is how you know if your test is successful. For instance, if a bot has an estimated win percentage of 0.65 with a margin of error of 0.10, and the other bot has an estimated win percentage of 0.35 (it will have the same margin of error), then the lowest the first bot's percentage could be is 0.55, and the highest the second bot's percentage could be is 0.45. Since they do not overlap, the first bot is better than the second bot (according to the estimates).
As to Kawigi's question: you don't figure out a confidence interval, and there is no way for an estimate to be 100% valid (unless it samples the entire population, which in Robocode is analogous to running every possible battle). You set your confidence interval a priori. The margin of error returned with that confidence interval, say 0.95, is valid 95% of the time. One in every twenty samples will lie entirely outside the interval, though.
I hope this makes sense. It's a pretty hairy topic to discuss since it seems to involve abstract levels of error.
As an aside, I use a system like this to control my virtual guns. The amount I use each gun is based on the amount that its margin of error overlaps with the best (estimated) gun.