By Vic Stewart.
The original Zoom Targeting concept was a good algorithm for Dynamic Segmentation. But, it had one major drawback: unacceptable memory consumption. I discarded the idea because I could not solve this problem. However, a year later I think I have found a solution:
A Node contains pointers to a Leaf class and a Splitter class Only one of the two pointers points to an actual object, the other one is a nullpointer This defines whether a Node is a Leaf or Splitter(Branch) Node.
The first Node starts its life as a Leaf Node.
----------------- | Leaf | | Node | -----------------
In Leaf mode it collects wave data in the form of Observations. The Observations are all kept in the Leaf Node for later use. They are also rolled up into GF arrays as in other GF guns. Obviously this GF array is used for the actual aiming. When the number of Observations grows beyond a certain number, the Node becomes eligable to be split into two children Leaf Nodes.
When splitting is triggered, the Node will turn itself into a Splitter Node. It now has to choose one Dimension/parameter by which it can seperate its data. The Node can choose between any number of Dimensions. All Dimensions are individually tested to find the most useful splitting Dimension. The Dimension which comes out on top is used for the actual splitting. For example, the Node may choose DISTANCE as its most useful splitting criterium. Observations with a distance < 500 will be sent to the first child, Observations with a distance >= 500 will be sent to the second child.
The Parent Leaf Node now becomes a Splitter (Branch) Node.
----------------- ----------------- | Leaf | | Leaf | | Node | | Node | ----------------- ----------------- \ / \ / ----------------- | Splitter | | Node | -----------------
Waves are now sent, through the Splitter Node, to the two Leaf Nodes, until one of the Leaf Nodes will again be triggered to Split.
----------------- ----------------- | Leaf | | Leaf | | Node | | Node | ----------------- ----------------- \ / \ / ----------------- ----------------- | Leaf | | Splitter | | Node | | Node | ----------------- ----------------- \ / \ / ----------------- | Splitter | | Node | -----------------
This process is repeated as the match progresses.
About memory consumption:
Suppose the Splitting threshold is 40. That means the average Leaf Node will have 30 Observations (20 for the brand new ones, 39 for the ones just about to be split) In an average 35 round match, 30000 waves will be collected. That means after 35 rounds we will have approx. 30000/30 = 1000 Leaf Nodes. For 1000 Leaf Nodes there are 999 Splitter Nodes. That means memory consumption will be not a problem. A lot of memory will be consumed by the Leaf Nodes, but nothing compared to the amount of memory the original ZoomTargeting algorithm could devour. Of course, there will be more memory consumption than your average GF gun has, that is because all wave data/Observations? are held in memory, next to the GF arrays.
In theory, this algorithm has some interesting properties:
There is an interesting spinoff for this algorithm:
The tree structure can be serialized by only 4 bits per Splitter Node in the case of a maximum of 16 Dimensions. This is true because the only thing a Splitter node has to know to be able to do its job, is the Dimension with which it must split the data.
The filesize for the above example would become: 999 Splitter Nodes * 4 bits = 3.6 kB unzipped. The resulting file could be read by the robot in a next match, which will dramatically increase its learning speed. Note that no guessfactors are actually saved, only the tree structure.
The Serializing could be limited to a subset of Splitter Nodes, to fit into 1kB files. That would make it possible to save data on all RoboRumble participants. Despite the fact that not the entire tree structure is saved, it may give the bot enough of a headstart to outperform its opponents. Off course the robot will expand the tree itself using waves it collects during the match.
Secondly: after splitting on distance > 500 how would this alorithm allow for the distance to be split more finely? Could it somehow determine that it is necessary to split the > 500 into > 250 and < 250? If so, how does it determine that this second split is more significant than say acceleration observations?
This certainly is an intriguing idea and I will take a much more deep look at it when I get home (still in CA until WED this week). -- jim
suppose we have a Node that needs to be split, containing the following GF bins: (The numbers represent visits, the ^ represents bins used for targeting, taking botwidth into account)
[######## ] [########## ] [############ ] [############## ] [###############] [###############] [###############] [###############] [###############]  ^^^This node is obviously very flat. Suppose botwidth covers 3 bins. VisitsCovered? will be 9+9+9=27, which in this case is not very good. Now suppose our algorithm finds that one of the many possible splitting parameters results in these two child nodes:
[ ] [ # ] [ ### ] [ ##### ] [ ####### ] [ ######### ] [ ########### ] [ ############# ] [###############]  ^^^ [ ] [# ] [## ] [### ] [#### ] [##### ##] [###### ####] [####### ######] [###############]  ^^^Now we have one VisitsCovered? number of 7+8+7 = 22, and one of 8+7+6 = 21, so we have a total of 22+21 = 43. The algorithm will have found that 43 is much better than 27 (the VisitsCovered? of the parent node). Maybe some other splits have resulted in, say, total VisitsCovered? of 29, 35, 28, 30 etc. Now the algorithm will decide that our example split is the most efficient one, since it will cover most visits.
So there's your holy grail, if I understand you correctly.
Splitting more finely is already covered in the algorithm. The class RecursionLevel? takes care of that. An instance of that class is always sent up the tree together with the observation to be processed. When a splitting node is encountered this node will alter a variable in the RecursionLevel? object to indicate that a certain parameter has been split upon. Actually, the RecursionLevel? class contain two variables per dimension: a max, and a min value. In the Distance case, the RL Object will start life with the distance numbers 0 and 1000. After the first split it will have the numbers 0-500 respectively. After the third split it may have the values 250-500, etc. Note that this is only an example! It is not carved in stone that the algorithm will start out by splitting on distance three times in a row from the start.
I really don't think this is rocket science. If it sounds overly complex then I most likely failed to explain it well. Kawigi, Jamougha and probably others, have in the past been working on the true rocket science variant: binary decision trees based on entropy. Check out Segmentation/Prioritizing. This WikiTargeting/DynamicSegmentation algorithm is my attempt to simplify that theory. So if you think my explanation is crummy, please let me know. Then I will try to explain it more clearly. --Vic
Yes he did (although I don't think he actually tried it). Anyway, I abandoned the entropy stuff because I just thought is was too complex. I think my above solution of visits covered is much simpler and much more to the point when working with arrays of GF bins. --Vic
Jamougha just showed that the normal entropy calculation (square of the difference) wouldn't always do intuitive things as far as improving your hit rate. And also, decision tree learning doesn't have to be binary. -- Kawigi
I don't know if Jamougha's doubts apply to my variation of the algorithm. I must also admit that I didn't fully understand Jamougha's 'overfitting' argument. Does this have to do with splitting up segments when there is not enough data to justify this? The thing about my algorithm I most worry about is the forming of the first couple of nodes. The first 40 waves collected will have a crucial impact on the tree since it will define the first (base) split. These 40 waves may see the enemy travelling from the spawn point in a corner towards the middle of the battlefield. I don't know if this will cause the algorithm to fail. Testing must show that. --Vic
I imagine the argument about overfitting would involve splitting up segments because of data that incidently shows some difference due to the random nature of the data, rather than because of actual tendencies. In theory, after, say, 500 rounds, the same tree (give or take) should be formulated as every other 500-round match, hopefully without the additional need to remove too many splits along the way from realizing that they weren't good afterall.
It occurs to me that a threaded approach to decision tree learning mixed with guess-factor targeting might be an interesting way to go (periodically re-generate the tree based on entropy or something like it, maybe at the beginning of every round). Or maybe I'm on crack. The real usefulness of d-tree learning may just be to analyze deeply segmented data after a long match to decide which static segments to leave in your bot :-p -- Kawigi
From experience I know that after only 5 or 6 rounds performance becomes an issue when re-arranging the entire graph. I also think that when a segment has enough data (say 300 waves collected) re-arranging becomes unnecessary. The stability of the segment should outweigh the performance penalty of re-arranging. Of course the real boundary here would have to be determined in practise. Maybe it could be as low as 50 or something. So maybe there should be a middle type node (containing between 40 -splitting minimum- and 300 observations) that gets to be rearranged after every round. That should solve my above concern about the formation of the first nodes. Then again, as the tree broadens, more and more of these middle type nodes will be present and performance will become an issue once again. Maybe it is a valid assumption that nodes become more stable higher up in the tree? I mean, they will become spikier and more predictable, so there is less chance of splitting on random spikes. In that case the middle type node upper boundary could decrease as we go higher up in the tree. In the case of a binary tree this 'stability threshold' should be halved at every level increment to keep the time needed for re-arranging constant. Since I am no math wizard, does that make any sense mathematically, Kawigi? --Vic
There's also a good chance that once more data comes in (particularly against good movers, like Raiko or something), it won't be spiky anymore. How much data do you need to be "sure" about a split? You might be right about it depending on the depth, but probably more importantly the amount of data in that area (which will naturally be more in some parts of the tree than others, for instance, you will have roughly twice as much in an acceleration segment than a deceleration segment, and depending on the bot, you could have even more in a constant speed segment. Most bots would have more data in a top-speed segment than a medium-speed segment, too. And so on. I guess the bottom line is that I don't know how the "stability threshold" should be implemented at all, but I think that data should rule over depth, and another thing that could influence it is the variance or entropy of the data in that segment. Maybe entropy needs to be "adjusted" to account differently for data that isn't in the same bucket but is really close (the normal entropy function is used for classification, where the classes aren't necessarily similar). Anyways, lots of musings there.
The performance hit on relearning the tree need only increase as the depth of the final tree increases, but in either case, it isn't something you want to do every tick, nor is it something worth doing that often, but doing it in the background occasionally seems reasonable. If the data is stored in the form of tiny segments, those can each be processed in basically constant time (not increasing with the amount of data actually in there). There's still plenty of thinking and playing that should be done with this, keep it up ;-)
Random note - Entropy seems to me to be related to statistical variance, except that it is built better for classification problems, but variance might work better when your stats follow a normal distribution. Why doesn't someone just invent something built for us? --Kawigi
Your first question is the same one we have been asking on the WikiTargeting page. Isn't there some mathematical definition of this in statistics? And, oh yes, I will keep it up. I actually find this an intriguing concept. I'll be back in about three weeks time. First up: vacation :-D --Vic
A few things that I have been thinking about. First, your visit count method has the problem of, given a flat distribution, you could have your spike being the max (ie a 99999999 you could have 99990000 and 00009999). What I am contimplating is a combination of the visit counts, an entropy calculation, and a maximum hit probability. Also, you dont need to perform the recursive divisions, because if the first split is not the most useful, the profiles will just make a useful split the next time. They will just start off the same. I am also working on a system to see how similar two arrays are, which should be useful both on selecting the most appropriate split, and for comparing on fire behavior to general behavior (if they can be shown to be the same it will increase the learning speed 10 fold. -- Jokester
I'm thinking that my gun for DrussGT will be something down these lines. And, from what I can reason out, it shouldn't matter which split comes first. If it is a 'bad' choice, then the data afterwards should make *each* of the subnodes do a good split when we have more data. It may slow down our learning time, but that shouldn't matter with this gun anyways. If it is a problem, one way to stabilize the initial splits would be to add a few simple segments as default (eg. lateral velocity, walls) and let the dynamic segmentation add in all the fancy stuff (distance, accel, decel, time since decel, distance last x ticks, time since direction change, gunheat, headingchange, advancing velocity, etc) individually for each bot. If you kept a counter it would even be possible to do rolling averages, weighting each entry by how long ago it was collected. As mentioned above, one of the requirements for a node is that it splits the data so that there is a balance of data to each side (we ARE using a binary tree here), that there are enough entries in the segment beforehand, and that the resultant nodes have a higher Entropy than the initial node. Additionally, by using this arrangement, you will never have an empty node when you want to fire your data, and each subnode is optimised for its conditions: for example, it may be useful to make a decel split inside the nearwall segment, but not otherwise. Anyways, if anybody else wants to check my code when I've got something working, feel free =) -- Skilgannon