


While the Hit Percentage method is the most promising, a combination of the methods could yield the best results 
While the Hit Percentage method is the most promising, a combination of the methods could yield the best results 
Entropy  Kawigi's description
I'm reading about learning decision trees and the ID3 algorithm for doing it in my machine learning class right now at school, and I saw a lot of ideas that seem relevant to guessfactor targeting, flattening and segmentation. First is the idea of Entropy. A set of data has an entropy from zero to log2(c) where c is the number of possible values. It is zero if all the data has the same value, and it is log2(c) if all the possible values are evenly used. In other words, entropy is like a measurement for flatness. If we find the entropy of a given set of data and divide it by the max entropy, we can get a sort of normalized entropy, where our goal in movement would be to have an entropy of 1 in every segment (SittingDuck has an entropy of zero on its profile). The entropy of a set of samples is the sum of p*log2(p) for each possible output, where p is the probability of that output within the set. A code example, samples could be thought of as a segment:
public static double entropy(int[] samples) { double sum = 0; for (int i=0; i<samples.length; i++) sum += samples[i]; double entropy = 0; for (int i=0; i<samples.length; i++) if (samples[i] != 0) entropy += samples[i]/sum*Math.log(samples[i]/sum)/Math.log(2); return entropy; } public static double maxEntropy(int possibleValues) { return Math.log(possibleValues)/Math.log(2); } //I may have just made this up, but it should return 1 for a segment that's flat with flat juice on top public static double normalizedEntropy(int[] samples) { return entropy(samples)/maxEntropy(samples.length); }
The next major concept is information gain. The idea is that you can figure out what the order of importance is of your segmentations (assuming you have sufficient data I suppose) by figuring out which segmentation gives you the best information gain from before. So say I start with unsegmented data (it's segmented somewhere, but I want to start by figuring out the entropy of the unsegmented data) and then I try applying each potential segmentation and use the one that returns the biggest information gain (basically reduces the total entropy by the most in the end). Information gain is calculated by the following:
Gain(S, A) = Entropy(S)  sum(% of total samples that fall in Sv * Entropy(Sv))
Here, S is the complete unsegmented data, A is the segmentation, and Sv is the data in each segment. The idea here is that we have nearly zero information gain when this segmentation yields a bunch of segments with the same general distribution as we had with no segmentation, or if it perfectly divides the data so that we know 100% what to do for each segment, we'll approach an information gain of log2(c) (where c is the number of guess factors). Here's some code that can do the simplest part of calculating the information gain on a certain segmentation, where samples is the unsegmented data, and segmentedSamples is the same set of data, but segmented:
public static double informationGain(int[] samples, int[][] segmentedSamples) { int totalSamples = 0; for (int i=0; i<samples.length; i++) totalSamples += samples[i]; double entropyS = entropy(samples); for (int i=0; i<segmentedSamples.length; i++) { int segmentSamples = 0; for (int j=0; j<segmentedSamples[i].length; j++) segmentSamples += segmentedSamples[i][j]; entropyS = (double)segmentSamples/totalSamples*entropy(segmentedSamples[i]); } return entropyS; }
If you apply this recursively on some overly complex set of data, you should be able to figure out which segmentations are more expendable (due to low information gain) and which ones you want to include in your next GF nano. And, of course, I need to do something like this tonight and write a report about it. Blah.
Upside: counts the overall spikiness of the segment Downside: does not always select the segment with the best hit percentage
The decision is a modified spikyness algorithm (as seen on the WikiTargeting page), that returns a VisitsCovered? number.
suppose we have a Node that needs to be split, containing the following GF bins: (The numbers represent visits, the ^ represents bins used for targeting, taking botwidth into account)
[######## ] [########## ] [############ ] [############## ] [###############] [###############] [###############] [###############] [###############] [999999998877665] ^^^This node is obviously very flat. Suppose botwidth covers 3 bins. VisitsCovered? will be 9+9+9=27, which in this case is not very good. Now suppose our algorithm finds that one of the many possible splitting parameters results in these two child nodes:
[ ] [ # ] [ ### ] [ ##### ] [ ####### ] [ ######### ] [ ########### ] [ ############# ] [###############] [123456787654321] ^^^ [ ] [# ] [## ] [### ] [#### ] [##### ##] [###### ####] [####### ######] [###############] [876543211223344] ^^^Now we have one VisitsCovered? number of 7+8+7 = 22, and one of 8+7+6 = 21, so we have a total of 22+21 = 43. The algorithm will have found that 43 is much better than 27 (the VisitsCovered? of the parent node). Maybe some other splits have resulted in, say, total VisitsCovered? of 29, 35, 28, 30 etc. Now the algorithm will decide that our example split is the most efficient one, since it will cover most visits.
Upside: finds the highest probability selection Downside: Not very effective against flat profiles
Hit Percentage
Finds the percent of data in the highest bin.
Upside: Finds the segment with the best spike Downside: Population sensitive
While the Hit Percentage method is the most promising, a combination of the methods could yield the best results