public static double entropy(int[] samples) { double sum = 0; for (int i=0; i<samples.length; i++) sum += samples[i]; double entropy = 0; for (int i=0; i<samples.length; i++) if (samples[i] != 0) entropy += -samples[i]/sum*Math.log(samples[i]/sum)/Math.log(2); return entropy; } public static double maxEntropy(int possibleValues) { return Math.log(possibleValues)/Math.log(2); } //I may have just made this up, but it should return 1 for a segment that's flat with flat juice on top public static double normalizedEntropy(int[] samples) { return entropy(samples)/maxEntropy(samples.length); }
The next major concept is information gain. The idea is that you can figure out what the order of importance is of your segmentations (assuming you have sufficient data I suppose) by figuring out which segmentation gives you the best information gain from before. So say I start with unsegmented data (it's segmented somewhere, but I want to start by figuring out the entropy of the unsegmented data) and then I try applying each potential segmentation and use the one that returns the biggest information gain (basically reduces the total entropy by the most in the end). Information gain is calculated by the following:
Gain(S, A) = Entropy(S) - sum(% of total samples that fall in Sv * Entropy(Sv))
Here, S is the complete unsegmented data, A is the segmentation, and Sv is the data in each segment. The idea here is that we have nearly zero information gain when this segmentation yields a bunch of segments with the same general distribution as we had with no segmentation, or if it perfectly divides the data so that we know 100% what to do for each segment, we'll approach an information gain of log2(c) (where c is the number of guess factors). Here's some code that can do the simplest part of calculating the information gain on a certain segmentation, where samples is the unsegmented data, and segmentedSamples is the same set of data, but segmented:
public static double informationGain(int[] samples, int[][] segmentedSamples) { int totalSamples = 0; for (int i=0; i<samples.length; i++) totalSamples += samples[i]; double entropyS = entropy(samples); for (int i=0; i<segmentedSamples.length; i++) { int segmentSamples = 0; for (int j=0; j<segmentedSamples[i].length; j++) segmentSamples += segmentedSamples[i][j]; entropyS -= (double)segmentSamples/totalSamples*entropy(segmentedSamples[i]); } return entropyS; }
If you apply this recursively on some overly complex set of data, you should be able to figure out which segmentations are more expendable (due to low information gain) and which ones you want to include in your next GF nano. And, of course, I need to do something like this tonight and write a report about it. Blah.
-- Kawigi
I feel for ya. :-) I'm trying to gird myself up for college work too. Looks like we're doing the same stuff.
On the topic, it always seemed to me like the whole principle of using information gain was totally wrong. Say your buckets in one segment give you '0 0 0 0 0 0 3 3 3 3 3 3' and in another '2 2 2 2 2 5 2 2 2 2 2' - well obviously the latter is a better segment, but information gain should say that the former is better. I suspect that the reason for using 'information gain' in AI is that it sounds more academic than 'well we picked the one which does the best job of classification, duh.' My tests with algoithms like PRISM and J42, which in effect try to use information gain to do rule-based segmentation of data, have been flipping horrible. They didn't even perform as well as an unsegmented GF gun. -- Jamougha
Interesting that the second set you gave has a higher entropy, even though it corresponds to a 20% estimated hit rate in the middle. Maybe it would be worth trying an information-gain formula that is just based on the same idea but with estimated hit rates instead of entropies? i.e - for each segmentation, try to figure out the sum of (% of total samples that go in this segment) * (estimated hit rate). -- Kawigi
Yes, I think that could be a pretty good approach; I've thought about doing VG selection along those lines, or even building new guns or segmentations dynamically using it. You would need some sort of weighting to help keep the segments full, though, otherwise you will end up just matching the noisiest segments. One of the problems with the above algorithms is vast overfitting - they can match 40%+ of the training data but fail miserably on seperate test data. -- Jamougha
Jamougha's contention that the 2/5 segment is better than the 0/3 segment isn't so obvious to me. Perhaps it comes down to the simplicity of your factor selection. -- Martin