[Home]Segmentation/Prioritizing

Robo Home | Segmentation | Changes | Preferences | AllPages

Showing revision 11
I'm reading about learning decision trees and the ID3 algorithm for doing it in my machine learning class right now at school, and I saw a lot of ideas that seem relevant to guess-factor targeting, flattening and segmentation. First is the idea of Entropy. A set of data has an entropy from zero to log2(c) where c is the number of possible values. It is zero if all the data has the same value, and it is log2(c) if all the possible values are evenly used. In other words, entropy is like a measurement for flatness. If we find the entropy of a given set of data and divide it by the max entropy, we can get a sort of normalized entropy, where our goal in movement would be to have an entropy of 1 in every segment (SittingDuck has an entropy of zero on its profile). The entropy of a set of samples is the sum of -p*log2(p) for each possible output, where p is the probability of that output within the set. A code example, samples could be thought of as a segment:

	public static double entropy(int[] samples)
	{
		double sum = 0;
		for (int i=0; i<samples.length; i++)
			sum += samples[i];
		double entropy = 0;
		for (int i=0; i<samples.length; i++)
			if (samples[i] != 0)
				entropy += -samples[i]/sum*Math.log(samples[i]/sum)/Math.log(2);
		return entropy;
	}

	public static double maxEntropy(int possibleValues)
	{
		return Math.log(possibleValues)/Math.log(2);
	}
	
	//I may have just made this up, but it should return 1 for a segment that's flat with flat juice on top
	public static double normalizedEntropy(int[] samples)
	{
		return entropy(samples)/maxEntropy(samples.length);
	}

The next major concept is information gain. The idea is that you can figure out what the order of importance is of your segmentations (assuming you have sufficient data I suppose) by figuring out which segmentation gives you the best information gain from before. So say I start with unsegmented data (it's segmented somewhere, but I want to start by figuring out the entropy of the unsegmented data) and then I try applying each potential segmentation and use the one that returns the biggest information gain (basically reduces the total entropy by the most in the end). Information gain is calculated by the following:

	Gain(S, A) = Entropy(S) - sum(% of total samples that fall in Sv * Entropy(Sv))

Here, S is the complete unsegmented data, A is the segmentation, and Sv is the data in each segment. The idea here is that we have nearly zero information gain when this segmentation yields a bunch of segments with the same general distribution as we had with no segmentation, or if it perfectly divides the data so that we know 100% what to do for each segment, we'll approach an information gain of log2(c) (where c is the number of guess factors). Here's some code that can do the simplest part of calculating the information gain on a certain segmentation, where samples is the unsegmented data, and segmentedSamples is the same set of data, but segmented:

	public static double informationGain(int[] samples, int[][] segmentedSamples)
	{
		int totalSamples = 0;
		for (int i=0; i<samples.length; i++)
			totalSamples += samples[i];
		double entropyS = entropy(samples);
		for (int i=0; i<segmentedSamples.length; i++)
		{
			int segmentSamples = 0;
			for (int j=0; j<segmentedSamples[i].length; j++)
				segmentSamples += segmentedSamples[i][j];
			entropyS -= (double)segmentSamples/totalSamples*entropy(segmentedSamples[i]);
		}
		return entropyS;
	}

If you apply this recursively on some overly complex set of data, you should be able to figure out which segmentations are more expendable (due to low information gain) and which ones you want to include in your next GF nano. And, of course, I need to do something like this tonight and write a report about it. Blah.

-- Kawigi

I feel for ya. :-) I'm trying to gird myself up for college work too. Looks like we're doing the same stuff.

On the topic, it always seemed to me like the whole principle of using information gain was totally wrong. Say your buckets in one segment give you '0 0 0 0 0 0 3 3 3 3 3 3' and in another '2 2 2 2 2 5 2 2 2 2 2' - well obviously the latter is a better segment, but information gain should say that the former is better. I suspect that the reason for using 'information gain' in AI is that it sounds more academic than 'well we picked the one which does the best job of classification, duh.' My tests with algoithms like PRISM and J42, which in effect try to use information gain to do rule-based segmentation of data, have been flipping horrible. They didn't even perform as well as an unsegmented GF gun. -- Jamougha

Interesting that the second set you gave has a higher entropy, even though it corresponds to a 20% estimated hit rate in the middle. Maybe it would be worth trying an information-gain formula that is just based on the same idea but with estimated hit rates instead of entropies? i.e - for each segmentation, try to figure out the sum of (% of total samples that go in this segment) * (estimated hit rate). -- Kawigi

Yes, I think that could be a pretty good approach; I've thought about doing VG selection along those lines, or even building new guns or segmentations dynamically using it. You would need some sort of weighting to help keep the segments full, though, otherwise you will end up just matching the noisiest segments. One of the problems with the above algorithms is vast overfitting - they can match 40%+ of the training data but fail miserably on seperate test data. -- Jamougha

Jamougha's contention that the 2/5 segment is better than the 0/3 segment isn't so obvious to me. Perhaps it comes down to the simplicity of your factor selection. -- Martin

I was recently thinking about this entropy model, and I just came up with an alternative model that's a hybrid of entropy and hit% that I think would produce better results:
- For each segmentation division:
--- Find the peak GuessFactor's value
--- Take the hit % of the peak GuessFactor (probably best to account for botwidth if you want to be really accurate)
- Find the average peak GuessFactor
- Subtract the average peak from each peak, multiply these values by the hit % already calculated
- Take the entropy of these new values
Essentially it's taking the entropy of the peaks (how much the peaks vary across the segment), with artificially reduced entropy where hit rates are low. I'm not sure how to explain exactly how I came up with this model, however I think it's an advantageous one, because it has the following properties:
- It takes hit rates accurately into account, as it would prefer the 2/5 graph above over the 0/3 graph.
- And it also prefers segments where the peak GuessFactor is not the same all of the time (a segment that always produced the identical 2/5 graph got nearly all values would not be as useful as one where the peak varied)
So what to other people think of this model here? -- Rednaxela


Robo Home | Segmentation | Changes | Preferences | AllPages
Edit revision 11 of this page | View other revisions | View current revision
Edited March 19, 2008 17:31 EST by Rednaxela (diff)
Search: