213 Decision Tree Learning


How now Rubyists,

This week’s quiz is about decision tree learning. Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item’s target value. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications.

For a practical example let’s examine the plight of our friend David. David is the manager of a famous golf club. Sadly, he is having some trouble with his customer attendance. There are days when everyone wants to play golf and the staff are overworked. On other days, for no apparent reason, no one plays golf and staff are idle. David’s objective is to optimise staff time by predicting when people will play golf. To accomplish that he needs to understand the reasons people decide to play. He assumes that weather must be an important underlying factor, so he decides to use the weather forecast for the upcoming week. So during two weeks he has been recording:

* The outlook, whether it was sunny, overcast or raining. * The temperature (in degrees Fahrenheit). * The relative humidity in percent. * Whether it was windy or not. * Whether people attended the golf club on that day.

David compiled this data as shown:

# Outlook, temperature, humidity, windy, play
data = [
  [:sunny,    85, 85, false, false],
  [:sunny,    80, 90, true,  false],
  [:overcast, 83, 78, false, true ],
  [:rain,     70, 96, false, true ],
  [:rain,     68, 80, false, true ],
  [:rain,     65, 70, true,  false],
  [:overcast, 64, 65, true,  true ],
  [:sunny,    72, 95, false, false],
  [:sunny,    69, 70, false, true ],
  [:rain,     75, 80, false, true ],
  [:sunny,    75, 70, true,  true ],
  [:overcast, 72, 90, true,  true ],
  [:overcast, 81, 75, false, true ],
  [:rain,     71, 80, true,  true ],

Our job is to write a Ruby program that will construct a decision tree from this data. The algorithms that are used for constructing decision trees work by choosing a variable at each step that is the next best variable to use in splitting the set of items. “Best” is defined by how well the variable splits the set into subsets that have the same value of the target variable. Different algorithms use different formulae for measuring “best”, the techniques you use are up to you.

Have Fun!

Sunday, July 19, 2009