## Introduction

The decision tree is one of the most commonly used classification techniques. The kNN algorithm in Part 1 did a great job of classifying, but it didn’t lead to any major insights about the data. One of the best things about decision trees is that humans can easily understand the data.

## Background

I’ll follow the ID3 algorithm, which tells us how to split the data and when to stop splitting it. The basic idea is to construct the decision tree by employing a top-down, greedy search through the given sets to test each feature at every tree node.

Two the most important concepts in the ID3 algorithm are ** entropy** and

**. These concepts are used in information theory. You can read more about ID3 and information concepts at Wikipedia or refer to the great brief introduction. In this tip, I will use formulas from the short paper.**

*information gain*Assume that, we want to predict which player will receive the Golden Ball award in 2018. A player wants to receive the Golden Ball award, he must receive the Golden Shoe award or/and he is the Champion Leage Winner. Data is collected in the following table to build a decision tree:

The target classification can be **Yes** or **No**. First feature is **‘Has Golden Shoe’** and the second feature is **‘Is Champion Leage Winner’**. The values of these features are Yes or No. For simplicity, I will assign labels to our features as folows:

A -> Has Golden Shoe

B -> Is Champion Leage Winner

And for convenient, I also convert values of features from **Yes/No** to **1/0** correspondingly.

## Using the code

In Python, We can write a *dataSet *based on data table above by using a function that called** createDataSet**:

`def createDataSet(): dataSet = [[1, 1, 'yes'], [1, 1, 'yes'], [0, 1, 'no'], [1, 0, 'no'], [1, 0, 'no']] labels = ['A','B'] return dataSet, labels`

Entropy of our *dataSet *can be calculated:

**Entropy(dataSet) = – (2/5)Log2(2/5) – (3/5)Log2(3/5) = 0.9709505944546686 ~ 0.971**

Our dataSet is a collection of 5 examples with 2 YES and 3 NO examples.

In Python, the first, we need to count number of ‘yes’ and number of ‘no’ in the final column of *dataSet*:

`# number of data instances or number of rows of our dataSet numInstances = len(dataSet) # create a dictionary whose keys are the values in the final column keyCounts = {} # for each instance or row for instance in dataSet: # return value in the final column currentKey = instance[-1] # If a key was not encountered previously, one is created if currentKey not in keyCounts.keys(): keyCounts[currentKey] = 0 # For each key, keep track of how many times this key occurs keyCounts[currentKey] += 1`

If we run this code, ** keyCounts** will look like this:

{‘yes’: 2, ‘no’: 3}

With ** keyCounts**, we can calculate entropy of our

*dataSet*:

`# our entropy entropy = 0.0 # for each key for key in keyCounts: # calculate the probability of this key prob = float(keyCounts[key])/numInstances # calculate entropy entropy -= prob * log(prob,2)`

And now, we can create a function that called **calcEntropy** to calculate entropy of *dataSet*:

`def calcEntropy(dataSet): # number of data instances or number of rows of our dataSet numInstances = len(dataSet) # create a dictionary whose keys are the values in the final column keyCounts = {} # for each instance or row for instance in dataSet: # return value in the final column currentKey = instance[-1] # If a key was not encountered previously, one is created if currentKey not in keyCounts.keys(): keyCounts[currentKey] = 0 # For each key, keep track of how many times this key occurs keyCounts[currentKey] += 1 # our entropy entropy = 0.0 # for each key for key in keyCounts: # calculate the probability of this key prob = float(keyCounts[key])/numInstances # calculate entropy entropy -= prob * log(prob,2) return entropy`

Now, we want to decide whether we should split the data based on the first feature or the second feature, in other words, we need to find which feature (A or B) will be the root node in our decision tree. To do this, we need to calculate information gain of each feature.

For feature A, suppose there are 4 occurrences of A = 1 and 1 occurrence of A = 0. For A = 1, 2 of the examples are Yes and 2 are No. For A = 0, 1 is No. So, information gain of A:

**Gain(S, A) = Entropy(S) – (4/5)*Entropy(S1) – (1/5)*Entropy(S0) ~ 0.971 – 4/5 ~ 0.171**

Entropy(S0) = 0 because all members of S0 belong to the same class (No)

Entropy(S1) = 1 because the collection contains an equal number of positive (Yes) and negative (No) examples

For feature B, suppose there are 3 occurrences of B = 1 and 2 occurrence of B = 0. For B = 1, 2 of the examples are Yes and 1 is No. For B = 0, 2 are No. So, information gain of B:

**Gain(S, B) = Entropy(S) – (3/5)*Entropy(S1) – (2/5)*Entropy(S0) **

Entropy(S1) = -(2/3)Log2(2/3)-(1/3)Log2(1/3) = 0.9182958340544896~0.918

Entropy(S0) = 0 because all members of S0 belong to the same class (No)

**Gain(S,B) = 0.971 – (3/5)*0.918 ~ 0.420 **

Clearly, information gain of the B feature is higher than information gain of the A feature. Therefore, it is used as the root node in our decision tree. The next question is, should the A feature be tested at the ‘1’ branch node or the ‘2’ branch node?

Gain(S0,A) = 0 so the result of ‘0’ branch node of the B is ‘no’ and the ‘1’ branch node is the A feature.

This process goes on until all data is classified perfectly or we run out of features.

My tree looks like this:

The decision tree can also be expressed in rule format:

- IF a player is the Champion Leage Winner (B = 1) AND he has the Golden Shoe (A = 1) THEN this player will receive the Golden Ball
- IF a player is the Champion Leage Winner (B = 1) AND he has NOT the Golden Shoe (A = 0) THEN this player will NOT receive the Golden Ball
- IF a player is NOT the Champion Leage Winner (B = 0) THEN he will NOT receive the Golden Ball

You can try predicting which player will receive the Golden Ball 2018 – Messi? Salah? or CR7?

Now, let’s look our dataset again:

In Python, the A feature is the first column of *dataSet *and its index is 0, the B feature is the second column of *dataSet *and its index is 1, and the result of prediction is the final column. In the first column of *dataSet*, there are 4 occurrences of ‘1’ and 1 occurrence of ‘0’, and there are 2 Yes of 1 and there are 2 No of 0. How to split the *dataSet *based on a specific feature and its value? The following **splitDataSet** function is a solution:

`def splitDataSet(dataSet, feature, value): subDataSet = [] for row in dataSet: if row[feature] == value: reducedRow = row[:feature] reducedRow.extend(row[feature +1:]) subDataSet.append(reducedRow) return subDataSet`

If feature = 0 and value = 1 then

`subDataSet = [[1, 'yes'], [1, 'yes'], [0, 'no'], [0, 'no']]`

If you are Python beginner, you can be confused about Python code above. If so, you can refer here to get a little of Python knowledge.

With the **splitDataSet** function above and the entropy of our *dataSet*, we can calculate the information gain of a feature (such as A) as the following ** infoGain** function:

`def infoGain(dataSet, feature): # calculate entropy of dataset baseEntropy = calcEntropy(dataSet) # entropy of subdataset newEntropy = 0.0 featList = [example[feature] for example in dataSet] # Get list of unique values, ex: [0,0,1,1,1] -> {0,1} uniqueVals = set(featList) # iterate over all the unique values of this feature and split the data for each feature for value in uniqueVals: # split dataset based on feature and its value subDataSet = splitDataSet(dataSet, feature, value) # The new entropy is calculated and summed up for all the unique # values of that feature prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcEntropy(subDataSet) # the information gain of feature is the reduction in entropy gain = baseEntropy - newEntropy return gain`

Try running code:

**Gain(S, A) = infoGain(dataSet, 0) = 0.17095059445466854 ~ 0.171**

**Gain(S,B) = infoGain(dataSet, 1) = 0.4199730940219749 ~ 0.420**

When we have the information gain of features, it’s easy to find the best feature based on the best gain:

`def chooseBestFeature(dataSet): numFeatures = len(dataSet[0]) - 1 bestInfoGain = 0.0; bestFeature = -1 for i in range(numFeatures): gain = infoGain(dataSet,i) if (gain > bestInfoGain): bestInfoGain = gain bestFeature = i return bestFeature`

Now, we can create our decision tree. I created the decision tree as the following Python code:

`# find the majority result class ('yes' or 'no') that occurs with the greatest frequency. def majorityClass(classList): classCount={} for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0] # create my decision tree def createTree(dataSet,labels,level): classList = [example[-1] for example in dataSet] # Stop when all result classes are equal if classList.count(classList[0]) == len(classList): return classList[0] # When no more features, return majority class if len(dataSet[0]) == 1: return majorityClass (classList) bestFeat = chooseBestFeature(dataSet) bestFeatLabel = labels[bestFeat] if level == 0: myTree = "Root => " + bestFeatLabel + '/n/n' else: myTree = bestFeatLabel + '/n/n' del(labels[bestFeat]) featValues = [example[bestFeat] for example in dataSet] # Get list of unique values, ex: [0,0,1,1,1] -> {0,1} uniqueVals = set(featValues) # levels of tree level = level + 1 myTree += "Level " + str(level) + " => " # iterate over all the unique values from our chosen feature for value in uniqueVals: subLabels = labels[:] myTree += 'Branch ' + str(value)+ ' of ' + bestFeatLabel + ' : ' # recursively call createTree() for each split of the dataset myTree += createTree(splitDataSet(dataSet, bestFeat, value),subLabels,level) + ' ' return myTree`

if running code above, my decision tree will look like this:

Root => B Level 1 => Branch 0 of B : no Branch 1 of B : A Level 2 => Branch 0 of A : no Branch 1 of A : yes

You can try plotting a decision tree for yourself by using the matplotlib.

So far, we can collect and order all of our functions above and put them together into the file that called ** decisionTree.py**. The content of this file maybe look like this:

`def createDataSet(): … def calcEntropy(dataSet): … def splitDataSet(dataSet, feature, value): … def infoGain(dataSet, feature): … def chooseBestFeature(dataSet): … def majorityCnt(classList): … def createTree(dataSet,labels,level): …`

You can get a completed one of this file here.

The next, we can create a file, called ** Test.py**, to test our functions as follows:

`import decisionTree # create a dataset and labels myDat,labels= decisionTree.createDataSet() print(myDat) # calculate the entropy of the dataset entropy = decisionTree.calcEntropy(myDat) print(entropy) # calculate information gain of A feature gainA = decisionTree.infoGain(myDat,0) print(gainA) # calculate information gain of B feature gainB = decisionTree.infoGain(myDat,1) print(gainB) # split the dataset based on A feature and its 1 value splitA = decisionTree.splitDataSet(myDat,0,1) print(splitA) # choose the best feature best = decisionTree.chooseBestFeature(myDat) print(best) # create my decision tree tree = decisionTree.createTree(myDat,labels,0) print(tree)`

## Points of Interest

Part 1 and Part 2 (this tip) of my series, we have created conclusions about data such as “This data instance is in this class!” The next questtion will be “What if we assign a probability to a data instance belonging to a given class?” The answer will be discussed in the next tip.

## History

Keep a running update of any changes or improvements you’ve made here.