An introduction to basic algorithms of Machine Learning (Part 1) – DEVELOPPARADISE
15/05/2018

An introduction to basic algorithms of Machine Learning (Part 1)


Introduction

In machine learning, you look at instances of data. Each instance of data is composed of a number of features. Classification, one the popular and essential tasks of machine learning, is used to place an unknown piece of data into a known group. In this tip, I’ll discuss our first machine-learning algorithm: k-Nearest Neighbors. k-Nearest Neighbors is easy to grasp and very effective.

I also installed the Anaconda and noted something about using Python code with Anaconda Prompt. You can read more here.

Background

Before going to the k-Nearest Neighbors (short is kNN) algorithm, I will explain a little bit of about idea of kNN algorithm, about data structures, about Python code, and about linear algebra.

Assume that I have four neighbors A, B, C, and D as follows:

A (1.0, 1.1)

B (1.0, 1.0)

C (0.0, 0.0)

D (0.0, 0.1)

And me:

0 (0.0, 0.0)

Our problem is to find some nearest neighbors to me? Solution is to calculate distances between neighbors and me, and then we find some lowest distances.

We can calculate distances between O and A, B, C, or D by using the Euclidian distance. For instance, if we have O(xO,yO) and A(xA, yA) then 

OA = An introduction to basic algorithms of Machine Learning (Part 1)

For simplicity of calculation and programming, we are going to organize A, B, C, and D in a matrix neighbors:

An introduction to basic algorithms of Machine Learning (Part 1)

In Python, we can write neighbors as an array:

neighbors = array([[1.0,1.1], [1.0,1.0], [0.0,0.0],[0.0,0.0]])

And we also organize O in a matrix me:

An introduction to basic algorithms of Machine Learning (Part 1)

We will implement some calculations on neighbors and me, so neighbors and me must have the same size. We can convert me matrix to newMe matrix:

An introduction to basic algorithms of Machine Learning (Part 1)

In Python, we can write me as an array (like neighbors) and we don’t need newMe:

me = array([[0,0], [0,0], [0,0],[0,0]]) 

but we also write me as a list:

me = [0,0]

 and then using the tile function to convert:

newMe = title(me, (4,1))

We need to implement some intermediate calculations:

diffMat = neighbors newMe =

An introduction to basic algorithms of Machine Learning (Part 1)

sqDiffMat =

An introduction to basic algorithms of Machine Learning (Part 1)

In Python, we can write:

sqDiffMat = diffMat**2

sqDistances

An introduction to basic algorithms of Machine Learning (Part 1)

In Python, we can use the sum function to calculate sqDistances from sqDiffMat:

sqDistances = sqdiffMat.sum(axis=1)

The final result, we have a matrix that contains distances:

distancesMat

An introduction to basic algorithms of Machine Learning (Part 1)

In Python, we can write:

distancesMat = sqDistances**0.5

So far, It’s easy to find the nearest neighbor to me based on OA, OB, OC, OD.

Using the code

We will create the kNN.py file that includes two functions. The first, I write a function to create a dataset with random number of neighbors and I will also assign a name to each neighbor:

def createDataSet(numofNeighbors):       neighbors = np.random.rand(numofNeighbors,2)       # assign A1 to the first neighbor, A2 to the second neighbor,...       names = ['']*numofNeighbors       for i in range(0,numofNeighbors):             names[i]= 'A'+str(i)       return neighbors, names

The second function is my kNN algorithm:

# For every point in our dataset: def classify(me, neighbors, names, k):          numofNeighbors = neighbors.shape[0]       diffMat = tile (me, (numofNeighbors,1))-neighbors       # calculate the distance between me and the current neighbor       sqDiffMat = diffMat**2       sqDistances = sqDiffMat.sum(axis=1)       distancesMat = sqDistances**0.5       # sort the distances in increasing order       sortedDistIndicies = distancesMat.argsort()       # take k items with lowest distances to me       classCount={}       for i in range(k):             voteIname = names[sortedDistIndicies[i]]             classCount[voteIname] = classCount.get(voteIname,0) + 1       # find the majority class among these items       sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1), reverse=True)       # return the majority class as our prediction for the class of me       return sortedClassCount

At top of kNN.py, we need to import something:

from numpy import * import operator import numpy as np

Notes that, you need to put the kNN.py file in the folder that you installed the Anaconda before.

To test the algorithm above, we create the Test.py file that contains some Python code as follows:

from numpy import * import matplotlib.pyplot as plt import kNN # create 5 my neighbors neighbors,names = kNN.createDataSet(5) # find two nearest neighbors to me result = kNN.classify([0,0], neighbors, names, 2) # x and y save positons of my neighbors x = [0]* neighbors.shape[0] y = [0]* neighbors.shape[0] for i in range(0, neighbors.shape[0]):       x[i] = neighbors [i][0]       y[i] = neighbors [i][1] # display my neighbors with blue color plt.plot(x,y,'bo') plt.axis([-0.2, 1.2, -0.2, 1.2]) # assign names to neighbors for i, name in enumerate(names):       plt.annotate(name,(x[i],y[i]),(x[i]-0.08,y[i]+0.01)) # diplay me with red color plt.plot([0],[0],'ro') # display two nearest neighbors with  messages and yellow color for i, name in enumerate(names):       for r in result:             if name is r[0]:                   plt.plot([x[i]],[y[i]],'yo')                   plt.annotate('I am here',(x[i],y[i]),(x[i]+0.01,y[i]-0.05)) plt.show()

We run Test.py by using Anaconda Prompt, the result can look like:

An introduction to basic algorithms of Machine Learning (Part 1)

You can test with 10 neighbors:

An introduction to basic algorithms of Machine Learning (Part 1)

And you also can find 4 nearest neighbors:

An introduction to basic algorithms of Machine Learning (Part 1)

Points of Interest

The k-Nearest Neighbors algorithm is a simple and effective way to classify data. An additional drawback is that kNN doesn’t give you any idea of the underlying structure of the data. In the next tip, I’ll address this issue by exploring ways in which probability measurements can help you do classification.

History

Keep a running update of any changes or improvements you’ve made here.