## Introduction

In machine learning, you look at instances of data. Each instance of data is composed of a number of features. Classification, one the popular and essential tasks of machine learning, is used to place an unknown piece of data into a known group. In this tip, I’ll discuss our first machine-learning algorithm: k-Nearest Neighbors. k-Nearest Neighbors is easy to grasp and very effective.

I also installed the Anaconda and noted something about using Python code with Anaconda Prompt. You can read more here.

## Background

Before going to the k-Nearest Neighbors (short is kNN) algorithm, I will explain a little bit of about idea of kNN algorithm, about data structures, about Python code, and about linear algebra.

Assume that I have four neighbors A, B, C, and D as follows:

A (1.0, 1.1)

B (1.0, 1.0)

C (0.0, 0.0)

D (0.0, 0.1)

And me:

0 (0.0, 0.0)

Our problem is to find some nearest neighbors to me? Solution is to calculate distances between neighbors and me, and then we find some lowest distances.

We can calculate distances between O and A, B, C, or D by using the Euclidian distance. For instance, if we have O(x_{O},y_{O}) and A(x_{A}, y_{A}) then

OA =

For simplicity of calculation and programming, we are going to organize A, B, C, and D in a matrix **neighbors**:

In Python, we can write **neighbors** as an array:

**neighbors** = array([[1.0,1.1], [1.0,1.0], [0.0,0.0],[0.0,0.0]])

And we also organize O in a matrix **me**:

We will implement some calculations on **neighbors **and **me**, so **neighbors **and **me **must have the same size. We can convert **me **matrix to **newMe **matrix:

In Python, we can write **me **as an array** **(like **neighbors**) and we don’t need **newMe**:

**me** = array([[0,0], [0,0], [0,0],[0,0]])

but we also write **me **as a list:

**me** = [0,0]

and then using the tile function to convert:

`newMe = title(me, (4,1))`

We need to implement some intermediate calculations:

**diffMat **= **neighbors **– **newMe **=

**sqDiffMat **=

In Python, we can write:

`sqDiffMat = diffMat**2`

**sqDistances **=

In Python, we can use the sum function to calculate **sqDistances **from **sqDiffMat**:

`sqDistances = sqdiffMat.sum(axis=1)`

The final result, we have a matrix that contains distances:

**distancesMat **=

In Python, we can write:

`distancesMat = sqDistances**0.5`

So far, It’s easy to find the nearest neighbor to me based on OA, OB, OC, OD.

## Using the code

We will create the **kNN.py** file that includes two functions. The first, I write a function to create a dataset with random number of neighbors and I will also assign a name to each neighbor:

`def createDataSet(numofNeighbors): neighbors = np.random.rand(numofNeighbors,2) # assign A1 to the first neighbor, A2 to the second neighbor,... names = ['']*numofNeighbors for i in range(0,numofNeighbors): names[i]= 'A'+str(i) return neighbors, names`

The second function is my kNN algorithm:

`# For every point in our dataset: def classify(me, neighbors, names, k): numofNeighbors = neighbors.shape[0] diffMat = tile (me, (numofNeighbors,1))-neighbors # calculate the distance between me and the current neighbor sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis=1) distancesMat = sqDistances**0.5 # sort the distances in increasing order sortedDistIndicies = distancesMat.argsort() # take k items with lowest distances to me classCount={} for i in range(k): voteIname = names[sortedDistIndicies[i]] classCount[voteIname] = classCount.get(voteIname,0) + 1 # find the majority class among these items sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1), reverse=True) # return the majority class as our prediction for the class of me return sortedClassCount`

At top of** kNN.py**, we need to import something:

`from numpy import * import operator import numpy as np`

Notes that, you need to put the **kNN.py** file in the folder that you installed the **Anaconda** before.

To test the algorithm above, we create the **Test.py** file that contains some Python code as follows:

`from numpy import * import matplotlib.pyplot as plt import kNN # create 5 my neighbors neighbors,names = kNN.createDataSet(5) # find two nearest neighbors to me result = kNN.classify([0,0], neighbors, names, 2) # x and y save positons of my neighbors x = [0]* neighbors.shape[0] y = [0]* neighbors.shape[0] for i in range(0, neighbors.shape[0]): x[i] = neighbors [i][0] y[i] = neighbors [i][1] # display my neighbors with blue color plt.plot(x,y,'bo') plt.axis([-0.2, 1.2, -0.2, 1.2]) # assign names to neighbors for i, name in enumerate(names): plt.annotate(name,(x[i],y[i]),(x[i]-0.08,y[i]+0.01)) # diplay me with red color plt.plot([0],[0],'ro') # display two nearest neighbors with messages and yellow color for i, name in enumerate(names): for r in result: if name is r[0]: plt.plot([x[i]],[y[i]],'yo') plt.annotate('I am here',(x[i],y[i]),(x[i]+0.01,y[i]-0.05)) plt.show()`

We run **Test.py** by using **Anaconda Prompt**, the result can look like:

You can test with 10 neighbors:

And you also can find 4 nearest neighbors:

## Points of Interest

The k-Nearest Neighbors algorithm is a simple and effective way to classify data. An additional drawback is that kNN doesn’t give you any idea of the underlying structure of the data. In the next tip, I’ll address this issue by exploring ways in which probability measurements can help you do classification.

## History

Keep a running update of any changes or improvements you’ve made here.