K-Nearest Neighbor (KNN)
- KNN is simple supervised learning algorithm used for both regression and classification problems.
- KNN is basically store all available cases and classify new cases based on similarities with stored cases.
Concept: So the concept that KNN works on is Basically similarities measurements, for example, if you look at Mango,it is more similar to Apple then dog or cat, then what KNN will do is put it in the category of fruits not in the category of animals.
What is K in KNN
What happens in KNN,we trained the model and after that we want to test our model , means we want to classify our new data (test-data),for that we will check some (K) classes around it and assign the most common class to the test-data.
K- Number of nearest neighbors
K=1 means the testing data are given the same level as the closet example in training set.
K=4 means the labels of the four closet classes are check and most common class is assign to the testing data.
How does KNN is work ?
Let's understand it with the above given diagram
- In this diagram we have 2 classes one blue class one red class
- Now we have a new green point, we have to find out whether this point is in class red or blue
- For this, we will define the value of K
- At K= 1, we will see the distance from the green point to the nearest points, and select the point with lowest distance and classify the green point in that class, here it red.
- At K=5 We will calculate the distance from the green point to the nearest points and select the five points with the lowest distance and classify the green point to the most common class, that is red here.
- How to choose the value of K? The value of k is not defined, it depends on the cases.
Lazy Learner
- KNN is simple algorithm for classification but that's not the reason
- KNN is lazy learner because it doesn't learn a discriminative function from the training data but memorizes the training dataset instead.
KNN Algorithm
let's understand the concept of KNN algorithm with iris flower problem
Data: This data consist of total 150 instances (samples) , 4 features , and three classes (targets).
Problem: Using four features we have to classify which flower belongs to which category.
Importing Data-set
import sklearn import pandas as pd from sklearn.datasets import load_iris iris=load_iris() iris.keys() df=pd.DataFrame(iris['data']) print(df) print(iris['target_names']) iris['feature_names'] |
---|
Output
Note:
- Now we need a target and data so that we can train the model
- As we know that we have to find out the class from the features we have
- With this logic,our target is classes (0,1,2) and data is in df.
Splitting Data
- The data is split so that with some data we can train the model and from the remaining data we can test the model and can check how well our model is
- To do this we have an inbuilt function in sklearn
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) |
---|
Note: It will split our 33% data into testing data and remaining data is our training data
KNN Classifier and Training of the Model
from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=3) |
---|
Note:
- It implements the concepts of KNN. Here we have taken number of neighbors (K)= 3.
- First, it will calculate the distance with all the training points to the test point and then select the three lowest distance points.
- And test data point is classify to the class most common in among three.
Note:- Training the model with features values (data) and target values (target)
Prediction and Accuracy
Demo:
- Here I want to show you just by taking one data point
- we have a data point x_new
import numpy as np
x_new=np.array([[5,2.9,1,0.2]])
Now we want to see the class or category of this point
prediction=knn.predict(x_new)
iris['target_names'][prediction]
Output
Note: As we can see that our point belongs to class (0 or setosa class), this demo is just for understanding
from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report y_pred=knn.predict(X_test) cm=confusion_matrix(y_test,y_pred) print(cm) print(" correct predicition",accuracy_score(y_test,y_pred)) print(" worng predicition",(1-accuracy_score(y_test,y_pred))) |
---|
Output
Note: As you can see in confusion matrix only one prediction is wrong , and also our accuracy is 0.98 (98%).
Practice KNN - We have a dataset that contains multiple user's information through the social network who are interested in buying SUV Car or not.
DataSet-Click Here for Download user_data.csv
Click here for more programs of RTU ML LAB