**k-Nearest Neighbors(KNN)** uses distance metric to find the k most similar instances(the neighbors) and takes **mean** outcome(for **regression problem**) or **mode**(for **classification problem**) as the prediction.

Note: **k-Nearest Neighbors** is a **non-linear** machine learning(ML) algorithm. **KNN** can require a lot of memory to perform calculation and suggests to only include the most relevant input variables.

Medium Post: Top 10 algorithms for ML newbies

This **recipe** includes the following topics:

- Load
**regression problem**Boston house price dataset from github - Split columns into the usual feature columns(X) and target/prediction column(Y)
- Split data using
**KFold()**class with**kFold**:10,**seed**:7 - Instantiate a regression model
**(KNeighborsRegressor)** - Set scoring parameter to
**‘neg_mean_squared_error’** - Call
**cross_val_score()**to run cross validation - Calculate
**Mean Squared Error**from scores returned by**cross_val_score()**

**Caveat:** **cross_val_score()** reports scores in ascending order (largest score is best). But **MSE** is naturally descending scores (the smallest score is best). Thus we need to use **‘neg_mean_squared_error’** to invert the sorting. This also results in the score to be negative even though the value can never be negative.

```
# import modules
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# read data file from github
# dataframe: houseDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/housing.csv'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
houseDf = pd.read_csv(gitFileURL, delim_whitespace=True, names = cols)
# convert into numpy array for scikit-learn
houseArr = houseDf.values
# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'MEDV' column
# MEDV: median value of owner-occupied homes in $1000s
X = houseArr[:, 0:13]
Y = houseArr[:, 13]
# set k-fold count
folds = 10
# set seed to reproduce the same random data each time
seed = 7
# split data using KFold
kfold = KFold(n_splits=folds, random_state=seed)
# instantiate a regression model
model = KNeighborsRegressor()
# set scoring parameter to 'neg_mean_squared_error'
scoring = 'neg_mean_squared_error'
# call cross_val_score() to run cross validation
resultArr = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
# calculate mean of scores for all folds
mse = resultArr.mean()
# display Mean Squared Error
# descending score(smallest score is best) is denoted by negative even though the value is positive
print("Mean Squared Error: %.3f" % mse)
```

```
Mean Squared Error: -107.287
```