2.1.13.2.KNN with Python

1. 匯入基本的library

  import pandas as pd
  import numpy as np
  import matplotlib.pyplot as plt
  import seaborn as sns

將圖表直接嵌入到Notebook之中
```
  %matplotlib inline
```

2.讀取資料並了解資料

df = pd.read_csv('Classified Data', index_col = 0)
df.head()

3.標準化資料

變數的scale通常對結果有很大的影響, 當使用KNN classifier時通常會統一observation的尺度

使用StandardScaler

StandardScaler的作用是將數據減去平均值並除以方差, 公式為(X-mean)/std

將原始資料的TARGET CLASS drop掉後fit, 再transform

from sklearn.preprocessing import StandardScaler
scaler  = StandardScaler()
scaler.fit(df.drop("TARGET CLASS", axis = 1))
scaled_features = scaler.transform(df.drop("TARGET CLASS", axis = 1))

將標準化後的資料準轉成DataFrame

  df_feat = pd.DataFrame(scaled_features, columns = df.columns[:-1])
  df_feat.head()

4.使用Skikit-learn library

首先介紹train_test_split, 這個函式可以隨機劃分訓練集和測試集

  from sklearn.cross_validation import train_test_split
  X = df_feat
  Y = df["TARGET CLASS"]
  X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)

5.使用KNN classifier進行預測

設定n_neighbors(K)為1

  from sklearn.neighbors import KNeighborsClassifier
  knn = KNeighborsClassifier(n_neighbors = 1)
  knn.fit(X_train, y_train)
  predictions = knn.predict(X_test)

6.評估模型的精度

使用classification_report

  from sklearn.metrics import classification_report
  print(classification_report(y_test, predictions))

confusion_matrix

  from sklearn.metrics import confusion_matrix
  print(confusion_matrix(y_test, predictions))

7.調整K值

將多次的預測值與y_test的差異矩陣取平均

  error_rate = []
  for i in range(1, 40):
      knn = KNeighborsClassifier(n_neighbors = i)
      knn.fit(X_train, y_train)
      pred_i = knn.predict(X_test)
      error_rate.append(np.mean(pred_i != y_test))

  plt.figure(figsize = (10, 6))
  plt.plot(range(1,40), error_rate, color = 'blue',linestyle = 'dashed', marker = 'o', markerfacecolor = 'red'
       , markersize = 10)
  plt.title("Error Rate vs K value")
  plt.xlabel('K')

重新預估k值

  knn = KNeighborsClassifier(n_neighbors = 37)
  knn.fit(X_train, y_train)
  predictions = knn.predict(X_test)
  print(classification_report(y_test, predictions))
  print(confusion_matrix(y_test, predictions))

Previous2.1.13.1.KNN Theory Next2.1.14.Decision trees and random forests

Last updated 5 years ago

Was this helpful?