2.1.13.2.KNN with Python

1. 匯入基本的library

  • pandas, numpy, matplotlib, seaborn

      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      import seaborn as sns
  • 將圖表直接嵌入到Notebook之中

      %matplotlib inline

2.讀取資料並了解資料

df = pd.read_csv('Classified Data', index_col = 0)
df.head()

3.標準化資料

  • 變數的scale通常對結果有很大的影響, 當使用KNN classifier時通常會統一observation的尺度

  • 使用StandardScaler

    • StandardScaler的作用是將數據減去平均值並除以方差, 公式為(X-mean)/std

    • 將原始資料的TARGET CLASS drop掉後fit, 再transform

      from sklearn.preprocessing import StandardScaler
      scaler  = StandardScaler()
      scaler.fit(df.drop("TARGET CLASS", axis = 1))
      scaled_features = scaler.transform(df.drop("TARGET CLASS", axis = 1))
  • 將標準化後的資料準轉成DataFrame

      df_feat = pd.DataFrame(scaled_features, columns = df.columns[:-1])
      df_feat.head()

4.使用Skikit-learn library

  • 首先介紹train_test_split, 這個函式可以隨機劃分訓練集和測試集

      from sklearn.cross_validation import train_test_split
      X = df_feat
      Y = df["TARGET CLASS"]
      X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)

5.使用KNN classifier進行預測

  • 設定n_neighbors(K)為1

      from sklearn.neighbors import KNeighborsClassifier
      knn = KNeighborsClassifier(n_neighbors = 1)
      knn.fit(X_train, y_train)
      predictions = knn.predict(X_test)

6.評估模型的精度

  • 使用classification_report

      from sklearn.metrics import classification_report
      print(classification_report(y_test, predictions))

  • confusion_matrix

      from sklearn.metrics import confusion_matrix
      print(confusion_matrix(y_test, predictions))

7.調整K值

  • 將多次的預測值與y_test的差異矩陣取平均

      error_rate = []
      for i in range(1, 40):
          knn = KNeighborsClassifier(n_neighbors = i)
          knn.fit(X_train, y_train)
          pred_i = knn.predict(X_test)
          error_rate.append(np.mean(pred_i != y_test))
    
      plt.figure(figsize = (10, 6))
      plt.plot(range(1,40), error_rate, color = 'blue',linestyle = 'dashed', marker = 'o', markerfacecolor = 'red'
           , markersize = 10)
      plt.title("Error Rate vs K value")
      plt.xlabel('K')

  • 重新預估k值

      knn = KNeighborsClassifier(n_neighbors = 37)
      knn.fit(X_train, y_train)
      predictions = knn.predict(X_test)
      print(classification_report(y_test, predictions))
      print(confusion_matrix(y_test, predictions))

Last updated

Was this helpful?