# 2.1.13.2.KNN with Python

## 1. 匯入基本的library

* [pandas](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas.html), [numpy](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/213python-for-data-analysis-numpy.html), [matplotlib](https://jenhsuan.gitbooks.io/python/content/test.html), [seaborn](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn.html)

  ```
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
  ```
* 將圖表直接嵌入到Notebook之中

  ```
    %matplotlib inline
  ```

## 2.讀取資料並了解資料

```
df = pd.read_csv('Classified Data', index_col = 0)
df.head()
```

![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-26%20%E4%B8%8B%E5%8D%8810.46.37.png)

## 3.標準化資料

* 變數的scale通常對結果有很大的影響, 當使用KNN classifier時通常會統一observation的尺度
* 使用StandardScaler
  * [StandardScaler](https://www.cnblogs.com/chaosimple/p/4153167.html)的作用是將數據減去平均值並除以方差, 公式為(X-mean)/std
  * 將原始資料的TARGET CLASS drop掉後[fit, 再transform](https://blog.csdn.net/lz_peter/article/details/78237094)

    ```
    from sklearn.preprocessing import StandardScaler
    scaler  = StandardScaler()
    scaler.fit(df.drop("TARGET CLASS", axis = 1))
    scaled_features = scaler.transform(df.drop("TARGET CLASS", axis = 1))
    ```
* 將標準化後的資料準轉成DataFrame

  ```
    df_feat = pd.DataFrame(scaled_features, columns = df.columns[:-1])
    df_feat.head()
  ```

## 4.使用Skikit-learn library

* 首先介紹train\_test\_split, 這個函式可以隨機劃分訓練集和測試集

  ```
    from sklearn.cross_validation import train_test_split
    X = df_feat
    Y = df["TARGET CLASS"]
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)
  ```

## 5.使用KNN classifier進行預測

* 設定n\_neighbors(K)為1

  ```
    from sklearn.neighbors import KNeighborsClassifier
    knn = KNeighborsClassifier(n_neighbors = 1)
    knn.fit(X_train, y_train)
    predictions = knn.predict(X_test)
  ```

## 6.評估模型的精度

* 使用classification\_report

  ```
    from sklearn.metrics import classification_report
    print(classification_report(y_test, predictions))
  ```

  ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-26%20%E4%B8%8B%E5%8D%8810.43.33.png)
* confusion\_matrix

  ```
    from sklearn.metrics import confusion_matrix
    print(confusion_matrix(y_test, predictions))
  ```

  ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-26%20%E4%B8%8B%E5%8D%8810.45.32.png)

## 7.調整K值

* [將多次的預測值與y\_test的差異矩陣取平均](https://stackoverflow.com/questions/41419864/understand-np-mean-in-python)

  ```
    error_rate = []
    for i in range(1, 40):
        knn = KNeighborsClassifier(n_neighbors = i)
        knn.fit(X_train, y_train)
        pred_i = knn.predict(X_test)
        error_rate.append(np.mean(pred_i != y_test))

    plt.figure(figsize = (10, 6))
    plt.plot(range(1,40), error_rate, color = 'blue',linestyle = 'dashed', marker = 'o', markerfacecolor = 'red'
         , markersize = 10)
    plt.title("Error Rate vs K value")
    plt.xlabel('K')
  ```

  ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-26%20%E4%B8%8B%E5%8D%8810.50.09.png)
* 重新預估k值

  ```
    knn = KNeighborsClassifier(n_neighbors = 37)
    knn.fit(X_train, y_train)
    predictions = knn.predict(X_test)
    print(classification_report(y_test, predictions))
    print(confusion_matrix(y_test, predictions))
  ```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jen-hsuan-hsieh.gitbook.io/python/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/sd/21132knn-with-python.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
