Python
  • Introduction
  • Chapter 1.Notes from research
    • 1.Introduction of Python
    • 2. Build developer environment
      • 2.1.Sublime Text3
      • 2.2.Jupyter(IPython notebook)
        • 2.2.1.Introduction
        • 2.2.2.Basic usage
        • 2.2.3.some common operations
      • 2.3.Github
        • 2.3.1.Create Github account
        • 2.3.2.Create a new repository
        • 2.3.3.Basic operations: config, clone, push
      • 2.4.Install Python 3.4 in Windows
    • 3. Write Python code
      • 3.1.Hello Python
      • 3.2.Basic knowledges
      • 3.3.撰寫獨立python程式
      • 3.4.Arguments parser
      • 3.5.Class
      • 3.6.Sequence
    • 4. Web crawler
      • 4.1.Introduction
      • 4.2.requests
      • 4.3.beautifulSoup4
      • 3.4.a little web crawler
    • 5. Software testing
      • 5.1. Robot Framework
        • 1.1.Introduction
        • 1.2.What is test-automation framework?
        • 1.3.Robot Framework Architecture
        • 1.4.Robot Framework Library
        • 1.5.Reference
    • 6. encode/ decode
      • 6.1.編碼/解碼器的基本概念
      • 6.2.常見的編碼/ 解碼錯誤訊息與其意義
      • 6.3 .處理文字檔案
    • 7. module
      • 7.1.Write a module
      • 7.2.Common module
        • 7.2.1.sched
        • 7.2.2.threading
    • 8. Integrate IIS with django
      • 8.1.Integrate IIS with django
  • Chapter 2.Courses
    • 2.1.Python for Data Science and Machine Learning Bootcamp
      • 2.1.1.Virtual Environment
      • 2.1.2.Python crash course
      • 2.1.3.Python for Data Analysis - NumPy
        • 2.1.3.1.Numpy arrays
        • 2.1.3.2.Numpy Array Indexing
        • 2.1.3.3.Numpy Operations
      • 2.1.4.Python for Data Analysis - Pandas
        • 2.1.4.1.Introduction
        • 2.1.4.2.Series
        • 2.1.4.3.DataFrames
        • 2.1.4.4.Missing Data
        • 2.1.4.5.GroupBy
        • 2.1.4.6.Merging joining and Concatenating
        • 2.1.4.7.Data input and output
      • 2.1.5.Python for Data Visual Visualization - Pandas Built-in Data Visualization
      • 2.1.6.Python for Data Visualization - Matplotlib
        • 2.1.6.1.Introduction of Matplotlib
        • 2.1.6.2.Matplotlib
      • 2.1.7.Python for Data Visualization - Seaborn
        • 2.1.7.1.Introduction to Seaborn
        • 2.1.7.2.Distribution Plots
        • 2.1.7.3.Categorical Plots
        • 2.1.7.4.Matrix Plots
        • 2.1.7.5.Grids
        • 2.1.7.6.Regression Plots
      • 2.1.8. Python for Data Visualization - Plotly and Cufflinks
        • 2.1.8.1.Introduction to Plotly and Cufflinks
        • 2.1.8.2.Plotly and Cufflinks
      • 2.1.9. Python for Data Visualization - Geographical plotting
        • 2.1.9.1.Choropleth Maps - USA
        • 2.1.9.2.Choropleth Maps - World
      • 2.1.10.Combine data analysis and visualization to tackle real world data sets
        • 911 calls capstone project
      • 2.1.11.Linear regression
        • 2.1.11.1.Introduction to Scikit-learn
        • 2.1.11.2.Linear regression with Python
      • 2.1.12.Logistic regression
        • 2.1.12.1.Logistic regression Theory
        • 2.1.12.2.Logistic regression with Python
      • 2.1.13.K Nearest Neighbors
        • 2.1.13.1.KNN Theory
        • 2.1.13.2.KNN with Python
      • 2.1.14.Decision trees and random forests
        • 2.1.14.1.Introduction of tree methods
        • 2.1.14.2.Decision trees and Random Forests with Python
      • 2.1.15.Support Vector Machines
      • 2.1.16.K means clustering
      • 2.1.17.Principal Component Analysis
    • 2.2. Machine Learning Crash Course Jam
Powered by GitBook
On this page
  • 1. 匯入基本的library
  • 2.讀取資料並了解資料
  • 3.標準化資料
  • 4.使用Skikit-learn library
  • 5.使用KNN classifier進行預測
  • 6.評估模型的精度
  • 7.調整K值

Was this helpful?

  1. Chapter 2.Courses
  2. 2.1.Python for Data Science and Machine Learning Bootcamp
  3. 2.1.13.K Nearest Neighbors

2.1.13.2.KNN with Python

Previous2.1.13.1.KNN TheoryNext2.1.14.Decision trees and random forests

Last updated 5 years ago

Was this helpful?

1. 匯入基本的library

  • , , ,

      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      import seaborn as sns
  • 將圖表直接嵌入到Notebook之中

      %matplotlib inline

2.讀取資料並了解資料

df = pd.read_csv('Classified Data', index_col = 0)
df.head()

3.標準化資料

  • 變數的scale通常對結果有很大的影響, 當使用KNN classifier時通常會統一observation的尺度

  • 使用StandardScaler

    • from sklearn.preprocessing import StandardScaler
      scaler  = StandardScaler()
      scaler.fit(df.drop("TARGET CLASS", axis = 1))
      scaled_features = scaler.transform(df.drop("TARGET CLASS", axis = 1))
  • 將標準化後的資料準轉成DataFrame

      df_feat = pd.DataFrame(scaled_features, columns = df.columns[:-1])
      df_feat.head()

4.使用Skikit-learn library

  • 首先介紹train_test_split, 這個函式可以隨機劃分訓練集和測試集

      from sklearn.cross_validation import train_test_split
      X = df_feat
      Y = df["TARGET CLASS"]
      X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)

5.使用KNN classifier進行預測

  • 設定n_neighbors(K)為1

      from sklearn.neighbors import KNeighborsClassifier
      knn = KNeighborsClassifier(n_neighbors = 1)
      knn.fit(X_train, y_train)
      predictions = knn.predict(X_test)

6.評估模型的精度

  • 使用classification_report

      from sklearn.metrics import classification_report
      print(classification_report(y_test, predictions))
  • confusion_matrix

      from sklearn.metrics import confusion_matrix
      print(confusion_matrix(y_test, predictions))

7.調整K值

  •   error_rate = []
      for i in range(1, 40):
          knn = KNeighborsClassifier(n_neighbors = i)
          knn.fit(X_train, y_train)
          pred_i = knn.predict(X_test)
          error_rate.append(np.mean(pred_i != y_test))
    
      plt.figure(figsize = (10, 6))
      plt.plot(range(1,40), error_rate, color = 'blue',linestyle = 'dashed', marker = 'o', markerfacecolor = 'red'
           , markersize = 10)
      plt.title("Error Rate vs K value")
      plt.xlabel('K')
  • 重新預估k值

      knn = KNeighborsClassifier(n_neighbors = 37)
      knn.fit(X_train, y_train)
      predictions = knn.predict(X_test)
      print(classification_report(y_test, predictions))
      print(confusion_matrix(y_test, predictions))

的作用是將數據減去平均值並除以方差, 公式為(X-mean)/std

將原始資料的TARGET CLASS drop掉後

StandardScaler
fit, 再transform
將多次的預測值與y_test的差異矩陣取平均
pandas
numpy
matplotlib
seaborn