2.1.14.2.Decision trees and Random Forests with Python

1. 匯入基本的library

  • pandas, numpy, matplotlib, seaborn

      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      import seaborn as sns
  • 將圖表直接嵌入到Notebook之中

      %matplotlib inline

2.讀取資料並了解資料

  • 讀取資料

    • 由df.head可知變因有Age, Number, Start, 結果為Kyphosis

       df = pd.read_csv('kyphosis.csv')
       df.head()

  • 先了解資料欄位的型別以及變數的型態, 由pd.info()可以知道這份資料有4個欄位: 有3筆屬於int64, 1筆屬於object

     df.info()

  • 視覺化資料以了解每個因子間的相關性

3.使用Skikit-learn library

  • 首先介紹train_test_split, 這個函式可以隨機劃分訓練集和測試集

      from sklearn.cross_validation import train_test_split
      X = df_feat
      Y = df["TARGET CLASS"]
      X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)

4.使用Decision tree classifier

  • import Decision tree classifier

      from sklearn.tree import DecisionTreeClassifier
      dtree = DecisionTreeClassifier()
  • 訓練模型

     dtree.fit(X_train, y_train)
  • 預測

      pred = dtree.predict(X_test)
  • 評估模型的精度

    • confusion_matrix, classification_report

      from sklearn.metrics import classification_report, confusion_matrix
      print(classification_report(y_test, pred))
      print(confusion_matrix(y_test, pred))

5.使用Random tree forest

  • import Random tree forest

      from sklearn.ensemble import RandomForestClassifier
  • 訓練模型

      rfc = RandomForestClassifier(n_estimators = 200)
      rfc.fit(X_train, y_train)
  • 預測

      rfc_pred = rfc.predict(X_test)
  • 評估模型的精度

      rfc_pred = rfc.predict(X_test)
      print(classification_report(y_test, rfc_pred))
      print(confusion_matrix(y_test, rfc_pred))

6.視覺化決策樹

Last updated

Was this helpful?