# 2.1.14.2.Decision trees and Random Forests with Python

## 1. 匯入基本的library

* [pandas](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas.html), [numpy](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/213python-for-data-analysis-numpy.html), [matplotlib](https://jenhsuan.gitbooks.io/python/content/test.html), [seaborn](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn.html)

  ```
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
  ```
* 將圖表直接嵌入到Notebook之中

  ```
    %matplotlib inline
  ```

## 2.讀取資料並了解資料

* 讀取資料
  * 由df.head可知變因有Age, Number, Start, 結果為Kyphosis

    ```
     df = pd.read_csv('kyphosis.csv')
     df.head()
    ```

    ![](https://1184108162-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4M0G8SFgkeUaGo4vl-%2F-M4M0HrDfjWeZX2tGCNv%2F-M4M0TJ-4nQgBKRDkAIN%2F%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-29%20%E4%B8%8B%E5%8D%8810.04.11.png?generation=1586302958088396\&alt=media)
* 先了解資料欄位的型別以及變數的型態, 由pd.info()可以知道這份資料有4個欄位: 有3筆屬於int64, 1筆屬於object

  ```
   df.info()
  ```

  ![](https://1184108162-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4M0G8SFgkeUaGo4vl-%2F-M4M0HrDfjWeZX2tGCNv%2F-M4M0TJ1ftfEE2agnj9Y%2F%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-29%20%E4%B8%8B%E5%8D%8810.10.28.png?generation=1586302958282075\&alt=media)
* 視覺化資料以了解每個因子間的相關性
  * [seaborn pairpot](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2172distribution-plot.html)

    ```
    sns.pairplot(data = df, hue='Kyphosis')
    ```

    ![](https://1184108162-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4M0G8SFgkeUaGo4vl-%2F-M4M0HrDfjWeZX2tGCNv%2F-M4M0TJ3-JbQGgLMcdOz%2F%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-29%20%E4%B8%8B%E5%8D%8810.06.45.png?generation=1586302957760351\&alt=media)

## 3.使用Skikit-learn library

* 首先介紹train\_test\_split, 這個函式可以隨機劃分訓練集和測試集

  ```
    from sklearn.cross_validation import train_test_split
    X = df_feat
    Y = df["TARGET CLASS"]
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)
  ```

## 4.使用Decision tree classifier

* import Decision tree classifier

  ```
    from sklearn.tree import DecisionTreeClassifier
    dtree = DecisionTreeClassifier()
  ```
* 訓練模型

  ```
   dtree.fit(X_train, y_train)
  ```
* 預測

  ```
    pred = dtree.predict(X_test)
  ```
* 評估模型的精度
  * confusion\_matrix, classification\_report

    ```
    from sklearn.metrics import classification_report, confusion_matrix
    print(classification_report(y_test, pred))
    print(confusion_matrix(y_test, pred))
    ```

    ![](https://1184108162-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4M0G8SFgkeUaGo4vl-%2F-M4M0HrDfjWeZX2tGCNv%2F-M4M0TJ5s0zfvMfvVGh1%2F%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-29%20%E4%B8%8B%E5%8D%8811.10.44.png?generation=1586302958555937\&alt=media)

## 5.使用Random tree forest

* import Random tree forest

  ```
    from sklearn.ensemble import RandomForestClassifier
  ```
* 訓練模型

  ```
    rfc = RandomForestClassifier(n_estimators = 200)
    rfc.fit(X_train, y_train)
  ```
* 預測

  ```
    rfc_pred = rfc.predict(X_test)
  ```
* 評估模型的精度

  ```
    rfc_pred = rfc.predict(X_test)
    print(classification_report(y_test, rfc_pred))
    print(confusion_matrix(y_test, rfc_pred))
  ```

  ![](https://1184108162-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4M0G8SFgkeUaGo4vl-%2F-M4M0HrDfjWeZX2tGCNv%2F-M4M0TJ70-Hidwc7-A9J%2F%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-29%20%E4%B8%8B%E5%8D%8811.33.23.png?generation=1586302957892537\&alt=media)

## 6.視覺化決策樹
