# 2.1.11.2.Linear regression with Python

## 1. 匯入基本的library

* [pandas](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas.html), [numpy](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/213python-for-data-analysis-numpy.html), [matplotlib](https://jenhsuan.gitbooks.io/python/content/test.html), [seaborn](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn.html)

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
```

* 將圖表直接嵌入到Notebook之中

```
%matplotlib inline
```

## 2.讀取資料並了解資料

* 讀取資料

  ```
    df = pd.read_csv('USA_Housing.csv')
  ```
* 可以確認前幾筆資料的內容長怎樣, 例如確認前10筆

  ```
    df.head(10)
  ```
* 了解資料欄位的型別以及變數的型態

  ```
    df.info()
  ```
* 取得資料的基本統計數值: 如數量, 平均值, 標準差, 四分位數等數值

  ```
    df.describe()
  ```
* 取得df中的欄位名稱

  ```
    df.columns

    output:
    Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')
  ```

## 3.繪製圖表分析資料

* 利用[seaborn繪製多變量圖(pairplot)](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2172distribution-plot.html), 多變量圖的輸入資料可以用sns.load\_dataset()讀入資料, 也可以用pd.read\_csv所讀入的資料

  ```
    sns.pairplot(df)
  ```

  ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-16%20%E4%B8%8A%E5%8D%8810.04.20.png)
* 利用[seanborn繪製散點圖(distplot)](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2172distribution-plot.html)

  ```
    sns.distplot(df['Price'])
  ```

  ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-16%20%E4%B8%8A%E5%8D%8810.04.29.png)
* 利用[seaborn繪製不同數值間相關係數的熱力圖](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2174matrix-plots.html)
  * 其中df.corr()會拿資料中的數值欄位來形成相關係數矩陣

    ```
    sns.heatmap(df.corr(), annot = True)
    ```

    ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-16%20%E4%B8%8A%E5%8D%8810.34.16.png)

## 5. 使用Skikit-learn library

* Skikit-learn model的使用方式都是用from sklearn.family import model的形式

  ```
    # from sklearn.family import model
    from sklearn.cross_validation import train_test_split
  ```
* 首先介紹train\_test\_split, 這個函式可以隨機劃分訓練集和測試集

  ```
    #从样本中随机的按比例选取train data和testdata
    #X：features of data, 或是稱為所要劃分的樣本特徵集
    #Y：labels of data, 或是稱為所要劃分的樣本結果
    #test_size：樣本占比, 如果是整数的话就是樣本的数量
    #random_state：是隨機數的種子
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)
  ```

  * 舉例來說, 如果我們想用每天的氣溫資料預測飲料的銷售業績, 則可以把X帶入每天的氣溫資料, Y帶入飲料的銷售業績資料
* 建立複回歸模型
  * 在這個例子中, 我們希望做的事情是用**Avg. Area Income (平均區域所得), Avg. Area House Age (平均屋齡), Avg. Area Number of Rooms (平均房間數), Avg. Area Number of Bedrooms (平均臥房數), Area Population (區域人數)**&#x4F86;預估**Price (房價)**, 回歸模型中若有多個變量參與則稱為**複回歸模型**

    ```
    X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']]
    Y = df['Price']
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)
    ```
  * 建立線性回歸模型

    ```
    from sklearn.linear_model import LinearRegression
    lm = LinearRegression()
    lm.fit(X_train, y_train)
    #印出相關係數
    print(lm.coef_)
    #印出截距
    print(lm.intercept_)
    ```

## 6. 利用Skikit-learn進行預測

* 用測試集資料進行預測

  ```
    predictions = lm.predict(X_test)
  ```
* 繪圖來看看預測值與測試集的一致性
  * 散布圖

    ```
    plt.scatter(y_test, predictions)
    ```

    ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-16%20%E4%B8%8B%E5%8D%882.32.16.png)
  * 直方圖

    ```
    sns.distplot((y_test-predictions))
    ```

    ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-16%20%E4%B8%8B%E5%8D%882.32.21.png)
* 評估線性回歸模型品質的指標 (Regression Evaluation Metrics) 1. Mean Absolute Error (MAE): 即為平均誤差\
  ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-16%20%E4%B8%8B%E5%8D%881.54.52.png) 2. Mean Square Error (MSE): 平均平方誤差 ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-16%20%E4%B8%8B%E5%8D%881.54.58.png) 3. Root Mean Square Error (RMSE) ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-16%20%E4%B8%8B%E5%8D%881.55.03.png)

```
    from sklearn import metrics
    metrics.mean_absolute_error(y_test, predictions)
    metrics.mean_squared_error(y_test, predictions)
    np.sqrt(metrics.mean_squared_error(y_test, predictions))
```
