> For the complete documentation index, see [llms.txt](https://jen-hsuan-hsieh.gitbook.io/python/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://jen-hsuan-hsieh.gitbook.io/python/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/ww/linear-regression-with-python.md).

# 2.1.11.2.Linear regression with Python

## 1. 匯入基本的library

* [pandas](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas.html), [numpy](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/213python-for-data-analysis-numpy.html), [matplotlib](https://jenhsuan.gitbooks.io/python/content/test.html), [seaborn](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn.html)

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
```

* 將圖表直接嵌入到Notebook之中

```
%matplotlib inline
```

## 2.讀取資料並了解資料

* 讀取資料

  ```
    df = pd.read_csv('USA_Housing.csv')
  ```
* 可以確認前幾筆資料的內容長怎樣, 例如確認前10筆

  ```
    df.head(10)
  ```
* 了解資料欄位的型別以及變數的型態

  ```
    df.info()
  ```
* 取得資料的基本統計數值: 如數量, 平均值, 標準差, 四分位數等數值

  ```
    df.describe()
  ```
* 取得df中的欄位名稱

  ```
    df.columns

    output:
    Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')
  ```

## 3.繪製圖表分析資料

* 利用[seaborn繪製多變量圖(pairplot)](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2172distribution-plot.html), 多變量圖的輸入資料可以用sns.load\_dataset()讀入資料, 也可以用pd.read\_csv所讀入的資料

  ```
    sns.pairplot(df)
  ```

  ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-16%20上午10.04.20.png)
* 利用[seanborn繪製散點圖(distplot)](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2172distribution-plot.html)

  ```
    sns.distplot(df['Price'])
  ```

  ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-16%20上午10.04.29.png)
* 利用[seaborn繪製不同數值間相關係數的熱力圖](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2174matrix-plots.html)
  * 其中df.corr()會拿資料中的數值欄位來形成相關係數矩陣

    ```
    sns.heatmap(df.corr(), annot = True)
    ```

    ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-16%20上午10.34.16.png)

## 5. 使用Skikit-learn library

* Skikit-learn model的使用方式都是用from sklearn.family import model的形式

  ```
    # from sklearn.family import model
    from sklearn.cross_validation import train_test_split
  ```
* 首先介紹train\_test\_split, 這個函式可以隨機劃分訓練集和測試集

  ```
    #从样本中随机的按比例选取train data和testdata
    #X：features of data, 或是稱為所要劃分的樣本特徵集
    #Y：labels of data, 或是稱為所要劃分的樣本結果
    #test_size：樣本占比, 如果是整数的话就是樣本的数量
    #random_state：是隨機數的種子
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)
  ```

  * 舉例來說, 如果我們想用每天的氣溫資料預測飲料的銷售業績, 則可以把X帶入每天的氣溫資料, Y帶入飲料的銷售業績資料
* 建立複回歸模型
  * 在這個例子中, 我們希望做的事情是用**Avg. Area Income (平均區域所得), Avg. Area House Age (平均屋齡), Avg. Area Number of Rooms (平均房間數), Avg. Area Number of Bedrooms (平均臥房數), Area Population (區域人數)**&#x4F86;預估**Price (房價)**, 回歸模型中若有多個變量參與則稱為**複回歸模型**

    ```
    X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']]
    Y = df['Price']
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)
    ```
  * 建立線性回歸模型

    ```
    from sklearn.linear_model import LinearRegression
    lm = LinearRegression()
    lm.fit(X_train, y_train)
    #印出相關係數
    print(lm.coef_)
    #印出截距
    print(lm.intercept_)
    ```

## 6. 利用Skikit-learn進行預測

* 用測試集資料進行預測

  ```
    predictions = lm.predict(X_test)
  ```
* 繪圖來看看預測值與測試集的一致性
  * 散布圖

    ```
    plt.scatter(y_test, predictions)
    ```

    ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-16%20下午2.32.16.png)
  * 直方圖

    ```
    sns.distplot((y_test-predictions))
    ```

    ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-16%20下午2.32.21.png)
* 評估線性回歸模型品質的指標 (Regression Evaluation Metrics) 1. Mean Absolute Error (MAE): 即為平均誤差\
  ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-16%20下午1.54.52.png) 2. Mean Square Error (MSE): 平均平方誤差 ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-16%20下午1.54.58.png) 3. Root Mean Square Error (RMSE) ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-16%20下午1.55.03.png)

```
    from sklearn import metrics
    metrics.mean_absolute_error(y_test, predictions)
    metrics.mean_squared_error(y_test, predictions)
    np.sqrt(metrics.mean_squared_error(y_test, predictions))
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jen-hsuan-hsieh.gitbook.io/python/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/ww/linear-regression-with-python.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
