# 2.2. Machine Learning Crash Course Jam

![](https://1184108162-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4M0G8SFgkeUaGo4vl-%2F-M4M0HrDfjWeZX2tGCNv%2F-M4M0UY7lI0xo8-rr6eK%2Ftimeline_20180512_125839.jpg?generation=1586302962347592\&alt=media)

## 1.基本資訊

* [共筆](https://docs.google.com/document/d/1VearZ0uRzaDodKua7ALpfX2Y10D30hukriAkUPiZJF0/edit)
* [Google MLCC教材](https://developers.google.com/machine-learning/crash-course/ml-intro)
* [Tensorflow colab](https://colab.research.google.com/notebooks/mlcc/first_steps_with_tensor_flow.ipynb?utm_source=mlcc\&utm_campaign=colab-external\&utm_medium=referral\&utm_content=firststeps-colab\&hl=zh-tw#scrollTo=7G12E76-339G)
* [驗證colab](https://colab.research.google.com/notebooks/mlcc/validation.ipynb?utm_source=mlcc\&utm_campaign=colab-external\&utm_medium=referral\&utm_content=validation-colab\&hl=zh-tw)
* [邏輯回歸codlab](https://colab.research.google.com/notebooks/mlcc/logistic_regression.ipynb?hl=zh-tw)

## 2.Google Machine Learning Course

* 1.[監督式學習](https://developers.google.com/machine-learning/crash-course/framing/check-your-understanding)
* 2.[regression](https://developers.google.com/machine-learning/crash-course/descending-into-ml/check-your-understanding): 預測連續型的資料 (e.g., 房價資料)
  * (1)Algebraic : y = wx + b -> y = wx1 + w2x2 + b
  * (2)Loss : 衡量預測模型的能力, 最好的情況是損失為0
    * Mean square error (MSE)
    * Root Mean square error (RMSE)&#x20;
* 3.Reducing error
  * <https://developers.google.com/machine-learning/crash-course/reducing-loss/check-your-understanding>
  * <https://developers.google.com/machine-learning/crash-course/fitter/graph>
  * 最佳的Learning rate會因資料而異
    * step size: 經幾次迭代後更新learning rate
    * Stochastic gradient descent: SGD (隨機選擇下一點計算下一次斜率) vs GD, SGD會比較有效率

## 3.Review Pandas

* [Review Pandas簡報](https://drive.google.com/file/d/1oM2FIihYNN8HClrCX7ZTXUpjZoava-Gm/view)
* [codlab](https://colab.research.google.com/notebooks/welcome.ipynb)是一個很像Jupyter notebook的環境
* Pandas Structure
  * Series
    * 讀取外部資料
      * 用head檢查資料有沒有讀進來

        ```
        accidents = pd.read_csv('https://raw.githubusercontent.com/kristenchan/Sharing/master/104_accident.csv')
        accidents.head(5)
        ```
      * 查詢資料類型

        ```
        accidents.dtypes
        ```
      * 數值型資料 (count, min, max, std. mean)

        ```
        accidents.describe()
        ```
      * Stastic上的資料種類
        * 類別資料
        * 連續型資料
          * 將資料轉型

            ```
            accidents['Year'] = accidents.Year.astype('str')
            ```
    * 選取資料

      ```
      accidents['Age']
      accidents.iloc[:, 9]
      ```
    * 重新洗牌
    * 新增欄位

## 4.Tensorflow

* 1.匯入Library

```
import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
```

* 2.讀取外部資料

```
california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")
```

* 3.对数据进行随机化处理，以确保不会出现任何病态排序结果（可能会损害随机梯度下降法的效果）

```
california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))
```

* 4.將大資料簡化 (修改單位)

```
california_housing_dataframe["median_house_value"] /= 1000.0
```

* 5.數值型資料統計

```
california_housing_dataframe.describe()
```

* 6.Build the first model (Linear regression)
  * (1)先畫圖瞭解一下資料

    ```
    import seaborn as sns
    #畫出房屋數量跟房價中位數的關係
    sns.lmplot("total_rooms", "median_house_value", data = california_housing_dataframe, fit_reg = False)
    ```
  * (2)定義x軸資料

    ```
    # Define the input feature: total_rooms.
    my_feature = california_housing_dataframe[["total_rooms"]]

    # Configure a numeric feature column for total_rooms.
    feature_columns = [tf.feature_column.numeric_column("total_rooms")]
    ```
  * (3)定義y資料

    ```
    # Define the label.
    targets = california_housing_dataframe["median_house_value"]
    ```
  * (4)配置 LinearRegressor

    ```
    # Use gradient descent as the optimizer for training the model.
    my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
    my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
    # Configure the linear regression model with our feature columns and optimizer.
    # Set a learning rate of 0.0000001 for Gradient Descent.
    linear_regressor = tf.estimator.LinearRegressor(
      feature_columns=feature_columns,
      optimizer=my_optimizer
    )
    ```
  * (5)Training model

    ```
    _ = linear_regressor.train(
      input_fn = lambda:my_input_fn(my_feature, targets),
      steps=100
    )
    ```
  * (6)Predict : 看看model的預測值如何

    ```
    # Create an input function for predictions.
    # Note: Since we're making just one prediction for each example, we don't 
    # need to repeat or shuffle the data here.
    prediction_input_fn =lambda: my_input_fn(my_feature,     targets, num_epochs=1, shuffle=False)

    # Call predict() on the linear_regressor to make predictions.
    predictions = linear_regressor.predict(input_fn=prediction_input_fn)
    ```
  * (7)計算MSE, RMSE

    ```
    # Format predictions as a NumPy array, so we can calculate error metrics.
    predictions = np.array([item['predictions'][0] for item in predictions])

    # Print Mean Squared Error and Root Mean Squared Error.
    mean_squared_error = metrics.mean_squared_error(predictions, targets)
    root_mean_squared_error = math.sqrt(mean_squared_error)
    print "Mean Squared Error (on training data): %0.3f" % mean_squared_error
    print "Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error

    min_house_value = california_housing_dataframe["median_house_value"].min()
    max_house_value = california_housing_dataframe["median_house_value"].max()
    min_max_difference = max_house_value - min_house_value

    print "Min. Median House Value: %0.3f" % min_house_value
    print "Max. Median House Value: %0.3f" % max_house_value
    print "Difference between Min. and Max.: %0.3f" % min_max_difference
    print "Root Mean Squared Error: %0.3f" % root_mean_squared_error
    ```
  * (8)瞭解predictions跟target的差異

    ```
    calibration_data = pd.DataFrame()
    calibration_data["predictions"] = pd.Series(predictions)
    calibration_data["targets"] = pd.Series(targets)
    calibration_data.describe()
    ```
  * (9)取平均分佈的樣本

    ```
    sample = california_housing_dataframe.sample(n=300)
    ```
  * (10)繪製散布圖

    ```
    # Get the min and max total_rooms values.
    x_0 = sample["total_rooms"].min()
    x_1 = sample["total_rooms"].max()

    # Retrieve the final weight and bias generated during training.
    weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
    bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

    # Get the predicted median_house_values for the min and max total_rooms values.
    y_0 = weight * x_0 + bias 
    y_1 = weight * x_1 + bias

    # Plot our regression line from (x_0, y_0) to (x_1, y_1).
    plt.plot([x_0, x_1], [y_0, y_1], c='r')

    # Label the graph axes.
    plt.ylabel("median_house_value")
    plt.xlabel("total_rooms")

    # Plot a scatter plot from our data sample.
    plt.scatter(sample["total_rooms"], sample["median_house_value"])

    # Display graph.
    plt.show()
    ```

## 5.Build a predictable model

* [Build a predictable model簡報](https://drive.google.com/file/d/1HzlVbhutRfyMHpGWaqgMzQ9WZQtRm31v/view)
  * 1.如何Fit a model?
    * 預測未知的資料夠準確
    * 模型的狀態
      * Overfitting
      * Optimum
      * Underfitting
    * 未來的資料怎麼來?
      * 將資料分成training set及testing set
      * 怎麼切testing set?
        * 80/20法則: 取1/5作為testing set
      * validation set
  * 2.如何挑選合適的feature?
    * (1)特徵工程
      * 挑選適合的特徵
    * (2)單位換算
    * (3)Feature should not take on "magic value"
      * 不要用-1表示不能用的資料
      * 多加一個flag的column
    * (4)不要用代碼紀錄特徵的特性
    * (5)了解資料, 刪除outlier
    * (6)對於不是連續型的變數, 可以用資料區間表示
  * 3.分類
    * 機率模型
      * 邏輯式回歸: 將線性模型轉換為機率模型
        * y = exp(b + w1x1 + ... + wpxp) > 0
        * p = f(b + w1x1 + ... + wpxp) < 0
    * Logistic regression是最簡單的神經網路 (NN)
    * Threshold的選擇取決於風險的承擔
      * 同一模型, 須根據不同場景選擇不同閾值, 將會影響決策
      * 模型說是真, 實際也為真的機率 (True Positive)
      * 除了精確度外, 需要考慮召回率: TP/(TP + FP)

### 驗證codlab

* 1.匯入Library

  ```
    import math

    from IPython import display
    from matplotlib import cm
    from matplotlib import gridspec
    from matplotlib import pyplot as plt
    import numpy as np
    import pandas as pd
    from sklearn import metrics
    import tensorflow as tf
    from tensorflow.python.data import Dataset

    tf.logging.set_verbosity(tf.logging.ERROR)
    pd.options.display.max_rows = 10
    pd.options.display.float_format = '{:.1f}'.format
  ```
* 2.讀取外部資料
* 3.資料預處理

```
def preprocess_features(california_housing_dataframe):
  """Prepares input features from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
  """
  selected_features = california_housing_dataframe[
    ["latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"]]
  processed_features = selected_features.copy()
  # Create a synthetic feature.
  processed_features["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] /
    california_housing_dataframe["population"])
  return processed_features

def preprocess_targets(california_housing_dataframe):
  """Prepares target features (i.e., labels) from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the target feature.
  """
  output_targets = pd.DataFrame()
  # Scale the target to be in units of thousands of dollars.
  output_targets["median_house_value"] = (
    california_housing_dataframe["median_house_value"] / 1000.0)
  return output_targets
```

(4) 選擇Training set

```
training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_examples.describe()

training_targets = preprocess_targets(california_housing_dataframe.head(12000))
training_targets.describe()
```

(5) 選擇validation set

```
validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_examples.describe()validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))
validation_targets.describe()
```

### codlab觀察重點

* 1.Training data和validation data有相似的特性
  * 所以記得先用shuffle重新洗牌, 讓資料均勻

    ```
    from sklearn.utils import shuffle

    california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")

    california_housing_dataframe = shuffle(california_housing_dataframe)
    ```
  * overfitting
    * 是否是feature放太多所造成的結果?
    * 目標loss越小, 不希望模型太複雜
  * 如何定義模型複雜度?
    * y = b + w1x1 + ... + wpxp
      * w或p越大, 模型越複雜
    * 模型複雜度
      * ||w||^2 = w1 ^2 + ... + wp ^2
