2.2. Machine Learning Crash Course Jam

1.基本資訊

2.Google Machine Learning Course

3.Review Pandas

  • codlab是一個很像Jupyter notebook的環境

  • Pandas Structure

    • Series

      • 讀取外部資料

        • 用head檢查資料有沒有讀進來

          accidents = pd.read_csv('https://raw.githubusercontent.com/kristenchan/Sharing/master/104_accident.csv')
          accidents.head(5)
        • 查詢資料類型

          accidents.dtypes
        • 數值型資料 (count, min, max, std. mean)

          accidents.describe()
        • Stastic上的資料種類

          • 類別資料

          • 連續型資料

            • 將資料轉型

              accidents['Year'] = accidents.Year.astype('str')
      • 選取資料

        accidents['Age']
        accidents.iloc[:, 9]
      • 重新洗牌

      • 新增欄位

4.Tensorflow

  • 1.匯入Library

import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
  • 2.讀取外部資料

california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")
  • 3.对数据进行随机化处理,以确保不会出现任何病态排序结果(可能会损害随机梯度下降法的效果)

california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))
  • 4.將大資料簡化 (修改單位)

california_housing_dataframe["median_house_value"] /= 1000.0
  • 5.數值型資料統計

california_housing_dataframe.describe()
  • 6.Build the first model (Linear regression)

    • (1)先畫圖瞭解一下資料

      import seaborn as sns
      #畫出房屋數量跟房價中位數的關係
      sns.lmplot("total_rooms", "median_house_value", data = california_housing_dataframe, fit_reg = False)
    • (2)定義x軸資料

      # Define the input feature: total_rooms.
      my_feature = california_housing_dataframe[["total_rooms"]]
      
      # Configure a numeric feature column for total_rooms.
      feature_columns = [tf.feature_column.numeric_column("total_rooms")]
    • (3)定義y資料

      # Define the label.
      targets = california_housing_dataframe["median_house_value"]
    • (4)配置 LinearRegressor

      # Use gradient descent as the optimizer for training the model.
      my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
      my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
      # Configure the linear regression model with our feature columns and optimizer.
      # Set a learning rate of 0.0000001 for Gradient Descent.
      linear_regressor = tf.estimator.LinearRegressor(
        feature_columns=feature_columns,
        optimizer=my_optimizer
      )
    • (5)Training model

      _ = linear_regressor.train(
        input_fn = lambda:my_input_fn(my_feature, targets),
        steps=100
      )
    • (6)Predict : 看看model的預測值如何

      # Create an input function for predictions.
      # Note: Since we're making just one prediction for each example, we don't 
      # need to repeat or shuffle the data here.
      prediction_input_fn =lambda: my_input_fn(my_feature,     targets, num_epochs=1, shuffle=False)
      
      # Call predict() on the linear_regressor to make predictions.
      predictions = linear_regressor.predict(input_fn=prediction_input_fn)
    • (7)計算MSE, RMSE

      # Format predictions as a NumPy array, so we can calculate error metrics.
      predictions = np.array([item['predictions'][0] for item in predictions])
      
      # Print Mean Squared Error and Root Mean Squared Error.
      mean_squared_error = metrics.mean_squared_error(predictions, targets)
      root_mean_squared_error = math.sqrt(mean_squared_error)
      print "Mean Squared Error (on training data): %0.3f" % mean_squared_error
      print "Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error
      
      min_house_value = california_housing_dataframe["median_house_value"].min()
      max_house_value = california_housing_dataframe["median_house_value"].max()
      min_max_difference = max_house_value - min_house_value
      
      print "Min. Median House Value: %0.3f" % min_house_value
      print "Max. Median House Value: %0.3f" % max_house_value
      print "Difference between Min. and Max.: %0.3f" % min_max_difference
      print "Root Mean Squared Error: %0.3f" % root_mean_squared_error
    • (8)瞭解predictions跟target的差異

      calibration_data = pd.DataFrame()
      calibration_data["predictions"] = pd.Series(predictions)
      calibration_data["targets"] = pd.Series(targets)
      calibration_data.describe()
    • (9)取平均分佈的樣本

      sample = california_housing_dataframe.sample(n=300)
    • (10)繪製散布圖

      # Get the min and max total_rooms values.
      x_0 = sample["total_rooms"].min()
      x_1 = sample["total_rooms"].max()
      
      # Retrieve the final weight and bias generated during training.
      weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
      bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')
      
      # Get the predicted median_house_values for the min and max total_rooms values.
      y_0 = weight * x_0 + bias 
      y_1 = weight * x_1 + bias
      
      # Plot our regression line from (x_0, y_0) to (x_1, y_1).
      plt.plot([x_0, x_1], [y_0, y_1], c='r')
      
      # Label the graph axes.
      plt.ylabel("median_house_value")
      plt.xlabel("total_rooms")
      
      # Plot a scatter plot from our data sample.
      plt.scatter(sample["total_rooms"], sample["median_house_value"])
      
      # Display graph.
      plt.show()

5.Build a predictable model

  • Build a predictable model簡報

    • 1.如何Fit a model?

      • 預測未知的資料夠準確

      • 模型的狀態

        • Overfitting

        • Optimum

        • Underfitting

      • 未來的資料怎麼來?

        • 將資料分成training set及testing set

        • 怎麼切testing set?

          • 80/20法則: 取1/5作為testing set

        • validation set

    • 2.如何挑選合適的feature?

      • (1)特徵工程

        • 挑選適合的特徵

      • (2)單位換算

      • (3)Feature should not take on "magic value"

        • 不要用-1表示不能用的資料

        • 多加一個flag的column

      • (4)不要用代碼紀錄特徵的特性

      • (5)了解資料, 刪除outlier

      • (6)對於不是連續型的變數, 可以用資料區間表示

    • 3.分類

      • 機率模型

        • 邏輯式回歸: 將線性模型轉換為機率模型

          • y = exp(b + w1x1 + ... + wpxp) > 0

          • p = f(b + w1x1 + ... + wpxp) < 0

      • Logistic regression是最簡單的神經網路 (NN)

      • Threshold的選擇取決於風險的承擔

        • 同一模型, 須根據不同場景選擇不同閾值, 將會影響決策

        • 模型說是真, 實際也為真的機率 (True Positive)

        • 除了精確度外, 需要考慮召回率: TP/(TP + FP)

驗證codlab

  • 1.匯入Library

      import math
    
      from IPython import display
      from matplotlib import cm
      from matplotlib import gridspec
      from matplotlib import pyplot as plt
      import numpy as np
      import pandas as pd
      from sklearn import metrics
      import tensorflow as tf
      from tensorflow.python.data import Dataset
    
      tf.logging.set_verbosity(tf.logging.ERROR)
      pd.options.display.max_rows = 10
      pd.options.display.float_format = '{:.1f}'.format
  • 2.讀取外部資料

  • 3.資料預處理

def preprocess_features(california_housing_dataframe):
  """Prepares input features from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
  """
  selected_features = california_housing_dataframe[
    ["latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"]]
  processed_features = selected_features.copy()
  # Create a synthetic feature.
  processed_features["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] /
    california_housing_dataframe["population"])
  return processed_features

def preprocess_targets(california_housing_dataframe):
  """Prepares target features (i.e., labels) from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the target feature.
  """
  output_targets = pd.DataFrame()
  # Scale the target to be in units of thousands of dollars.
  output_targets["median_house_value"] = (
    california_housing_dataframe["median_house_value"] / 1000.0)
  return output_targets

(4) 選擇Training set

training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_examples.describe()

training_targets = preprocess_targets(california_housing_dataframe.head(12000))
training_targets.describe()

(5) 選擇validation set

validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_examples.describe()validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))
validation_targets.describe()

codlab觀察重點

  • 1.Training data和validation data有相似的特性

    • 所以記得先用shuffle重新洗牌, 讓資料均勻

      from sklearn.utils import shuffle
      
      california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")
      
      california_housing_dataframe = shuffle(california_housing_dataframe)
    • overfitting

      • 是否是feature放太多所造成的結果?

      • 目標loss越小, 不希望模型太複雜

    • 如何定義模型複雜度?

      • y = b + w1x1 + ... + wpxp

        • w或p越大, 模型越複雜

      • 模型複雜度

        • ||w||^2 = w1 ^2 + ... + wp ^2

Last updated

Was this helpful?