2.2. Machine Learning Crash Course Jam

1.基本資訊

2.Google Machine Learning Course

1.監督式學習
2.regression: 預測連續型的資料 (e.g., 房價資料)
- (1)Algebraic : y = wx + b -> y = wx1 + w2x2 + b
- (2)Loss : 衡量預測模型的能力, 最好的情況是損失為0
  - Mean square error (MSE)
  - Root Mean square error (RMSE)
3.Reducing error
- https://developers.google.com/machine-learning/crash-course/reducing-loss/check-your-understanding
- https://developers.google.com/machine-learning/crash-course/fitter/graph
- 最佳的Learning rate會因資料而異
  - step size: 經幾次迭代後更新learning rate
  - Stochastic gradient descent: SGD (隨機選擇下一點計算下一次斜率) vs GD, SGD會比較有效率

3.Review Pandas

Review Pandas簡報
codlab是一個很像Jupyter notebook的環境
Pandas Structure
- Series
  - 讀取外部資料
    用head檢查資料有沒有讀進來
    accidents = pd.read_csv('https://raw.githubusercontent.com/kristenchan/Sharing/master/104_accident.csv') accidents.head(5)
    查詢資料類型
    accidents.dtypes
    數值型資料 (count, min, max, std. mean)
    accidents.describe()
    Stastic上的資料種類
    類別資料
    連續型資料
    將資料轉型
    accidents['Year'] = accidents.Year.astype('str')
  - 選取資料
    accidents['Age'] accidents.iloc[:, 9]
  - 重新洗牌
  - 新增欄位

4.Tensorflow

1.匯入Library

import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

2.讀取外部資料

california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")

3.对数据进行随机化处理，以确保不会出现任何病态排序结果（可能会损害随机梯度下降法的效果）

california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))

4.將大資料簡化 (修改單位)

california_housing_dataframe["median_house_value"] /= 1000.0

5.數值型資料統計

california_housing_dataframe.describe()

6.Build the first model (Linear regression)

(1)先畫圖瞭解一下資料

import seaborn as sns
#畫出房屋數量跟房價中位數的關係
sns.lmplot("total_rooms", "median_house_value", data = california_housing_dataframe, fit_reg = False)

(2)定義x軸資料

# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]

# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

(3)定義y資料

# Define the label.
targets = california_housing_dataframe["median_house_value"]

(4)配置 LinearRegressor

# Use gradient descent as the optimizer for training the model.
my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
# Configure the linear regression model with our feature columns and optimizer.
# Set a learning rate of 0.0000001 for Gradient Descent.
linear_regressor = tf.estimator.LinearRegressor(
  feature_columns=feature_columns,
  optimizer=my_optimizer
)

(5)Training model

_ = linear_regressor.train(
  input_fn = lambda:my_input_fn(my_feature, targets),
  steps=100
)

(6)Predict : 看看model的預測值如何

# Create an input function for predictions.
# Note: Since we're making just one prediction for each example, we don't 
# need to repeat or shuffle the data here.
prediction_input_fn =lambda: my_input_fn(my_feature,     targets, num_epochs=1, shuffle=False)

# Call predict() on the linear_regressor to make predictions.
predictions = linear_regressor.predict(input_fn=prediction_input_fn)

(7)計算MSE, RMSE

# Format predictions as a NumPy array, so we can calculate error metrics.
predictions = np.array([item['predictions'][0] for item in predictions])

# Print Mean Squared Error and Root Mean Squared Error.
mean_squared_error = metrics.mean_squared_error(predictions, targets)
root_mean_squared_error = math.sqrt(mean_squared_error)
print "Mean Squared Error (on training data): %0.3f" % mean_squared_error
print "Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error

min_house_value = california_housing_dataframe["median_house_value"].min()
max_house_value = california_housing_dataframe["median_house_value"].max()
min_max_difference = max_house_value - min_house_value

print "Min. Median House Value: %0.3f" % min_house_value
print "Max. Median House Value: %0.3f" % max_house_value
print "Difference between Min. and Max.: %0.3f" % min_max_difference
print "Root Mean Squared Error: %0.3f" % root_mean_squared_error

(8)瞭解predictions跟target的差異

calibration_data = pd.DataFrame()
calibration_data["predictions"] = pd.Series(predictions)
calibration_data["targets"] = pd.Series(targets)
calibration_data.describe()

(9)取平均分佈的樣本

sample = california_housing_dataframe.sample(n=300)

(10)繪製散布圖

# Get the min and max total_rooms values.
x_0 = sample["total_rooms"].min()
x_1 = sample["total_rooms"].max()

# Retrieve the final weight and bias generated during training.
weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

# Get the predicted median_house_values for the min and max total_rooms values.
y_0 = weight * x_0 + bias 
y_1 = weight * x_1 + bias

# Plot our regression line from (x_0, y_0) to (x_1, y_1).
plt.plot([x_0, x_1], [y_0, y_1], c='r')

# Label the graph axes.
plt.ylabel("median_house_value")
plt.xlabel("total_rooms")

# Plot a scatter plot from our data sample.
plt.scatter(sample["total_rooms"], sample["median_house_value"])

# Display graph.
plt.show()

5.Build a predictable model

Build a predictable model簡報
- 1.如何Fit a model?
  - 預測未知的資料夠準確
  - 模型的狀態
    Overfitting
    Optimum
    Underfitting
  - 未來的資料怎麼來?
    將資料分成training set及testing set
    怎麼切testing set?
    80/20法則: 取1/5作為testing set
    validation set
- 2.如何挑選合適的feature?
  - (1)特徵工程
    挑選適合的特徵
  - (2)單位換算
  - (3)Feature should not take on "magic value"
    不要用-1表示不能用的資料
    多加一個flag的column
  - (4)不要用代碼紀錄特徵的特性
  - (5)了解資料, 刪除outlier
  - (6)對於不是連續型的變數, 可以用資料區間表示
- 3.分類
  - 機率模型
    邏輯式回歸: 將線性模型轉換為機率模型
    y = exp(b + w1x1 + ... + wpxp) > 0
    p = f(b + w1x1 + ... + wpxp) < 0
  - Logistic regression是最簡單的神經網路 (NN)
  - Threshold的選擇取決於風險的承擔
    同一模型, 須根據不同場景選擇不同閾值, 將會影響決策
    模型說是真, 實際也為真的機率 (True Positive)
    除了精確度外, 需要考慮召回率: TP/(TP + FP)

驗證codlab

1.匯入Library

  import math

  from IPython import display
  from matplotlib import cm
  from matplotlib import gridspec
  from matplotlib import pyplot as plt
  import numpy as np
  import pandas as pd
  from sklearn import metrics
  import tensorflow as tf
  from tensorflow.python.data import Dataset

  tf.logging.set_verbosity(tf.logging.ERROR)
  pd.options.display.max_rows = 10
  pd.options.display.float_format = '{:.1f}'.format

2.讀取外部資料
3.資料預處理

def preprocess_features(california_housing_dataframe):
  """Prepares input features from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
  """
  selected_features = california_housing_dataframe[
    ["latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"]]
  processed_features = selected_features.copy()
  # Create a synthetic feature.
  processed_features["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] /
    california_housing_dataframe["population"])
  return processed_features

def preprocess_targets(california_housing_dataframe):
  """Prepares target features (i.e., labels) from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the target feature.
  """
  output_targets = pd.DataFrame()
  # Scale the target to be in units of thousands of dollars.
  output_targets["median_house_value"] = (
    california_housing_dataframe["median_house_value"] / 1000.0)
  return output_targets

(4) 選擇Training set

training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_examples.describe()

training_targets = preprocess_targets(california_housing_dataframe.head(12000))
training_targets.describe()

(5) 選擇validation set

validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_examples.describe()validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))
validation_targets.describe()

codlab觀察重點

1.Training data和validation data有相似的特性
- 所以記得先用shuffle重新洗牌, 讓資料均勻
  from sklearn.utils import shuffle california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",") california_housing_dataframe = shuffle(california_housing_dataframe)
- overfitting
  - 是否是feature放太多所造成的結果?
  - 目標loss越小, 不希望模型太複雜
- 如何定義模型複雜度?
  - y = b + w1x1 + ... + wpxp
    w或p越大, 模型越複雜
  - 模型複雜度
    ||w||^2 = w1 ^2 + ... + wp ^2

Previous2.1.17.Principal Component Analysis

Last updated 5 years ago

Was this helpful?